.

Members-Only

Recent Talks & Demos are for members only

Exclusive feed

You must be an AI Tinkerers active member to view these talks and demos.

April 01, 2025 · Medellín

VOZY: Voice AI Sales Agent

Demonstration of VOZY, a voice‑enabled AI receptionist that handles inbound sales calls, answers FAQs, gathers lead information, and logs contacts in real time.

Overview
Tech stack
  • Speech-to-Text
    Speech-to-Text (STT) instantly converts spoken audio into written text: it’s the core engine for voice assistants like Alexa and real-time captioning across 125+ languages.
    Speech-to-Text, formally Automatic Speech Recognition (ASR), leverages sophisticated deep learning models to transform human speech into a digital text format. This technology powers critical enterprise applications: transcribing contact center calls, generating subtitles for live media, and enabling voice commands for smart devices. Major providers, including Google Cloud and Amazon Transcribe, offer APIs with high accuracy (often 95%+) and features like speaker diarization and custom vocabulary, making voice data actionable across nearly every industry.
  • Text-to-Speech
    Text-to-Speech (TTS) is the AI-driven technology that converts written text into synthesized, human-like audio: it gives your digital content a voice.
    TTS is a core speech synthesis technology, leveraging deep learning and neural networks to transform raw text into natural-sounding speech. The process involves linguistic analysis (parsing grammar and context) and acoustic modeling (generating the audio waveform). Modern Neural TTS systems deliver high-fidelity voices across 50+ languages, moving far beyond the robotic sound of older systems. Key applications drive major efficiency gains: accessibility tools for users with reading disabilities, automated customer service via IVR systems, and high-volume content production for audiobooks and video narration.
  • Generative AI
    Generative AI employs foundation models (e.g., Large Language Models) to create novel, complex content—text, images, code, and audio—from simple user prompts.
    Generative AI is a deep learning paradigm focused on *creating* new output, not just classifying data. Key models like OpenAI's GPT-4 and Stability AI's Stable Diffusion leverage massive datasets (trillions of parameters) to identify complex patterns. This enables them to generate high-quality, original content: from drafting software code and summarizing 50-page reports to producing photorealistic images in seconds. It fundamentally shifts the human-computer interaction model from command-based to prompt-based creation, driving immediate, high-impact productivity gains across all industries.
  • Vector store
    It is a specialized index for high-dimensional vector embeddings, enabling millisecond-speed semantic similarity search for AI and RAG applications.
    A vector store is a specialized data engine: it efficiently stores, indexes, and manages high-dimensional vector embeddings—the numerical representations of unstructured data (text, images, audio). Unlike traditional databases, the vector store utilizes Approximate Nearest Neighbor (ANN) algorithms (e.g., HNSW) to quickly find data points that are conceptually similar, not just keyword-matched. This core capability is indispensable for modern AI workflows, specifically powering semantic search, recommendation systems, and Retrieval-Augmented Generation (RAG) in Large Language Models (LLMs). Look at platforms like Milvus and Qdrant for open-source implementations, or Pinecone for a fully managed service.
  • RAG
    RAG (Retrieval-Augmented Generation) is the GenAI framework that grounds LLMs (like GPT-4) on external, verified data, drastically reducing model hallucinations and providing verifiable sources.
    RAG is a critical GenAI architecture: it solves the LLM 'hallucination' problem by inserting a retrieval step before generation. A user query is vectorized, then used to query an external knowledge base (e.g., a Pinecone vector database) for relevant document chunks (typically 512-token segments). These retrieved facts augment the original prompt, providing the LLM (e.g., Gemini or Llama 3) the specific, current, or proprietary context required. This process ensures the final response is accurate and grounded in domain-specific data, avoiding the high cost and latency of full model retraining.

Related projects