Technology
Speech LLM
Speech Large Language Models (SpeechLLMs) unify LLM capabilities with audio processing, enabling end-to-end spoken language understanding and generation.
SpeechLLMs are multi-modal architectures designed to process raw audio and text inputs concurrently, bypassing traditional cascaded systems (ASR to LLM). The core design integrates an audio encoder (e.g., Whisper, WavLM) with a pretrained LLM via a modality adapter (Source 2, 3). This unified approach facilitates deeper contextual reasoning: models like Qwen-Audio have demonstrated performance gains, including up to a 10% reduction in Word Error Rate (WER) and an 8% improvement in sentiment accuracy over single-task models (Source 5). Key deployments focus on advanced conversational AI, enabling full-duplex interaction like that seen in GPT-4o, and enhancing real-time applications such as customer support analytics and automated meeting transcription (Source 4, 1).
Related technologies
Recent Talks & Demos
Showing 1-1 of 1