Technology

Speech LLM

Speech Large Language Models (SpeechLLMs) unify LLM capabilities with audio processing, enabling end-to-end spoken language understanding and generation.

SpeechLLMs are multi-modal architectures designed to process raw audio and text inputs concurrently, bypassing traditional cascaded systems (ASR to LLM). The core design integrates an audio encoder (e.g., Whisper, WavLM) with a pretrained LLM via a modality adapter (Source 2, 3). This unified approach facilitates deeper contextual reasoning: models like Qwen-Audio have demonstrated performance gains, including up to a 10% reduction in Word Error Rate (WER) and an 8% improvement in sentiment accuracy over single-task models (Source 5). Key deployments focus on advanced conversational AI, enabling full-duplex interaction like that seen in GPT-4o, and enhancing real-time applications such as customer support analytics and automated meeting transcription (Source 4, 1).

https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/multimodal/speech_llm/overview.html

1 project · 1 city

Related technologies

GPT-4o 72 Llama 3 139 Multimodal LLM 3 Ultravox 2

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

Ultravox: Open Source Speech LM

Seattle Jun 6

Llama 3 GPT-4o