Multi-task Audio Transformer Model

The talk explains a unified autoregressive transformer that handles audio and text, covering tokenization, multi-task training for TTS, ASR, and voice completion.

Overview

We have pretrained and finetuned a single model that can take in audio or text and output audio or text. This single model can be used for multiple audio-related tasks, like TTS, ASR, and text-to-voice completion. We will demo the TTS part and talk about the overall architecture of the model.
We have hosted the model with ultra-fast inference and low latency.

Links

Tech stack