Transcriber R&D project

A demo of a Next.js app that evaluates and improves transcription quality and timestamps using multiple speech-to-text models, alignment, and merging techniques.

Next Whisper AssemblyAI Reverb VAD

Overview

I’ll present a custom-built tool designed to evaluate the quality of timestamps and text generated by various speech-to-text engines. At Capsule, precise transcriptions and timestamps are essential for video editing, making accuracy in these areas critical.

In just 4 minutes, I’ll demo a Next.js app that integrates with the media library and visualizes outputs from different versions of our transcription pipeline. This pipeline includes ML models such as VAD, language identification, speech-to-text engines (e.g., Whisper, AssemblyAI, Reverb), alignment model, and LLMs.

You’ll get a glimpse into how I address complex challenges using techniques like chunking, LLM, Diff, Merge algorithms, and logging. I’ll cover solutions for issues such as missed repetitions, punctuation improvements, timestamp alignment losses, hallucinations, Japanese tokenization errors, and how merging outputs from two transcription models enhances quality of text and timestamps.

Links

https://capsule.video/
Enterprise video platform transforms After Effects into responsive, AI-enhanced templates for collaboration.

Tech stack

Next

Next.js is the full-stack React framework: it delivers high-performance web applications via hybrid rendering and powerful, Rust-based tooling.

This is the React Framework for production: Next.js enables you to build full-stack web applications with zero configuration and maximum efficiency. It supports a hybrid rendering approach (Server-Side Rendering, Static Site Generation, and Incremental Static Regeneration) for optimal speed and SEO performance. Key features include React Server Components, Server Actions for running server code directly, and the App Router for advanced routing and nested layouts. Developed by Vercel, it leverages Rust-based tools like Turbopack and the Speedy Web Compiler for the fastest possible builds and a superior developer experience.

https://nextjs.org/

View projects
Whisper

Whisper: OpenAI's robust, open-source ASR model for multilingual speech recognition, translation, and language identification.

Whisper is OpenAI's general-purpose Automatic Speech Recognition (ASR) model, trained on a massive, diverse dataset for high-accuracy performance. It functions as a powerful multitasking system: handling multilingual transcription, direct speech translation, and language identification. The architecture processes audio in a sliding 30-second window, performing autoregressive predictions. Developers can select from six distinct model sizes to optimize for specific speed versus accuracy tradeoffs: this is the go-to solution for reliable, large-scale audio processing.

https://github.com/openai/whisper

View projects
AssemblyAI

AssemblyAI provides industry-leading Speech AI models—like its Universal model with >93.3% accuracy—via a developer-first API for transcription and deep audio intelligence.

AssemblyAI is the premier Speech AI platform, delivering state-of-the-art models for both high-accuracy transcription and deep audio understanding (Speech-to-Text and Audio Intelligence). Developers access this power through a robust, scalable API, integrating features like real-time streaming with <500ms latency. The platform goes beyond raw text: models like LeMUR apply LLMs to transcribed speech for automatic summarization, sentiment analysis, and PII redaction. Major companies, including CallRail, Fireflies, and Spotify, rely on AssemblyAI to build cutting-edge, voice-powered products and extract actionable insights from conversational data.

https://www.assemblyai.com/

View projects
Reverb

Reverb is an audio signal processing effect: it simulates the complex, continuous decay of sound reflections in an acoustic space, adding depth and spatial dimension to a dry signal.

Reverb (reverberation) is the core audio effect that creates a sense of space by simulating thousands of complex, diminishing sound reflections. The technology evolved from early echo chambers and mechanical systems, notably the 1957 EMT 140 plate reverb, which used a large sheet of metal to generate the effect. Modern digital reverb, pioneered by units like the 1976 EMT 250, uses sophisticated algorithms—a network of short delays and filters—to precisely model room acoustics, from small chambers to vast concert halls. This processing is critical in music and film production: it unifies disparate tracks, places instruments in a cohesive sonic environment, and manipulates perceived distance by controlling parameters like pre-delay and decay time.

https://en.wikipedia.org/wiki/Reverb_effect

View projects
VAD

Voice Activity Detection (VAD): The core signal-processing technology that precisely isolates human speech from noise and silence in real-time audio streams.

VAD, or Voice Activity Detection, is the foundational signal-processing technique that acts as a binary classifier: speech (1) versus non-speech (0) in an audio stream. Its primary function is to conserve resources and enhance performance in applications like Voice over IP (VoIP) and Automatic Speech Recognition (ASR). For example, in a VoIP application like Zoom or Discord, VAD ensures data transmission only occurs during spoken segments, drastically reducing bandwidth consumption and computational load. Modern VAD algorithms have evolved past simple energy-based models; they now leverage deep learning architectures and Gaussian Mixture Models (GMMs) to accurately distinguish speech from complex background noise. High-performance solutions, such as Cobra VAD, are benchmarked to deliver double the accuracy of older standards like Google's WebRTC VAD, processing audio chunks in milliseconds.

https://picovoice.ai/cobra-vad/

View projects

Related projects

journaling and note-taking with inline AI

San Francisco

This talk explores building a note-taking app with Excel-like formulas and inline AI using Claude’s citations API for…

Two dead cats in a dark room

San Francisco

This talk presents a content aggregator that searches personalized data sources like RSS feeds and YouTube subtitles to…

11.ai all the things

San Francisco

This talk demonstrates a voice-controlled AI home assistant using MCP and Cerebras for fast, accurate control of lights,…

Dittto

Los Angeles

Learn how Dittto lets teams instantly test multiple brand voices to generate clearer, more effective hero copy, with…

Zero shot voice cloning vs fine-tuning

San Francisco

This talk compares zero-shot voice cloning and fine-tuning methods, demonstrating voice cloning from short samples using state-of-the-art models…

unrav.io - Make complex simple again

Seattle

Live demonstration of unrav.io, an AI‑powered tool that transforms any web page into summaries, audio podcasts, mind maps,…