Technology
Sparse sync
SparseSync is a transformer-based framework designed to synchronize audio and visual streams in unconstrained videos where alignment cues are infrequent or spatially small.
SparseSync tackles the challenge of audio-visual synchronization in 'in-the-wild' videos where cues are intermittent, such as a single dog bark or a distant wood chop. Unlike traditional models optimized for dense signals like talking heads, SparseSync uses a SparseSelector architecture to compress long temporal sequences into a manageable set of learnable tokens. This approach reduces computational complexity from quadratic to linear, allowing the model to process high-resolution, long-duration clips without sacrificing accuracy. By training on the VGGSound-Sparse dataset, the system achieves state-of-the-art performance in predicting precise temporal offsets even when the synchronization signal is sparse in both space and time.
Related technologies
Recent Talks & Demos
Showing 1-1 of 1