NVIDIA Introduces Cache-Aware Streaming ASR for High-Concurrency Voice Agents

Aremi Olu

Translate this article

Updated:

January 9, 2026

NVIDIA has released Nemotron Speech ASR, an open automatic speech recognition model designed to address scalability and latency challenges in real-time voice AI applications. The model introduces a cache-aware streaming architecture intended to replace traditional buffered inference methods.

Reported Technical Innovation

The core advancement is a shift from buffered inference—where overlapping audio windows are repeatedly reprocessed—to a cache-aware system.This new approach reuses past computations to process only new audio input, which NVIDIA claims eliminates redundant calculations. The model is based on a FastConformer architecture with 8x downsampling, processing fewer tokens per second to reduce memory footprint and increase throughput.

Claimed Performance Benefits

According to NVIDIA's benchmarks,the architecture delivers:

· Increased Concurrency: Up to 3x higher concurrent stream support on an NVIDIA H100 GPU compared to previous baselines.

· Stable Latency: Maintains linear latency scaling under high load, avoiding the "latency drift" where delays accumulate with more users.

· Runtime Flexibility: Offers dynamically configurable latency modes (80ms to 1.12s) without model retraining, allowing a trade-off between speed and word error rate (WER).

Industry Validation and Implementation

The announcement includes reported validation from partners:

· Modal: Testing with 127 concurrent WebSocket clients showed a stable median delay of 182ms with minimal drift over a three-minute stream.

· Daily: Integration into a full voice-agent pipeline (with an LLM and TTS) resulted in a median time-to-final transcription of 24ms and a complete voice-to-voice loop under 900ms in local deployment.

NVIDIA positions this technology as a foundational step for scalable, real-time conversational agents, moving beyond systems designed for offline transcription. The model and tools are available on Hugging face and through NVIDIA NeMo.

aimachine learning

About the Author

Aremi Olu

Aremi Olu is an AI news correspondent from Nigeria.