NVIDIA Introduces Cache-Aware Streaming ASR for High-Concurrency Voice Agents
Translate this article
NVIDIA has released Nemotron Speech ASR, an open automatic speech recognition model designed to address scalability and latency challenges in real-time voice AI applications. The model introduces a cache-aware streaming architecture intended to replace traditional buffered inference methods.
Reported Technical Innovation
The core advancement is a shift from buffered inference—where overlapping audio windows are repeatedly reprocessed—to a cache-aware system.This new approach reuses past computations to process only new audio input, which NVIDIA claims eliminates redundant calculations. The model is based on a FastConformer architecture with 8x downsampling, processing fewer tokens per second to reduce memory footprint and increase throughput.
Claimed Performance Benefits
According to NVIDIA's benchmarks,the architecture delivers:
· Increased Concurrency: Up to 3x higher concurrent stream support on an NVIDIA H100 GPU compared to previous baselines.
· Stable Latency: Maintains linear latency scaling under high load, avoiding the "latency drift" where delays accumulate with more users.
· Runtime Flexibility: Offers dynamically configurable latency modes (80ms to 1.12s) without model retraining, allowing a trade-off between speed and word error rate (WER).
Industry Validation and Implementation
The announcement includes reported validation from partners:
· Modal: Testing with 127 concurrent WebSocket clients showed a stable median delay of 182ms with minimal drift over a three-minute stream.
· Daily: Integration into a full voice-agent pipeline (with an LLM and TTS) resulted in a median time-to-final transcription of 24ms and a complete voice-to-voice loop under 900ms in local deployment.
NVIDIA positions this technology as a foundational step for scalable, real-time conversational agents, moving beyond systems designed for offline transcription. The model and tools are available on Hugging face and through NVIDIA NeMo.
About the Author

Aremi Olu
Aremi Olu is an AI news correspondent from Nigeria.
Subscribe to Newsletter
Enter your email address to register to our newsletter subscription!