In this episode of the AI Engineering Podcast, Carter Huffman, co-founder and CTO of Modulate, discusses the engineering behind low-latency, high-accuracy Voice AI. He explains why voice is a uniquely challenging modality due to its rich non-textual signals like tone, emotion, and context, and how simple speech-to-text-to-speech pipelines can't capture the necessary nuance. Carter introduces Modulate's Ensemble Listening Model (ELM) architecture, which uses dynamic routing and cost-based optimization to achieve scalability and precision in various audio environments. He covera topics such as reliability under distributed systems constraints, watchdogging with periodic model checks, structured long-horizon memory for conversations, and the trade-offs that make ensemble approaches compelling for repeated tasks at scale. Carter also shares insights on how ELMs generalize beyond voice, draws parallels to database query planners and mixture-of-experts, and discusses strategies for observability and evaluation in complex processing pipelines.
Announcements
- Hello and welcome to the AI Engineering Podcast, your guide to the fast-moving world of building scalable and maintainable AI systems
- Unlock the full potential of your AI workloads with a seamless and composable data infrastructure. Bruin is an open source framework that streamlines integration from the command line, allowing you to focus on what matters most - building intelligent systems. Write Python code for your business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. With native support for ML/AI workloads, Bruin empowers data teams to deliver faster, more reliable, and scalable AI solutions. Harness Bruin's connectors for hundreds of platforms, including popular machine learning frameworks like TensorFlow and PyTorch. Build end-to-end AI workflows that integrate seamlessly with your existing tech stack. Join the ranks of forward-thinking organizations that are revolutionizing their data engineering with Bruin. Get started today at aiengineeringpodcast.com/bruin, and for dbt Cloud customers, enjoy a $1,000 credit to migrate to Bruin Cloud.
- Your host is Tobias Macey and today I'm interviewing Carter Huffman about his work building an ensemble approach to low latency voice AI
Interview
- Introduction
- How did you get involved in machine learning?
- Can you describe the "Ensemble Listening" approach and the story behind why Modulate moved away from monolithic architectures?
- When designing a real-time voice system, how do you handle the routing logic between specialized models without blowing your latency budget?
- What does the "gatekeeper" or routing layer actually look like in code?
- You’ve mentioned "evals that don’t lie." How do you build a validation pipeline for noisy, adversarial voice data that catches regressions that a simple word-error-rate (WER) might miss?
- In an ensemble of models, a failure in one specialized node might not crash the system, but it can degrade the output quality. How do you monitor for these "silent failures" in real-time without introducing massive overhead?
- For many teams, the default is to call an API for a frontier model. At what point in the scaling or latency curve does it become technically (or economically) necessary to swap a general LLM for a suite of specialized, smaller models?
- How do you track the real-world costs associated with the technical and human overhead of this more complex system?
- What are the most interesting, innovative, or unexpected ways that you have seen orchestrated ensembles used in live conversation environments?
- What are the most interesting, unexpected, or challenging lessons that you have learned while managing the lifecycle of multiple specialized models simultaneously?
- When is an ensemble approach the wrong choice? (e.g., At what level of complexity or throughput is the overhead of orchestration more trouble than it’s worth?)
- What do you have planned for the future of Ensemble Listening Models?
- Are we looking at self-optimizing routers, or perhaps moving these ensembles closer to the edge?
Contact Info
Parting Question
- From your perspective, what are the biggest gaps in tooling, technology, or training for AI systems today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@aiengineeringpodcast.com with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers.
Links
- Modulate
- Nasa Jet Propulsion Laboratory
- OpenAI Whisper
- Multi-Armed Bandit
- Cost-Based Optimizer
- GPT 5
- LLM Attention
- Transformer Architecture
- Mixture of Experts
- Dilated Convolution
- Wavenet
The intro and outro music is from Hitman's Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0