May 9, 2026

The model choices that defined our voice agent

Following up on last week's post about the voice agent I shipped for two Pilates studios.

I said the model was the easy part. At the LLM layer that holds. But "model" is broader — STT, LLM, runtime — and the model decisions are where I burned the most time.

Three calls that mattered:

1) Speech-to-text was the hardest pick.
Tested Azure, Deepgram (nova-2 then nova-3), and Speechmatics for Portuguese voice. Speechmatics won on accuracy. Lesson under the lesson: always wire a fallback. STT models drift, regions blip, vendors deprecate — design for "if STT-A fails, route to STT-B" from day one, not after the first 3 a.m. incident.

2) Picking the LLM is a prompt-size problem disguised as a model problem.
Big prompt with multi-tool reasoning needs both robustness and speed. Tested gpt-5-mini, gpt-4o-mini, and gpt-4o. Landed on gpt-4o — the mini variants started to drop tool-call accuracy as the system prompt grew. Small prompt? A mini probably resolves. Extreme complexity? LangChain becomes a credible option for orchestration.

3) Cloud runtime: Lambda kept warm beat ECS Fargate — for our volume.
Voice latency budget (~900ms round-trip) doesn't tolerate cold starts. Tested ECS Fargate vs Lambda and picked Lambda with always-warm techniques — provisioned concurrency on critical paths plus periodic warm-up pings on the rest. At our call volume that was cheaper than Fargate and fast enough for the budget. At higher sustained volumes Fargate flips back to being the right answer — always-on cost amortizes better when traffic is constant.

Gen-AI in production isn't picking one model. It's picking three, with fallback paths, against a ms budget you can't fake.

Next post: the thing that matters even more than prompt engineering — evaluation.

What model decision cost you the most time?

P.S. New tech post every Wednesday.

#GenAI #AppliedAI #SoftwareEngineering