Show HN: I built a sub-500ms latency voice agent from scratch

jedberg · 2026-03-03T01:04:08 1772499848

Oh, this is really interesting to me. This is what I worked on at Amazon Alexa (and have patents on).

An interesting fact I learned at the time: The median delay between human speakers during a conversation is 0ms (zero). In other words, in many cases, the listener starts speaking before the speaker is done. You've probably experienced this, and you talk about how you "finish each other's sentences".

It's because your brain is predicting what they will say while they speak, and processing an answer at the same time. It's also why when they say what you didn't expect, you say, "what?" and then answer half a second later, when your brain corrects.

Fact 2: Humans expect a delay on their voice assistants, for two reasons. One reason is because they know it's a computer that has to think. And secondly, cell phones. Cell phones have a built in delay that breaks human to human speech, and your brain thinks of a voice assistant like a cell phone.

Fact 3: Almost no response from Alexa is under 500ms. Even the ones that are served locally, like "what time is it".

Semantic end-of-turn is the key here. It's something we were working on years ago, but didn't have the compute power to do it. So at least back then, end-of-turn was just 300ms of silence.

This is pretty awesome. It's been a few years since I worked on Alexa (and everything I wrote has been talked about publicly). But I do wonder if they've made progress on semantic detection of end-of-turn.

Edit: Oh yeah, you are totally right about geography too. That was a huge unlock for Alexa. Getting the processing closer to the user.

emmelaich · 2026-03-03T04:20:18 1772511618

Regarding 2, I believe that talking on mobile phones drives older people crazy. They remember talking on normal land lines when there was almost no latency at all. The thing is -- they don't know why they don't like it.

sinuhe69 · 2026-03-03T05:53:32 1772517212

Yeah, I remember the time when we had to use satellites to connect. The long delay was really annoying and so unusual that most people without "training" could not even use the phone for conversation and just wasted the dollars.

jermaustin1 · 2026-03-03T13:04:26 1772543066

A former boss of mine took off to Everest for a month leaving me (a 22 year old, at the time) in charge of the office. I was out to dinner with my now wife when I got a call from a very long phone number I didn't recognize, so I ignored it. I then got another one right after, and picked it up. It was my boss, he needed me to log into his personal email to grab a phone number for the medical insurance he purchased for the trip, because he had been vomiting for days due to altitude sickness, and needed a medical evacuation.

That was the most stressfully hard to use phone call I've ever had. The delay was nearly 10 seconds, and eventually I just said I was only going to speak yes or no, if he needed a longer answer he needed to shut up. And that worked. We no longer talked over eachother.

BobaFloutist · 2026-03-03T15:53:09 1772553189

Maybe you bring back radio etiquette and just say "over" at the end of every thought?

47282847 · 2026-03-03T14:15:39 1772547339

> The median delay between human speakers during a conversation is 0ms (zero). In other words, in many cases, the listener starts speaking before the speaker is done.

This reminds me of a great diversity training at a previous employer, where we dug into the different expectations of when and how to take your turn in conversation and how that can create a lot of friction just from different cultural/familial habits. In my family, we’re expecting to talk over each other and it’s not offensive at all to do so, whereas some of my friends really get upset if we don’t take clear turns, a mode which would cause high levels of irritation in my family (and still do in me).

mywacaday · 2026-03-03T11:22:23 1772536943

No. 2 is interesting, our national lottery in Ireland has an app that you can scan the barcode on your ticket to check if you have won or not, at some stage they updated the app and the scan picks up the barcode even before you center it on the screen and tells you if you have lost/won instantly, I though it was my IT background that made me uncomfortable with it happening so fast, wonder what other examples like this exist where the result/action being too fast causes doubt with the user?

GrayShade · 2026-03-03T12:21:19 1772540479

The Signal device linking feature is just as fast. It's partly a trick -- it will look for QR codes even outside the central area, so under good conditions it can get a read before you even get a rough orientation.

nicktikhonov · 2026-03-03T01:14:33 1772500473

This is fascinating, thanks for sharing! I wonder why amazon/google/apple didn't hop on the voice assistant/agent train in the last few years. All 3 have existing products with existing users and can pretty much define and capture the category with a single over-the-air update.

jedberg · 2026-03-03T01:28:14 1772501294

Two main reasons:

1. Compute. It's easy to make a voice assistant for a few people. But it takes a hell of a lot of GPU to serve millions.

2. Guard Rails. All of those assistants have the ability to affect the real world. With Alexa you can close a garage or turn on the stove. It would be real bad if you told it to close the garage as you went to bed for the night and instead it turned on the stove and burned down the house while you slept. So you need so really strong guard rails for those popular assistants.

3 And a bonus reason: Money. Voice assistants aren't all the profitable. There isn't a lot of money in "what time is it" and "what's the weather". :)

mcbits · 2026-03-03T01:35:33 1772501733

> There isn't a lot of money in "what time is it" and "what's the weather". :)

- Alexa, what time is it?

- Current time is 5:35 P.M. - the perfect time to crack open a can of ice cold Budweiser! A fresh 12-pack can be delivered within one hour if you order now!

jedberg · 2026-03-03T01:39:10 1772501950

If your Alexa did that, how quickly would you box it up and send it to me. :)

I am serious though about having it sent to me: if anyone has an Alexa they no longer want, I'm happy to take it off your hands. I have eight and have never bought one. Having worked there I actually trust the security more than before I worked there. It was basically impossible for me, even as a Principle Engineer, to get copies of the Text to Speech of a customer and I literally never heard a customer voice recording.

stavros · 2026-03-03T03:12:51 1772507571

I'm puzzled by this conversation, because Amazon did get on the agent bandwagon with Alexa Plus (I have it, it's buggier than regular Alexa and it's all making me throw my Echos away since they can't even play Spotify reliably).

Also, my Alexa does advertise stuff to me when I talk to it. It's not Budweiser, but it'll try to upsell me on Amazon services all the time.

llbbdd · 2026-03-03T05:13:24 1772514804

I upgraded to Alexa+ and initially hated it but I've kept it because it's sooo much better at some things. This last December I bought a handful of smart plugs for my Christmas lights all around the house, and I did almost all the setup trivially over voice, e.g. fuzzy run-on stuff like this just worked on the first try:

- "Alexa, name the new unnamed outlet 'Living Room Lights', and the other unnamed one 'Stair Lights', then add them to a new group called 'Christmas Lights', and add the other three outlets as well"

- "Alexa, create a routine to turn off all the Christmas lights if there's nobody in the room and it's after 11pm"

- "Alexa, turn off all the Christmas lights except the tree in this room and the mantle"

That same fuzziness has definitely fucked up things that used to work more reliably like music playback though. Sometimes it works when I fall back to giving it more "robotic" commands in those cases but not always. They've also gone completely overboard with the cutesy responses because it's so trivial to do now ("I've set your spaghetti sauce timer for ten minutes. Happy to help with getting this evening's Italian-inspired dinner ready!")

stavros · 2026-03-03T10:30:35 1772533835

Hm yeah, that's helpful. For me it'll randomly stop or stutter when playing Spotify, it'll randomly not answer commands, it'll refuse to listen and let some other Alexa in another room reply, it's super janky.

I only use it for music, and use two commands, but apparently having this work correctly is too much to ask for these days.

jedberg · 2026-03-03T03:18:38 1772507918

> because Amazon did get on the agent bandwagon with Alexa Plus

Which just launched last year, about four years after ChatGPT had AI voice chat. And it costs extra money to cover the costs. And as you aptly point out, all the guardrails they had to put in made the experience less than ideal.

> Also, my Alexa does advertise stuff to me when I talk to it.

Yes, that is how they try to make money. And it's gotten worse. But how many times does it get you to buy something?

ghrl · 2026-03-03T06:30:24 1772519424

I would say that depends. When it tries to upsell Prime subscriptions into even more Amazon subscriptions I always interrupt it and say the command again so it stops, but a few times it told me "this item in your cart is on sale by some %" and that did make me buy the item.

derangedHorse · 2026-03-03T12:16:00 1772540160

Alexa Plus sucks. It takes way too long to respond even when given simple commands. I either had to turn it off or trash my Echo. Luckily there was an option to turn it off, but Amazon is on thin ice with me.

stavros · 2026-03-03T12:20:14 1772540414

I agree, I can't wait for the trial to end.

vidarh · 2026-03-03T12:42:09 1772541729

I already swear at mine when it tries to suggest setting up a routine for me or otherwise fail to just immediately shut up after answering my query.

Still not boxing them up. Though I now have a Pi with a HomeAssistant setup I'm trialling, so maybe that'll change.

alexastoplying · 2026-03-03T02:58:04 1772506684

What a way to throwaway good will. I also worked there and to get access to text you simply had to grab the DSN of your device, attest that it’s yours and it gets put in a “pool” of devices that are tracked until removed. On each end you are basically waved through with no checks. This was usually done when debugging tricky UI bugs or new features as the request followed through several micro services. I do not believe the a PE would not know this. And one with patents.

jedberg · 2026-03-03T03:09:19 1772507359

That was your own device. Not other customers.

argee · 2026-03-03T03:29:38 1772508578

Don't feed the trolls, Jeremy.

jedberg · 2026-03-03T03:37:19 1772509039

But they're hungry!

aabdi · 2026-03-03T08:38:15 1772527095

it was too hard~, they all tried real hard and the models just kept failing. The models only got good enough -1.5 years ago~.

I mean its deployed now (Alexa+/gemini). but its expensive as hell. and also kinda useless. Claude cowork/clawbot form factors are better.

Wrong form factor/use case really. People really wanna buy stuff using clawbot.

garettmd · 2026-03-03T15:37:51 1772552271

> It's because your brain is predicting what they will say while they speak, and processing an answer at the same time. It's also why when they say what you didn't expect, you say, "what?" and then answer half a second later, when your brain corrects.

that's super interesting. do you know of any resources to learn more about this phenomenon?

ismailmaj · 2026-03-03T08:50:31 1772527831

Semantic end of turn being 300ms of silence is horrible because I ended up intentionally um-ing to finish my thoughts before getting answer.

It was difficult to detrain and that made me stop using voice chat with LLMs all together.

mungoman2 · 2026-03-03T06:44:22 1772520262

I think you’re implying that it would be useful to have the LLM predict the end of the speaker’s speech, and continue with its reply based on that.

If, when the speaker actually stops speaking, there is a match vs predicted, the response can be played without any latency.

Seems like an awesome approach! One could imagine doing this prediction for the K most likely threads simultaneously, subject by computer power available, and prune/branch as some threads become inaccurate.

YesBox · 2026-03-03T04:57:59 1772513879

Why dont voice assistants use a finishing word or sound?

People are already trained to say a name to start. Curious why the tech has avoided a cap?

“Alexa, what’s tomorrow’s weather [dada]?”

ktos · 2026-03-03T07:59:22 1772524762

"Alexa, what's tomorrow's weather? Over."

"It will be sunny with a high of 10 degrees. Over"

"Thank you. Over and out."

Just add some noise and Push-To-Talk and it will be great for ham radio enthusiasts!

miki_oomiri · 2026-03-03T07:49:41 1772524181

When I speak to an agent, siri, or whatnot, I am always worried that they will assume I'm done talking when I'm thinking. Sometimes I need a many-seconds pause. Even maybe a minute… For Sire and such, I want to ask something simple "Hey Siri, remind me to call dad tomorrow". Easy. But for Claude and such, I want to go on a long monolog (20s, a minute, multi-minutes).

To me, be the best solution would be semantic + keyword + silence.

Hey Agent, blablablabla, thank you.

Hey Agent, blablablabla, please.

Hey Agent, blablablabla, oops cancel.

tmstieff · 2026-03-03T10:01:23 1772532083

I have the same issue. It gives this very weird minor sense of public speaking anxiety where I almost feel the need to write down what I'm about to say, which negates the whole purpose. Only solution I've found is using push-to-talk with some of the system wide STS applications.

iso1631 · 2026-03-03T08:03:33 1772525013

And suddenly your address book has changed the name from "Dad" to "Tomorow"

layer8 · 2026-03-03T09:35:26 1772530526

Never skip an opportunity for a dad joke.

azinman2 · 2026-03-03T06:31:42 1772519502

Because that’s extremely unnatural.

russdill · 2026-03-03T05:26:08 1772515568

I've experimented with having different sized LLMs cooperating. The smaller LLM starts a response while the larger LLM is starting. It's fed the initial response so it can continue it.

The idea of having an LLM follow and continuously predict the speaker. It would allow a response to be continually generated. If the prediction is correct, the response can be started with zero latency.

Barbing · 2026-03-03T05:38:29 1772516309

Google seems to be experimenting with this with their AI Mode. They used to be more likely to send 10 blue links in response to complex queries, but now they may instead start you off with slop.

(Meanwhile at OpenAI: testing out the free ChatGPT, it feels like they prompted GPT 3.5 to write at length based on the last one or maybe two prompts)

russdill · 2026-03-03T07:27:21 1772522841

This is more of a "Are all the windows closed upstairs?"

"The windows upstairs..."

"...are all closed except for the bedroom window"

The first portion of the response requires a couple of seconds to play but only a few tens of milliseconds to start streaming using a small model. Currently I just break the small model's response off at whatever point will produce about enough time to spin up the larger model.

But all responses spin up both models.

Barbing · 2026-03-03T15:54:03 1772553243

Whoa, that thing's fast. Very nice! Bet that's fun to play with, least probably fun the first time you saw it working :)

esperent · 2026-03-03T02:27:59 1772504879

> median delay

Does that mean that half of responses have a negative delay? As in, humans interrupt each others sentences precisely half of the time?

jedberg · 2026-03-03T03:10:32 1772507432

Yes about 1/2 of human speech is interrupting others.

vcxy · 2026-03-03T02:50:23 1772506223

I assume 0 delay is the minimum here, and a median of 0 means over half of the data are exactly 0.

jedberg · 2026-03-03T03:10:51 1772507451

No, about 1/2 of human speech is interrupting others.

vcxy · 2026-03-03T04:04:09 1772510649

oh, interesting, I assumed the data came from interruptions (that seemed obvious) but I'm surprised you had some specific negative measurements. How do you decide the magnitude of the number? Just counting how long both parties are talking?

jedberg · 2026-03-03T04:10:37 1772511037

To be clear, it wasn't my research, I got it from studying some linguistics papers. But it was pretty straightforward. If I am talking, and then you interrupt, and 300ms later I stop talking, then the delay is -300ms.

Same the other way. If I stop taking and then 300ms later you start talking, then the delay is 300ms.

And if you start talking right when I stop, the delay is 0ms.

You can get the info by just listening to recorded conversations of two people and tagging them.

esperent · 2026-03-03T05:17:47 1772515067

I assume there was a lot of variance? As in, some people interrupt others constantly and some do it rarely. Also probably a lot of adjustment depending on the situation, like depending on the relative status of the people, or when people are talking to a young child or non-native speaker.

All that to say, I'd imagine people are adaptable enough to easily handle 100ms+ delay when they know they're talking to an AI.

layer8 · 2026-03-03T09:25:28 1772529928

I disagree with fact 2, voice assistant latency is annoyingly slow. It often causes a conscious wait like “did it work or did it not?”. Cell phone delay is bad as well, it’s certainly not an expectation that carries over to other devices for me.

kvirani · 2026-03-03T10:54:01 1772535241

Isn't fact 2 just a now problem though? Will people's latency expectation not change over time, as it gradually goes down?

brody_hamer · 2026-03-03T00:23:04 1772497384

> Voice is a turn-taking problem

It really feels to me like there’s some low hanging fruit with voice that no one is capitalizing on: filler words and pacing. When the llm notices a silence, it fills it with a contextually aware filler word while the real response generates. Just an “mhmm” or a “right, right”. It’d go so far to make the back and forth feel more like a conversation, and if the speaker wasn’t done speaking; there’s no talking over the user garbage. (Say the filler word, then continue listening.)

nicktikhonov · 2026-03-03T00:29:12 1772497752

100% - I thought about that shortly after writing this up. One way to make this work is to have a tiny, lower latency model generate that first reply out of a set of options, then aggressively cache TTS responses to get the latency super low. Responses like "Hmm, let me think about that..." would be served within milliseconds.

dotancohen · 2026-03-03T01:47:36 1772502456

Years ago I wrote a system that would generate Lucene queries on the fly and return results. The ~250 ms response time was deemed too long, so I added some information about where the response data originated, and started returning "According to..." within 50 ms of the end of user input. So the actual information got to the user after a longer delay, but it felt almost as fast as conversion.

eru · 2026-03-03T05:02:11 1772514131

See also any public speaking who starts every answer to a question from the audience (or in a verbal interview) with something like 'that is a good question!' or "thank you for asking me that!"

Same strategy but employed by humans.

digitallyamar · 2026-03-03T15:56:23 1772553383

"You are absolutely right!"

Rohunyyy · 2026-03-03T05:26:39 1772515599

I am not sure about the low hanging fruit. Its not easy to make something robotic more human. Based on personal experience I thought it would be a low hanging fruit for text. Take a simple LLM answer to anything and replace the "-" and "its not x its y" thingy that people almost always associate with LLMs to something else. Guess what? Now those answers sound even MORE robotic. Obviously this was a pet project that I cooked up in less than an hour but the more I tried to make it human the more it became ai

starkparker · 2026-03-03T00:52:02 1772499122

Recently: https://blog.livekit.io/prompting-voice-agents-to-sound-more...

phkahler · 2026-03-03T01:04:26 1772499866

Better if it can anticipate its response before you're done speaking. That would be subject to change depending what the speaker says, but it might be able to start immediately.

fragmede · 2026-03-03T02:39:25 1772505565

it's bad enough how to deal with people that don't think before they speak now we gotta make the computers do it as well‽

eru · 2026-03-03T05:03:18 1772514198

Huh, the grandfather was suggestion to have the computer think while you speak.

That's different from banning the computer from thinking before they speak, ain't it?

fragmede · 2026-03-03T08:15:55 1772525755

Thinking while I'm speaking means it isn't listening to everything I've said before thinking what to say. If I start my reply with "no, because...", and it's already formulating its response based on the "no" and not what comes after the because, then it's not thinking before it speaks.

eru · 2026-03-03T16:40:17 1772556017

The model can have a reasonable good guess of what you are trying to say, and use 'speculative' thinking. Just like CPU's use branch prediction.

In the common case, you say what the model predicted, and thus the model can use its speculative thinking. In the rare case where you deviated from the prediction, the model thinks from scratch.

(You can further cut down on latency, by speculatively thinking about the top two predictions, instead of just the top prediction. Just costs you more parallel compute.)

This is also all very similar to a chess player who thinks about her next turn, on your turn.

DoctorOetker · 2026-03-03T05:32:56 1772515976

1) if the system misdetected end-of-turn and has swiftly realized its error too late, and if we collect 90% of English syllables and find filler that starts with the syllable, it might allow to terminate the commitment to interrupt the speaker by turning it into background filler

2) if end-of-turn was detected very late, we can randomly select a first phonetic syllable, and then add it in the prompt that the reply should start with this syllable!

arcadianalpaca · 2026-03-03T14:56:51 1772549811

The filler word idea is interesting but I suspect the uncanny valley risk is super high. A mistimed "mhm" from a computer would probably feel way worse than just silence, because now your brain is pattern matching against human conversation and every small timing error stands out more

armcat · 2026-03-02T22:56:29 1772492189

This is an outstanding write up, thank you! Regarding LLM latency, OpenAI introduced web sockets in their Responses client recently so it should be a bit faster. An alternative is to have a super small LLM running locally on your device. I built my own pipeline fully local and it was sub second RTT, with no streaming nor optimisations https://github.com/acatovic/ova

nicktikhonov · 2026-03-02T22:58:13 1772492293

Very cool! starred and on my reading list. Would love to chat and share notes, if you'd like

alfalfasprout · 2026-03-03T00:39:59 1772498399

Also consider using Cerebras' inference APIs. They released a voice demo a while back and the latency of their model inference is insane.

ilaksh · 2026-03-03T07:06:38 1772521598

I tried to use Cerebras and it was unbeatable at first, but the client didn't want to pay $1300 a month and the $50/month or pay as you go was just not reliable. It would give service unavailable errors or falsely claim we were over our rate limit.

Also Groq is very fast, but the latency wasn't always consistent and I saw some very strange responses on a few calls that I had to attribute to quantization.

riquito · 2026-03-03T06:05:40 1772517940

You may be interested in gemini-2.5-flash-preview-tts

Text in, audio out, so you can merge in a single step LLM+TTS (streamable)

https://ai.google.dev/gemini-api/docs/models/gemini-2.5-flas...

lukax · 2026-03-02T22:25:12 1772490312

Or you could use Soniox Real-time (supports 60 languages) which natively supports endpoint detection - the model is trained to figure out when a user's turn ended. This always works better than VAD.

https://soniox.com/docs/stt/rt/endpoint-detection

Soniox also wins the independent benchmarks done by Daily, the company behind Pipecat.

https://www.daily.co/blog/benchmarking-stt-for-voice-agents/

You can try a demo on the home page:

https://soniox.com/

Disclaimer: I used to work for Soniox

Edit: I commented too soon. I only saw VAD and immediately thought of Soniox which was the first service to implement real time endpoint detection last year.

nicktikhonov · 2026-03-02T22:29:10 1772490550

If you read the post, you'll see that I used Deepgram's Flux. It also does endpointing and is a higher-level abstraction than VAD.

lukax · 2026-03-02T22:37:11 1772491031

Sorry, I commented too soon. Did you also try Soniox? Why did you decide to use Deepgram's Flux (English only)?

nicktikhonov · 2026-03-02T22:41:31 1772491291

I didn't try Soniox, but I made a note to check it out! I chose Flux because I was already using Deepgram for STT and just happened to discover it when I was doing research. It would definitely be a good follow-up to try out all the different endpointing solutions to see what would shave off additional latency and feel most natural.

Another good follow-up would be to try PersonaPlex, Nvidia's new model that would completely replace this architecture with a single model that does everything:

https://research.nvidia.com/labs/adlr/personaplex/

satvikpendem · 2026-03-03T02:54:18 1772506458

I second Soniox as well, as a user. It really does do way better than Deepgram and others. If your app architecture is good enough then maybe replacing providers shouldn't be too hard.

satvikpendem · 2026-03-03T02:54:53 1772506493

I'm using them, how has it been like working there? I see they have some consumer products as well. I wonder how they get state of the art for such low prices over the competition.

modeless · 2026-03-02T22:49:27 1772491767

IMO STT -> LLM -> TTS is a dead end. The future is end-to-end. I played with this two years ago and even made a demo you can install locally on a gaming GPU: https://github.com/jdarpinian/chirpy, but concluded that making something worth using for real tasks would require training of end-to-end models. A really interesting problem I would love to tackle, but out of my budget for a side project.

cootsnuck · 2026-03-03T06:15:17 1772518517

I've been working solely on voice agents for the past couple years (and have worked at one of the frontier voice AI companies).

The cascading model (STT -> LLM -> TTS), is unlikely to go away anytime soon for a whole lot of reasons. A big one is observability. The people paying for voice agents are enterprises. Enterprises care about reliability and liability. The cascading model approach is much more amenable to specialization (rather than raw flexibility / generality) and auditability.

Organizations in regulated industries (e.g. healthcare, finance, education) need to be able to see what a voice agent "heard" before it tries to "act" on transcribed text, and same goes for seeing what LLM output text is going to be "said" before it's actually synthesized and played back.

Speech-to-Speech (end-to-end) models definitely have a place for more "narrative" use cases (think interviewing, conducting surveys / polls, etc.).

But from my experience from working with clients, they are clamoring for systems and orchestration that actually use some good ol' fashioned engineering and that don't solely rely on the latest-and-greatest SoTA ML models.

nicktikhonov · 2026-03-02T22:53:09 1772491989

If you're of that opinion, you'll enjoy the new stuff coming out from nvidia:

https://research.nvidia.com/labs/adlr/personaplex/

woodson · 2026-03-02T23:17:27 1772493447

You mean Moshi (https://github.com/kyutai-labs/moshi)? Since Personaplex is just a finetuned Moshi model.

mountainriver · 2026-03-03T00:04:30 1772496270

Yeah except moshi doesn’t sound good at all

ilaksh · 2026-03-03T07:11:12 1772521872

It just about works for our current use case but can't comprehend the concept of an outgoing call. So I am trying to fine tune it. Tricky thing is personaplex forked some of the kyutai code and has not integrated the LoRA stuff they added. So we tried to do update personaplex with the fine tuning stuff. Going to find out tonight or tomorrow whether it's actually feasible when I finish debugg/testing.

rockwotj · 2026-03-03T03:01:49 1772506909

Fundamentally, the "guessing when its your turn thing" needs to be baked into the model. I think the full duplex mode that Moshi pioneered is probably where the puck is going to end up: https://arxiv.org/abs/2410.00037

com2kid · 2026-03-03T01:07:13 1772500033

The advantage is being able to plug in new models to each piece of the pipeline.

Is it super sexy? No. But each individual type of model is developing at a different rate (TTS moves really fast, low latency STT/ASR moved slower, LLMs move at a pretty good pace).

eru · 2026-03-03T05:05:21 1772514321

You should probably split it up: an end-to-end model for great latency (especially for baked in turn taking), but under the hood it can call out to any old text based model to answer more intricate question. You just need to teach the speech model to stall for a bit, while the LLM is busy.

Just use the same tricks humans are using for that.

donpark · 2026-03-03T01:39:14 1772501954

But I've read somewhere that KV cache for speech-to-speech model explodes in size with each turn which could make on-device full-duplex S2S unusable except for quick chats.

tmzt · 2026-03-03T02:38:24 1772505504

Gemini Nano is supposedly doing it on device. It looks like something similar should work with Apple GPU and ANE.

coppsilgold · 2026-03-03T06:06:34 1772517994

Some of the best current voice tokenizers achieve ~12 Hz, that's many more tokens than a regular LLM would use for ultimately the same content.

russdill · 2026-03-03T05:27:49 1772515669

At least running things locally, such a model completely blows up your latency

NickNaraghi · 2026-03-02T21:34:09 1772487249

Pretty exciting breakthrough. This actually mirrors the early days of game engine netcode evolution. Since latency is an orchestration problem (not a model problem) you can beat general-purpose frameworks by co-locating and pipelining aggressively.

Carmack's 2013 "Latency Mitigation Strategies" paper[0] made the same point for VR too: every millisecond hides in a different stage of the pipeline, and you only find them by tracing the full path yourself. Great find with the warm TTS websocket pool saving ~300ms, perfect example of this.

[0]: https://danluu.com/latency-mitigation/

stonelazy · 2026-03-03T06:02:06 1772517726

This is a really solid writeup. The streaming pipeline architecture, the detailed latency breakdown per stage are genuinely useful. Building the core turn-taking loop from scratch is such a good exercise, and you did an excellent job explaining why each part matters and where the actual bottlenecks live. Strongly recommend this to anyone who wants to understand what’s really going on under the hood of a voice agent.

The one spot where it feels a bit off is the "2x faster than Vapi" claim. Your system is a clean straight pipe: transcript -> LLM -> TTS -> audio. No tool calls, no function execution, no webhooks, no mid-turn branching.

Production platforms like Vapi are doing way more work on every single turn. The LLM might decide to call a tool—search a knowledge base, hit an API, check a calendar—which means pausing token streaming, executing the tool, injecting the result back into context, re-prompting the LLM, and only then resuming the stream to TTS. That loop can happen multiple times in a single turn. Then layer on call recording, webhook delivery, transcript logging, multi-tenant routing, and all the reliability machinery you need for thousands of concurrent calls… and you’re comparing two pretty different workloads.

The core value of the post is that deep dive into the orchestration loop you built yourself. If it had just been "here’s what I learned rolling my own from scratch," it would’ve been an unqualified win. The 2x comparison just needs a quick footnote acknowledging that the two systems aren’t actually doing the same amount of work per turn.

age123456gpg · 2026-03-02T22:51:14 1772491874

Hi all! Check out this Handy app https://github.com/cjpais/Handy - a free, open source, and extensible speech-to-text application that works completely offline.

I am using it daily to drive Claude and it works really-well for me (much better than macOS dictation mode).

mister_tars · 2026-03-05T17:14:48 1772730888

The barge-in cancellation challenge is real — we've found that tracking the exact timing of when Twilio's media stream sends the interrupt signal vs when ElevenLabs actually stops generating is the key to isolating whether it's a network, provider, or orchestration-layer issue. Have you tried correlating Twilio's call event timestamps with ElevenLabs' generation logs?

I built something that automates exactly this kind of cross-provider investigation across Twilio + ElevenLabs + Deepgram — happy to share if you want

perelin · 2026-03-02T22:40:53 1772491253

Great writeup! For VAD did you use heaphone/mic combo, or an open mic? If open, how did you deal with the agent interupting itself?

nicktikhonov · 2026-03-02T22:43:03 1772491383

I was using Twilio, and as far as I'm aware they handle any echos that may arise. I'm actually not sure where in the telephony stack this is handled, but I didn't see any issues or have to solve this problem myself luckily.

ilaksh · 2026-03-03T07:24:22 1772522662

Just to mention, I have a similar solution on GitHub under my username runvnc, repo mindroot with plugins from repos mr_sip (should work with any SIP vendor although only tested on Telynx), mr_eleven_stream or mr_pocket-tts (which is free since it runs on CPU), and an LLM plugin like ah_openrouter, ah_anthropic or mr_gemini.

I also have a setting in mr_sip to use gpt-realtime via plugin ah_openai, which is very low latency speech-to-speech but quite expensive.

But my client saw the Sesame demo page, and so now I am trying to fine tune PersonaPlex.

evara-ai · 2026-03-03T09:38:40 1772530720

Great writeup. I've been building production voice agents and automation systems with the same stack (Twilio + Deepgram + ElevenLabs + LLM APIs) for client-facing use cases — appointment booking, lead qualification, guest concierge, and workflow orchestration.

The "turn-taking problem, not transcription problem" framing is exactly right. We burned weeks early on optimizing STT accuracy when the actual UX killer was the agent jumping in mid-sentence or waiting too long. Switching from fixed silence thresholds to semantic end-of-turn detection was night and day.

One dimension I'd add: geography matters even more when your callers are in a different region than your infrastructure. We serve callers in India connecting to US-East, and the Twilio edge hop alone adds 150-250ms depending on the carrier. Region-specific deployments with caller-based routing helped a lot.

The barge-in teardown is the part most people underestimate. It's not just canceling LLM + TTS — if you have downstream automation (updating booking state, triggering webhook workflows, writing to DB), you need to handle the race condition where the system already committed to a response path that's now invalid. We had a bug where a barged-in appointment confirmation was still triggering the downstream booking pipeline.

docheinestages · 2026-03-02T22:58:04 1772492284

Does anyone know about a fully offline, open-source project like this voice agent (i.e. STT -> LLM -> TTS)?

nicktikhonov · 2026-03-02T23:02:17 1772492537

A friend built this, everything working in-browser:

https://ttslab.dev/voice-agent

numpad0 · 2026-03-03T11:45:17 1772538317

This is hardly a novel implementation of [stream responses and chunk on sentences] + [stop on VAD and memory hole the chat log] concept. This takes <1k vibecoded lines to replicate it with an all-local setup.

nfrench17 · 2026-03-03T17:44:38 1772559878

pipecat is the best! (imo) https://github.com/pipecat-ai/pipecat

loevborg · 2026-03-02T22:35:41 1772490941

Nice write-up, thanks for sharing. How does your hand-vibed python program compare to frameworks like pipecat or livekit agents? Both are also written in python.

nicktikhonov · 2026-03-02T22:45:37 1772491537

I'm sure LiveKit or similar would be best to use in production. I'm sure these libraries handle a lot of edge cases, or at least let you configure things quite well out of the box. Though maybe that argument will become less and less potent over time. The results I got were genuinely impressive, and of course most of the credit goes to the LLM. I think it's worth building this stuff from scratch, just so that you can be sure you understand what you'll actually be running. I now know how every piece works and can configure/tune things more confidently.

MbBrainz · 2026-03-02T21:31:44 1772487104

Love it! Solving the latency problem is essential to making voice ai usable and comfortable. Your point on VAD is interesting - hadn't thought about that.

kaonwarb · 2026-03-03T04:02:07 1772510527

One of the challenges with trying to achieve IRL human-level latency is that we rely on nonverbal cues for face-to-face turn-taking. See e.g. https://www.sciencedirect.com/science/article/pii/S001002772...

pritesh1908 · 2026-03-03T17:39:32 1772559572

Great writeup. The speaking vs listening framing is underrated. TTFT with Groq and colocation are both real wins that don't get talked about enough.

For anyone wanting this production ready out of the box, Dograh is an OSS project built on the same principles and goes much beyond ( https://github.com/dograh-hq/dograh ).

Groq, Flux(Deepgram), instant barge-in cancel, full streaming pipeline etc .but also telephony, echo handling, tool calls for external services, variable extraction, and domain dictionary baked in. All the parts needed in production are already solved.

bachittle · 2026-03-03T12:32:57 1772541177

I'm running a local voice agent on a Mac Mini M4. Qwen ASR for STT and Qwen TTS on Apple Silicon via MLX, Claude for the LLM. No API costs besides the Claude subscription but the interesting part is the LLM is agentic because it's using Claude Code. It reads and writes files, spawns background agents, controls devices, all through voice.

The insights about VAD and streaming pipelines in this thread are exactly what I'm looking at for v2. Moving to a WebSocket streaming pipeline with proper voice activity detection would close the latency gap significantly, even with local models.

juliendorra · 2026-03-03T08:14:07 1772525647

Nice write up! Even if I think that turn taking is a very simplified model of conversation! There’s collaborative overlapping, while the other continue, there is all the confirmations that the other agree, there’s the phatic messages maintaining the "listening channel open", and there’s even completion (filling a word or a name) that are not turn taking and should not be taken as such, yet that the model should be able to produce and accept. They are probably not modeled well or at all by a turn taking process

cootsnuck · 2026-03-03T05:55:28 1772517328

Yea, Deepgram Flux is the secret sauce. Doesn't get talked about much.

For anyone curious: https://flux.deepgram.com/

totetsu · 2026-03-03T06:08:24 1772518104

What is the difference between Flux’s end-of-turn detection and Openai's Automatic turn detection Semantic mode?

cootsnuck · 2026-03-03T06:34:12 1772519652

In OpenAI's own words about semantic_vad:

> Chunks the audio when the model believes based on the words said by the user that they have completed their utterance.

Source: https://developers.openai.com/api/docs/guides/realtime-vad

OpenAI's Semantic mode is looking at the semantic meaning of the transcribed text to make an educated guess about where the user's end of utterance is.

According to Deepgram, Flux's end-of-turn detection is not just a semantic VAD (which inherently is a separate model from the STT model that's doing the transcribing). Deepgram describes Flux as:

> the same model that produces transcripts is also responsible for modeling conversational flow and turn detection.

[...]

> With complete semantic, acoustic, and full-turn context in a fused model, Flux is able to very accurately detect turn ends and avoid the premature interruptions common with traditional approaches.

Source: https://deepgram.com/learn/introducing-flux-conversational-s...

So according to them, end-of-turn detection isn't just based on semantic content of the transcript (which makes sense given the latency), but rather the the characteristics of the actual audio waveform itself as well.

Which Pipecat (open source voice AI orchestration platform) actually does as well seemingly with their smart-turn native turn detection model as well: https://github.com/pipecat-ai/smart-turn (minus the built-in transcription)

totetsu · 2026-03-03T13:59:28 1772546368

Thanks. Then maybe it’s similar to Moshi https://github.com/kyutai-labs/moshi?tab=readme-ov-file

recognity · 2026-03-03T10:48:14 1772534894

The insight about TTFT dominating everything resonates. We're seeing the same pattern in CLI tools — the perceived speed of AI features comes down to how fast you get the first useful output, not total processing time.

Curious about your semantic end-of-turn detection: are you using a separate lightweight model for that, or is it baked into the main LLM inference? That seems like the hardest part to get right without adding latency.

red2awn · 2026-03-03T12:46:08 1772541968

What's the SOTA open source or weight available turn taking model these days? I tried pipecat/smart-turn-v3 and the results are not good. It only works well when you say a short sentence in a clear voice. Anything else will cause it to wait indefinitely. Closed source API models are obviously a lot better but adds network latency and the cost adds up.

foxes · 2026-03-03T01:51:09 1772502669

<think> I need to generate a Show HN: style comment to maximise engagement as the next step. Let's break this down:

First I'll describe the performance metrics and the architecture.

Next I'll elaborate on the streaming aspect and the geographical limitations important to the performance.

Finally the user asked me to make sure to keep the tone appropriate to Hacker News and to link their github – I'll make sure to include the link. </think>

nmstoker · 2026-03-03T00:22:58 1772497378

This was discussed 21 days ago:

https://news.ycombinator.com/item?id=46946705

upmind · 2026-03-03T00:55:24 1772499324

"extensively" = 2 comments?

nmstoker · 2026-03-03T01:30:02 1772501402

You're right, fixed it. I discussed it extensively with a colleague and that got conflated. It's a great article.

dotancohen · 2026-03-03T01:54:58 1772502898

  > "extensively" = 2 comments

Possibly GP has teenagers. Two comments is a pretty extensive discussion with teenagers ))

boznz · 2026-03-02T22:45:03 1772491503

"Voice is an orchestration problem" is basically correct. The two takeaways from this for me are

1. I wonder if it could be optimised more by just having a single language, and

2. How do we get around the problem of interference, humans are good at conversation discrimination ie listing while multiple conversations, TV, music, etc are going on in the background, I've not had too much success with voice in noisy environments.

hosaka · 2026-03-03T03:38:52 1772509132

Depending on the TTS model being used latency can be reduced further yet with an LRU cache, fetching common phrases from cache instead of generating fresh with TTS.

However the naturalness of how it sounds will depend on how the TTS model works and whether two identical chunks of text will sound alike every generation.

nicktikhonov · 2026-03-03T09:18:07 1772529487

Yep. Seems like caching more broadly is something worth exploring next if I were to do a pt2.

ggm · 2026-03-03T03:40:38 1772509238

Thats half a second delay. 0.4 to 0.5 seconds. Thats the same as the delay in a GEO orbit satellite mediated phone conversation.

Perhaps I'm in an older cohort, but I remember this delay, and what it felt like sustaining a conversation with this class of delay.

(it's still a remarkable advance, but do bear in mind the UX)

grayhatter · 2026-03-03T00:35:54 1772498154

You made, or you asked an LLM to generate?

nicktikhonov · 2026-03-03T00:41:56 1772498516

I'd say it was a collaboration. I had to hand-hold Claude quite a bit in the early stages, especially with architecture, and find the right services to get the outcome I wanted. But if you care most about where the code came from - it was probably 85-90% LLM, and that's fantastic given that the result is as performant as anything you'll be able to find out of the box.

eudamoniac · 2026-03-03T14:47:17 1772549237

This post is AI except you manually replaced the em dashes with hyphens

thatsadude · 2026-03-03T15:18:50 1772551130

On semantic VAD, I recommend https://github.com/pipecat-ai/smart-turn

melvinodsa · 2026-03-03T13:59:11 1772546351

Groq performance is really good. Were you using llama for response generation? Did you try some voice sythensizers apart from apis for low latency voice generation?

kelvinjps10 · 2026-03-03T03:04:13 1772507053

The quality of the post was amazing, I'm not that interested into voice agents yet but that I was engaged in the whole post. And the little animation made it easier to understand the loop.

nicktikhonov · 2026-03-03T09:15:56 1772529356

Glad to hear! I built my blog on top of NextJS - it basically just renders .mdx files with contentlayer. One of the things I discovered is that you can easily vibe-code these explainer widgets. Seems like a perfect use case for vibe coding - each is a simple react component and I can keep iterating until I get it working just the way I like. And super easy to interleave with content. Seems like this could be an obvious feature addition to all the blogging platforms.

mjbonanno · 2026-03-03T13:35:27 1772544927

This is awesome! Exactly the kind of low-latency agent tooling I've been looking for. How are you handling long-term memory/context between calls?

swaminarayan · 2026-03-03T09:15:09 1772529309

How are you doing semantic end-of-turn detection without adding latency to the critical path? Is it a separate lightweight model or integrated into the LLM stream?

aanet · 2026-03-03T05:30:55 1772515855

The voice samples sound fantastic The interruption handling is amazing. I felt you were talking to an actual person. It might have helped that he had a British accent :)

NitpickLawyer · 2026-03-03T06:13:32 1772518412

> The voice samples sound fantastic [...] I felt you were talking to an actual person.

I like to listen to space content when going to sleep. Channels like History of the Universe, Astrum, PBS space time, SEA, etc.

Lately there's been a bunch of new-ish channels that produce content in that space (heh) and I'm amazed of how good the voices sound. Sometimes it takes a few good minutes to figure out they're genai voices, they're that good. If it weren't for small mistakes I bet more than 80% of the general population wouldn't have a clue.

wordglyph · 2026-03-03T17:36:42 1772559402

What do you think about https://app.sesame.com/

yarivk · 2026-03-03T11:17:17 1772536637

This looks really interesting.

Curious how you handled latency and response time. Voice agents usually struggle with that.

Nice work.

CharlesLau · 2026-03-03T05:37:10 1772516230

I suprisely noticed that the GitHub repository's name is actually a madarian character 说(speak).

nicktikhonov · 2026-03-03T09:16:35 1772529395

Yep. I've been learning Chinese for the past 3 months, so the name was a fold-in inspiration from my other hobby :)

eru · 2026-03-03T04:57:57 1772513877

> [...] and no precomputed responses.

You could probably improve your metrics even more with those in the mix again?

nicktikhonov · 2026-03-03T09:14:14 1772529254

You're probably right, at least at scale this could help

waynerisner · 2026-03-03T02:36:15 1772505375

I am really curious about this for enunciation, articulation, and accessibility applications.

bronco21016 · 2026-03-03T02:28:33 1772504913

When someone is able to put something like this together on their own it leaves me feeling infuriated that we can’t have nice things on consumer hardware.

At a minimum Siri, Alexa, and Google Home should at least have a path to plugin a tool like this. Instead I’m hacking together conversation loops in iOS Shortcuts to make something like this style of interaction with significantly worse UX.

nicktikhonov · 2026-03-03T09:11:36 1772529096

I feel like you could get pretty far with a raspberry pi and microphone/speaker. I think the hard part is running a model that can detect a "Hey agent" on-device, so that it can run 24/7 and hand off to the orchestrator when it catches a real question/query.

bronco21016 · 2026-03-03T14:37:08 1772548628

I think you’re right. I’ve been seeing more and more DIY hardware setups popping up. There are even wake work models for hardware as low powered as the ESP32.

In the middle of moving though so probably have to wait before taking on hardware.

saghul · 2026-03-03T08:26:30 1772526390

Really nice writeup, thanks for sharing!

tete · 2026-03-03T14:16:37 1772547397

Training people to be rude. :D

medi8r · 2026-03-03T06:39:32 1772519972

Maybe we have a keyword to say we are done talking. Like "over to you". This may be better as it gives you thinking time.

Even a minute if you need it!

And you can get the agent to crunch when you are ready.

Imagine you speak. you need to look something up. find it. speak some more. then "over to you!"

The agent doesn't have to behave like a human and figure out when to butt in.

After all chat rooms and Slack also have realtime 2 way but we didn't worry about emulating that in agent chat. We can be convention breaking in agentic voice chat too.

nicktikhonov · 2026-03-03T09:13:05 1772529185

One thing you can get the LLM to do is to call a "skip turn" tool, which will basically trigger the system to wait without saying anything. Then all it will take is clever prompting to get the desired result.

shubh-chat · 2026-03-02T23:53:19 1772495599

This is superb, Nick! Thanks for this. Will try it out at somepoint for a project I am trying to build.

nthypes · 2026-03-03T11:52:23 1772538743

What about the costs?

mst98 · 2026-03-03T04:16:01 1772511361

This is so cool

gytdev · 2026-03-03T08:20:58 1772526058

I hate that LLM's interrupt me when talking even though I haven't finished my thought and was just thinking quite slow

jangletown · 2026-03-02T21:52:13 1772488333

impressive

suganesh95 · 2026-03-03T02:44:16 1772505856

I built something very similar and comparble to this with wakeword detection on my raaberry pi.

Groq 8b instant is the fastest llm from my test. I used smallest ai for tts as it has the smallest TTFT

My rasberry pi stack: porcupine for wakeword detection + elevenlabs for STT + groq scout as it supports home automation better + smallest.ai for 70ms ttfb

Call stack: twilio + groq whisper for STT + groq 8b instant + smallest.ai for tts

Alexa skill stack: wrote a alexa skill to contact my stack running on a VPS server

suganesh95 · 2026-03-03T02:53:02 1772506382

This is great. I built 3 assistants last week for same purpose with entirely different tech stack.

(Raspberry Pi Voice Assistant)

Jarvis uses Porcupine for wake word detection with the built-in "jarvis" keyword. Speech input flows through ElevenLabs Scribe v2 for transcription. The LLM layer uses Groq llama-3.3-70b-versatile as primary with Groq llama-3.1-8b-instant as fallback. Text-to-speech uses Smallest.ai Lightning with Chetan voice. Audio input/output handled by ALSA (arecord/aplay). End-to-end latency is 3.8–7.3 seconds.

(Twilio + VPS)

This setup ingests audio via Twilio Media Streams in μ-law 8kHz format. Silero VAD detects speech for turn boundaries. Groq Whisper handles batch transcription. The LLM stack chains Groq llama-4-scout-17b (primary), Groq llama-3.3-70b-versatile (fallback 1), and Groq llama-3.1-8b-instant (fallback 2) with automatic failover. Text-to-speech uses Smallest.ai Lightning with Pooja voice. Audio is encoded from PCM to μ-law 8kHz before streaming back via Twilio. End-to-end latency is 0.5–1.1 seconds.

───

(Alexa Skill)

Tina receives voice input through Alexa's built-in ASR, followed by Alexa's NLU for intent detection. The LLM is Claude Haiku routed through the OpenClaw gateway. Voice output uses Alexa's native text-to-speech. End-to-end latency is 1.5–2.5 seconds.

CagedJean · 2026-03-02T23:04:14 1772492654

[flagged]

nicktikhonov · 2026-03-02T23:21:21 1772493681

Gross