19 comments

  • jedberg 21 minutes ago
    Oh, this is really interesting to me. This is what I worked on at Amazon Alexa (and have patents on).

    An interesting fact I learned at the time: The median delay between human speakers during a conversation is 0ms (zero). In other words, in many cases, the listener starts speaking before the speaker is done. You've probably experienced this, and you talk about how you "finish each other's sentences".

    It's because your brain is predicting what they will say while they speak, and processing an answer at the same time. It's also why when they say what you didn't expect, you say, "what?" and then answer half a second later, when your brain corrects.

    Fact 2: Humans expect a delay on their voice assistants, for two reasons. One reason is because they know it's a computer that has to think. And secondly, cell phones. Cell phones have a built in delay that breaks human to human speech, and your brain thinks of a voice assistant like a cell phone.

    Fact 3: Almost no response from Alexa is under 500ms. Even the ones that are served locally, like "what time is it".

    Semantic end-of-turn is the key here. It's something we were working on years ago, but didn't have the compute power to do it. So at least back then, end-of-turn was just 300ms of silence.

    This is pretty awesome. It's been a few years since I worked on Alexa (and everything I wrote has been talked about publicly). But I do wonder if they've made progress on semantic detection of end-of-turn.

    Edit: Oh yeah, you are totally right about geography too. That was a huge unlock for Alexa. Getting the processing closer to the user.

    • nicktikhonov 11 minutes ago
      This is fascinating, thanks for sharing! I wonder why amazon/google/apple didn't hop on the voice assistant/agent train in the last few years. All 3 have existing products with existing users and can pretty much define and capture the category with a single over-the-air update.
  • brody_hamer 1 hour ago
    > Voice is a turn-taking problem

    It really feels to me like there’s some low hanging fruit with voice that no one is capitalizing on: filler words and pacing. When the llm notices a silence, it fills it with a contextually aware filler word while the real response generates. Just an “mhmm” or a “right, right”. It’d go so far to make the back and forth feel more like a conversation, and if the speaker wasn’t done speaking; there’s no talking over the user garbage. (Say the filler word, then continue listening.)

    • nicktikhonov 56 minutes ago
      100% - I thought about that shortly after writing this up. One way to make this work is to have a tiny, lower latency model generate that first reply out of a set of options, then aggressively cache TTS responses to get the latency super low. Responses like "Hmm, let me think about that..." would be served within milliseconds.
    • phkahler 21 minutes ago
      Better if it can anticipate its response before you're done speaking. That would be subject to change depending what the speaker says, but it might be able to start immediately.
    • starkparker 33 minutes ago
  • armcat 2 hours ago
    This is an outstanding write up, thank you! Regarding LLM latency, OpenAI introduced web sockets in their Responses client recently so it should be a bit faster. An alternative is to have a super small LLM running locally on your device. I built my own pipeline fully local and it was sub second RTT, with no streaming nor optimisations https://github.com/acatovic/ova
    • nicktikhonov 2 hours ago
      Very cool! starred and on my reading list. Would love to chat and share notes, if you'd like
      • alfalfasprout 45 minutes ago
        Also consider using Cerebras' inference APIs. They released a voice demo a while back and the latency of their model inference is insane.
  • modeless 2 hours ago
    IMO STT -> LLM -> TTS is a dead end. The future is end-to-end. I played with this two years ago and even made a demo you can install locally on a gaming GPU: https://github.com/jdarpinian/chirpy, but concluded that making something worth using for real tasks would require training of end-to-end models. A really interesting problem I would love to tackle, but out of my budget for a side project.
  • NickNaraghi 3 hours ago
    Pretty exciting breakthrough. This actually mirrors the early days of game engine netcode evolution. Since latency is an orchestration problem (not a model problem) you can beat general-purpose frameworks by co-locating and pipelining aggressively.

    Carmack's 2013 "Latency Mitigation Strategies" paper[0] made the same point for VR too: every millisecond hides in a different stage of the pipeline, and you only find them by tracing the full path yourself. Great find with the warm TTS websocket pool saving ~300ms, perfect example of this.

    [0]: https://danluu.com/latency-mitigation/

  • lukax 3 hours ago
    Or you could use Soniox Real-time (supports 60 languages) which natively supports endpoint detection - the model is trained to figure out when a user's turn ended. This always works better than VAD.

    https://soniox.com/docs/stt/rt/endpoint-detection

    Soniox also wins the independent benchmarks done by Daily, the company behind Pipecat.

    https://www.daily.co/blog/benchmarking-stt-for-voice-agents/

    You can try a demo on the home page:

    https://soniox.com/

    Disclaimer: I used to work for Soniox

    Edit: I commented too soon. I only saw VAD and immediately thought of Soniox which was the first service to implement real time endpoint detection last year.

    • nicktikhonov 2 hours ago
      If you read the post, you'll see that I used Deepgram's Flux. It also does endpointing and is a higher-level abstraction than VAD.
      • lukax 2 hours ago
        Sorry, I commented too soon. Did you also try Soniox? Why did you decide to use Deepgram's Flux (English only)?
        • nicktikhonov 2 hours ago
          I didn't try Soniox, but I made a note to check it out! I chose Flux because I was already using Deepgram for STT and just happened to discover it when I was doing research. It would definitely be a good follow-up to try out all the different endpointing solutions to see what would shave off additional latency and feel most natural.

          Another good follow-up would be to try PersonaPlex, Nvidia's new model that would completely replace this architecture with a single model that does everything:

          https://research.nvidia.com/labs/adlr/personaplex/

  • age123456gpg 2 hours ago
    Hi all! Check out this Handy app https://github.com/cjpais/Handy - a free, open source, and extensible speech-to-text application that works completely offline.

    I am using it daily to drive Claude and it works really-well for me (much better than macOS dictation mode).

  • nmstoker 1 hour ago
    This was discussed extensively before 21 days ago:

    https://news.ycombinator.com/item?id=46946705

    • upmind 30 minutes ago
      "extensively" = 2 comments?
  • docheinestages 2 hours ago
    Does anyone know about a fully offline, open-source project like this voice agent (i.e. STT -> LLM -> TTS)?
  • loevborg 2 hours ago
    Nice write-up, thanks for sharing. How does your hand-vibed python program compare to frameworks like pipecat or livekit agents? Both are also written in python.
    • nicktikhonov 2 hours ago
      I'm sure LiveKit or similar would be best to use in production. I'm sure these libraries handle a lot of edge cases, or at least let you configure things quite well out of the box. Though maybe that argument will become less and less potent over time. The results I got were genuinely impressive, and of course most of the credit goes to the LLM. I think it's worth building this stuff from scratch, just so that you can be sure you understand what you'll actually be running. I now know how every piece works and can configure/tune things more confidently.
  • perelin 2 hours ago
    Great writeup! For VAD did you use heaphone/mic combo, or an open mic? If open, how did you deal with the agent interupting itself?
    • nicktikhonov 2 hours ago
      I was using Twilio, and as far as I'm aware they handle any echos that may arise. I'm actually not sure where in the telephony stack this is handled, but I didn't see any issues or have to solve this problem myself luckily.
  • MbBrainz 3 hours ago
    Love it! Solving the latency problem is essential to making voice ai usable and comfortable. Your point on VAD is interesting - hadn't thought about that.
  • boznz 2 hours ago
    "Voice is an orchestration problem" is basically correct. The two takeaways from this for me are

    1. I wonder if it could be optimised more by just having a single language, and

    2. How do we get around the problem of interference, humans are good at conversation discrimination ie listing while multiple conversations, TV, music, etc are going on in the background, I've not had too much success with voice in noisy environments.

  • grayhatter 49 minutes ago
    You made, or you asked an LLM to generate?
    • nicktikhonov 43 minutes ago
      I'd say it was a collaboration. I had to hand-hold Claude quite a bit in the early stages, especially with architecture, and find the right services to get the outcome I wanted. But if you care most about where the code came from - it was probably 85-90% LLM, and that's fantastic given that the result is as performant as anything you'll be able to find out of the box.
  • shubh-chat 1 hour ago
    This is superb, Nick! Thanks for this. Will try it out at somepoint for a project I am trying to build.
  • jangletown 3 hours ago
    impressive
  • aplomb1026 53 minutes ago
    [dead]
  • andrewmcwatters 39 minutes ago
    [dead]
  • CagedJean 2 hours ago
    Do you have hot talk when you are alone in the shower with HER?