When you press call, your voice travels through three streaming services - speech-to-text, a language model, and text-to-speech - stitched together and orchestrated at the edge. Here's the whole thing, end to end.
Your browser
AudioWorklet · mic capture + playback
VoiceSession - Cloudflare Durable Object
Auth-gated · one per call · owns the conversation + turn loop
STT
Deepgram · nova-3
Streaming British-English transcription. 150ms endpointing for snappy turn-taking, with a 1s utterance-end backstop so it never talks over you. CV proper nouns are primed via keyterm prompting.
LLM
Claude Haiku 4.5
Forced to call exactly one tool per turn - answer (with confidence + which CV sections it used) or escalate. The CV and persona prompt are cached, and warmed during the greeting so your first question hits a warm cache.
TTS
ElevenLabs · Flash v2.5
Sub-100ms first audio, streamed back as 24kHz PCM frames. The whole answer is synthesised in one request so the voice keeps full-sentence context (per-sentence streaming sounded disjointed, so it was dropped).
Cloudflare-native, end to end - no separate servers, queues, or origin to manage.
Frontend
Nuxt 4 · Nuxt UI
SSR landing + a single call page. WebAudio + an AudioWorklet handle mic capture and playback in the browser.
Transport
One WebSocket
Binary PCM frames up, audio + JSON control events down. One connection, not three.
Orchestration
Cloudflare Durable Object
One stateful actor per call - holds the conversation, bridges the browser and Deepgram sockets, and runs the turn loop.
Persistence
D1 · Drizzle
Every turn is written with its latencies, confidence, grounding, tokens and cost - the eval log.
Auth
better-auth magic link
Passwordless, domain-gated. The demo sits behind a small allow-list.
Reporting
Resend
A transcript + metrics report is emailed after each call to the caller and admins.
Hosting
Cloudflare Workers
Edge runtime with first-class WebSocket and Durable Object support. No servers to babysit.
The parts I'd actually want to talk through in an interview.
Let's be honest: for a real product, a native speech-to-speech model like Gemini Live would be the better call - lower latency, less to maintain. But the point of this piece was to show I can build the thing, not piggyback on one prebuilt box. Orchestrating three streaming APIs - managing their sockets, latencies, and failure modes, and turning the result into a structured, inspectable decision every turn - is the actual engineering, and it's far more telling than wiring up a single Google product.
A call is inherently stateful and single-threaded: there's a conversation history, two live WebSockets to bridge (browser ↔ Deepgram), an in-flight abort controller, and a hard time cap. A Durable Object is exactly that - a single addressable actor that owns all of it, with no shared-state race conditions to reason about.
The model must call one of two tools each turn: respond (with an answer, a confidence level, and the CV sections it drew on) or escalate (with a reason). There is no free-text path. That constraint is what turns a chat into a measurable system - confidence and grounding fall out of every single turn for free.
The agent answers only from a structured version of my CV - if a fact isn't there, it doesn't guess, it hands you to me. It opens by telling you it's an AI, hedges when confidence is low, and refuses sensitive questions (salary, references, contact details) outright. Responsible automation, not a parlour trick.
On a real call, the wait before the first word is everything. So: Haiku over a frontier model (much lower time-to-first-token), tight 150ms endpointing, a warmed prompt cache, and barge-in - the moment you start talking, the in-flight LLM and TTS are aborted and the audio queue is flushed. And I'll be straight with you: it doesn't feel totally natural yet. There's a real last 20% here - tuning turn-taking, smoothing the gaps - and every tweak is a trade-off against cost, structured tool use, and reliable escalation. I optimised hard, but I'd rather show honest engineering judgement than pretend I nailed it.
The live dashboard beside the call shows which service is running, its latency, a per-turn cost breakdown, and anything escalated - all driven by the same telemetry written to the database. The thinking is in the instrumentation, and I wanted that to be visible while you use it, not buried in a log.
This is a weekend portfolio piece, not a product. Calls are capped at two minutes, there's a daily spend cap, and the agent only knows my CV - it's deliberately small. The goal was the right architecture and the right instincts around grounding, escalation, and observability - at the same shape as a production voice agent, just smaller.
It'll tell you it's an AI, then answer questions about my work in real time.