How this works

A real-time voice agent, built in the open.

When you press call, your voice travels through three streaming services - speech-to-text, a language model, and text-to-speech - stitched together and orchestrated at the edge. Here's the whole thing, end to end.

Your browser

AudioWorklet · mic capture + playback

WebSocket · PCM up / audio + JSON down

VoiceSession - Cloudflare Durable Object

Auth-gated · one per call · owns the conversation + turn loop

per-turn loop

STT

Deepgram · nova-3

Streaming British-English transcription. 150ms endpointing for snappy turn-taking, with a 1s utterance-end backstop so it never talks over you. CV proper nouns are primed via keyterm prompting.

LLM

Claude Haiku 4.5

Forced to call exactly one tool per turn - answer (with confidence + which CV sections it used) or escalate. The CV and persona prompt are cached, and warmed during the greeting so your first question hits a warm cache.

TTS

ElevenLabs · Flash v2.5

Sub-100ms first audio, streamed back as 24kHz PCM frames. The whole answer is synthesised in one request so the voice keeps full-sentence context (per-sentence streaming sounded disjointed, so it was dropped).

D1 - every turn, with metrics Resend - post-call report

The stack

Cloudflare-native, end to end - no separate servers, queues, or origin to manage.

Frontend

Nuxt 4 · Nuxt UI

SSR landing + a single call page. WebAudio + an AudioWorklet handle mic capture and playback in the browser.

Transport

One WebSocket

Binary PCM frames up, audio + JSON control events down. One connection, not three.

Orchestration

Cloudflare Durable Object

One stateful actor per call - holds the conversation, bridges the browser and Deepgram sockets, and runs the turn loop.

Persistence

D1 · Drizzle

Every turn is written with its latencies, confidence, grounding, tokens and cost - the eval log.

Auth

better-auth magic link

Passwordless, domain-gated. The demo sits behind a small allow-list.

Reporting

Resend

A transcript + metrics report is emailed after each call to the caller and admins.

Hosting

Cloudflare Workers

Edge runtime with first-class WebSocket and Durable Object support. No servers to babysit.

Decisions & trade-offs

The parts I'd actually want to talk through in an interview.

A discrete pipeline, not an end-to-end speech model

Let's be honest: for a real product, a native speech-to-speech model like Gemini Live would be the better call - lower latency, less to maintain. But the point of this piece was to show I can build the thing, not piggyback on one prebuilt box. Orchestrating three streaming APIs - managing their sockets, latencies, and failure modes, and turning the result into a structured, inspectable decision every turn - is the actual engineering, and it's far more telling than wiring up a single Google product.

One Durable Object per call

A call is inherently stateful and single-threaded: there's a conversation history, two live WebSockets to bridge (browser ↔ Deepgram), an in-flight abort controller, and a hard time cap. A Durable Object is exactly that - a single addressable actor that owns all of it, with no shared-state race conditions to reason about.

Forced tool use makes every turn structured

The model must call one of two tools each turn: respond (with an answer, a confidence level, and the CV sections it drew on) or escalate (with a reason). There is no free-text path. That constraint is what turns a chat into a measurable system - confidence and grounding fall out of every single turn for free.

Grounded answers, honest escalation

The agent answers only from a structured version of my CV - if a fact isn't there, it doesn't guess, it hands you to me. It opens by telling you it's an AI, hedges when confidence is low, and refuses sensitive questions (salary, references, contact details) outright. Responsible automation, not a parlour trick.

Chasing perceived latency

On a real call, the wait before the first word is everything. So: Haiku over a frontier model (much lower time-to-first-token), tight 150ms endpointing, a warmed prompt cache, and barge-in - the moment you start talking, the in-flight LLM and TTS are aborted and the audio queue is flushed. And I'll be straight with you: it doesn't feel totally natural yet. There's a real last 20% here - tuning turn-taking, smoothing the gaps - and every tweak is a trade-off against cost, structured tool use, and reliable escalation. I optimised hard, but I'd rather show honest engineering judgement than pretend I nailed it.

Observability is the product

The live dashboard beside the call shows which service is running, its latency, a per-turn cost breakdown, and anything escalated - all driven by the same telemetry written to the database. The thinking is in the instrumentation, and I wanted that to be visible while you use it, not buried in a log.

Scoped on purpose

This is a weekend portfolio piece, not a product. Calls are capped at two minutes, there's a daily spend cap, and the agent only knows my CV - it's deliberately small. The goal was the right architecture and the right instincts around grounding, escalation, and observability - at the same shape as a production voice agent, just smaller.

Want to hear it?

It'll tell you it's an AI, then answer questions about my work in real time.