Build a voice agent that can make outbound calls with AssemblyAI

Why outbound voice agents matter

A voice agent that can dial out, not just answer, unlocks workflows that text channels drop the ball on:

Use case	What the agent does	Why outbound beats inbound
Appointment reminders	Calls the patient 24 h before, confirms or reschedules	Reaches people who never read the SMS
Lead qualification	Calls a fresh inbound lead, qualifies, books with sales	Engages while interest is still hot
Survey + NPS	Reads the prompt, captures freeform answers	Higher response rate than email
Past-due collections	Calls account, takes payment via tool call	Lower agent cost than a human dialer
Recall and renewal	Notifies of a recall, prescription refill, or expiring policy	Cuts through inbox noise
Customer winback	Reaches lapsed customers with a personalized offer	More personal than a marketing email

In every case the win is the same: the agent reaches the customer through the channel they actually pick up, holds a real conversation, and writes the outcome to your system of record.

Architecture

The system has three components connected by two WebSockets:

Parameter	Type	Description
`vad_threshold`	0.0–1.0	Voice activity detection sensitivity. Raise for noisy phone lines.
`min_silence`	ms	Minimum silence before the end-of-turn check fires. Raise for deliberate speech.
`max_silence`	ms	Hard cap on silence before forcing end-of-turn.
`interrupt_response`	boolean	Set to `false` to disable barge-in entirely.

The key insight: both legs use audio/pcmu (G.711 μ-law at 8 kHz). Twilio Media Streams already deliver base64-encoded μ-law audio, and the Voice Agent API accepts and emits the same format natively. That means zero resampling — bytes pass through end-to-end.

Prerequisites

Node.js 18+ and npm
An AssemblyAI API key — free tier available
A Twilio account plus a voice-capable phone number in your console
ngrok (or any public HTTPS tunnel) so Twilio can reach your dev machine

Consent matters. Automated outbound calls are regulated almost everywhere — TCPA in the US, the various state DNC registries, GDPR in the EU, two-party-consent rules for recording, and more. Disclose that the call is automated in the opener, honor “remove me from the list” requests, and consult counsel before dialing real prospects.

Quick start

1. Clone and install

 git clone https://github.com/kelsey-aai/voice-agent-outbound-calls
cd voice-agent-outbound-calls
npm install

2. Configure your environment

 cp .env.example .env
# Fill in:
#   ASSEMBLYAI_API_KEY     — from the AssemblyAI dashboard
#   TWILIO_ACCOUNT_SID     — from console.twilio.com
#   TWILIO_AUTH_TOKEN      — from console.twilio.com
#   TWILIO_FROM_NUMBER     — your Twilio voice number, e.g. +15551234567
#   PUBLIC_URL             — leave blank for now; we'll fill it after ngrok

3. Run the server

 npm start
# → Listening on http://localhost:3000

4. Expose it with ngrok

In a second terminal:

ngrok http 3000

Copy the HTTPS forwarding URL (e.g. https://ab12cd34.ngrok-free.app) and paste it into .env as PUBLIC_URL. Restart npm start.

5. Place a call

 curl -X POST http://localhost:3000/call \
  -H 'content-type: application/json' \
  -d '{"to":"+15551234567"}''

Use your own phone number for the first call so you become the prospect. The phone rings, the agent greets you with the disclosure, and you can talk to it like a human.

How it works

1. Place the call

POST /call receives a JSON body with the target number and asks Twilio to dial it. Twilio's Calls API does the actual dialing and, when the recipient picks up, fetches the URL we passed as url: to get TwiML instructions for the call.

const call = await twilioClient.calls.create({
  to,
  from: TWILIO_FROM_NUMBER,
  url: `${PUBLIC_URL}/twiml`,
});

2. Return TwiML that opens a media stream

When Twilio fetches /twiml, the server returns a tiny piece of XML that wraps the live call in a verb. That verb tells Twilio to open a WebSocket back to our server and pipe the call audio over it.

app.post("/twiml", (_req, res) => {
  const wsUrl = PUBLIC_URL.replace(/^http/, "ws") + "/twilio-stream";
  res.type("text/xml").send(`<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Connect>
    <Stream url="${wsUrl}" />
  </Connect>
</Response>`);
});

3. Bridge two WebSockets

When Twilio connects to /twilio-stream, we open a second WebSocket to AssemblyAI and shuttle messages between them. The first message we send to AssemblyAI is session.update — it configures the agent's personality, voice, and audio formats.

aaiWs.send(JSON.stringify({
  type: "session.update",
  session: {
    system_prompt: SYSTEM_PROMPT,
    greeting: GREETING,
    input:  { format: { encoding: "audio/pcmu" } },
    output: { voice: "ivy", format: { encoding: "audio/pcmu" } },
  },
}));

Both formats are audio/pcmu. Twilio Media Streams already deliver base64-encoded μ-law 8 kHz audio. AssemblyAI accepts that format natively and can emit it back, which means we never decode, resample, or re-encode any audio in this server.

Greeting is set in session.update. Outbound calls need the agent to speak first — the prospect has no idea why their phone is ringing. Setting session.greeting makes the agent open the conversation as soon as the session is ready.

4. Forward audio in both directions

The Twilio side emits connected, start, media, and stop events. We capture streamSid from start, forward media payloads to AssemblyAI as input.audio events, and close the AAI socket on stop.

case "media": {
  const payload = msg.media.payload;  // already base64 μ-law 8 kHz
  aaiWs.send(JSON.stringify({ type: "input.audio", audio: payload }));
  break;
}

Each reply.audio chunk from AssemblyAI is base64 μ-law that we wrap in a Twilio media event and ship straight back to the call:

case "reply.audio":
  twilioWs.send(JSON.stringify({
    event: "media",
    streamSid,
    media: { payload: evt.data },
  }));
  break;

5. Handle barge-in cleanly

When the user speaks while the agent is talking, AssemblyAI emits reply.done with status: "interrupted". On a phone call we also need to flush whatever audio Twilio still has buffered. Twilio supports a clear event for exactly this:

case "reply.done":
  if (evt.status === "interrupted" && streamSid) {
    twilioWs.send(JSON.stringify({ event: "clear", streamSid }));
  }
  break;

6. Echo cancellation is the carrier's job

On a phone call you don't have to think about acoustic echo cancellation — the carrier and the handset handle it. That's a meaningful difference from terminal-based clients, which need headphones to keep the agent from interrupting itself.

Tuning the agent

Voice

Drop any voice ID from the Voices catalog into session.output.voice. Eighteen English voices and 16 multilingual voices are available; multilingual voices code-switch with English automatically.

output: { voice: "james",  format: { encoding: "audio/pcmu" } }
output: { voice: "sophie", format: { encoding: "audio/pcmu" } }
output: { voice: "diego",  format: { encoding: "audio/pcmu" } }

System prompt and greeting

Both live near the top of server.js. Keep them short — phone-call replies should be one or two sentences. Always disclose that the call is automated in the first sentence; several US states require it.

Turn detection

Outbound calls often run on noisier lines than browser-based agents. The defaults in server.js are tuned a little tighter:

turn_detection: {
  vad_threshold: 0.5,        // 0.0–1.0; raise for noisy lines
  min_silence: 400,          // ms; raise for deliberate speech
  max_silence: 1200,         // ms; max wait before forcing end-of-turn
  interrupt_response: true,  // false to disable barge-in
}

See the session configuration reference for every available knob.

Recording, machine detection, and time limits

Twilio's Calls API takes optional flags that you almost certainly want in production:

const call = await twilioClient.calls.create({
  to,
  from: TWILIO_FROM_NUMBER,
  url: `${PUBLIC_URL}/twiml`,
  record: true,
  machineDetection: "Enable",
  timeLimit: 600,  // hard cap in seconds
});

record: true saves the call to Twilio's media store. machineDetection: "Enable" lets you branch on voicemail vs. live human. timeLimit puts a ceiling on a single call so a stuck LLM can't burn budget.

Tools (Function Calling)

Once the conversation works, add tools to let the agent do things — book a meeting, look up an account, mark the lead as DNC. Tools register on the same session.update you already send. The full pattern is covered in the tool-calling guide.

Troubleshooting

The phone rings but the call drops immediately. Check the Twilio console call log. Most often it's a TwiML fetch failure — Twilio couldn't reach PUBLIC_URL/twiml because ngrok died, the URL still says localhost, or the protocol is http:// instead of https://.

Twilio connects but the agent never speaks. Look for [aai] session.ready in your server logs. If you see UNAUTHORIZED, your AssemblyAI key is wrong. If you see no AAI logs at all, your environment variables aren't loaded — confirm .env is next to server.js.

The agent's voice sounds chipmunky or muffled. Both session.input.format.encoding and session.output.format.encoding must be audio/pcmu. If either is left at the default audio/pcm (24 kHz), the formats won't match and Twilio will play the audio at the wrong rate.

The agent keeps talking over me after I interrupt. Make sure you forward the clear event to Twilio when you receive reply.done with status: "interrupted". Without it, Twilio plays out the rest of its buffered audio.

Twilio trial accounts only call verified numbers. That's a Twilio limitation, not a bug in this code. Verify the recipient number in the Twilio console, or upgrade the account.

The full troubleshooting guide is in the Voice Agent API docs.

Frequently asked questions

How do I make a voice agent that places outbound phone calls?

Use Twilio's Calls API to dial the target number and pass it TwiML that opens a to your server. On your server, accept the resulting Media Streams WebSocket and bridge it to AssemblyAI's Voice Agent API. Configure both session.input.format.encoding and session.output.format.encoding as audio/pcmu so Twilio's μ-law 8 kHz audio passes through without resampling.

What audio format should I use for a Twilio voice agent?

Use audio/pcmu (G.711 μ-law, 8 kHz) on both the input and output of the Voice Agent API. Twilio Media Streams emit base64-encoded μ-law 8 kHz audio natively, and the Voice Agent API accepts and emits the same format. That means no decoding, no resampling, and no re-encoding.

How does the Voice Agent API handle barge-in over a phone call?

When the user speaks while the agent is talking, the Voice Agent API emits reply.done with status: "interrupted". On a Twilio call you also need to flush Twilio's outbound buffer by sending {event: "clear", streamSid} over the Media Streams WebSocket.

Do I need separate STT, LLM, and TTS for an outbound voice agent?

No. The AssemblyAI Voice Agent API bundles speech recognition, the language model, and text-to-speech behind a single WebSocket. You stream telephony audio in and get the agent's spoken audio back, with neural turn detection, barge-in, and tool calling built in.

How do I authenticate from a Node.js server?

Pass your AssemblyAI API key as a Bearer token in the Authorization HTTP header on the WebSocket upgrade request: new WebSocket(url, { headers: { Authorization: "Bearer YOUR_API_KEY" } }).

Is it legal to call prospects with an AI voice agent?

It depends on jurisdiction and use case. In the US, the TCPA and state DNC registries restrict automated calls. Several states require AI disclosure in the opener. The EU's GDPR and ePrivacy rules add their own requirements. Disclose that the call is automated, honor opt-out requests, and consult counsel before dialing real prospects.

How much does it cost?

AssemblyAI offers a free tier so you can prototype without a credit card. For current pricing, see the AssemblyAI pricing page. Twilio bills separately for outbound minutes and the phone number itself.

推荐订阅源

DEV Community