Voice Settings

Control how your agent sounds and processes speech.

Text-to-Speech (TTS)

Configure the voice your agent uses when speaking.

Setting	Description
Voice	The TTS voice model (e.g. Alloy, Nova, Shimmer for OpenAI; Neural voices for Azure)
Speed	Playback speed — 0.5× (slow) to 2.0× (fast). Default: 1.0
Pitch	Voice pitch adjustment (where supported)

Speech-to-Text (STT)

Setting	Description
Language	The language the agent listens for. Should match the agent's primary language
Model	STT model used for transcription

Voice Activity Detection (VAD)

VAD determines when the caller is speaking versus silent.

Setting	Description
Silence threshold	How long (ms) the agent waits after speech stops before processing. Lower = faster but can cut off slow speakers
End-of-speech pause	Pause duration that signals the caller has finished their turn

Interruption handling

When enabled, the caller can interrupt the agent mid-response. The agent stops speaking and processes the new input immediately. Useful for natural, flowing conversations.

Latency vs quality

For real-time voice conversations, lower latency is usually preferable over maximum quality. Recommended settings:

Model: gpt-4o-mini or gemini-2.0-flash for speed
Temperature: 0.3–0.5 for consistent, focused responses
Max tokens: 150–250 per response (keeps responses concise for voice)

Text-to-Speech (TTS)​

Speech-to-Text (STT)​

Voice Activity Detection (VAD)​

Interruption handling​

Latency vs quality​

Text-to-Speech (TTS)

Speech-to-Text (STT)

Voice Activity Detection (VAD)

Interruption handling

Latency vs quality