Skip to main content

Voice Settings

Control how your agent sounds and processes speech.

Text-to-Speech (TTS)

Configure the voice your agent uses when speaking.

SettingDescription
VoiceThe TTS voice model (e.g. Alloy, Nova, Shimmer for OpenAI; Neural voices for Azure)
SpeedPlayback speed — 0.5× (slow) to 2.0× (fast). Default: 1.0
PitchVoice pitch adjustment (where supported)

Speech-to-Text (STT)

SettingDescription
LanguageThe language the agent listens for. Should match the agent's primary language
ModelSTT model used for transcription

Voice Activity Detection (VAD)

VAD determines when the caller is speaking versus silent.

SettingDescription
Silence thresholdHow long (ms) the agent waits after speech stops before processing. Lower = faster but can cut off slow speakers
End-of-speech pausePause duration that signals the caller has finished their turn

Interruption handling

When enabled, the caller can interrupt the agent mid-response. The agent stops speaking and processes the new input immediately. Useful for natural, flowing conversations.

Latency vs quality

For real-time voice conversations, lower latency is usually preferable over maximum quality. Recommended settings:

  • Model: gpt-4o-mini or gemini-2.0-flash for speed
  • Temperature: 0.30.5 for consistent, focused responses
  • Max tokens: 150250 per response (keeps responses concise for voice)