Voice Settings
Control how your agent sounds and processes speech.
Text-to-Speech (TTS)
Configure the voice your agent uses when speaking.
| Setting | Description |
|---|---|
| Voice | The TTS voice model (e.g. Alloy, Nova, Shimmer for OpenAI; Neural voices for Azure) |
| Speed | Playback speed — 0.5× (slow) to 2.0× (fast). Default: 1.0 |
| Pitch | Voice pitch adjustment (where supported) |
Speech-to-Text (STT)
| Setting | Description |
|---|---|
| Language | The language the agent listens for. Should match the agent's primary language |
| Model | STT model used for transcription |
Voice Activity Detection (VAD)
VAD determines when the caller is speaking versus silent.
| Setting | Description |
|---|---|
| Silence threshold | How long (ms) the agent waits after speech stops before processing. Lower = faster but can cut off slow speakers |
| End-of-speech pause | Pause duration that signals the caller has finished their turn |
Interruption handling
When enabled, the caller can interrupt the agent mid-response. The agent stops speaking and processes the new input immediately. Useful for natural, flowing conversations.
Latency vs quality
For real-time voice conversations, lower latency is usually preferable over maximum quality. Recommended settings:
- Model:
gpt-4o-miniorgemini-2.0-flashfor speed - Temperature:
0.3–0.5for consistent, focused responses - Max tokens:
150–250per response (keeps responses concise for voice)