Voice & Turn-Taking Settings

Voice & Turn-Taking Settings

These settings control how aggressively the agent detects user speech and decides when to respond.

Interruptions

  • allow_interruptions
    • Allows the user to interrupt the agent mid-speech.
  • min_interruption_duration
    • Minimum duration of detected user speech to count as an interruption.

Defaults:

  • allow_interruptions=True
  • min_interruption_duration=0.08

Turn-taking / endpointing

These parameters control Voice Activity Detection (VAD) and determine when the agent decides the user has finished speaking:

  • min_silence_duration - How long (in seconds) the user must be silent before the system considers they might be done speaking.
  • activation_threshold - Confidence level (0.0-1.0) required to detect speech. Higher values reduce false positives from background noise.
  • prefix_padding_duration - Amount of audio (in seconds) to capture before detected speech starts. Helps catch the beginning of words.
  • min_endpointing_delay - Minimum time (in seconds) to wait before ending the user's turn.
  • max_endpointing_delay - Maximum time (in seconds) to wait before forcefully ending the turn.

Defaults:

  • min_silence_duration=2.0
  • activation_threshold=0.4
  • prefix_padding_duration=0.5
  • min_endpointing_delay=0.45
  • max_endpointing_delay=3.0

These defaults are optimized to handle natural conversation pauses and prevent the agent from cutting off users who are thinking or formulating their thoughts.

Practical guidance

  • If the agent responds too quickly: Increase min_silence_duration or max_endpointing_delay.
  • If the agent cuts off users mid-thought: Increase min_endpointing_delay and max_endpointing_delay.
  • If users experience silence detection issues: Check if the stream is closing prematurely by increasing max_endpointing_delay to 4.0-5.0 seconds.
  • If the agent interrupts users too easily: Reduce allow_interruptions or increase min_interruption_duration.
  • If speech is missed at the beginning: Increase prefix_padding_duration.
  • For noisy environments: Increase activation_threshold to 0.65-0.7 to reduce false speech detection.
  • For users who speak slowly or thoughtfully: Increase both min_silence_duration (to 1.0-1.2) and max_endpointing_delay (to 4.0-5.0).