Skip to main content

Voice Activity Detection (VAD)

VAD (Voice Activity Detection) is used to detect whether there is speech in audio.

  • When there is voice input, it will trigger a Speaking event
  • Or when there is no voice input for a period of time, it will trigger a Silence event.

VAD is configured in the vad field in CallOption of Invite(call)/Accept(answer), format: VADOption.

Parameters

type

Select vad implementation:

  • webrtc: Use WebRTC's VAD implementation
  • silero: Use Silero's VAD implementation
  • ten: Use TEN's VAD implementation

samplerate

Sample rate, default 16000. Same as ASR, needs to be configured the same as Track sample rate:

  • SIP calls default to 16000hz
  • WebRTC calls depend on codec:
    • G722 16000hz
    • Opus 48000hz
    • Others 8000hz

voice_threshold

Trigger threshold, default 0.5. Setting a higher value is stricter but may miss detections. Setting a lower value is more lenient but may have false positives.

vad
vad
yes
yes
no
no
score >
voice_treshold
score >...
voice
voice
speaking
speaking
not speaking
not speaking
Text is not SVG - cannot display

speech_padding

Start detection after speechPadding milliseconds from speech start, default 250.

silence_padding

After Silence duration exceeds silence_padding milliseconds.

silence_timeout

Silence detection timeout, unit milliseconds, default empty.

Events

Silence Event

speaking
speaking
not speaking
not speaking
not speaking
not speaking
t1
t1
t2
t2
Text is not SVG - cannot display
  • t1 > speech_padding and t2 > silence_padding
  • t2 > silence_timeout (if silence_timeout is configured)

Speaking Event

  • First speech
  • More than speech_padding + silence_padding milliseconds since last speech
  • More than silence_timeout milliseconds since last speech (if silence_timeout is configured)