Voice Activity Detection (VAD)

VAD (Voice Activity Detection) is used to detect whether there is speech in audio.

When there is voice input, it will trigger a Speaking event
Or when there is no voice input for a period of time, it will trigger a Silence event.

VAD is configured in the vad field in CallOption of Invite(call)/Accept(answer), format: VADOption.

Parameters

type

Select vad implementation:

silero: Use Silero's VAD implementation
ten: Use TEN's VAD implementation

samplerate

Sample rate, default 16000. Same as ASR, needs to be configured the same as Track sample rate:

SIP calls default to 16000hz
WebRTC calls depend on codec:
- G722 16000hz
- Opus 48000hz
- Others 8000hz

voice_threshold

Trigger threshold, default 0.5. Setting a higher value is stricter but may miss detections. Setting a lower value is more lenient but may have false positives.

speech_padding

Start detection after speechPadding milliseconds from speech start, default 250.

silence_padding

After Silence duration exceeds silence_padding milliseconds.

silence_timeout

Silence detection timeout, unit milliseconds, default empty.

Events

Silence Event

t1 > speech_padding and t2 > silence_padding
t2 > silence_timeout (if silence_timeout is configured)

Speaking Event

First speech
More than speech_padding + silence_padding milliseconds since last speech
More than silence_timeout milliseconds since last speech (if silence_timeout is configured)

Parameters​

type​

samplerate​

voice_threshold​

speech_padding​

silence_padding​

silence_timeout​

Events​

Silence Event​

Speaking Event​