Voice Activity Detection (VAD)
VAD (Voice Activity Detection) is used to detect whether there is speech in audio.
- When there is voice input, it will trigger a Speaking event
- Or when there is no voice input for a period of time, it will trigger a Silence event.
VAD is configured in the vad field in CallOption of Invite(call)/Accept(answer), format: VADOption.
Parameters
type
Select vad implementation:
- webrtc: Use WebRTC's VAD implementation
- silero: Use Silero's VAD implementation
- ten: Use TEN's VAD implementation
samplerate
Sample rate, default 16000. Same as ASR, needs to be configured the same as Track sample rate:
- SIP calls default to 16000hz
- WebRTC calls depend on codec:
- G722 16000hz
- Opus 48000hz
- Others 8000hz
voice_threshold
Trigger threshold, default 0.5. Setting a higher value is stricter but may miss detections. Setting a lower value is more lenient but may have false positives.
speech_padding
Start detection after speechPadding milliseconds from speech start, default 250.
silence_padding
After Silence duration exceeds silence_padding milliseconds.
silence_timeout
Silence detection timeout, unit milliseconds, default empty.
Events
Silence Event
- t1 > speech_padding and t2 > silence_padding
- t2 > silence_timeout (if silence_timeout is configured)
Speaking Event
- First speech
- More than speech_padding + silence_padding milliseconds since last speech
- More than silence_timeout milliseconds since last speech (if silence_timeout is configured)