Skip to main content

Text-to-Speech (TTS)

TTS
text: Hello
TTS...
SDK
SDK
Hello(Audio)
Hello(Audio)
RustPBX
RustPBX
Phone
Phone
TTS Provider
TTS Provider
Hello(Text)
Hello(Text)
Hello(Audio)
Hello(Audio)
Session
Session
Text is not SVG - cannot display

The TTS command converts text to speech and plays it.

TTS Track

The TTS command creates a TTS Track for forwarding audio. The start and end of the TTS Track will trigger Track Start and Track End events respectively.

forward
forward
TTS/Play Track
TTS/Play Track
RTP.
Connection
RTP....
RTP Track
RTP Track
Phone
Phone
Audio Stream
Audio Stream
External Service
External Serv...
Text is not SVG - cannot display

When playId is set in the TTS command, the corresponding TrackEnd event will contain this playId, which can be used to get playback completion notifications.

TTS Command Parameters

text

Text to convert.

info

Some providers limit text length and content, special characters may cause errors. Please refer to the provider documentation at the end of this document.

speaker

Set the voice tone. Note that different models support different tones, and tones support different languages, prices also vary.

Voice List

playId

playId is used to complete one playback with multiple commands. When consecutive TTS commands use the same playId, they will reuse the same TTS Track.

warning

If a TTS command uses a different playId than the previous TTS command, and the previous TTS Track has not ended, it will terminate the previous TTS Track and create a new TTS Track.

endOfStream

Set endOfStream=true to indicate that all TTS commands for the current playId have been sent. The TTS Track will exit after all command results finish playing and send a Track End event.

TTS Command
TTS Command
TTS Command
TTS Command
TTS Command
TTS Command
TTS Track 1
TTS Track 1
TTS Track 2
TTS Track 2
endOfStream: true
endOfStream: true
TrackStart Event
TrackStart Ev...
TrackEnd Event
TrackEnd Event
playId: 1
playId: 1
playId: 2
playId: 2
Text is not SVG - cannot display

streaming

streaming is used to choose between the provider's streaming and non-streaming APIs:

  • Set streaming=true to use the provider's streaming API, multiple TTS commands will be sent consecutively in the same WebSocket connection.

  • Set streaming=false (default), each TTS command will use a separate http request.

Streaming text fragments are shorter, so caching mechanism will be disabled.

Non-streaming uses separate requests, can set max_concurrent_tasks option to set request concurrency, default is 1.

base64

Set base64=true to use the text field to pass base64 encoded audio.

Audio needs to be PCM format with sample rate of 16kHz.

autoHangup

Set autoHangup=true to automatically hang up after playback completes.

waitInputTimeout

Set waitInputTimeout, when exceeding waitInputTimeout milliseconds with no audio input, will trigger Silence event.

Caching

When streaming=false is set, caching mechanism will be enabled.

TTS command request audio results will be saved to /tmp/mediacache (default path), file name is: {text_hash}-{sample_rate}-{speaker}-{speed}.pcm.

When subsequent commands use the same text, sample rate, speaker, and speed, the cached file will be used directly.

Cache path can be modified in the RustPBX configuration file, example:

config.toml
[tts]
media_cache_path = "/tmp/mediacache"

Other Configuration

More TTS options can be configured in both the tts field of CallOption in the Invite/Accept commands and the option field in TTS commands.

Their parameters are the same, see SynthesisOption for details.

If configured in both places, options in TTS command will override options in CallOption.

Mainly includes:

  • samplerate: Sample rate
    • Default: 16000, best set to the same as Track sample rate, resampling will have additional performance cost.
  • speed: Speech rate
    • Default is provider's standard speech rate. Deepgram does not support adjusting speech rate. Specific configuration needs to refer to provider documentation.
  • volume: Volume
    • Default is provider's standard volume. Deepgram does not support adjusting volume, specific configuration needs to refer to provider documentation.
  • emotion: Emotion
    • Note that models and voices support emotions differently, specific configuration needs to refer to provider documentation.
  • endpoint: Custom service endpoint URL
    • tencent default: non-streaming wss://tts.cloud.tencent.com/stream_ws, streaming wss://tts.cloud.tencent.com/stream_wsv2.
    • aliyun default: wss://dashscope.aliyuncs.com/api-ws/v1/inference.
    • deepgram default: (https/wss)://api.deepgram.com/v1/speak.
  • extra: Provider-specific parameters, passed directly to provider.
  • max_concurrent_tasks: Maximum concurrent tasks for non-streaming TTS commands, default is 1.

If the provider has other parameters, they can also use the same field names in the extra field. These parameters will be passed directly to the provider.

API Key Configuration

API Key can be configured in environment variables when starting RustPBX, or configured in SynthesisOption.

Configure in environment variables:

  • Tencent Cloud:

    • TENCENT_APPID: Tencent Cloud appId
    • TENCENT_SECRET_ID: Tencent Cloud secretId
    • TENCENT_SECRET_KEY: Tencent Cloud secretKey
  • Alibaba Cloud:

    • DASHSCOPE_API_KEY: Alibaba Cloud Model Studio Bailian API Key
  • Deepgram:

    • DEEPGRAM_API_KEY: Deepgram API Key

Configure in SynthesisOption:

Tencent Cloud:

  • appId: Tencent Cloud appId
  • secretId: Tencent Cloud secretId
  • secretKey: Tencent Cloud secretKey

Other providers:

  • secretKey: Provider API Key

Interruption

Use the Interrupt command to interrupt TTS that is currently playing.

  • If graceful=true is set, the TTS Track will exit after the current TTS command finishes playing (only effective in non-streaming TTS).
  • If graceful=false (default) is set, playback will be interrupted immediately.

If there is TTS currently playing and the provider supports subtitles, an Interruption event will be triggered, containing the played time and text position (if supported by the provider).

TTS Command
TTS Command
TTS Command
TTS Command
TTS Track 1
TTS Track 1
playId: 1
playId: 1
TTS Command
TTS Command
Interrupt
graceful = true
Interrupt...
TrackEnd
TrackEnd
Playing
Playing
Text is not SVG - cannot display
tip

For more information, see: