Overview

Architecture

RustPBX adopts a decoupled architecture.

Separation of Responsibilities:

RustPBX: Responsible for audio processing, protocol communication, TTS, ASR and other service integrations
Client: Responsible for business logic, AI interaction, process control

Clients connect to the server via WebSocket and interact with the server through the Command/Event pattern.

Client → RustPBX: Send Commands to control calls
- Examples: Make calls, play TTS, hang up calls
RustPBX → Client: Push Events for status notifications
- Examples: Speech recognition results, call status changes

Concepts

Session

A session (Session) represents a complete voice call process, divided into incoming (answering calls) and outgoing (initiating calls) scenarios.

Sessions are identified by Session ID. By default, the server automatically generates a UUID, or you can customize it when connecting to RustPBX.

Track

RustPBX has multiple types of Tracks, mainly including:

SIP/WebRTC Track: Responsible for media transmission.
TTS/Play Track: Responsible for playing audio streams to SIP/WebRTC Track.
Media Pass: Takes over audio stream processing.

The start and end of each Track will trigger Track Start and Track End events respectively.

Connect to RustPBX

Clients connect to the server via WebSocket, and different call types are distinguished by path. For detailed parameters, see Connect to RustPBX.

Example ws://localhost:8080/call/sip?id=session123&dump=true:

Use SIP call
Set sessionId to session123
Enable dump

Call Establishment

Calls are divided into incoming (answering calls) and outgoing (initiating calls) scenarios. After connection, you can transfer using the Refer command to implement AI-to-human transfer.

The parameters CallOption for incoming and outgoing calls are mostly the same, except for setting the callee, mainly used to configure functions like TTS, ASR, VAD, and recording.

Outgoing

Initiate a call by sending the Invite command, supporting both SIP and WebRTC protocols. See Call Control for details.

SIP calls need to set the caller caller and callee callee SIP addresses in CallOption
WebRTC calls need to set the SDP Offer

When the call succeeds, you will receive an Answer event, otherwise you will receive a Reject event.

SIP Call Flow:

Incoming

Answering incoming calls has the following steps:

Configure Webhook to receive incoming call notifications
When there is an incoming call, RustPBX will send a request to the configured WebHook address, the request contains dialogId, used to identify the incoming call.
Connect to RustPBX and set id = dialogId, then send the Accept command to answer, or use the Reject command to reject.

Answer/Reject Flow:

Transfer

Use the Refer command during a call to transfer the call to another SIP address, which can be used for "transfer to human" scenarios. Process:

Send Refer command
RustPBX calls the transfer target
After the target answers, RustPBX forwards audio between both ends

Call Control

For details, see: Call Control

Audio Processing

Play (Play/TTS)

Play audio files via the Play command, and convert text to speech via the TTS command.

TTS and Play commands will create corresponding Tracks.

Like other Tracks, the start and end of TTS Track and Play Track will trigger Track Start and Track End events respectively.

If the TTS/Play command contains a playId parameter, the corresponding TrackEnd event will contain this playId, used to get playback completion notifications.

Play command supports http and https addresses, supports wav and mp3 formats.
TTS supports four providers: Alibaba Cloud, Tencent Cloud, Deepgram, and VoiceApi, supports both streaming and non-streaming APIs, supports base64 encoded audio.

TTS/Play

For details, see: TTS/Play

Automatic Speech Recognition (ASR)

Configure ASR functionality in the CallOption of the call establishment commands Invite/Accept/Refer.

Can convert speech to text in real-time and notify recognition results via events.

AsrDelta: Intermediate recognition result, content may change.
AsrFinal: Final recognition result, content is stable.

Events contain the recognized text result, as well as start and end times.

Supports four providers: Tencent Cloud, Alibaba Cloud, Deepgram, and VoiceApi.

Voice Activity Detection (VAD)

Voice Activity Detection is also configured in CallOption.

Supports three implementations: webrtc, silero, and ten.

When there is voice input, it will trigger a Speaking event, or when there is no voice input for a period of time, it will trigger a Silence event.

Noise Reduction

The noise reduction feature is configured in the denoise field in CallOption, defaults to off. Implemented using the RNNoise algorithm.

Recording

The recording feature is configured in the recorder field in CallOption. Supports wav, mp3, ogg, and flac formats, defaults to off.

Recording file folder and format are set in the RustPBX configuration file:

config.toml
recorder_path = "/tmp/recorders"
recorder_format = "wav"

The file name defaults to session_id, or can be customized in the recorderFile field.

Media Pass

In addition to TTS + ASR, you can also use Media Pass to completely delegate audio processing to external systems, used for end-to-end large model integration.

Like TTS and Play, Media Pass also creates a corresponding Track. The difference is that Media Pass audio forwarding is bidirectional.

Speech sent from the phone will be forwarded to the WebSocket server
Speech received from the WebSocket server will be forwarded to the phone

Next Steps

📄️ WebSocket API

WebSocket API specification

📄️ RustPBX Go SDK

RustPBX Go SDK specification

Architecture​

Concepts​

Session​

Track​

Connect to RustPBX​

Call Establishment​

Outgoing​

Incoming​

Transfer​

Audio Processing​

Play (Play/TTS)​

Automatic Speech Recognition (ASR)​

Voice Activity Detection (VAD)​

Noise Reduction​

Recording​

Media Pass​

Next Steps​