Overview
Architecture
RustPBX adopts a decoupled architecture.
Separation of Responsibilities:
- RustPBX: Responsible for audio processing, protocol communication, TTS, ASR and other service integrations
- Client: Responsible for business logic, AI interaction, process control
Clients connect to the server via WebSocket and interact with the server through the Command/Event pattern.
- Client → RustPBX: Send Commands to control calls
- Examples: Make calls, play TTS, hang up calls
- RustPBX → Client: Push Events for status notifications
- Examples: Speech recognition results, call status changes
Concepts
Session
A session (Session) represents a complete voice call process, divided into incoming (answering calls) and outgoing (initiating calls) scenarios.
Sessions are identified by Session ID. By default, the server automatically generates a UUID, or you can customize it when connecting to RustPBX.
Track
RustPBX has multiple types of Tracks, mainly including:
- SIP/WebRTC Track: Responsible for media transmission.
- TTS/Play Track: Responsible for playing audio streams to SIP/WebRTC Track.
- Media Pass: Takes over audio stream processing.
The start and end of each Track will trigger Track Start and Track End events respectively.
Connect to RustPBX
Clients connect to the server via WebSocket, and different call types are distinguished by path. For detailed parameters, see Connect to RustPBX.
Example ws://localhost:8080/call/sip?id=session123&dump=true:
- Use SIP call
- Set sessionId to
session123 - Enable dump
Call Establishment
Calls are divided into incoming (answering calls) and outgoing (initiating calls) scenarios. After connection, you can transfer using the Refer command to implement AI-to-human transfer.
The parameters CallOption for incoming and outgoing calls are mostly the same, except for setting the callee, mainly used to configure functions like TTS, ASR, VAD, and recording.
Outgoing
Initiate a call by sending the Invite command, supporting both SIP and WebRTC protocols. See Call Control for details.
- SIP calls need to set the caller
callerand calleecalleeSIP addresses inCallOption - WebRTC calls need to set the SDP Offer
When the call succeeds, you will receive an Answer event, otherwise you will receive a Reject event.
SIP Call Flow:
Incoming
Answering incoming calls has the following steps:
- Configure Webhook to receive incoming call notifications
- When there is an incoming call, RustPBX will send a request to the configured WebHook address, the request contains
dialogId, used to identify the incoming call. - Connect to RustPBX and set
id = dialogId, then send theAcceptcommand to answer, or use theRejectcommand to reject.
Answer/Reject Flow:
Transfer
Use the Refer command during a call to transfer the call to another SIP address, which can be used for "transfer to human" scenarios. Process:
- Send
Refercommand - RustPBX calls the transfer target
- After the target answers, RustPBX forwards audio between both ends
For details, see: Call Control
Audio Processing
Play (Play/TTS)
Play audio files via the Play command, and convert text to speech via the TTS command.
TTS and Play commands will create corresponding Tracks.
Like other Tracks, the start and end of TTS Track and Play Track will trigger Track Start and Track End events respectively.
If the TTS/Play command contains a playId parameter, the corresponding TrackEnd event will contain this playId, used to get playback completion notifications.
-
Play command supports http and https addresses, supports wav and mp3 formats.
-
TTS supports four providers: Alibaba Cloud, Tencent Cloud, Deepgram, and VoiceApi, supports both streaming and non-streaming APIs, supports base64 encoded audio.
For details, see: TTS/Play
Automatic Speech Recognition (ASR)
Configure ASR functionality in the CallOption of the call establishment commands Invite/Accept/Refer.
Can convert speech to text in real-time and notify recognition results via events.
- AsrDelta: Intermediate recognition result, content may change.
- AsrFinal: Final recognition result, content is stable.
Events contain the recognized text result, as well as start and end times.
Supports four providers: Tencent Cloud, Alibaba Cloud, Deepgram, and VoiceApi.
Voice Activity Detection (VAD)
Voice Activity Detection is also configured in CallOption.
Supports three implementations: webrtc, silero, and ten.
When there is voice input, it will trigger a Speaking event, or when there is no voice input for a period of time, it will trigger a Silence event.
Noise Reduction
The noise reduction feature is configured in the denoise field in CallOption, defaults to off. Implemented using the RNNoise algorithm.
Recording
The recording feature is configured in the recorder field in CallOption. Supports wav, mp3, ogg, and flac formats, defaults to off.
Recording file folder and format are set in the RustPBX configuration file:
recorder_path = "/tmp/recorders"
recorder_format = "wav"
The file name defaults to session_id, or can be customized in the recorderFile field.
Media Pass
In addition to TTS + ASR, you can also use Media Pass to completely delegate audio processing to external systems, used for end-to-end large model integration.
Like TTS and Play, Media Pass also creates a corresponding Track. The difference is that Media Pass audio forwarding is bidirectional.
- Speech sent from the phone will be forwarded to the WebSocket server
- Speech received from the WebSocket server will be forwarded to the phone
Next Steps
📄️ WebSocket API
WebSocket API specification
📄️ RustPBX Go SDK
RustPBX Go SDK specification