Skip to main content

Overview

Architecture

RustPBX adopts a decoupled architecture.

Separation of Responsibilities:

  • RustPBX: Responsible for audio processing, protocol communication, TTS, ASR and other service integrations
  • Client: Responsible for business logic, AI interaction, process control
RustPBX
RustPBX
UserAgent
UserAgent
Media Engine
Media Engine
TTS
TTS
ASR
ASR
VAD
VAD
Phone
Phone
Web
Web
LLM
LLM
SDK
SDK
WebRTC
WebRTC
SIP
SIP
Event
Event
Command
Command
WebSocket
WebSocket
AI Agent
AI Agent
Text is not SVG - cannot display

Clients connect to the server via WebSocket and interact with the server through the Command/Event pattern.

  • Client → RustPBX: Send Commands to control calls
    • Examples: Make calls, play TTS, hang up calls
  • RustPBX → Client: Push Events for status notifications
    • Examples: Speech recognition results, call status changes
RustPBX
RustPBX
UserAgent
UserAgent
Audio
Audio
Phone
Phone
SDK
SDK
Media Engine
Media Engine
TTS Command
TTS Command
ASR Event

ASR Event
LLM
LLM
Agent
Agent
Text is not SVG - cannot display

Concepts

Session

A session (Session) represents a complete voice call process, divided into incoming (answering calls) and outgoing (initiating calls) scenarios.

Sessions are identified by Session ID. By default, the server automatically generates a UUID, or you can customize it when connecting to RustPBX.

Track

RustPBX has multiple types of Tracks, mainly including:

  • SIP/WebRTC Track: Responsible for media transmission.
  • TTS/Play Track: Responsible for playing audio streams to SIP/WebRTC Track.
  • Media Pass: Takes over audio stream processing.

The start and end of each Track will trigger Track Start and Track End events respectively.

Connect to RustPBX

Clients connect to the server via WebSocket, and different call types are distinguished by path. For detailed parameters, see Connect to RustPBX.

Example ws://localhost:8080/call/sip?id=session123&dump=true:

  • Use SIP call
  • Set sessionId to session123
  • Enable dump

Call Establishment

Calls are divided into incoming (answering calls) and outgoing (initiating calls) scenarios. After connection, you can transfer using the Refer command to implement AI-to-human transfer.

The parameters CallOption for incoming and outgoing calls are mostly the same, except for setting the callee, mainly used to configure functions like TTS, ASR, VAD, and recording.

Outgoing

Initiate a call by sending the Invite command, supporting both SIP and WebRTC protocols. See Call Control for details.

  • SIP calls need to set the caller caller and callee callee SIP addresses in CallOption
  • WebRTC calls need to set the SDP Offer

When the call succeeds, you will receive an Answer event, otherwise you will receive a Reject event.

SIP Call Flow:

1: Invite
1: Invite
Client
Client
2: INVITE
2: INVITE
RustPBX
RustPBX
Phone
Phone
3: 200 OK
3: 200 OK
4. Answer
4. Answer
5. ACK
5. ACK
Text is not SVG - cannot display

Incoming

Answering incoming calls has the following steps:

  1. Configure Webhook to receive incoming call notifications
  2. When there is an incoming call, RustPBX will send a request to the configured WebHook address, the request contains dialogId, used to identify the incoming call.
  3. Connect to RustPBX and set id = dialogId, then send the Accept command to answer, or use the Reject command to reject.

Answer/Reject Flow:

2: WebHook
http://localhost:8090/webhook?dialogId: 666
2: WebHook...
Client
Client
1: INVITE
1: INVITE
RustPBX
RustPBX
Phone
Phone
5: 200 OK/603 Decline
5: 200 OK/603 Decline
3. Connect
ws://localhost:8080/call/sip?id=666
3. Connect...
5: ACK
5: ACK
4. Accept/Reject
4. Accept/Reject
Text is not SVG - cannot display

Transfer

Use the Refer command during a call to transfer the call to another SIP address, which can be used for "transfer to human" scenarios. Process:

  1. Send Refer command
  2. RustPBX calls the transfer target
  3. After the target answers, RustPBX forwards audio between both ends
2. Refer
2. Refer
Client
Client
1. 通话
1. 通话
RustPBX
RustPBX
Phone
Phone
4. 200 OK
4. 200 OK
Call center agent
Call center agent
3. Invite
3. Invite
转发
转发
5. 通话
5. 通话
Text is not SVG - cannot display
Call Control

For details, see: Call Control

Audio Processing

Play (Play/TTS)

Play audio files via the Play command, and convert text to speech via the TTS command.

TTS and Play commands will create corresponding Tracks.

Like other Tracks, the start and end of TTS Track and Play Track will trigger Track Start and Track End events respectively.

If the TTS/Play command contains a playId parameter, the corresponding TrackEnd event will contain this playId, used to get playback completion notifications.

  • Play command supports http and https addresses, supports wav and mp3 formats.

  • TTS supports four providers: Alibaba Cloud, Tencent Cloud, Deepgram, and VoiceApi, supports both streaming and non-streaming APIs, supports base64 encoded audio.

forward
forward
TTS/Play Track
TTS/Play Track
RTP.
Connection
RTP....
RTP Track
RTP Track
Phone
Phone
Audio Stream
Audio Stream
External Service
External Serv...
Text is not SVG - cannot display
TTS/Play

For details, see: TTS/Play

Automatic Speech Recognition (ASR)

Configure ASR functionality in the CallOption of the call establishment commands Invite/Accept/Refer.

Can convert speech to text in real-time and notify recognition results via events.

  • AsrDelta: Intermediate recognition result, content may change.
  • AsrFinal: Final recognition result, content is stable.

Events contain the recognized text result, as well as start and end times.

Supports four providers: Tencent Cloud, Alibaba Cloud, Deepgram, and VoiceApi.

AsrFinal
text: Hello
AsrFinal...
Client
Client
Hello(Audio)
Hello(Audio)
RustPBX
RustPBX
Phone
Phone
ASR Provider
ASR Provider
Hello(Text)
Hello(Text)
Hello(Adudio)
Hello(Adudio)
Session
Session
Text is not SVG - cannot display

Voice Activity Detection (VAD)

Voice Activity Detection is also configured in CallOption.

Supports three implementations: webrtc, silero, and ten.

When there is voice input, it will trigger a Speaking event, or when there is no voice input for a period of time, it will trigger a Silence event.

Noise Reduction

The noise reduction feature is configured in the denoise field in CallOption, defaults to off. Implemented using the RNNoise algorithm.

Recording

The recording feature is configured in the recorder field in CallOption. Supports wav, mp3, ogg, and flac formats, defaults to off.

Recording file folder and format are set in the RustPBX configuration file:

config.toml
recorder_path = "/tmp/recorders"
recorder_format = "wav"

The file name defaults to session_id, or can be customized in the recorderFile field.

Media Pass

In addition to TTS + ASR, you can also use Media Pass to completely delegate audio processing to external systems, used for end-to-end large model integration.

Like TTS and Play, Media Pass also creates a corresponding Track. The difference is that Media Pass audio forwarding is bidirectional.

  • Speech sent from the phone will be forwarded to the WebSocket server
  • Speech received from the WebSocket server will be forwarded to the phone
relay
relay
Media Pass Track
Media Pass Track
RTP. Connection
RTP. Connection
RTP Track
RTP Track
Phone
Phone
Audio Stream
Audio Stream
WebSocket Service
WebSocket Service
Text is not SVG - cannot display

Next Steps