Skip to main content

Preface

RustPBX is a high-performance media processing engine designed for building Voice Agents.

Voice Agent Architecture

General agents interact with large models using text, while Voice Agents input and output are voice.

There are two main approaches:

  • End-to-end architecture: Uses voice as input and output. Offers low latency and multi-modal understanding capabilities.
  • Chained architecture: Recognition speech to text, sends it to a large model, then converts the text to speech and plays it back.

End-to-end large models are expensive (at least 10x more than text models) and have complex tool calling requirements.

The chained architecture uses text interaction with large models, making output more controllable, allowing flexible voice selection based on TTS(Text-to-Speech) service, and tool calling is the same as text models.

Speech
Recognition
Speech...
Aduio Input
Aduio Input
Text to Speech
Text to Speech
TextRequest
Text Request
AI SDK
AI SDK
Audio Output
Audio Output
Text Response
Text Response
Model
Model
Chained Architecture
Chained Ar...
Text is not SVG - cannot display

Using the SDK

Clients connect to RustPBX with WebSocket and interact through Command/Event pattern:

RustPBX
RustPBX
UserAgent
UserAgent
Audio
Audio
Phone
Phone
SDK
SDK
Media Engine
Media Engine
TTS Command
TTS Command
ASR Event

ASR Event
LLM
LLM
Agent
Agent
Text is not SVG - cannot display

Call Control

Clients send Commands to RustPBX to control call behavior. For example:

  • Initiate calls via Invite command
  • Hang up calls via Hangup command
  • Play text via TTS command

Event Notifications

RustPBX sends Events to clients via WebSocket connection, notifying call status changes and processing results. For example:

  • Notify speech recognition results via AsrFinal event
  • Notify speaking status via Speaking event
  • Notify when the other party hangs up via Hangup event

More Content