Preface

RustPBX is a high-performance media processing engine designed for building Voice Agents.

Voice Agent Architecture

General agents interact with large models using text, while Voice Agents input and output are voice.

There are two main approaches:

End-to-end architecture: Uses voice as input and output. Offers low latency and multi-modal understanding capabilities.
Chained architecture: Recognition speech to text, sends it to a large model, then converts the text to speech and plays it back.

End-to-end large models are expensive (at least 10x more than text models) and have complex tool calling requirements.

The chained architecture uses text interaction with large models, making output more controllable, allowing flexible voice selection based on TTS(Text-to-Speech) service, and tool calling is the same as text models.

Using the SDK

Clients connect to RustPBX with WebSocket and interact through Command/Event pattern:

Call Control

Clients send Commands to RustPBX to control call behavior. For example:

Initiate calls via Invite command
Hang up calls via Hangup command
Play text via TTS command

Event Notifications

RustPBX sends Events to clients via WebSocket connection, notifying call status changes and processing results. For example:

Notify speech recognition results via AsrFinal event
Notify speaking status via Speaking event
Notify when the other party hangs up via Hangup event

📄️ RustPBX Go SDK

RustPBX Go SDK specification

Voice Agent Architecture​

Using the SDK​

Call Control​

Event Notifications​

More Content​

📄️ RustPBX Go SDK

Voice Agent Architecture

Using the SDK

Call Control

Event Notifications

More Content