Preface
RustPBX is a high-performance media processing engine designed for building Voice Agents.
Voice Agent Architecture
General agents interact with large models using text, while Voice Agents input and output are voice.
There are two main approaches:
- End-to-end architecture: Uses voice as input and output. Offers low latency and multi-modal understanding capabilities.
- Chained architecture: Recognition speech to text, sends it to a large model, then converts the text to speech and plays it back.
End-to-end large models are expensive (at least 10x more than text models) and have complex tool calling requirements.
The chained architecture uses text interaction with large models, making output more controllable, allowing flexible voice selection based on TTS(Text-to-Speech) service, and tool calling is the same as text models.
Using the SDK
Clients connect to RustPBX with WebSocket and interact through Command/Event pattern:
Call Control
Clients send Commands to RustPBX to control call behavior. For example:
- Initiate calls via
Invitecommand - Hang up calls via
Hangupcommand - Play text via
TTScommand
Event Notifications
RustPBX sends Events to clients via WebSocket connection, notifying call status changes and processing results. For example:
- Notify speech recognition results via
AsrFinalevent - Notify speaking status via
Speakingevent - Notify when the other party hangs up via
Hangupevent
More Content
📄️ RustPBX Go SDK
RustPBX Go SDK specification