Preface

Active Call is a dedicated User Agent and high-performance media processing engine designed specifically for building Voice AI applications. It handles complex telephony protocols and audio processing, allowing you to focus on your AI’s business logic.

Active Call Architecture

General agents interact with large models using text, while Active Calls input and output are voice.

There are two main approaches:

  • End-to-end architecture: Uses voice as input and output. Offers low latency and multi-modal understanding capabilities.
  • Chained architecture: Recognition speech to text, sends it to a large model, then converts the text to speech and plays it back.

End-to-end large models are expensive (at least 10x more than text models) and have complex tool calling requirements.

The chained architecture uses text interaction with large models, making output more controllable, allowing flexible voice selection based on TTS(Text-to-Speech) service, and tool calling is the same as text models.

Agent

Using the SDK

Clients connect to Active Call with WebSocket and interact through Command/Event pattern:

Event

Call Control

Clients send Commands to Active Call to control call behavior. For example:

  • Initiate calls via Invite command
  • Hang up calls via Hangup command
  • Play text via TTS command

Event Notifications

Active Call sends Events to clients via WebSocket connection, notifying call status changes and processing results. For example:

  • Notify speech recognition results via AsrFinal event
  • Notify speaking status via Speaking event
  • Notify when the other party hangs up via Hangup event

More Content