Learn how to integrate the Beyond Presence real-time API into your tech stack.
There are several different options for integrating Beyond Presence avatars
into your tech stack. Which option is best for you, depends on which components
of the pipeline you want to manage yourself.
Integration via end-to-end API with custom LLM is still under
development. In the meanwhile, we recommend to use the
audio-to-video API if your use case
requires a custom LLM.
If your use case requires one of the other integration options that are still
in development, please reach out to us at support@beyondpresence.ai!
A typical conversational video agent pipeline consists of the following layers:
Transport In
The input transport layer defines how the audio and video of the user
reach the agent. Usually this is handled by a video conferencing
solution or other WebRTC service. Beyond Presence agents currently use
LiveKit to manage all data transport.
ASR (Automatic Speech Recognition)
The ASR component is the “ear” of the agent that is responsible for
listening to user audio, transcribing it, and detecting interruptions.
LLM
The LLM layer is the “brain” of the agent that is responsible for generating
agent responses. This might also include RAG, function calling,
or other methods for handling knowledge and external tools.
TTS (Text-to-Speech)
The TTS component is the “voice” of the agent that is responsible for
turning agent responses into audio. Beyond Presence provides the TTS audio
using latency-optimized voice models powered by ElevenLabs and Cartesia.
Real-Time Avatar
The real-time avatar is the “face” of the agent. Beyond Presence provides
lightning-fast high-quality avatars powered by our proprietary 3D AI models.
Transport Out
The output transport layer defines how the audio and video of the agent
get back to the user. Usually this is handled by a video conferencing
solution or other WebRTC service. All Beyond Presence APIs currently use
LiveKit to manage the output transport.
Below is a closer description of the different options how you can integrate
Beyond Presence real-time avatars into your existing tech stack:
End-to-End
🎯 Ideal for non-technical teams
✅ Easy to use, fully managed, optimized end-to-end latency
⚠️ No control over technical components
The end-to-end integration option is the easiest integration option and also
has the lowest conversation latency, but it is not ideal for developers
who need control over some of the components in the agent pipeline since
all components of the pipeline are managed by Beyond Presence.
To use this option, you can simply create an agent through the
Beyond Presence Dashboard
and embed it as an iFrame in your frontend for users to interact with it.
End-to-End with Custom LLM
🎯 Ideal for low-tech teams who want to bring their own LLM
⚠️ Requires a deployed LLM, no control over ASR/TTS, tricky to get latency right
If you need support for custom LLMs but don’t want to build your own agent
pipeline in-house, the end-to-end API with custom LLM is the option for you.
However, to use this option, you will either need to deploy your custom LLM
as a streaming text-to-text Websocket API that the Beyond Presence agent can
access, or you need to connect to the Beyond Presence agent via Websocket
and modify your LLM logic to continuously fetch conversation history
changes from the Beyond Presence agent and stream responses back.
Overall, this solution is most straight-forward if you already have a
deployed LLM server, but somewhat difficult to set up otherwise, and the
overall call latency will also heavily depend on the efficiency of your
LLM server implementation.
Audio-to-Video
🎯 Ideal for Python developers and technical teams with existing audio agents
✅ Modular, full control over all pipeline components, most extensible
⚠️ Requires building your own audio agent
The audio-to-video API is our recommended option for deep tech teams since
it is the most modular solution that will give you maximum control over your
audio agent stack. It is a great integration option if you already have an
existing audio agent pipeline, if you want to have full control over the
interaction with the end user, or if you plan to integrate custom components
for visually perceiving the user or similar. To use it, you stream
the audio output of your TTS / audio agent to the Beyond Presence API, and
have the resulting avatar video directly displayed to your user.