Realtime Models Overview

Realtime Models Overview

Realtime models provide native audio-to-audio conversation capabilities with end-to-end latency optimizations. These models handle the complete voice pipeline (speech-to-text, reasoning, and text-to-speech) in a single integrated system.

What are Realtime Models?

Realtime models are designed specifically for voice conversations and offer:

  • Native Audio Processing: Direct audio input and output without separate STT/TTS components
  • Ultra-Low Latency: Optimized for real-time conversations with minimal delay
  • Integrated Pipeline: Single model handles listening, reasoning, and speaking
  • Natural Interruptions: Better handling of conversational dynamics

When to Use Realtime Models

Use realtime models when:

  • You need the absolute lowest latency for voice conversations
  • You want simplified architecture (one model instead of LLM + STT + TTS)
  • Your use case benefits from native audio understanding
  • You're building interactive voice experiences

Available Providers

SIPHON supports the following realtime model providers:

Usage Pattern

from siphon.agent import Agent
from siphon.plugins import openai  # or gemini

# Use realtime model instead of separate LLM/STT/TTS
agent = Agent(
    agent_name="RealtimeAssistant",
    llm=openai.Realtime(
        model="gpt-realtime",
        voice="alloy",
        temperature=0.3
    ),
    system_instructions="You are a helpful voice assistant.",
)

if __name__ == "__main__":
    agent.dev()

Key Differences from Traditional Pipeline

Traditional (LLM + STT + TTS)Realtime Model
Three separate componentsSingle integrated component
Higher overall latencyUltra-low latency
More configuration optionsSimplified configuration
Flexible provider mixingProvider-specific

Notes

  • Realtime models typically require specific API access or preview features
  • Configuration options may vary by provider
  • Some providers may have usage limits or pricing differences for realtime APIs