Convonet Sequence Diagram

FastAPI · WebRTC/WebSocket · GCP Cloud Run · Domain Agents · Agent Monitor

Voice Flow (WebSocket → STT → Agent → TTS)

One utterance: browser and voice-gateway-service over WebSocket; gateway calls Deepgram and agent-llm-service over HTTP. For restaurant flows, agent-llm calls hanok-table-service via MCP/HTTP.

Reference: microservices call flow (v2)

How voice-gateway and agent-llm relate to call-center (UI), crm-integration, hanok-table-service, Redis, and PostgreSQL on one voice turn and for dashboards.

Sequence Phases

Phase 1: Authentication

WebSocket connect, PIN auth (PostgreSQL or env)

Phase 2: Conversation Loop

Record → STT → agent-llm HTTP → TTS → playback

Phase 3: Transfer Request

User requests transfer; agent-llm returns transfer_marker

Phase 4: Twilio Transfer

Voice-gateway → Twilio → FusionPBX → Agent Dashboard

Phase 1: Authentication

WebSocket connection and PIN validation (FastAPI voice-gateway-service)

User Browser → Voice Gateway: Connect WebSocket

User opens the voice assistant UI at /voice_assistant and establishes a WebSocket connection to wss://v2.convonetai.com/webrtc/ws (FastAPI voice-gateway-service). No LiveKit on GCP.

User Browser → Voice Gateway: authenticate (PIN)

Client sends a message with type: "authenticate" and PIN. Voice gateway requires this when ENABLE_VOICE_PIN is true.

Voice Gateway → PostgreSQL (or env): Validate PIN

Voice gateway validates the PIN against the users_anthropic table (via DB_URI) using voice_pin. If DB_URI is not set, it falls back to VOICE_PIN environment variable.

Voice Gateway: Store session (in-memory)

On success, voice gateway stores user_id, user_name, and authenticated in per-connection session state (no Redis required for this step).

Voice Gateway → User Browser: auth_ok

Voice gateway sends auth_ok with user details. The client hides the PIN form and shows the Start/Stop recording controls.

Phase 2: Conversation Loop (STT → Agent → TTS)

Audio capture in browser, batch STT and agent call in voice-gateway, TTS and playback

User Browser → Voice Gateway: start_recording

User clicks Start. Browser uses getUserMedia and MediaRecorder to capture microphone audio. Client sends start_recording over the WebSocket. Voice gateway sets recording state and clears the in-memory audio buffer for this session.

User Browser → Voice Gateway: audio_chunk (base64)

While recording, the browser sends one or more audio_chunk messages (base64-encoded WebM audio). Voice gateway appends decoded bytes to the session buffer.

User Browser → Voice Gateway: stop_recording

User clicks Stop. Client sends stop_recording. Voice gateway runs the pipeline in a background task: STT → agent-llm HTTP → TTS, then sends results back over the WebSocket.

Voice Gateway → Deepgram: Batch STT

Voice gateway sends the concatenated audio to Deepgram for batch transcription (e.g. transcribe_audio_with_deepgram_webrtc). Deepgram returns the transcript text.

Voice Gateway → Agent LLM Service: POST /agent/process

Voice gateway sends an HTTP POST to AGENT_LLM_URL/agent/process with transcript, user_id, session_id, and metadata (source: voice, t0, voice_timing, stt_provider) for Agent Monitor. Agent-llm-service runs LangGraph, selects domain agent (Todo/Mortgage/Healthcare/Hanok) by intent, calls Multi-LLM and MCP tools, can call hanok-table-service for reservation APIs, tracks the interaction (tool_calls, voice_timing) in Redis, and returns the response text (and optional transfer_marker).

Voice Gateway → Deepgram: TTS

Voice gateway synthesizes the agent response text with Deepgram TTS and receives audio bytes.

Voice Gateway → User Browser: transcript_final, agent_final, audio_chunk

Voice gateway sends transcript_final (STT text), agent_final (LLM response text), and audio_chunk (base64 TTS audio). The browser displays transcript and agent text and plays the TTS audio.

Loop: Steps 6–12 repeat for each utterance. Tool execution (e.g. todos, mortgage, healthcare) happens inside agent-llm-service; voice-gateway only forwards the transcript and streams back the final response and TTS.

Phase 3: Transfer Request

User requests transfer to a human agent; agent-llm returns transfer_marker

User says “transfer to agent” (in voice)

The utterance is captured and sent through the same pipeline (STT → agent-llm). LangGraph detects transfer intent and the LLM returns a response with transfer_marker.

Agent LLM → Voice Gateway: response + transfer_marker

Agent-llm-service returns the reply and transfer_marker. Voice gateway can then trigger the Twilio transfer flow (Phase 4).

Phase 4: Twilio Transfer Flow

Voice-gateway calls Twilio to bridge the call to FusionPBX; agent dashboard (JsSIP) receives the call

Voice Gateway → Twilio API: transfer_bridge

Voice gateway (or Twilio webhook flow) calls the transfer-bridge endpoint so Twilio places a SIP call to FusionPBX (e.g. sip:2001@FREEPBX_DOMAIN;transport=udp).

Twilio → FusionPBX: SIP INVITE (e.g. extension 2001)

Twilio sends SIP INVITE to FusionPBX. FusionPBX rings the target extension (e.g. 2001).

FusionPBX → Agent Dashboard (JsSIP): Incoming call

The agent dashboard at /call-center uses a JsSIP client registered with FusionPBX over WSS. It receives the incoming call and can show user info (e.g. from PostgreSQL) and answer/hangup controls.

Agent answers → Twilio bridges audio

When the agent answers, Twilio bridges the user leg and the SIP leg. Live conversation continues between the user and the human agent.

—

Alternative: Reject / Timeout

If the agent rejects or the call times out, FusionPBX/Twilio report failure. Voice gateway can send a transfer-failed message to the browser so the user sees an error state.

Key Sequence Points

Authentication

PIN is validated by voice-gateway against PostgreSQL users_anthropic.voice_pin (when DB_URI is set) or VOICE_PIN env. Session state is in-memory per WebSocket.

Voice Pipeline

Browser sends audio over WebSocket. Voice-gateway runs batch STT (Deepgram) → HTTP POST to agent-llm-service → TTS (Deepgram), then sends transcript_final, agent_final, and audio_chunk back. No LiveKit on GCP.

AI & Tools

Agent-llm-service runs LangGraph, multi-LLM (Claude, Gemini, OpenAI), and MCP tools (todo, mortgage, healthcare, hanok reservations). Intent routing selects the domain agent. Agent Monitor UI is served by call-center-service at /agent-monitor.

Transfer

When the user requests transfer, agent-llm returns transfer_marker. Voice-gateway triggers Twilio to bridge to FusionPBX; the agent dashboard (JsSIP at /call-center) receives the call.