Convonet Sequence Diagram
FastAPI · WebRTC/WebSocket · GCP Cloud Run · Domain Agents · Agent Monitor
Voice Flow (WebSocket → STT → Agent → TTS)
One utterance: browser and voice-gateway-service over WebSocket; gateway calls Deepgram and agent-llm-service over HTTP. For restaurant flows, agent-llm calls hanok-table-service via MCP/HTTP.
Reference: microservices call flow (v2)
How voice-gateway and agent-llm relate to call-center (UI), crm-integration, hanok-table-service, Redis, and PostgreSQL on one voice turn and for dashboards.
Sequence Phases
Phase 1: Authentication
WebSocket connect, PIN auth (PostgreSQL or env)
Phase 2: Conversation Loop
Record → STT → agent-llm HTTP → TTS → playback
Phase 3: Transfer Request
User requests transfer; agent-llm returns transfer_marker
Phase 4: Twilio Transfer
Voice-gateway → Twilio → FusionPBX → Agent Dashboard
Phase 1: Authentication
WebSocket connection and PIN validation (FastAPI voice-gateway-service)
User Browser → Voice Gateway: Connect WebSocket
User opens the voice assistant UI at /voice_assistant and establishes a WebSocket connection to wss://v2.convonetai.com/webrtc/ws (FastAPI voice-gateway-service). No LiveKit on GCP.
User Browser → Voice Gateway: authenticate (PIN)
Client sends a message with type: "authenticate" and PIN. Voice gateway requires this when ENABLE_VOICE_PIN is true.
Voice Gateway → PostgreSQL (or env): Validate PIN
Voice gateway validates the PIN against the users_anthropic table (via DB_URI) using voice_pin. If DB_URI is not set, it falls back to VOICE_PIN environment variable.
Voice Gateway: Store session (in-memory)
On success, voice gateway stores user_id, user_name, and authenticated in per-connection session state (no Redis required for this step).
Voice Gateway → User Browser: auth_ok
Voice gateway sends auth_ok with user details. The client hides the PIN form and shows the Start/Stop recording controls.
Phase 2: Conversation Loop (STT → Agent → TTS)
Audio capture in browser, batch STT and agent call in voice-gateway, TTS and playback
User Browser → Voice Gateway: start_recording
User clicks Start. Browser uses getUserMedia and MediaRecorder to capture microphone audio. Client sends start_recording over the WebSocket. Voice gateway sets recording state and clears the in-memory audio buffer for this session.
User Browser → Voice Gateway: audio_chunk (base64)
While recording, the browser sends one or more audio_chunk messages (base64-encoded WebM audio). Voice gateway appends decoded bytes to the session buffer.
User Browser → Voice Gateway: stop_recording
User clicks Stop. Client sends stop_recording. Voice gateway runs the pipeline in a background task: STT → agent-llm HTTP → TTS, then sends results back over the WebSocket.
Voice Gateway → Deepgram: Batch STT
Voice gateway sends the concatenated audio to Deepgram for batch transcription (e.g. transcribe_audio_with_deepgram_webrtc). Deepgram returns the transcript text.
Voice Gateway → Agent LLM Service: POST /agent/process
Voice gateway sends an HTTP POST to AGENT_LLM_URL/agent/process with transcript, user_id, session_id, and metadata (source: voice, t0, voice_timing, stt_provider) for Agent Monitor. Agent-llm-service runs LangGraph, selects domain agent (Todo/Mortgage/Healthcare/Hanok) by intent, calls Multi-LLM and MCP tools, can call hanok-table-service for reservation APIs, tracks the interaction (tool_calls, voice_timing) in Redis, and returns the response text (and optional transfer_marker).
Voice Gateway → Deepgram: TTS
Voice gateway synthesizes the agent response text with Deepgram TTS and receives audio bytes.
Voice Gateway → User Browser: transcript_final, agent_final, audio_chunk
Voice gateway sends transcript_final (STT text), agent_final (LLM response text), and audio_chunk (base64 TTS audio). The browser displays transcript and agent text and plays the TTS audio.
Phase 3: Transfer Request
User requests transfer to a human agent; agent-llm returns transfer_marker
User says “transfer to agent” (in voice)
The utterance is captured and sent through the same pipeline (STT → agent-llm). LangGraph detects transfer intent and the LLM returns a response with transfer_marker.
Agent LLM → Voice Gateway: response + transfer_marker
Agent-llm-service returns the reply and transfer_marker. Voice gateway can then trigger the Twilio transfer flow (Phase 4).
Phase 4: Twilio Transfer Flow
Voice-gateway calls Twilio to bridge the call to FusionPBX; agent dashboard (JsSIP) receives the call
Voice Gateway → Twilio API: transfer_bridge
Voice gateway (or Twilio webhook flow) calls the transfer-bridge endpoint so Twilio places a SIP call to FusionPBX (e.g. sip:2001@FREEPBX_DOMAIN;transport=udp).
Twilio → FusionPBX: SIP INVITE (e.g. extension 2001)
Twilio sends SIP INVITE to FusionPBX. FusionPBX rings the target extension (e.g. 2001).
FusionPBX → Agent Dashboard (JsSIP): Incoming call
The agent dashboard at /call-center uses a JsSIP client registered with FusionPBX over WSS. It receives the incoming call and can show user info (e.g. from PostgreSQL) and answer/hangup controls.
Agent answers → Twilio bridges audio
When the agent answers, Twilio bridges the user leg and the SIP leg. Live conversation continues between the user and the human agent.
Alternative: Reject / Timeout
If the agent rejects or the call times out, FusionPBX/Twilio report failure. Voice gateway can send a transfer-failed message to the browser so the user sees an error state.
Key Sequence Points
Authentication
PIN is validated by voice-gateway against PostgreSQL users_anthropic.voice_pin (when DB_URI is set) or VOICE_PIN env. Session state is in-memory per WebSocket.
Voice Pipeline
Browser sends audio over WebSocket. Voice-gateway runs batch STT (Deepgram) → HTTP POST to agent-llm-service → TTS (Deepgram), then sends transcript_final, agent_final, and audio_chunk back. No LiveKit on GCP.
AI & Tools
Agent-llm-service runs LangGraph, multi-LLM (Claude, Gemini, OpenAI), and MCP tools (todo, mortgage, healthcare, hanok reservations). Intent routing selects the domain agent. Agent Monitor UI is served by call-center-service at /agent-monitor.
Transfer
When the user requests transfer, agent-llm returns transfer_marker. Voice-gateway triggers Twilio to bridge to FusionPBX; the agent dashboard (JsSIP at /call-center) receives the call.