A voice agent that uses Google Gemini for LLM reasoning, Deepgram for speech-to-text, ElevenLabs for text-to-speech, and Silero VAD for voice activity detection. This implementation uses direct API integration without any orchestration frameworks, connected via Plivo telephony.
- Google Gemini LLM for conversational reasoning
- Deepgram real-time STT via WebSocket
- ElevenLabs TTS for natural voice synthesis
- Silero VAD for client-side voice activity detection and barge-in
- Plivo telephony with bidirectional WebSocket audio streaming
- Native asyncio orchestration (no frameworks)
- Inbound and outbound call support
- 3 concurrent tasks: plivo_rx, deepgram_rx, plivo_tx
Plivo (u-law 8kHz) --> Deepgram STT (PCM 8kHz) --> Gemini LLM
|
Plivo (u-law 8kHz) <-- ElevenLabs TTS (PCM 24kHz) <----+
Silero VAD runs on Plivo audio for barge-in detection and turn management.
Audio Flow:
- Caller speaks --> Plivo captures audio (u-law 8kHz)
- Audio converted to PCM --> sent to Deepgram for transcription
- VAD runs in parallel for speech start/end detection
- Transcript sent to Gemini for response generation
- Response text sent to ElevenLabs for speech synthesis
- TTS audio (PCM 24kHz) converted to u-law 8kHz --> sent back to caller
- Barge-in: if user speaks during response, audio queue is cleared
- Python 3.10+
- uv package manager
- ngrok for local development
- API keys for:
- Plivo - Telephony
- Google AI Studio - Gemini API
- Deepgram - Speech-to-text
- ElevenLabs - Text-to-speech
-
Navigate to the project:
cd gemini2-deepgramnova2-elevenflashv2.5-native -
Create environment file:
cp .env.example .env
-
Configure your
.envfile with your API keys and Plivo phone number. -
Install dependencies:
uv sync
-
Start ngrok (in a separate terminal):
ngrok http 8000
-
Update
PUBLIC_URLin your.envwith the ngrok HTTPS URL. -
Run the inbound server:
uv run python -m inbound.server
Or the outbound server:
uv run python -m outbound.server
-
Call your Plivo phone number to test the voice agent.
| Variable | Description | Default |
|---|---|---|
GEMINI_API_KEY |
Google AI API key | Required |
GEMINI_MODEL |
Gemini model name | gemini-2.0-flash |
DEEPGRAM_API_KEY |
Deepgram API key | Required |
DEEPGRAM_MODEL |
Deepgram model | nova-2-phonecall |
ELEVENLABS_API_KEY |
ElevenLabs API key | Required |
ELEVENLABS_VOICE_ID |
ElevenLabs voice ID | 21m00Tcm4TlvDq8ikWAM |
ELEVENLABS_MODEL |
ElevenLabs model | eleven_flash_v2_5 |
PLIVO_AUTH_ID |
Plivo account auth ID | Required |
PLIVO_AUTH_TOKEN |
Plivo account auth token | Required |
PLIVO_PHONE_NUMBER |
Your Plivo phone number | Required |
PUBLIC_URL |
Public URL for webhooks | Required |
SERVER_PORT |
Server port | 8000 |
gemini2-deepgramnova2-elevenflashv2.5-native/
|-- inbound/
| |-- __init__.py
| |-- agent.py # Voice agent for inbound calls
| |-- server.py # FastAPI: /answer, /ws, /hangup
| +-- system_prompt.md # System prompt for inbound calls
|-- outbound/
| |-- __init__.py
| |-- agent.py # Voice agent + CallManager for outbound
| |-- server.py # FastAPI: /outbound/call, /outbound/ws
| +-- system_prompt.md # System prompt for outbound calls
|-- utils.py # Audio conversion, VAD, phone utils
|-- tests/
| |-- conftest.py
| |-- helpers.py
| |-- test_integration.py # Unit + local integration tests
| |-- test_e2e_live.py # E2E with real APIs
| |-- test_live_call.py # Real inbound call test
| |-- test_multiturn_voice.py
| +-- test_outbound_call.py # Real outbound call test
|-- pyproject.toml
|-- .env.example
|-- Dockerfile
+-- README.md
- utils.py: Audio format conversion (u-law, PCM, resampling), SileroVADProcessor, phone normalization
- inbound/agent.py: VoiceAgent class with 3 concurrent tasks (plivo_rx, deepgram_rx, plivo_tx)
- inbound/server.py: FastAPI server for inbound call webhooks and WebSocket handling
- outbound/agent.py: VoiceAgent + OutboundCallRecord + CallManager for outbound calls
- outbound/server.py: FastAPI server for outbound call management
- Plivo: u-law 8kHz mono (telephony standard)
- Deepgram: Linear PCM 16-bit 8kHz
- ElevenLabs: Linear PCM 16-bit 24kHz
- Silero VAD: Float32 16kHz
Client-side Silero VAD runs on every Plivo audio frame:
- Speech start during agent response triggers barge-in (clears audio queue, sends clearAudio)
- Speech end signals turn completion for natural conversation flow
# Unit tests (offline, no API keys needed)
uv run pytest tests/test_integration.py -v -k "unit"
# Local integration (starts server, needs API keys)
uv run pytest tests/test_integration.py -v -k "local"
# E2E live tests (needs API keys)
uv run pytest tests/test_e2e_live.py -v -s
# Live call tests (needs Plivo + API keys + ngrok)
uv run pytest tests/test_live_call.py -v -s
# Outbound call tests
uv run pytest tests/test_outbound_call.py -v -s- Check ElevenLabs API key and voice ID
- Verify audio format conversion is working
- Check server logs for TTS errors
- Ensure using
nova-2-phonecallmodel for phone audio - Check audio is being received from Plivo (check logs)
- Verify ngrok is running and URL is correct in
.env - Check Plivo console for webhook configuration
- Ensure
PUBLIC_URLuses HTTPS