Local voice transcription with AI-powered refinement for developers
Transform your speech into clean, structured prompts using Whisper.cpp (local, GPU-accelerated) + Gemini API (cloud refinement).
- 🎤 Hotkey Recording: F8/F9 to start/stop
- 🚀 GPU Acceleration: CUDA-powered Whisper transcription
- 🤖 AI Refinement: Gemini cleans up filler words, fixes grammar, structures output
- 📝 Structured Output: XML/JSON/plain text formats
- 🔒 Privacy-First: Transcription runs locally, only refined text hits API
- ⚡ Auto-Paste: Seamlessly inserts text at cursor
- 🔌 VS Code Extension: Integrated workflow
test_lq.mp4
- Dictate code comments without "um" and "uh"
- Convert rambling thoughts into structured prompts
- Hands-free coding when keyboard is unavailable
- Faster brainstorming and documentation
-
Build whisper.cpp with CUDA:
git clone https://github.com/ggerganov/whisper.cpp.git cd whisper.cpp mkdir build && cd build cmake .. -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release cmake --build . --config Release -j$(nproc) cd ../..
-
Download model:
cd whisper.cpp/models bash download-ggml-model.sh medium.en cd ../..
-
Install Python dependencies:
pip install sounddevice scipy numpy pyperclip pynput python-dotenv google-genai
-
Configure AI refinement (optional but recommended):
Copy the example config:
cp .env.example .env
Edit
.envand add your Gemini API key:GEMINI_API_KEY=your-api-key-here VC_ENABLE_LLM=true VC_LLM_FORMAT=xml # Options: plain, xml, jsonGet a free API key: https://aistudio.google.com/apikey
-
Run Voice Commander:
python Linux/portable_commander_gpu.py
-
Install whisper.cpp:
git clone https://github.com/ggerganov/whisper.cpp.git cd whisper.cpp make -
Download model:
bash ./models/download-ggml-model.sh medium.en
-
Install Python dependencies:
pip install sounddevice scipy numpy pyperclip pynput
-
Run Voice Commander:
python portable_commander.py
See VScode_extension/ folder for VS Code integration.
- Press F8 to start recording
- Press F9 to stop and paste text
- Works in any application
Edit .env file:
| Variable | Options | Default | Description |
|---|---|---|---|
VC_ENABLE_LLM |
true/false |
true |
Enable AI refinement |
VC_LLM_FORMAT |
plain/xml/json |
xml |
Output structure |
GEMINI_API_KEY |
Your API key | - | Required for refinement |
VC_PASTE_MODE |
auto/ctrl_v/ctrl_shift_v |
auto |
Paste behavior |
- Python 3.7+
- CUDA-capable GPU (for acceleration)
- whisper.cpp compiled in parent directory
- Microphone access
- Gemini API key (free tier available)
- Press F8 → Start recording
- Speak naturally → "um, so like, I need a function that uh calculates fibonacci"
- Press F9 → Stop recording
- Whisper transcribes (local, GPU-accelerated)
- Gemini refines → Removes fillers, fixes grammar, structures output
- Auto-pastes → Clean text appears at cursor
Example:
Input: "um so like I want to [NOISE] create a function that uh calculates fibonacci"
Output: <prompt><task>Create a function that calculates the Fibonacci sequence</task></prompt>
This project demonstrates:
- End-to-end ML pipeline integration
- GPU optimization (CUDA)
- API integration (Gemini)
- Production-ready error handling
- Real-world developer tooling
PRs welcome! Areas for improvement:
- Additional LLM providers (OpenAI, Anthropic)
- Custom prompt templates
- Multi-language support
- Voice command macros
MIT License - see LICENSE file