Documentation Index
Fetch the complete documentation index at: https://docs.enconvo.ai/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Dictation turns your voice into text in real time, anywhere a cursor blinks on macOS. Use it for hands-free writing, voice commands, AI prompts, and accessibility. Audio is captured locally with built-in voice-activity detection and routed to the speech-to-text provider you choose — cloud or fully on-device. For transcribing pre-recorded audio or video files, see Transcription.Activation Modes
| Mode | How to trigger | Best for |
|---|---|---|
| Toggle Dictation | Press the dictation hotkey (default ⌥ V) to start, press again to finish | Long-form dictation, multi-sentence input |
| Push to Dictation | Hold Fn to start, release to finish | Short utterances, chat-style replies |
| Dictation in SmartBar | Open SmartBar, press the mic icon or hotkey | Voice-driven Agent commands and AI prompts |
| Voice Commands | Speak a registered voice command phrase | Triggering workflows, opening apps, running tools |
Using Dictation
Speak
Speak naturally at a normal pace. Built-in voice-activity detection (Silero VAD via FluidAudio) ignores silence and background noise.
Pause or cancel
Press
Esc to discard the current take. The indicator disappears and nothing is inserted.Choosing a Dictation Provider
The dictation provider is set under Settings → Dictation → Dictation Model Provider. EnConvo supports both real-time streaming providers (lowest latency, partials appear as you speak) and batch providers (transcript appears after you finish).- Real-time streaming
- Batch (post-utterance)
- Local (offline)
Streaming providers send partial transcripts back as you speak. They are the best choice for live dictation because there is no wait at the end.
| Provider | Model | Notes |
|---|---|---|
| Microsoft Azure | Azure Realtime | 100+ languages, free via Cloud Plan, recommended default |
| Soniox | STT Realtime v4 | Strong multilingual code-switching, language hints |
| AssemblyAI | Universal-Streaming v3 | Best-in-class diarization, US English focus |
| ElevenLabs | Scribe v2 Realtime | High quality, automatic language detection |
| Volcengine | BigASR Realtime (sauc.duration) | Best for Mandarin and Chinese dialects |
You can pick a different provider for Dictation and for Transcription of audio/video files — for example, Azure Realtime for live dictation and AssemblyAI for meeting transcripts.
Language
Set your primary language under Settings → Dictation → Primary Language. Most providers also support Auto Detect, but specifying the language usually improves both accuracy and speed because the model can skip language identification. For multilingual or code-switching speech (e.g. mixing Chinese and English), use providers that accept language hints — Soniox, Volcengine, and ElevenLabs all support biasing toward several languages at once.Voice Commands
Dictation in SmartBar can drive any EnConvo command. Common patterns:- Conversational AI — “Summarize the email I just opened”
- Workflow trigger — “Create a reminder for tomorrow at 9am”
- System action — “Open Cursor in this folder”
Tips for Better Recognition
Speak clearly, but naturally
Speak clearly, but naturally
Maintain a consistent pace and enunciate clearly — but don’t over-articulate. Modern ASR is trained on natural speech.
Use a directional or close-talk microphone
Use a directional or close-talk microphone
Built-in MacBook mics work well in quiet rooms; for noisy environments a USB headset or Lavalier dramatically improves WER.
Add domain terms to hot-words
Add domain terms to hot-words
For Soniox, Volcengine, and AssemblyAI, drop product names, acronyms, and jargon into the provider’s hot-words / context-terms field.
Use the prompt field for context
Use the prompt field for context
Whisper-style providers (Groq, OpenAI, Local Whisper) accept a free-form prompt up to 224 tokens. Add a sentence describing the topic to bias the model.
Set the language explicitly
Set the language explicitly
Auto Detect is convenient, but specifying the language is faster and usually more accurate.
Privacy
Audio is captured locally and sent to your selected provider. To keep audio fully on-device, use NVIDIA Parakeet, Qwen3 ASR, or Local Whisper — none of them require an internet connection after the model has been downloaded.
Troubleshooting
Microphone not detected
Microphone not detected
- Open System Settings → Privacy & Security → Microphone
- Confirm EnConvo has microphone access
- Make sure the right microphone is selected as the system input
- Restart EnConvo after granting permissions
Dictation doesn't insert text
Dictation doesn't insert text
- Check that the active app accepts text input (some sandboxed inputs reject paste)
- Open System Settings → Privacy & Security → Accessibility and confirm EnConvo is enabled — it needs Accessibility to paste
- Try Push-to-Dictate (
Fn) to rule out the toggle hotkey
High end-of-turn latency
High end-of-turn latency
Switch to a streaming provider (Azure Realtime, Soniox Realtime, AssemblyAI Streaming). Batch providers like Groq Whisper only return the transcript after you stop speaking.
Poor transcription accuracy
Poor transcription accuracy
- Reduce background noise or move closer to the mic
- Set the language explicitly instead of Auto Detect
- Add domain terms to hot-words / context terms
- Try a higher-tier model — e.g. Whisper Large V3 over Turbo
Local model download fails
Local model download fails
Local Whisper, Parakeet, and Qwen ASR weights are downloaded on first use. Ensure stable internet, ~600 MB free disk, and check Activity Monitor for blocked network connections.
Related Features
Transcription
Transcribe pre-recorded audio and video files
Speech Recognition
Compare every supported STT provider and model
SmartBar
Use voice input in SmartBar for AI commands
Meeting Recording
Record meetings with live transcription