Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.enconvo.ai/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Dictation turns your voice into text in real time, anywhere a cursor blinks on macOS. Use it for hands-free writing, voice commands, AI prompts, and accessibility. Audio is captured locally with built-in voice-activity detection and routed to the speech-to-text provider you choose — cloud or fully on-device. For transcribing pre-recorded audio or video files, see Transcription.

Activation Modes

ModeHow to triggerBest for
Toggle DictationPress the dictation hotkey (default ⌥ V) to start, press again to finishLong-form dictation, multi-sentence input
Push to DictationHold Fn to start, release to finishShort utterances, chat-style replies
Dictation in SmartBarOpen SmartBar, press the mic icon or hotkeyVoice-driven Agent commands and AI prompts
Voice CommandsSpeak a registered voice command phraseTriggering workflows, opening apps, running tools
Push-to-talk on Fn is the fastest path for chat. Toggle mode (⌥ V) is better when you want to think between sentences without re-pressing a key.

Using Dictation

1

Activate

Use any of the activation modes above. A floating indicator confirms EnConvo is listening.
2

Speak

Speak naturally at a normal pace. Built-in voice-activity detection (Silero VAD via FluidAudio) ignores silence and background noise.
3

Pause or cancel

Press Esc to discard the current take. The indicator disappears and nothing is inserted.
4

Finish

Press the hotkey again (toggle mode) or release the key (push mode). EnConvo finalizes the transcript and pastes it at the cursor.

Choosing a Dictation Provider

The dictation provider is set under Settings → Dictation → Dictation Model Provider. EnConvo supports both real-time streaming providers (lowest latency, partials appear as you speak) and batch providers (transcript appears after you finish).
Streaming providers send partial transcripts back as you speak. They are the best choice for live dictation because there is no wait at the end.
ProviderModelNotes
Microsoft AzureAzure Realtime100+ languages, free via Cloud Plan, recommended default
SonioxSTT Realtime v4Strong multilingual code-switching, language hints
AssemblyAIUniversal-Streaming v3Best-in-class diarization, US English focus
ElevenLabsScribe v2 RealtimeHigh quality, automatic language detection
VolcengineBigASR Realtime (sauc.duration)Best for Mandarin and Chinese dialects
You can pick a different provider for Dictation and for Transcription of audio/video files — for example, Azure Realtime for live dictation and AssemblyAI for meeting transcripts.

Language

Set your primary language under Settings → Dictation → Primary Language. Most providers also support Auto Detect, but specifying the language usually improves both accuracy and speed because the model can skip language identification. For multilingual or code-switching speech (e.g. mixing Chinese and English), use providers that accept language hints — Soniox, Volcengine, and ElevenLabs all support biasing toward several languages at once.

Voice Commands

Dictation in SmartBar can drive any EnConvo command. Common patterns:
  • Conversational AI — “Summarize the email I just opened”
  • Workflow trigger — “Create a reminder for tomorrow at 9am”
  • System action — “Open Cursor in this folder”
Voice commands are matched against the registered command list, so you can build custom phrases through the Workflow editor.

Tips for Better Recognition

Maintain a consistent pace and enunciate clearly — but don’t over-articulate. Modern ASR is trained on natural speech.
Built-in MacBook mics work well in quiet rooms; for noisy environments a USB headset or Lavalier dramatically improves WER.
For Soniox, Volcengine, and AssemblyAI, drop product names, acronyms, and jargon into the provider’s hot-words / context-terms field.
Whisper-style providers (Groq, OpenAI, Local Whisper) accept a free-form prompt up to 224 tokens. Add a sentence describing the topic to bias the model.
Auto Detect is convenient, but specifying the language is faster and usually more accurate.

Privacy

Audio is captured locally and sent to your selected provider. To keep audio fully on-device, use NVIDIA Parakeet, Qwen3 ASR, or Local Whisper — none of them require an internet connection after the model has been downloaded.
For cloud providers, audio is processed under the provider’s privacy policy. The Enconvo Cloud Plan adds an additional proxy layer so your provider API keys never need to leave your machine.

Troubleshooting

  1. Open System Settings → Privacy & Security → Microphone
  2. Confirm EnConvo has microphone access
  3. Make sure the right microphone is selected as the system input
  4. Restart EnConvo after granting permissions
  1. Check that the active app accepts text input (some sandboxed inputs reject paste)
  2. Open System Settings → Privacy & Security → Accessibility and confirm EnConvo is enabled — it needs Accessibility to paste
  3. Try Push-to-Dictate (Fn) to rule out the toggle hotkey
Switch to a streaming provider (Azure Realtime, Soniox Realtime, AssemblyAI Streaming). Batch providers like Groq Whisper only return the transcript after you stop speaking.
  1. Reduce background noise or move closer to the mic
  2. Set the language explicitly instead of Auto Detect
  3. Add domain terms to hot-words / context terms
  4. Try a higher-tier model — e.g. Whisper Large V3 over Turbo
Local Whisper, Parakeet, and Qwen ASR weights are downloaded on first use. Ensure stable internet, ~600 MB free disk, and check Activity Monitor for blocked network connections.

Transcription

Transcribe pre-recorded audio and video files

Speech Recognition

Compare every supported STT provider and model

SmartBar

Use voice input in SmartBar for AI commands

Meeting Recording

Record meetings with live transcription