Documentation Index
Fetch the complete documentation index at: https://docs.enconvo.ai/llms.txt
Use this file to discover all available pages before exploring further.
Overview
EnConvo supports a dozen+ speech recognition providers, from cloud services with the highest accuracy to fully local models that keep your audio data private. Whether you need real-time dictation, long-form file transcription, or multilingual support, EnConvo has a provider for every use case. For the user-facing features built on top of these providers, see Dictation and Transcription.Supported Providers
Cloud Providers
| Provider | Model | Best For | Languages |
|---|---|---|---|
| Groq Whisper | Whisper Large V3 Turbo, Whisper Large V3 | Fast cloud transcription | 57+ |
| OpenAI | gpt-4o-transcribe, gpt-4o-mini-transcribe | High accuracy with AI post-processing | 57+ |
| Microsoft Azure | Azure Realtime, Azure Fast Transcription | Enterprise-grade streaming and batch | 100+ |
| Google Gemini | Gemini 3.1 Flash Lite | Multimodal audio understanding, up to 9.5 hours | 100+ |
| Mistral | Voxtral Mini | Lightweight, fast transcription | 20+ |
| Soniox | STT Realtime v4, STT Async v4 | Multilingual real-time dictation and file transcription | 60+ |
| AssemblyAI | Universal, Universal-Streaming v3 | Best-in-class speaker diarization | 12+ |
| ElevenLabs | Scribe v2, Scribe v2 Realtime | High quality, automatic language detection | 29+ |
| Volcengine BigASR | Realtime, Flash, Async (ByteDance) | Mandarin and Chinese dialects | 10+ |
Local Providers (Offline)
| Provider | Model | Best For | Languages |
|---|---|---|---|
| NVIDIA Parakeet | Parakeet TDT 0.6B V3 | Fast local ASR on Apple Silicon | 25 |
| Qwen ASR | Qwen3-ASR 0.6B/1.7B | Multilingual local ASR, Chinese dialects | 30+ |
| Local Whisper | whisper-base / small / large-v3 | Privacy-first offline recognition | 57+ |
Provider Highlights
NVIDIA Parakeet TDT (Local)
The flagship local ASR model in EnConvo, powered by the FluidAudio framework.- 209.8x real-time speed on Apple Silicon — a 1-minute audio file transcribes in under 0.3 seconds
- 2.1% Word Error Rate (WER) — approaching cloud provider accuracy
- 25 European languages with automatic language detection
- Automatic punctuation, capitalization, and word-level timestamps
- Supports long audio up to 24 minutes (full attention) or 3 hours (local attention)
- Released under CC-BY-4.0 open-source license
Qwen ASR (Local)
Alibaba’s state-of-the-art open-source ASR model with exceptional multilingual support.- 30 languages and 22 Chinese dialects — ideal for Chinese language users
- Available in 0.6B and 1.7B parameter sizes
- 4-bit and 8-bit quantized variants for efficient local inference via MLX
- Automatic language identification
- Streaming and offline unified inference
| Model | Parameters | Quantization | Memory |
|---|---|---|---|
| Qwen3-ASR-0.6B-4bit | 0.6B | 4-bit | Minimal |
| Qwen3-ASR-0.6B-8bit | 0.6B | 8-bit | Low |
| Qwen3-ASR-1.7B-4bit | 1.7B | 4-bit | Moderate |
| Qwen3-ASR-1.7B-8bit | 1.7B | 8-bit | Moderate |
Groq Whisper (Cloud)
Ultra-fast cloud inference powered by Groq’s LPU hardware.- Whisper Large V3 Turbo — fastest cloud Whisper variant
- Whisper Large V3 — highest accuracy Whisper model
- Custom prompts for domain-specific vocabulary (up to 224 tokens)
- Language-specific hints for improved accuracy
- Available free through Enconvo Cloud Plan
Google Gemini STT (Cloud)
Uses Gemini’s multimodal understanding for audio transcription.- Supports audio files up to 9.5 hours in duration
- Gemini 3.1 Flash Lite — cost-efficient, fastest performance
- Supports structured output and advanced text post-processing
- Available through Enconvo Cloud Plan or your own Google API key
Soniox STT (Cloud)
Soniox provides multilingual speech recognition for both real-time dictation and file-based transcription.- Soniox STT Realtime v4 for live dictation through EnConvo’s macOS realtime client
- Soniox STT Async v4 for audio/video file transcription after recording
- Supports automatic language detection, language hints, punctuation, and speaker diarization
- Uses your own Soniox API key through EnConvo Credentials
Volcengine BigASR (Cloud)
ByteDance’s Doubao speech model, optimized for Mandarin and Chinese dialects.- Realtime (
volc.bigasr.sauc.duration) for live dictation via WebSocket - Flash (sync) for files ≤ 2 hours and ≤ 100 MB; auto-falls-back to async for longer files
- Hot words and language hints for biasing recognition
- Works through your own Volcengine credentials or via Enconvo Cloud Plan
AssemblyAI (Cloud)
Industry-leading speaker diarization for multi-speaker recordings.- Universal-Streaming v3 for real-time dictation
- Universal for async file transcription with diarization, word-level timestamps, and confidence scores
- Word boost (custom vocabulary) and content moderation
- Available via Enconvo Cloud Plan or your own AssemblyAI key
ElevenLabs Scribe (Cloud)
ElevenLabs’ STT, strong on accented and noisy speech.- Scribe v2 Realtime for live dictation
- Scribe v2 for async file transcription with word-level timestamps
- Automatic language detection across 29+ languages
- Available via Enconvo Cloud Plan or your own ElevenLabs key
Configuration
Open Dictation Settings
Navigate to the Dictation command settings in EnConvo. The speech recognition provider is configured as the Dictation Model Provider.
Select a Provider
Choose your preferred provider from the dropdown. Cloud providers offer the highest accuracy, while local providers offer privacy and offline support.
Set Up Credentials
For cloud providers, configure your API key through the Credential Provider. Enconvo Cloud Plan users get access to Groq Whisper, Microsoft Azure, Google Gemini, OpenAI, and Mistral without separate API keys.
Choose a Language
Set your primary language for transcription. Most providers support Auto Detect, but specifying a language can improve accuracy and speed.
Use Cases
Real-Time Dictation
The primary use of speech recognition in EnConvo is real-time dictation — speaking and having your words appear as text instantly.- Press the dictation hotkey (default:
Option + V) or holdFn - Speak naturally at a comfortable pace
- Press the hotkey again (or release
Fn) to finish - The transcribed text is inserted at your cursor position
For real-time dictation, streaming providers like Microsoft Azure, Soniox, AssemblyAI, ElevenLabs, and Volcengine provide the lowest latency. Batch providers like Groq Whisper and OpenAI process after you finish speaking.
Dictation in SmartBar
Use voice input directly within the SmartBar for AI interactions:- Open the SmartBar
- Click the microphone icon or use the dictation shortcut
- Speak your query
- The transcribed text becomes your AI prompt
Audio/Video File Transcription
Transcribe pre-recorded audio and video files:- Use the Transcribe Audio/Video files command
- Drag and drop audio or video files
- Choose your output format (plain text or .txt file)
- EnConvo processes the files with speaker diarization support
Meeting Transcription
Combined with EnConvo’s Meeting Recording feature, speech recognition powers live meeting transcription. See the Meeting Recording documentation for details.Voice Activity Detection (VAD)
EnConvo includes built-in Voice Activity Detection powered by Silero VAD via the FluidAudio framework:- Automatic start/stop: Detects when you begin and stop speaking
- Background noise filtering: Ignores ambient sounds and only captures speech
- Low latency: Sub-100ms detection for natural dictation flow
- Runs locally: VAD processing happens entirely on your Mac
Language Support
Provider Language Coverage
- Most Languages
- Best for Chinese
- European Languages
- Universal
Microsoft Azure supports 100+ languages and dialects, making it the broadest choice. Combined with the free Enconvo Cloud Plan tier, it is an excellent default option.
Setting the Language
Most providers support automatic language detection, but explicitly setting the language is recommended for:- Better accuracy: The model focuses on the correct language’s phonemes
- Faster processing: Skips the language detection step
- Mixed content: When you know the primary language in multilingual audio
Offline Speech Recognition
For users who need speech recognition without an internet connection:| Solution | Model Size | Speed | Accuracy | Setup |
|---|---|---|---|---|
| NVIDIA Parakeet | ~600 MB | Fastest | Excellent | Auto-download on first use |
| Qwen ASR (4-bit) | ~400 MB | Fast | Very Good | Auto-download on first use |
| Local Whisper (small) | ~216 MB | Good | Good | Auto-download on first use |
| Local Whisper (large-v3) | ~626 MB | Moderate | Excellent | Auto-download on first use |
Advanced Configuration
Custom Prompts (Groq Whisper / Local Whisper)
Provide context to improve transcription of domain-specific terms:Audio Preprocessing
EnConvo automatically preprocesses audio for optimal recognition:- Converts to 16kHz mono FLAC/WAV format
- Splits large files into manageable chunks with overlap
- Merges transcription results using sequence alignment
- Cleans up temporary files after processing
Large File Handling
For long audio files, EnConvo automatically:- Splits audio into provider-appropriate chunks
- Adds overlap between chunks to avoid word loss at boundaries
- Processes chunks in parallel where possible
- Merges results using intelligent sequence alignment
Enconvo Cloud Plan
The Enconvo Cloud Plan includes access to multiple STT providers:| Provider | Model | Points per Use |
|---|---|---|
| Groq Whisper | Whisper Large V3 Turbo | 100 |
| Groq Whisper | Whisper Large V3 | 200 |
| OpenAI | gpt-4o-mini-transcribe | 100 |
| OpenAI | gpt-4o-transcribe | 200 |
| Google Gemini | Gemini 3.1 Flash Lite | 100 |
| Mistral | Voxtral Mini | 100 |
| Microsoft Azure | Azure Realtime / Fast Transcription | Free |
| AssemblyAI | Universal, Universal-Streaming v3 | Via Cloud |
| Soniox | STT Realtime v4, STT Async v4 | Via Cloud |
| ElevenLabs | Scribe v2, Scribe v2 Realtime | Via Cloud |
| Volcengine | BigASR Realtime, Flash, Async | Via Cloud |
Microsoft Azure STT via the Enconvo Cloud Plan is free and does not consume points. It is the recommended default for most users.
Troubleshooting
Microphone not detected
Microphone not detected
- Open System Settings > Privacy & Security > Microphone
- Ensure EnConvo has microphone access enabled
- Check that the correct microphone is selected in your system audio settings
- Try restarting EnConvo after granting permissions
Poor transcription accuracy
Poor transcription accuracy
- Speak clearly at a moderate pace
- Reduce background noise or use a directional microphone
- Explicitly set the language instead of using Auto Detect
- Try a larger model (e.g., Whisper Large V3 instead of V3 Turbo)
- Use custom prompts for specialized vocabulary
Local model download fails
Local model download fails
- Ensure you have a stable internet connection for the initial download
- Check available disk space — models range from 200 MB to 1 GB
- Try a smaller model variant first (e.g., Qwen3-ASR-0.6B-4bit)
- Check the console logs for download errors
Transcription is slow
Transcription is slow
- For real-time dictation, switch to a streaming provider (Azure, Soniox, AssemblyAI, Volcengine)
- For batch transcription, Groq Whisper offers the fastest cloud processing
- For local transcription, NVIDIA Parakeet is the fastest option
- Ensure your Mac is not thermally throttling (check Activity Monitor)
Related Features
Dictation
Voice-to-text dictation interface
Text to Speech
Convert text back to speech
Soniox
Configure Soniox speech-to-text
Meeting Recording
Record and transcribe meetings
SmartBar
Use voice input in SmartBar