Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.enconvo.ai/llms.txt

Use this file to discover all available pages before exploring further.

Overview

EnConvo supports a dozen+ speech recognition providers, from cloud services with the highest accuracy to fully local models that keep your audio data private. Whether you need real-time dictation, long-form file transcription, or multilingual support, EnConvo has a provider for every use case. For the user-facing features built on top of these providers, see Dictation and Transcription.

Supported Providers

Cloud Providers

ProviderModelBest ForLanguages
Groq WhisperWhisper Large V3 Turbo, Whisper Large V3Fast cloud transcription57+
OpenAIgpt-4o-transcribe, gpt-4o-mini-transcribeHigh accuracy with AI post-processing57+
Microsoft AzureAzure Realtime, Azure Fast TranscriptionEnterprise-grade streaming and batch100+
Google GeminiGemini 3.1 Flash LiteMultimodal audio understanding, up to 9.5 hours100+
MistralVoxtral MiniLightweight, fast transcription20+
SonioxSTT Realtime v4, STT Async v4Multilingual real-time dictation and file transcription60+
AssemblyAIUniversal, Universal-Streaming v3Best-in-class speaker diarization12+
ElevenLabsScribe v2, Scribe v2 RealtimeHigh quality, automatic language detection29+
Volcengine BigASRRealtime, Flash, Async (ByteDance)Mandarin and Chinese dialects10+

Local Providers (Offline)

ProviderModelBest ForLanguages
NVIDIA ParakeetParakeet TDT 0.6B V3Fast local ASR on Apple Silicon25
Qwen ASRQwen3-ASR 0.6B/1.7BMultilingual local ASR, Chinese dialects30+
Local Whisperwhisper-base / small / large-v3Privacy-first offline recognition57+

Provider Highlights

NVIDIA Parakeet TDT (Local)

The flagship local ASR model in EnConvo, powered by the FluidAudio framework.
  • 209.8x real-time speed on Apple Silicon — a 1-minute audio file transcribes in under 0.3 seconds
  • 2.1% Word Error Rate (WER) — approaching cloud provider accuracy
  • 25 European languages with automatic language detection
  • Automatic punctuation, capitalization, and word-level timestamps
  • Supports long audio up to 24 minutes (full attention) or 3 hours (local attention)
  • Released under CC-BY-4.0 open-source license
Parakeet TDT is recommended for users who prioritize privacy and speed. It runs entirely on your Mac with no internet connection required.

Qwen ASR (Local)

Alibaba’s state-of-the-art open-source ASR model with exceptional multilingual support.
  • 30 languages and 22 Chinese dialects — ideal for Chinese language users
  • Available in 0.6B and 1.7B parameter sizes
  • 4-bit and 8-bit quantized variants for efficient local inference via MLX
  • Automatic language identification
  • Streaming and offline unified inference
ModelParametersQuantizationMemory
Qwen3-ASR-0.6B-4bit0.6B4-bitMinimal
Qwen3-ASR-0.6B-8bit0.6B8-bitLow
Qwen3-ASR-1.7B-4bit1.7B4-bitModerate
Qwen3-ASR-1.7B-8bit1.7B8-bitModerate

Groq Whisper (Cloud)

Ultra-fast cloud inference powered by Groq’s LPU hardware.
  • Whisper Large V3 Turbo — fastest cloud Whisper variant
  • Whisper Large V3 — highest accuracy Whisper model
  • Custom prompts for domain-specific vocabulary (up to 224 tokens)
  • Language-specific hints for improved accuracy
  • Available free through Enconvo Cloud Plan

Google Gemini STT (Cloud)

Uses Gemini’s multimodal understanding for audio transcription.
  • Supports audio files up to 9.5 hours in duration
  • Gemini 3.1 Flash Lite — cost-efficient, fastest performance
  • Supports structured output and advanced text post-processing
  • Available through Enconvo Cloud Plan or your own Google API key

Soniox STT (Cloud)

Soniox provides multilingual speech recognition for both real-time dictation and file-based transcription.
  • Soniox STT Realtime v4 for live dictation through EnConvo’s macOS realtime client
  • Soniox STT Async v4 for audio/video file transcription after recording
  • Supports automatic language detection, language hints, punctuation, and speaker diarization
  • Uses your own Soniox API key through EnConvo Credentials
Use Soniox when you want a bring-your-own-key cloud STT provider with real-time dictation and broad multilingual coverage.

Volcengine BigASR (Cloud)

ByteDance’s Doubao speech model, optimized for Mandarin and Chinese dialects.
  • Realtime (volc.bigasr.sauc.duration) for live dictation via WebSocket
  • Flash (sync) for files ≤ 2 hours and ≤ 100 MB; auto-falls-back to async for longer files
  • Hot words and language hints for biasing recognition
  • Works through your own Volcengine credentials or via Enconvo Cloud Plan

AssemblyAI (Cloud)

Industry-leading speaker diarization for multi-speaker recordings.
  • Universal-Streaming v3 for real-time dictation
  • Universal for async file transcription with diarization, word-level timestamps, and confidence scores
  • Word boost (custom vocabulary) and content moderation
  • Available via Enconvo Cloud Plan or your own AssemblyAI key

ElevenLabs Scribe (Cloud)

ElevenLabs’ STT, strong on accented and noisy speech.
  • Scribe v2 Realtime for live dictation
  • Scribe v2 for async file transcription with word-level timestamps
  • Automatic language detection across 29+ languages
  • Available via Enconvo Cloud Plan or your own ElevenLabs key

Configuration

1

Open Dictation Settings

Navigate to the Dictation command settings in EnConvo. The speech recognition provider is configured as the Dictation Model Provider.
2

Select a Provider

Choose your preferred provider from the dropdown. Cloud providers offer the highest accuracy, while local providers offer privacy and offline support.
3

Set Up Credentials

For cloud providers, configure your API key through the Credential Provider. Enconvo Cloud Plan users get access to Groq Whisper, Microsoft Azure, Google Gemini, OpenAI, and Mistral without separate API keys.
4

Choose a Language

Set your primary language for transcription. Most providers support Auto Detect, but specifying a language can improve accuracy and speed.
5

Select a Model (if applicable)

Some providers offer multiple models. For example, Local Whisper offers base (fast), small (balanced), and large-v3 (most accurate) variants.

Use Cases

Real-Time Dictation

The primary use of speech recognition in EnConvo is real-time dictation — speaking and having your words appear as text instantly.
  1. Press the dictation hotkey (default: Option + V) or hold Fn
  2. Speak naturally at a comfortable pace
  3. Press the hotkey again (or release Fn) to finish
  4. The transcribed text is inserted at your cursor position
For real-time dictation, streaming providers like Microsoft Azure, Soniox, AssemblyAI, ElevenLabs, and Volcengine provide the lowest latency. Batch providers like Groq Whisper and OpenAI process after you finish speaking.

Dictation in SmartBar

Use voice input directly within the SmartBar for AI interactions:
  1. Open the SmartBar
  2. Click the microphone icon or use the dictation shortcut
  3. Speak your query
  4. The transcribed text becomes your AI prompt

Audio/Video File Transcription

Transcribe pre-recorded audio and video files:
  1. Use the Transcribe Audio/Video files command
  2. Drag and drop audio or video files
  3. Choose your output format (plain text or .txt file)
  4. EnConvo processes the files with speaker diarization support
Supported formats include MP3, WAV, FLAC, M4A, MP4, MOV, and more.

Meeting Transcription

Combined with EnConvo’s Meeting Recording feature, speech recognition powers live meeting transcription. See the Meeting Recording documentation for details.

Voice Activity Detection (VAD)

EnConvo includes built-in Voice Activity Detection powered by Silero VAD via the FluidAudio framework:
  • Automatic start/stop: Detects when you begin and stop speaking
  • Background noise filtering: Ignores ambient sounds and only captures speech
  • Low latency: Sub-100ms detection for natural dictation flow
  • Runs locally: VAD processing happens entirely on your Mac
VAD works automatically with all speech recognition providers — you do not need to configure it separately.

Language Support

Provider Language Coverage

Microsoft Azure supports 100+ languages and dialects, making it the broadest choice. Combined with the free Enconvo Cloud Plan tier, it is an excellent default option.

Setting the Language

Most providers support automatic language detection, but explicitly setting the language is recommended for:
  • Better accuracy: The model focuses on the correct language’s phonemes
  • Faster processing: Skips the language detection step
  • Mixed content: When you know the primary language in multilingual audio

Offline Speech Recognition

For users who need speech recognition without an internet connection:
SolutionModel SizeSpeedAccuracySetup
NVIDIA Parakeet~600 MBFastestExcellentAuto-download on first use
Qwen ASR (4-bit)~400 MBFastVery GoodAuto-download on first use
Local Whisper (small)~216 MBGoodGoodAuto-download on first use
Local Whisper (large-v3)~626 MBModerateExcellentAuto-download on first use
For the best offline experience, we recommend NVIDIA Parakeet TDT 0.6B V3 — it offers the fastest speed with near-cloud accuracy. Models are automatically downloaded when you first select them.

Advanced Configuration

Custom Prompts (Groq Whisper / Local Whisper)

Provide context to improve transcription of domain-specific terms:
Medical terminology: cardiovascular, myocardial infarction, electrocardiogram
Prompts are limited to 224 tokens and help the model correctly spell specialized vocabulary.

Audio Preprocessing

EnConvo automatically preprocesses audio for optimal recognition:
  • Converts to 16kHz mono FLAC/WAV format
  • Splits large files into manageable chunks with overlap
  • Merges transcription results using sequence alignment
  • Cleans up temporary files after processing

Large File Handling

For long audio files, EnConvo automatically:
  1. Splits audio into provider-appropriate chunks
  2. Adds overlap between chunks to avoid word loss at boundaries
  3. Processes chunks in parallel where possible
  4. Merges results using intelligent sequence alignment

Enconvo Cloud Plan

The Enconvo Cloud Plan includes access to multiple STT providers:
ProviderModelPoints per Use
Groq WhisperWhisper Large V3 Turbo100
Groq WhisperWhisper Large V3200
OpenAIgpt-4o-mini-transcribe100
OpenAIgpt-4o-transcribe200
Google GeminiGemini 3.1 Flash Lite100
MistralVoxtral Mini100
Microsoft AzureAzure Realtime / Fast TranscriptionFree
AssemblyAIUniversal, Universal-Streaming v3Via Cloud
SonioxSTT Realtime v4, STT Async v4Via Cloud
ElevenLabsScribe v2, Scribe v2 RealtimeVia Cloud
VolcengineBigASR Realtime, Flash, AsyncVia Cloud
Microsoft Azure STT via the Enconvo Cloud Plan is free and does not consume points. It is the recommended default for most users.

Troubleshooting

  1. Open System Settings > Privacy & Security > Microphone
  2. Ensure EnConvo has microphone access enabled
  3. Check that the correct microphone is selected in your system audio settings
  4. Try restarting EnConvo after granting permissions
  1. Speak clearly at a moderate pace
  2. Reduce background noise or use a directional microphone
  3. Explicitly set the language instead of using Auto Detect
  4. Try a larger model (e.g., Whisper Large V3 instead of V3 Turbo)
  5. Use custom prompts for specialized vocabulary
  1. Ensure you have a stable internet connection for the initial download
  2. Check available disk space — models range from 200 MB to 1 GB
  3. Try a smaller model variant first (e.g., Qwen3-ASR-0.6B-4bit)
  4. Check the console logs for download errors
  1. For real-time dictation, switch to a streaming provider (Azure, Soniox, AssemblyAI, Volcengine)
  2. For batch transcription, Groq Whisper offers the fastest cloud processing
  3. For local transcription, NVIDIA Parakeet is the fastest option
  4. Ensure your Mac is not thermally throttling (check Activity Monitor)

Dictation

Voice-to-text dictation interface

Text to Speech

Convert text back to speech

Soniox

Configure Soniox speech-to-text

Meeting Recording

Record and transcribe meetings

SmartBar

Use voice input in SmartBar