Speech Recognition - EnConvo Documentation

Overview

EnConvo supports a dozen+ speech recognition providers, from cloud services with the highest accuracy to fully local models that keep your audio data private. Whether you need real-time dictation, long-form file transcription, or multilingual support, EnConvo has a provider for every use case. For the user-facing features built on top of these providers, see Dictation and Transcription.

Supported Providers

Cloud Providers

Provider	Model	Best For	Languages
Groq Whisper	Whisper Large V3 Turbo, Whisper Large V3	Fast cloud transcription	57+
OpenAI	gpt-4o-transcribe, gpt-4o-mini-transcribe	High accuracy with AI post-processing	57+
Microsoft Azure	Azure Realtime, Azure Fast Transcription	Enterprise-grade streaming and batch	100+
Google Gemini	Gemini 3.1 Flash Lite	Multimodal audio understanding, up to 9.5 hours	100+
Mistral	Voxtral Mini	Lightweight, fast transcription	20+
Soniox	STT Realtime v5, STT Async v5	Multilingual real-time dictation and file transcription	60+
OpenAI OAuth	Transcribe	Speech recognition through a connected OpenAI subscription account	Provider-dependent
xAI	Speech-to-text	Speech recognition through xAI provider support	Provider-dependent
AssemblyAI	Universal, Universal-Streaming v3	Best-in-class speaker diarization	12+
ElevenLabs	Scribe v2, Scribe v2 Realtime	High quality, automatic language detection	29+
Volcengine BigASR	Realtime, Flash, Async (ByteDance)	Mandarin and Chinese dialects	10+

Local Providers (Offline)

Provider	Model	Best For	Languages
NVIDIA Parakeet	Parakeet TDT 0.6B V3	Fast local ASR on Apple Silicon	25
Qwen ASR	Qwen3-ASR 0.6B/1.7B	Multilingual local ASR, Chinese dialects	30+
Local Whisper	whisper-base / small / large-v3	Privacy-first offline recognition	57+

Provider Highlights

NVIDIA Parakeet TDT (Local)

The flagship local ASR model in EnConvo, powered by the FluidAudio framework.

209.8x real-time speed on Apple Silicon — a 1-minute audio file transcribes in under 0.3 seconds
2.1% Word Error Rate (WER) — approaching cloud provider accuracy
25 European languages with automatic language detection
Automatic punctuation, capitalization, and word-level timestamps
Supports long audio up to 24 minutes (full attention) or 3 hours (local attention)
Released under CC-BY-4.0 open-source license

Parakeet TDT is recommended for users who prioritize privacy and speed. It runs entirely on your Mac with no internet connection required.

Qwen ASR (Local)

Alibaba’s state-of-the-art open-source ASR model with exceptional multilingual support.

30 languages and 22 Chinese dialects — ideal for Chinese language users
Available in 0.6B and 1.7B parameter sizes
4-bit and 8-bit quantized variants for efficient local inference via MLX
Automatic language identification
Streaming and offline unified inference

Model	Parameters	Quantization	Memory
Qwen3-ASR-0.6B-4bit	0.6B	4-bit	Minimal
Qwen3-ASR-0.6B-8bit	0.6B	8-bit	Low
Qwen3-ASR-1.7B-4bit	1.7B	4-bit	Moderate
Qwen3-ASR-1.7B-8bit	1.7B	8-bit	Moderate

Groq Whisper (Cloud)

Ultra-fast cloud inference powered by Groq’s LPU hardware.

Whisper Large V3 Turbo — fastest cloud Whisper variant
Whisper Large V3 — highest accuracy Whisper model
Custom prompts for domain-specific vocabulary (up to 224 tokens)
Language-specific hints for improved accuracy
Available free through Enconvo Cloud Plan

Google Gemini STT (Cloud)

Uses Gemini’s multimodal understanding for audio transcription.

Supports audio files up to 9.5 hours in duration
Gemini 3.1 Flash Lite — cost-efficient, fastest performance
Supports structured output and advanced text post-processing
Available through Enconvo Cloud Plan or your own Google API key

Soniox STT (Cloud)

Soniox provides multilingual speech recognition for both real-time dictation and file-based transcription.

Soniox STT Realtime v5 for live dictation through EnConvo’s macOS realtime client
Soniox STT Async v5 for audio/video file transcription after recording
Supports automatic language detection, language hints, punctuation, and speaker diarization
Uses your own Soniox API key through EnConvo Credentials

Use Soniox when you want a bring-your-own-key cloud STT provider with real-time dictation and broad multilingual coverage.

Volcengine BigASR (Cloud)

ByteDance’s Doubao speech model, optimized for Mandarin and Chinese dialects.

Realtime (volc.bigasr.sauc.duration) for live dictation via WebSocket
Flash (sync) for files ≤ 2 hours and ≤ 100 MB; auto-falls-back to async for longer files
Hot words and language hints for biasing recognition
Works through your own Volcengine credentials or via Enconvo Cloud Plan

AssemblyAI (Cloud)

Industry-leading speaker diarization for multi-speaker recordings.

Universal-Streaming v3 for real-time dictation
Universal for async file transcription with diarization, word-level timestamps, and confidence scores
Word boost (custom vocabulary) and content moderation
Available via Enconvo Cloud Plan or your own AssemblyAI key

ElevenLabs Scribe (Cloud)

ElevenLabs’ STT, strong on accented and noisy speech.

Scribe v2 Realtime for live dictation
Scribe v2 for async file transcription with word-level timestamps
Automatic language detection across 29+ languages
Available via Enconvo Cloud Plan or your own ElevenLabs key

Configuration

Open Dictation Settings

Navigate to the Dictation command settings in EnConvo. The speech recognition provider is configured as the Dictation Model Provider.

Select a Provider

Choose your preferred provider from the dropdown. Cloud providers offer the highest accuracy, while local providers offer privacy and offline support.

Set Up Credentials

For cloud providers, configure your API key through the Credential Provider. Enconvo Cloud Plan users get access to Groq Whisper, Microsoft Azure, Google Gemini, OpenAI, and Mistral without separate API keys.

Choose a Language

Set your primary language for transcription. Most providers support Auto Detect, but specifying a language can improve accuracy and speed.

Select a Model (if applicable)

Some providers offer multiple models. For example, Local Whisper offers base (fast), small (balanced), and large-v3 (most accurate) variants.

Use Cases

Real-Time Dictation

The primary use of speech recognition in EnConvo is real-time dictation — speaking and having your words appear as text instantly.

Press the dictation hotkey (default: Option + V) or hold Fn
Speak naturally at a comfortable pace
Press the hotkey again (or release Fn) to finish
The transcribed text is inserted at your cursor position

For real-time dictation, streaming providers like Microsoft Azure, Soniox, AssemblyAI, ElevenLabs, and Volcengine provide the lowest latency. Batch providers like Groq Whisper and OpenAI process after you finish speaking.

Dictation in SmartBar

Use voice input directly within the SmartBar for AI interactions:

Open the SmartBar
Click the microphone icon or use the dictation shortcut
Speak your query
The transcribed text becomes your AI prompt

Voice Input In Chat Window

Use the microphone button in Chat Window to dictate a normal chat message. This uses the same speech recognition provider as Dictation.

Audio/Video File Transcription

Transcribe pre-recorded audio and video files:

Use the Transcribe Audio/Video files command
Drag and drop audio or video files
Choose your output format (plain text or .txt file)
EnConvo processes the files with speaker diarization support

Supported formats include MP3, WAV, FLAC, M4A, MP4, MOV, and more.

Meeting Transcription

Combined with EnConvo’s Meeting Recording feature, speech recognition powers live meeting transcription. See the Meeting Recording documentation for details.

Voice Activity Detection (VAD)

EnConvo includes built-in Voice Activity Detection powered by Silero VAD via the FluidAudio framework:

Automatic start/stop: Detects when you begin and stop speaking
Background noise filtering: Ignores ambient sounds and only captures speech
Low latency: Sub-100ms detection for natural dictation flow
Runs locally: VAD processing happens entirely on your Mac

VAD works automatically with all speech recognition providers — you do not need to configure it separately.

Language Support

Provider Language Coverage

Most Languages
Best for Chinese
European Languages
Universal

Microsoft Azure supports 100+ languages and dialects, making it the broadest choice. Combined with the free Enconvo Cloud Plan tier, it is an excellent default option.

Setting the Language

Most providers support automatic language detection, but explicitly setting the language is recommended for:

Better accuracy: The model focuses on the correct language’s phonemes
Faster processing: Skips the language detection step
Mixed content: When you know the primary language in multilingual audio

Offline Speech Recognition

For users who need speech recognition without an internet connection:

Solution	Model Size	Speed	Accuracy	Setup
NVIDIA Parakeet	~600 MB	Fastest	Excellent	Auto-download on first use
Qwen ASR (4-bit)	~400 MB	Fast	Very Good	Auto-download on first use
Local Whisper (small)	~216 MB	Good	Good	Auto-download on first use
Local Whisper (large-v3)	~626 MB	Moderate	Excellent	Auto-download on first use

For the best offline experience, we recommend NVIDIA Parakeet TDT 0.6B V3 — it offers the fastest speed with near-cloud accuracy. Models are automatically downloaded when you first select them.

Advanced Configuration

Custom Prompts (Groq Whisper / Local Whisper)

Provide context to improve transcription of domain-specific terms:

Medical terminology: cardiovascular, myocardial infarction, electrocardiogram

Prompts are limited to 224 tokens and help the model correctly spell specialized vocabulary.

Audio Preprocessing

EnConvo automatically preprocesses audio for optimal recognition:

Converts to 16kHz mono FLAC/WAV format
Splits large files into manageable chunks with overlap
Merges transcription results using sequence alignment
Cleans up temporary files after processing

Large File Handling

For long audio files, EnConvo automatically:

Splits audio into provider-appropriate chunks
Adds overlap between chunks to avoid word loss at boundaries
Processes chunks in parallel where possible
Merges results using intelligent sequence alignment

Enconvo Cloud Plan

The Enconvo Cloud Plan includes access to multiple STT providers:

Provider	Model	Points per Use
Groq Whisper	Whisper Large V3 Turbo	100
Groq Whisper	Whisper Large V3	200
OpenAI	gpt-4o-mini-transcribe	100
OpenAI	gpt-4o-transcribe	200
Google Gemini	Gemini 3.1 Flash Lite	100
Mistral	Voxtral Mini	100
Microsoft Azure	Azure Realtime / Fast Transcription	Free
AssemblyAI	Universal, Universal-Streaming v3	Via Cloud
Soniox	STT Realtime v5, STT Async v5	Via Cloud
ElevenLabs	Scribe v2, Scribe v2 Realtime	Via Cloud
Volcengine	BigASR Realtime, Flash, Async	Via Cloud

Microsoft Azure STT via the Enconvo Cloud Plan is free and does not consume points. It is the recommended default for most users.

Troubleshooting

Microphone not detected

Open System Settings > Privacy & Security > Microphone
Ensure EnConvo has microphone access enabled
Check that the correct microphone is selected in your system audio settings
Try restarting EnConvo after granting permissions

Poor transcription accuracy

Speak clearly at a moderate pace
Reduce background noise or use a directional microphone
Explicitly set the language instead of using Auto Detect
Try a larger model (e.g., Whisper Large V3 instead of V3 Turbo)
Use custom prompts for specialized vocabulary

Local model download fails

Ensure you have a stable internet connection for the initial download
Check available disk space — models range from 200 MB to 1 GB
Try a smaller model variant first (e.g., Qwen3-ASR-0.6B-4bit)
Check the console logs for download errors

Transcription is slow

For real-time dictation, switch to a streaming provider (Azure, Soniox, AssemblyAI, Volcengine)
For batch transcription, Groq Whisper offers the fastest cloud processing
For local transcription, NVIDIA Parakeet is the fastest option
Ensure your Mac is not thermally throttling (check Activity Monitor)

Dictation

Voice-to-text dictation interface

Text to Speech

Convert text back to speech

Soniox

Configure Soniox speech-to-text

Meeting Recording

Record and transcribe meetings

SmartBar

Use voice input in SmartBar

​Overview

​Supported Providers

​Cloud Providers

​Local Providers (Offline)

​Provider Highlights

​NVIDIA Parakeet TDT (Local)

​Qwen ASR (Local)

​Groq Whisper (Cloud)

​Google Gemini STT (Cloud)

​Soniox STT (Cloud)

​Volcengine BigASR (Cloud)

​AssemblyAI (Cloud)

​ElevenLabs Scribe (Cloud)

​Configuration

​Use Cases

​Real-Time Dictation

​Dictation in SmartBar

​Voice Input In Chat Window

​Audio/Video File Transcription

​Meeting Transcription

​Voice Activity Detection (VAD)

​Language Support

​Provider Language Coverage

​Setting the Language

​Offline Speech Recognition

​Advanced Configuration

​Custom Prompts (Groq Whisper / Local Whisper)

​Audio Preprocessing

​Large File Handling

​Enconvo Cloud Plan

​Troubleshooting

​Related Features

Dictation

Text to Speech

Soniox

Meeting Recording

SmartBar

Overview

Supported Providers

Cloud Providers

Local Providers (Offline)

Provider Highlights

NVIDIA Parakeet TDT (Local)

Qwen ASR (Local)

Groq Whisper (Cloud)

Google Gemini STT (Cloud)

Soniox STT (Cloud)

Volcengine BigASR (Cloud)

AssemblyAI (Cloud)

ElevenLabs Scribe (Cloud)

Configuration

Use Cases

Real-Time Dictation

Dictation in SmartBar

Voice Input In Chat Window

Audio/Video File Transcription

Meeting Transcription

Voice Activity Detection (VAD)

Language Support

Provider Language Coverage

Setting the Language

Offline Speech Recognition

Advanced Configuration

Custom Prompts (Groq Whisper / Local Whisper)

Audio Preprocessing

Large File Handling

Enconvo Cloud Plan

Troubleshooting

Related Features