Documentation Index
Fetch the complete documentation index at: https://docs.enconvo.ai/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The Transcribe Audio/Video files command turns recorded audio and video files into text. Drop in a meeting recording, lecture, podcast, or voice memo and EnConvo runs it through your chosen speech-to-text provider — cloud or fully local — returning plain text or a.txt file with optional speaker labels and timestamps.
Where Dictation is for live voice input, Transcription is for files you already have on disk.
Supported Formats
EnConvo accepts most common audio and video containers:- Audio:
.mp3,.wav,.flac,.aac,.ogg,.m4a,.aiff,.amr,.webm,.opus - Video:
.mp4,.mov,.mkv,.avi(audio track is extracted automatically)
Triggering Transcription
Open the Transcribe command
Search Transcribe Audio/Video files in SmartBar, or invoke it via a workflow.
Provide the files
Drag and drop one or more audio/video files, or pass them as
filePaths parameters from a workflow or shortcut.Pick output format
Choose Plain Text to receive the transcript inline, or TXT File to save it next to the source media.
Provider Selection
The Transcription command works with every speech-to-text provider EnConvo supports. The right choice depends on file length, language, accuracy needs, and whether the audio can leave your machine.- Recommended Defaults
- Cloud (Async)
- Local (Offline)
| Use case | Recommended provider |
|---|---|
| Free, accurate, broad language coverage | Microsoft Azure (Cloud Plan, free tier) |
| Long meetings with multiple speakers | AssemblyAI Universal |
| Long audio, multimodal post-processing | Google Gemini |
| Privacy / offline | NVIDIA Parakeet or Local Whisper |
| Chinese & dialects | Qwen3 ASR (local) or Volcengine BigASR (cloud) |
| Fast turnaround on short clips | Groq Whisper Large V3 Turbo |
Speaker Diarization
When transcribing meetings or interviews, enable Speaker Diarization in the command preferences. EnConvo will:- Pass the diarization flag to providers that support it natively (AssemblyAI, Soniox, Volcengine, Azure, ElevenLabs)
- Group transcript segments by speaker
- Label each line as
Speaker 0:,Speaker 1:, etc.
Diarization quality depends on the provider. AssemblyAI and Soniox produce the cleanest speaker turns; cloud Whisper-style models do not separate speakers and will return a single track.
Output
Plain Text
Plain Text
The transcript is returned inline as a string — useful for chaining into AI summarization, translation, or knowledge base ingestion in a workflow.
TXT File
TXT File
A
.txt file is written next to the source media (or to output_dir if specified). When diarization is enabled the file is laid out one speaker turn per paragraph.Segments & Words (raw)
Segments & Words (raw)
Providers that return per-segment / per-word timestamps include them on the
SpeechToTextResult.segments and .words fields. Workflows can read these to build subtitles, jump-to-time UIs, or speaker-aware summaries.Hot Words & Domain Vocabulary
Most providers accept a list of hot words or context terms to bias recognition toward names, jargon, and product terms. Set them in the provider’s settings:- Soniox → Context Terms (one per line)
- Volcengine → Hot Words
- AssemblyAI → Word Boost
- Groq / OpenAI / Local Whisper → Prompt (free-form, up to 224 tokens)
Working With Long Files
EnConvo handles large media without manual splitting:- Audio is decoded to a 16 kHz mono WAV
- If the file exceeds the provider’s per-call limit, it is split into chunks with several seconds of overlap
- Chunks are transcribed (in parallel where the provider permits)
- Sequence alignment merges chunk transcripts so words at boundaries are not duplicated or dropped
- Groq / OpenAI / ElevenLabs: 25 MB per request — chunked automatically
- Volcengine Flash: ≤ 2 hours and ≤ 100 MB per call; longer files fall back to the async submit/query pipeline
- Google Gemini: up to 9.5 hours per file with no chunking required
- AssemblyAI / Soniox / Azure Fast: server-side chunking, no client-side split needed
Troubleshooting
Transcription returns empty text
Transcription returns empty text
- Confirm the file actually contains speech (silent or music-only files return empty)
- Try a different provider — Whisper-based models can drop very short clips
- For Volcengine and Soniox, set a
language_hintsvalue matching the spoken language
Transcription is too slow
Transcription is too slow
- For cloud: prefer Groq Whisper Large V3 Turbo or Azure Fast Transcription for the highest throughput
- For local: use NVIDIA Parakeet on Apple Silicon — ~210x realtime
- Local Whisper
large-v3is the slowest local option; switch tosmallor Parakeet if speed matters
Speaker labels look wrong
Speaker labels look wrong
- Diarization assumes distinct speakers — overlapping speech produces noisy labels
- Try AssemblyAI Universal, which has the most robust diarization
- Provide a hint of the expected speaker count where the provider exposes one (e.g.
maxSpeakers)
Domain terms are misspelled
Domain terms are misspelled
- Add the terms to the provider’s hot-words / context-terms / prompt field
- Set the language explicitly instead of using auto-detect
- Try a larger model —
whisper-large-v3overturbo, AssemblyAI Universal over base
Video file is rejected
Video file is rejected
EnConvo extracts the audio track automatically using ffmpeg. If extraction fails, re-export the video to MP4 with a standard AAC track or convert to WAV first.
Related Features
Dictation
Live voice-to-text, push-to-talk and SmartBar dictation
Speech Recognition
Provider deep-dive: models, languages, Cloud Plan pricing
Meeting Recording
Capture meetings end-to-end with live transcription
Soniox
Configure Soniox real-time and async transcription