Transcription

Overview

The Transcribe Audio/Video files command turns recorded audio and video files into text. Drop in a meeting recording, lecture, podcast, or voice memo and EnConvo runs it through your chosen speech-to-text provider — cloud or fully local — returning plain text or a .txt file with optional speaker labels and timestamps. Where Dictation is for live voice input, Transcription is for files you already have on disk.

Supported Formats

EnConvo accepts most common audio and video containers:

Audio: .mp3, .wav, .flac, .aac, .ogg, .m4a, .aiff, .amr, .webm, .opus
Video: .mp4, .mov, .mkv, .avi (audio track is extracted automatically)

Files are normalized to 16 kHz mono before transcription. Large files are chunked with overlap, transcribed in parallel where the provider allows, and stitched back together using sequence alignment so words at chunk boundaries are not lost.

Triggering Transcription

Open the Transcribe command

Search Transcribe Audio/Video files in SmartBar, or invoke it via a workflow.

Provide the files

Drag and drop one or more audio/video files, or pass them as filePaths parameters from a workflow or shortcut.

Pick output format

Choose Plain Text to receive the transcript inline, or TXT File to save it next to the source media.

Run

EnConvo preprocesses the audio, sends it to your selected transcription provider, and streams progress back to SmartBar.

The transcription provider used here is configured separately from Dictation. Set it in Settings → Transcribe Audio/Video files → Transcription Provider so file transcription can use a heavier, more accurate model than your real-time dictation provider.

Provider Selection

The Transcription command works with every speech-to-text provider EnConvo supports. The right choice depends on file length, language, accuracy needs, and whether the audio can leave your machine.

Recommended Defaults
Cloud (Async)
Local (Offline)

Use case	Recommended provider
Free, accurate, broad language coverage	Microsoft Azure (Cloud Plan, free tier)
Long meetings with multiple speakers	AssemblyAI Universal
Long audio, multimodal post-processing	Google Gemini
Privacy / offline	NVIDIA Parakeet or Local Whisper
Chinese & dialects	Qwen3 ASR (local) or Volcengine BigASR (cloud)
Fast turnaround on short clips	Groq Whisper Large V3 Turbo

Provider	Default model	Notes
Microsoft Azure	Azure Fast Transcription	Free via Cloud Plan; 100+ languages; built-in diarization
AssemblyAI	Universal	Strong diarization with `speaker_labels`; up to 10 speakers
Soniox	STT Async v4	Multilingual + language-hints, context terms, optional diarization
Volcengine BigASR	Flash (sync) → async fallback	Best for Mandarin & ByteDance ecosystem
OpenAI	gpt-4o-transcribe / mini	High accuracy, AI-cleaned text
Groq Whisper	Whisper Large V3 Turbo	Cheapest per-minute via LPU
Mistral Voxtral	Voxtral Mini	Lightweight, fast
Google Gemini	Gemini 3.1 Flash Lite	Up to 9.5 h per file; structured JSON segments
ElevenLabs Scribe	Scribe v2	Strong on noisy / accented speech

Provider	Model	Notes
NVIDIA Parakeet	parakeet-tdt-0.6b-v3	25 European languages, ~210x realtime on Apple Silicon
Qwen3 ASR	Qwen3-ASR-0.6B / 1.7B (4-bit / 8-bit)	30 languages + 22 Chinese dialects
Local Whisper	whisper-base / small / large-v3	57+ languages via mlx-audio

Speaker Diarization

When transcribing meetings or interviews, enable Speaker Diarization in the command preferences. EnConvo will:

Pass the diarization flag to providers that support it natively (AssemblyAI, Soniox, Volcengine, Azure, ElevenLabs)
Group transcript segments by speaker
Label each line as Speaker 0:, Speaker 1:, etc.

Diarization quality depends on the provider. AssemblyAI and Soniox produce the cleanest speaker turns; cloud Whisper-style models do not separate speakers and will return a single track.

Output

Plain Text

The transcript is returned inline as a string — useful for chaining into AI summarization, translation, or knowledge base ingestion in a workflow.

TXT File

A .txt file is written next to the source media (or to output_dir if specified). When diarization is enabled the file is laid out one speaker turn per paragraph.

Segments & Words (raw)

Providers that return per-segment / per-word timestamps include them on the SpeechToTextResult.segments and .words fields. Workflows can read these to build subtitles, jump-to-time UIs, or speaker-aware summaries.

Hot Words & Domain Vocabulary

Most providers accept a list of hot words or context terms to bias recognition toward names, jargon, and product terms. Set them in the provider’s settings:

Soniox → Context Terms (one per line)
Volcengine → Hot Words
AssemblyAI → Word Boost
Groq / OpenAI / Local Whisper → Prompt (free-form, up to 224 tokens)

Useful for medical terms, code names, acronyms, brand names, or anything not in the model’s general vocabulary.

Working With Long Files

EnConvo handles large media without manual splitting:

Audio is decoded to a 16 kHz mono WAV
If the file exceeds the provider’s per-call limit, it is split into chunks with several seconds of overlap
Chunks are transcribed (in parallel where the provider permits)
Sequence alignment merges chunk transcripts so words at boundaries are not duplicated or dropped

Concrete provider limits worth knowing:

Groq / OpenAI / ElevenLabs: 25 MB per request — chunked automatically
Volcengine Flash: ≤ 2 hours and ≤ 100 MB per call; longer files fall back to the async submit/query pipeline
Google Gemini: up to 9.5 hours per file with no chunking required
AssemblyAI / Soniox / Azure Fast: server-side chunking, no client-side split needed

Troubleshooting

Transcription returns empty text

Confirm the file actually contains speech (silent or music-only files return empty)
Try a different provider — Whisper-based models can drop very short clips
For Volcengine and Soniox, set a language_hints value matching the spoken language

Transcription is too slow

For cloud: prefer Groq Whisper Large V3 Turbo or Azure Fast Transcription for the highest throughput
For local: use NVIDIA Parakeet on Apple Silicon — ~210x realtime
Local Whisper large-v3 is the slowest local option; switch to small or Parakeet if speed matters

Speaker labels look wrong

Diarization assumes distinct speakers — overlapping speech produces noisy labels
Try AssemblyAI Universal, which has the most robust diarization
Provide a hint of the expected speaker count where the provider exposes one (e.g. maxSpeakers)

Domain terms are misspelled

Add the terms to the provider’s hot-words / context-terms / prompt field
Set the language explicitly instead of using auto-detect
Try a larger model — whisper-large-v3 over turbo, AssemblyAI Universal over base

Video file is rejected

EnConvo extracts the audio track automatically using ffmpeg. If extraction fails, re-export the video to MP4 with a standard AAC track or convert to WAV first.

Dictation

Live voice-to-text, push-to-talk and SmartBar dictation

Speech Recognition

Provider deep-dive: models, languages, Cloud Plan pricing

Meeting Recording

Capture meetings end-to-end with live transcription

Soniox

Configure Soniox real-time and async transcription

Getting Started

Core Features

AI Capabilities

Providers

Workflows & Extensions

Integrations

Advanced

Configuration

Resources

Overview

Supported Formats

Triggering Transcription

Provider Selection

Speaker Diarization

Output

Hot Words & Domain Vocabulary

Working With Long Files

Troubleshooting

Dictation

Speech Recognition

Meeting Recording

Soniox

Getting Started

Core Features

AI Capabilities

Providers

Workflows & Extensions

Integrations

Advanced

Configuration

Resources

Documentation Index

​Overview

​Supported Formats

​Triggering Transcription

​Provider Selection

​Speaker Diarization

​Output

​Hot Words & Domain Vocabulary

​Working With Long Files

​Troubleshooting

​Related Features

Dictation

Speech Recognition

Meeting Recording

Soniox

Overview

Supported Formats

Triggering Transcription

Provider Selection

Speaker Diarization

Output

Hot Words & Domain Vocabulary

Working With Long Files

Troubleshooting

Related Features