Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.enconvo.ai/llms.txt

Use this file to discover all available pages before exploring further.

Overview

The Transcribe Audio/Video files command turns recorded audio and video files into text. Drop in a meeting recording, lecture, podcast, or voice memo and EnConvo runs it through your chosen speech-to-text provider — cloud or fully local — returning plain text or a .txt file with optional speaker labels and timestamps. Where Dictation is for live voice input, Transcription is for files you already have on disk.

Supported Formats

EnConvo accepts most common audio and video containers:
  • Audio: .mp3, .wav, .flac, .aac, .ogg, .m4a, .aiff, .amr, .webm, .opus
  • Video: .mp4, .mov, .mkv, .avi (audio track is extracted automatically)
Files are normalized to 16 kHz mono before transcription. Large files are chunked with overlap, transcribed in parallel where the provider allows, and stitched back together using sequence alignment so words at chunk boundaries are not lost.

Triggering Transcription

1

Open the Transcribe command

Search Transcribe Audio/Video files in SmartBar, or invoke it via a workflow.
2

Provide the files

Drag and drop one or more audio/video files, or pass them as filePaths parameters from a workflow or shortcut.
3

Pick output format

Choose Plain Text to receive the transcript inline, or TXT File to save it next to the source media.
4

Run

EnConvo preprocesses the audio, sends it to your selected transcription provider, and streams progress back to SmartBar.
The transcription provider used here is configured separately from Dictation. Set it in Settings → Transcribe Audio/Video files → Transcription Provider so file transcription can use a heavier, more accurate model than your real-time dictation provider.

Provider Selection

The Transcription command works with every speech-to-text provider EnConvo supports. The right choice depends on file length, language, accuracy needs, and whether the audio can leave your machine.

Speaker Diarization

When transcribing meetings or interviews, enable Speaker Diarization in the command preferences. EnConvo will:
  • Pass the diarization flag to providers that support it natively (AssemblyAI, Soniox, Volcengine, Azure, ElevenLabs)
  • Group transcript segments by speaker
  • Label each line as Speaker 0:, Speaker 1:, etc.
Diarization quality depends on the provider. AssemblyAI and Soniox produce the cleanest speaker turns; cloud Whisper-style models do not separate speakers and will return a single track.

Output

The transcript is returned inline as a string — useful for chaining into AI summarization, translation, or knowledge base ingestion in a workflow.
A .txt file is written next to the source media (or to output_dir if specified). When diarization is enabled the file is laid out one speaker turn per paragraph.
Providers that return per-segment / per-word timestamps include them on the SpeechToTextResult.segments and .words fields. Workflows can read these to build subtitles, jump-to-time UIs, or speaker-aware summaries.

Hot Words & Domain Vocabulary

Most providers accept a list of hot words or context terms to bias recognition toward names, jargon, and product terms. Set them in the provider’s settings:
  • Soniox → Context Terms (one per line)
  • Volcengine → Hot Words
  • AssemblyAI → Word Boost
  • Groq / OpenAI / Local Whisper → Prompt (free-form, up to 224 tokens)
Useful for medical terms, code names, acronyms, brand names, or anything not in the model’s general vocabulary.

Working With Long Files

EnConvo handles large media without manual splitting:
  1. Audio is decoded to a 16 kHz mono WAV
  2. If the file exceeds the provider’s per-call limit, it is split into chunks with several seconds of overlap
  3. Chunks are transcribed (in parallel where the provider permits)
  4. Sequence alignment merges chunk transcripts so words at boundaries are not duplicated or dropped
Concrete provider limits worth knowing:
  • Groq / OpenAI / ElevenLabs: 25 MB per request — chunked automatically
  • Volcengine Flash: ≤ 2 hours and ≤ 100 MB per call; longer files fall back to the async submit/query pipeline
  • Google Gemini: up to 9.5 hours per file with no chunking required
  • AssemblyAI / Soniox / Azure Fast: server-side chunking, no client-side split needed

Troubleshooting

  1. Confirm the file actually contains speech (silent or music-only files return empty)
  2. Try a different provider — Whisper-based models can drop very short clips
  3. For Volcengine and Soniox, set a language_hints value matching the spoken language
  1. For cloud: prefer Groq Whisper Large V3 Turbo or Azure Fast Transcription for the highest throughput
  2. For local: use NVIDIA Parakeet on Apple Silicon — ~210x realtime
  3. Local Whisper large-v3 is the slowest local option; switch to small or Parakeet if speed matters
  1. Diarization assumes distinct speakers — overlapping speech produces noisy labels
  2. Try AssemblyAI Universal, which has the most robust diarization
  3. Provide a hint of the expected speaker count where the provider exposes one (e.g. maxSpeakers)
  1. Add the terms to the provider’s hot-words / context-terms / prompt field
  2. Set the language explicitly instead of using auto-detect
  3. Try a larger model — whisper-large-v3 over turbo, AssemblyAI Universal over base
EnConvo extracts the audio track automatically using ffmpeg. If extraction fails, re-export the video to MP4 with a standard AAC track or convert to WAV first.

Dictation

Live voice-to-text, push-to-talk and SmartBar dictation

Speech Recognition

Provider deep-dive: models, languages, Cloud Plan pricing

Meeting Recording

Capture meetings end-to-end with live transcription

Soniox

Configure Soniox real-time and async transcription