Mistral Voxtral from the command line.

Voxtral Mini Transcribe is Mistral's speech-to-text model — open weights under Apache 2.0, best price-to-accuracy of any transcription API at $0.003/min, and built on a language model backbone that understands audio rather than just transcribing it. OSTT connects it to your terminal, hotkey, and shell on Linux and macOS.

Mistral Voxtral

Open weights. Best price-performance. Built on a language model.

Voxtral Mini Transcribe V2 delivers ~4% WER at $0.003/min — outperforming GPT-4o-mini-transcribe, Gemini 2.5 Flash, AssemblyAI Universal, and Deepgram Nova on accuracy, while processing audio roughly 3x faster than ElevenLabs Scribe v2. It's open-weight (Apache 2.0), supports 13 languages, speaker diarization, context biasing for technical vocabulary, and processes recordings up to 3 hours in a single request.

# ~/.config/ostt/ostt.toml
[transcription]
provider = "mistral"
model = "voxtral-mini-latest"

[mistral.voxtral-mini-latest.params]
# language = "en"  # Optional: improves accuracy when known
# context_bias = ["OSTT", "Voxtral"]

# Pick interactively
ostt model

# Record with hotkey, transcribe with Voxtral, copy to clipboard
ostt launch -c

Open weights, Apache 2.0

Voxtral is open-source. The 3B and 24B model weights are available on Hugging Face for self-hosting, private deployment, or on-premise use. The API routes to a transcription-optimised version of the mini model when you don't want to manage infrastructure.

Best price-performance

At $0.003/min, Voxtral Mini Transcribe V2 costs half of GPT-4o-mini-transcribe and one-fifth of ElevenLabs Scribe v2, while matching or beating both on accuracy benchmarks. For high-volume transcription work, no other API comes close on this ratio.

Context biasing for technical vocabulary

Provide up to 100 words or phrases to guide the model toward correct spellings of names, technical terms, and domain-specific vocabulary. OSTT sends your configured keywords as Voxtral context_bias terms automatically.

13 languages with speaker diarization

Voxtral Mini Transcribe V2 supports English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch — with speaker diarization and word-level timestamps in all 13.

3-hour recordings in a single request

Unlike most transcription APIs that require chunking at 25MB or 25 minutes, Voxtral processes recordings up to 3 hours in one request. Transcribe a full workday of audio without writing chunking logic.

Global hotkey

Bind OSTT to a system-wide shortcut. Press to open the recorder, speak, press again to stop. Voxtral transcribes and the result lands in your clipboard or stdout — without touching the mouse.

Workflow

From speech to useful output.

1. RecordPress your global hotkey or run ostt in the terminal.

2. TranscribeVoxtral Mini Transcribe processes the audio via the Mistral API.

3. ProcessOptionally run AI prompts or shell commands on the result.

4. SendPrint to stdout, copy to clipboard, write to a file, or pipe onward.

Pipeline

Open-source accuracy in your shell.

OSTT routes Voxtral output to wherever your workflow needs it — stdout, clipboard, file, or piped to any CLI tool. Use the -p flag to chain processing actions. OSTT keywords map directly to Voxtral context_bias, so domain vocabulary you add once improves every recording automatically.

# Transcribe a 2-hour recording in one call
ostt transcribe lecture.mp3 -o notes.md

# Record, run processing action, copy result
ostt -p clean -c

# Transcribe and pipe to downstream command
ostt | my-tool.sh

Voxtral accuracy at half the cost of the alternatives.

Read the docs Mistral provider reference