ElevenLabs Scribe v2 from the command line.

ElevenLabs Scribe v2 holds the top position on the Artificial Analysis speech-to-text benchmark with a 2.2% word error rate — lower than GPT-4o-transcribe, AssemblyAI, Deepgram, and Google. It transcribes in 99 languages, detects non-speech events, and identifies entities. OSTT connects it directly to your terminal, hotkey, and shell.

ElevenLabs Scribe v2

2.2% WER. The most accurate model available.

On the Artificial Analysis AA-WER v2 benchmark — which weights real-world audio including diverse accents and domain-specific language — Scribe v2 scores 2.2%, outperforming every major provider. It handles long-form audio with consistent accuracy across speaker changes, accents, and recording conditions. It detects non-speech events (laughter, footsteps), identifies entities across 56 categories, and supports smart speaker diarization. OSTT brings all of this to your keyboard.

# ~/.config/ostt/ostt.toml
[transcription]
provider = "elevenlabs"
model = "scribe_v2"

[elevenlabs.scribe_v2.params]
# language_code = "eng"  # Optional: set to force language
# diarize = true
# keyterms = ["OSTT", "Scribe"]

# Pick interactively
ostt model

# Record with hotkey, transcribe with Scribe v2, copy to clipboard
ostt launch -c

#1 accuracy benchmark

Scribe v2 leads the Artificial Analysis AA-WER v2 leaderboard at 2.2% WER, ahead of AssemblyAI Universal-3 Pro (3.3%), GPT-4o-transcribe (4.1%), and Deepgram Nova-3 (5.3%). For work where every word matters — legal, medical, research — this gap is significant.

99 languages, auto-detected

Scribe v2 supports 99 languages with automatic language detection. It handles mid-file language switches without configuration. Set a language_code in ostt.toml to lock a specific language and improve accuracy when the source is known.

Non-speech event detection

Scribe v2 detects and labels non-speech events such as laughter, applause, and background noise — useful for meeting notes, research transcription, and media workflows where context matters beyond the words spoken.

Entity detection

Built-in detection across 56 entity categories including PII, health data, and payment details, with precise timestamps. Useful for redaction workflows and structured audio analysis.

Smart speaker diarization

Identify and label multiple speakers with precise word-level timestamps. Scribe v2 handles a wide range of speaker counts with accurate attribution across accents and delivery styles.

Keyterm prompting

Provide up to 100 domain-specific terms, product names, or technical vocabulary. Scribe v2 applies them in context — not just as keyword matching — for accurate transcription of the terms that matter most.

Workflow

From speech to useful output.

1. RecordPress your global hotkey or run ostt in the terminal.

2. TranscribeScribe v2 processes the audio via the ElevenLabs API.

3. ProcessOptionally run AI prompts or shell commands on the result.

4. SendPrint to stdout, copy to clipboard, write to a file, or pipe onward.

Pipeline

Benchmark accuracy in your shell.

OSTT connects Scribe v2 to your shell like any other Unix tool. Transcription output lands on stdout — pipe it through jq, sed, or any CLI tool. Use -p to chain AI processing actions on the result. Add technical vocabulary to OSTT keywords once and improve accuracy across every recording.

# Transcribe an interview with Scribe v2
ostt transcribe interview.mp3 -o transcript.txt

# Record, process with AI action, copy result
ostt -p clean -c

# Transcribe a long recording, summarize, write to file
ostt transcribe lecture.mp3 -p summary -o notes.md

The most accurate transcription model in your terminal.

Read the docs ElevenLabs provider reference