Skip to content

Choosing a Transcription Model or Provider

OSTT is provider-neutral. You can use hosted cloud transcription, built-in local Whisper-compatible models, or your own external command or HTTP engine. The right choice depends on privacy, language, latency, cost, hardware, and how much setup you want to own.

If you are unsure, start with one good default and let OSTT make comparison practical: record once, then use ostt retry -m PROVIDER/MODEL to transcribe the same audio with another model.

Quick Recommendations

NeedGood starting pointWhy
Fastest setupOpenAI openai/gpt-4o-transcribe or Deepgram deepgram/nova-3Hosted APIs avoid local model downloads and local runtime setup.
Strong general cloud qualityOpenAI openai/gpt-4o-transcribe, Deepgram deepgram/nova-3, or AssemblyAI assemblyai/universal-3-proThese are current general-purpose cloud transcription models with useful params for language, formatting, prompts, or keyterms.
Swedish or EU-focused workBerget berget/KBLab/kb-whisper-largeBerget is a Swedish provider with Swedish-optimized and European-hosted transcription options.
Norwegian workBerget berget/NbAiLab/nb-whisper-largeNB-Whisper is documented for Norwegian, Bokmal, Nynorsk, and English.
Offline or privacy-sensitive workBuilt-in local Whisper, such as whisper/turbo if it fits your hardwareAudio stays on your machine after the model file is downloaded.
Fast local dictationBuilt-in local Whisper with daemon mode, or an external hot local HTTP engineDaemon mode avoids reloading the local model every call. External servers can keep other engines hot.
Newer local ASR enginescommand/<profile> or http/<profile> external enginesRun faster-whisper, Parakeet, Cohere Transcribe, Speaches, LocalAI, or your own wrapper without making OSTT bundle every runtime.
Long recordings or meetingsCloud providers with diarization/formatting params, or local batch engines if privacy mattersLong files often benefit from provider-specific params such as diarization, formatting, language hints, and prompts.
Developer dictationAny accurate model plus ostt keyword, ostt replace, and processing actionsTechnical terms need vocabulary hints and deterministic cleanup as much as raw model accuracy.

These are starting points, not benchmark rankings. Hardware, microphone quality, language, accent, noise, pricing, and provider policies change over time.

Compare Models With Retry

Most dictation tools hide model choice behind a global setting. OSTT saves recordings locally, so model choice becomes testable instead of theoretical.

bash
# Record once with your current default
ostt -o first.txt

# Retry the same audio with different providers
ostt retry -m deepgram/nova-3 -o deepgram.txt
ostt retry -m openai/gpt-4o-transcribe -o openai.txt
ostt retry -m berget/KBLab/kb-whisper-large -o berget.txt
ostt retry -m whisper/turbo -o local.txt

Use this for real audio from your microphone, your accent, your room, and your vocabulary. That is more useful than a generic benchmark table.

Cloud Providers

Cloud providers are usually easiest when you want strong transcription without downloading local models. They also move local CPU/GPU load off your machine.

ProviderGood forWatch out for
OpenAIGPT-4o transcription, GPT-4o Mini, hosted Whisper, prompt hints, diarization modelAudio leaves your machine. OSTT returns plain text even when JSON metadata is requested.
DeepgramNova models, low-latency cloud transcription, formatting, diarization, language detection, keytermsAdvanced params are provider-specific. Pick nova-3 or nova-2 intentionally.
GroqVery fast hosted Whisper variants and OpenAI-compatible request shapeGroq model choices differ in accuracy, cost, and translation support.
DeepInfraHosted open speech-recognition models, including Whisper and Voxtral optionsModel availability and pricing can change. Check the provider docs.
AssemblyAIUniversal-3 Pro, promptable transcription, speaker labels, language detection, keytermsAsync-provider behavior and params differ from OpenAI-style endpoints.
BergetSwedish and Norwegian optimized Whisper models, European hostingBest fit when Berget's regional and model choices match your use case.
ElevenLabsScribe transcription and multilingual speech-to-text workflowsAdvanced diarization and role params have provider-specific constraints.
MistralVoxtral transcription, context bias, diarization, timestamp granularityOSTT uses the synchronous transcription endpoint, not streaming.

Run ostt auth login before selecting a cloud provider:

bash
ostt auth login
ostt model

Built-In Local Whisper

Use local Whisper-compatible models when privacy, offline use, or predictable cost matters. OSTT's built-in local path uses whisper-rs with GGUF or ggml-*.bin model files.

Open the model picker:

bash
ostt model

Choose Local provider to download curated models, activate a downloaded model, inspect metadata, delete model files, or add a custom Hugging Face/direct model URL.

HardwareSuggested starting pointNotes
Low-end CPUtiny, base, or smallFaster, lower accuracy. Good for quick notes and testing.
Modern laptop CPUsmall, medium, or turboBalance speed and quality. Try short samples first.
Apple Siliconturbo or large if latency is acceptableMetal acceleration is enabled on macOS builds.
NVIDIA Linux GPUturbo or largeUse the CUDA build when the NVIDIA driver and cuBLAS runtime are available.
AMD/Intel Linux GPUturbo with the Vulkan buildVulkan support is useful but hardware-dependent.
Privacy-sensitive workLargest model that feels fast enoughNo audio leaves the machine, but local performance depends on hardware.

For repeated local dictation, start Daemon Mode:

bash
ostt daemon start
ostt launch --paste

Daemon mode keeps the active local model loaded so each transcription avoids model load time.

External Local Engines

Built-in local support intentionally stays focused on Whisper-compatible models. If you want faster-whisper, Parakeet, Cohere Transcribe, Speaches, LocalAI, onnx-asr, or a custom research model, run that engine yourself and let OSTT call it.

Use command/<profile> when you have a CLI or wrapper script:

toml
[transcription]
provider = "command"
model = "parakeet"

[command.parakeet]
display_name = "Parakeet"
command = "/home/you/asr/parakeet-transcribe.sh {audio_path}"
output_format = "pcm_s16le -ar 16000"
timeout_secs = 300

Use http/<profile> when your engine exposes an OpenAI-compatible /v1/audio/transcriptions endpoint:

toml
[transcription]
provider = "http"
model = "speaches"

[http.speaches]
display_name = "Speaches"
endpoint = "http://127.0.0.1:8000/v1/audio/transcriptions"
output_format = "pcm_s16le -ar 16000"
timeout_secs = 300

[http.speaches.params]
model = "Systran/faster-whisper-large-v3"
response_format = "json"

Then select or use it like any other model:

bash
ostt model select http/speaches
ostt -m http/speaches --paste
ostt retry -m command/parakeet

See External Engines for the full contract.

Developer Dictation Needs More Than a Model

For code, product names, acronyms, APIs, and unusual proper nouns, combine model choice with OSTT's transcript cleanup tools.

Add recognition hints before transcription:

bash
ostt keyword add Kubernetes
ostt keyword add VitePress

Fix final casing and common misrecognitions after transcription:

toml
[text.replace]
"ostt" = "OSTT"
"api" = "API"
"github" = "GitHub"
"open ai" = "OpenAI"

Then use processing actions for transformations that need AI or shell commands:

bash
ostt launch --paste -p clean
ostt transcribe meeting.mp3 -p summary -o summary.md

Tradeoffs

PathPrivacySetupLatencyCostMaintenance
Built-in local WhisperBestMediumHardware-dependentFree after hardwareLow
Cloud STTAudio leaves your machineEasy once key existsUsually goodUsage-basedLow
External command engineLocal if the command is localAdvancedDepends on engine startupFree after hardwareUser-managed
External HTTP engineLocal if endpoint is localAdvancedGood when server stays hotFree after hardwareUser-managed