Skip to content

DeepInfra

DeepInfra hosts open speech recognition models behind its inference API. OSTT supports DeepInfra-hosted Whisper models and Voxtral speech-recognition models.

DeepInfra documentation:

Models

Model IDNotes
deepinfra/openai/whisper-large-v3High-accuracy multilingual Whisper Large V3 model.
deepinfra/openai/whisper-large-v3-turboFaster pruned Whisper Large V3 Turbo model. DeepInfra lists this at $0.00020 / minute.
deepinfra/openai/whisper-largeDeepInfra-documented best-accuracy Whisper model.
deepinfra/openai/whisper-mediumFaster, lighter Whisper model.
deepinfra/openai/whisper-smallSmaller Whisper model for lightweight transcription.
deepinfra/openai/whisper-baseSmallest supported Whisper model.
deepinfra/openai/whisper-timestamped-mediumWhisper Medium variant documented for per-word timestamp segmentation.
deepinfra/mistralai/Voxtral-Mini-3B-2507Voxtral Mini speech-recognition model for transcription, translation, and audio understanding.
deepinfra/mistralai/Voxtral-Small-24B-2507Larger Voxtral speech-recognition model.

Params

toml
[deepinfra."openai/whisper-large-v3".params]
language = "sv"
initial_prompt = "Names: OSTT, DeepInfra, Whisper."
temperature = 0.0
task = "transcribe"
chunk_level = "segment"
chunk_length_s = 30
bash
ostt transcribe meeting.mp3 -m deepinfra/openai/whisper-large-v3 --param language=sv --param initial_prompt=OSTT
ostt model params deepinfra/openai/whisper-large-v3-turbo --format json
ParamTypeDescription
languagestringOptional language hint.
initial_promptstringOptional text prompt for the first transcription window. Saved ostt keyword terms are used as fallback only when initial_prompt is not set.
temperaturenumberSampling temperature, 0.0 to 1.0.
taskstringSupported values: transcribe, translate.
chunk_levelstringSupported values: segment, word. DeepInfra documents this as the chunk level for timestamp segmentation.
chunk_length_sintegerChunk length in seconds. DeepInfra documents 1 to 30, default 30.

Audio Formats

DeepInfra's speech API documents direct upload support for mp3 and wav. Responses include a top-level text field and segment timestamps; OSTT returns the transcript text.