
Mistral launched its first speech understanding fashions on Tuesday. Dubbed Voxtral, it’s an open-source audio era synthetic intelligence (AI) mannequin that not solely turns textual content into speech however may also perceive textual content to generate speech as a response natively. These fashions can be found in two sizes of 24 billion parameters and three billion parameters. The Paris-based AI agency highlighted that not solely is Voxtral out there to obtain at no cost, however the firm can also be making it out there at an reasonably priced price by way of utility programming interface (API).
In a newsroom submit, Mistral calls voice “humanity’s first interface,” highlighting it as a foundational pillar of communication. As AI fashions turn into extra succesful, the French AI firm stated it was essential to convey human-computer interactions to this pure interface.
However, there are some gaps on this effort. Mistral claimed at present’s voice-focused AI fashions might be grouped in two classes: open-source fashions which have a excessive phrase error price and restricted semantic understanding; and closed proprietary fashions which can be very costly and never accessible to all.
Voxtral, an open-source mannequin with native semantic understanding, is geared toward closing this hole, the corporate added. There are three fashions in whole — Voxtral Small with 24B parameters, Voxtral Mini with 3B parameters, and Voxtral Mini Transcribe with 3B parameters. All of those fashions can be found to the open group with the Apache 2.0 license that permits each educational and industrial utilization.
![]()
Mistral claims Voxtral gives the most effective steadiness between efficiency and price effectivity
Photo Credit: Mistral
Notably, Voxtral Small is the corporate’s premium mannequin geared toward production-scale purposes, whereas the Voxtral Mini is designed for native and edge deployments. The Voxtral Mini Transcribe is concentrated on transcription-related duties and is claimed to outperform OpenAI Whisper.
Voxtral fashions have a context window of 32,000 tokens, which interprets to as much as half-hour of transcription or 40 minutes of voice understanding. It may also reply questions on audio content material and generate summaries natively. Additionally, Voxtral can also be able to detecting a number of languages, together with English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian, and extra.
These fashions are constructed on prime of Mistral Small 3.1, Voxtral fashions additionally supply perform calling by way of voice, so customers can command the AI system with out having to kind something. Mistral claims that the Vostral Small mannequin outperforms GPT-4o mini Transcribe and Gemini 2.5 Flash throughout duties, and surpasses ElevenLabs Scribe in multilingual capabilities.
The Voxtral fashions might be downloaded from the corporate’s Hugging Face itemizing, accessed by way of API at a beginning worth of $0.001 (roughly Re. 1) per minute, or might be tried out by way of Mistral’s Le Chat platform.