Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Note
This feature is currently in public preview. This preview is provided without a service-level agreement, and is not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
LLM speech is powered by a large-language-model-enhanced speech model that delivers improved quality, deep contextual understanding, multilingual support, and prompt-tuning capabilities. It uses GPU acceleration for ultra-fast inference, making it ideal for a wide range of scenarios including generating captions and subtitles from audio files, summarizing meeting notes, assisting call center agents, transcribing voicemails, and more.
Feature availability
This table shows which transcription features are supported by the fast transcription API, with and without LLM speech (enhanced mode):
| Feature | Fast Transcription (default) | LLM Speech (enhanced) |
|---|---|---|
| Transcription | ✅ (transcription Speech models) | ✅ (multimodal model) |
| Translation | ❌ | ✅ (multimodal model) |
| Diarization | ✅ | ✅ |
| Channel (stereo) | ✅ | ✅ |
| Profanity filtering | ✅ | ✅ |
| Specify locale | ✅ | ❌ (use prompting to implement) |
| Custom prompting | ❌ | ✅ |
| Phrase list | ✅ | ❌ (use prompting to implement) |
For LLM speech (enhanced mode), use prompting to guide the output style instead of using explicit locale or phrase lists.
You can try LLM speech in Microsoft Foundry without writing any code.
Prerequisites
- An Azure subscription. Create one for free.
- A Foundry project. If you need to create a project, see Create a Microsoft Foundry project.
Try LLM speech
- Sign in to Microsoft Foundry. Make sure the New Foundry toggle is on. These steps refer to Foundry (new).
- Select Build from the top right menu.
- Select Models on the left pane.
- The AI Services tab shows the Foundry models that can be used out of the box in the Foundry portal. Select Azure Speech - Speech to text to open the Speech to Text playground.
- In the top dropdown, select LLM speech.
- Optionally use the Parameters section to change the language, profanity policy, and other settings. You can also add special instructions for the LLM.
- Use the Upload files section to select your audio file. Then select Start.
- View the transcription output in the Transcript tab. Optionally view the raw API response output in the JSON tab.
- Switch to the Code tab to get sample code for using LLM speech in your application.
Prerequisites
An Azure Speech in Foundry Tools resource in one of the regions where the LLM speech API is available. For the current list of supported regions, see Speech service regions.
An audio file (less than 5 hours long and less than 500 MB in size) in one of the formats and codecs supported by the batch transcription API: WAV, MP3, OPUS/OGG, FLAC, WMA, AAC, ALAW in WAV container, MULAW in WAV container, AMR, WebM, and SPEEX. For more information about supported audio formats, see supported audio formats.
Use the LLM speech API
Supported languages
The following languages are currently supported for both transcribe and translate tasks:
English,Chinese,German,French,Italian,Japanese,Spanish,Portuguese, andKorean.
Upload audio
You can provide audio data in the following ways:
- Pass inline audio data.
--form 'audio=@"YourAudioFile"'
- Upload audio file from a public
audioUrl.
--form 'definition": "{\"audioUrl\": \"https://crbn.us/hello.wav"}"'
Tip
For long audio files, uploading from a public URL is recommended.
In the sections below, inline audio upload is used as an example.
Call the LLM speech API
Make a multipart/form-data POST request to the transcriptions endpoint with the audio file and the request body properties.
The following example shows how to transcribe an audio file with a specified locale. If you know the locale of the audio file, you can specify it to improve transcription accuracy and minimize the latency.
- Replace
YourSpeechResoureKeywith your Speech resource key. - Replace
YourServiceRegionwith your Speech resource region. - Replace
YourAudioFilewith the path to your audio file.
Important
For the recommended keyless authentication with Microsoft Entra ID, replace --header 'Ocp-Apim-Subscription-Key: YourSpeechResoureKey' with --header "Authorization: Bearer YourAccessToken". For more information about keyless authentication, see the role-based access control how-to guide.
Use LLM speech to transcribe an audio
You can transcribe audio in the input language without specifying a locale code. The model automatically detects and selects the appropriate language based on the audio content.
curl --location 'https://<YourServiceRegion>.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: <YourSpeechResourceKey>' \
--form 'audio=@"YourAudioFile.wav"' \
--form 'definition={
"enhancedMode": {
"enabled": true,
"task": "transcribe"
}
}'
Use LLM speech to translate an audio file
You can translate audio into a specified target language. To enable translation, you must provide the target language code in the request.
curl --location 'https://<YourServiceRegion>.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: <YourSpeechResourceKey>' \
--form 'audio=@"YourAudioFile.wav"' \
--form 'definition={
"enhancedMode": {
"enabled": true,
"task": "translate",
"targetLanguage": "ko"
}
}'
Use prompt-tuning to alter performance
You can provide an optional text to guide the output style for transcribe or translate task.
curl --location 'https://<YourServiceRegion>.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: <YourSpeechResourceKey>' \
--form 'audio=@"YourAudioFile.wav"' \
--form 'definition={
"enhancedMode": {
"enabled": true,
"task": "transcribe",
"prompt": ["Output must be in lexical format."]
}
}'
Here are some best practices for prompts:
- Prompts are subject to a maximum length of 4,096 characters.
- Prompts should preferably be written in English.
- Prompts can guide output formatting. By default, responses use a display format optimized for readability. To enforce lexical formatting, include:
Output must be in lexical format. - Prompts can amplify the salience of specific phrases or acronyms, improving recognition likelihood. Use:
Pay attention to *phrase1*, *phrase2*, …. For best results, limit the number of phrases per prompt. - Prompts that aren’t related to speech tasks (for example,
Tell me a story.) are typically disregarded.
More configuration options
You can combine extra configuration options with fast transcription to enable enhanced features such as diarization, profanityFilterMode, and channels.
curl --location 'https://<YourServiceRegion>.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: <YourSpeechResourceKey>' \
--form 'audio=@"YourAudioFile.wav"' \
--form 'definition={
"enhancedMode": {
"enabled": true,
"task": "transcribe",
"prompt": ["Output must be in lexical format."]
},
"diarization": {
"maxSpeakers": 2,
"enabled": true
},
"profanityFilterMode": "Masked"
}'
Some configuration options, such as locales and phraseLists, are either not required or not applicable with LLM speech, and can be omitted from the request. Learn more from configuration options of fast transcription.
Use the mai-transcribe model (Preview)
You can also use the mai-transcribe-1 model provided by Microsoft AI (MAI) with the LLM Speech API.
For the current list of regions where the mai-transcribe model is supported, see Speech service regions.
The following languages are currently supported for mai-transcribe-1 model:
Arabic,Chinese,Czech,Danish,Dutch,English,Finnish,French,German,Hindi,Hungarian,Indonesian,Italian,Japanese,Korean,Norwegian Bokmål,Polish,Portuguese,Romanian,Russian,Spanish,Swedish,Thai,Turkish, andVietnamese.
To use the mai-transcribe-1 model, set the model property accordingly in the request.
curl --location 'https://<YourServiceRegion>.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: <YourSpeechResourceKey>' \
--form 'audio=@"YourAudioFile.wav"' \
--form 'definition={
"enhancedMode": {
"enabled": true,
"model":"mai-transcribe-1"
}
}'
There are a few extra limits using the mai-transcribe model:
- The audio file should be less than 70 MB in size;
- Diarization isn't supported.
Sample response
In the JSON response, the combinedPhrases property contains the full transcribed or translated text, and the phrases property contains segment-level and word-level details.
{
"durationMilliseconds": 57187,
"combinedPhrases": [
{
"text": "With custom speech,you can evaluate and improve the microsoft speech to text accuracy for your applications and products 现成的语音转文本,利用通用语言模型作为一个基本模型,使用microsoft自有数据进行训练,并反映常用的口语。此基础模型使用那些代表各常见领域的方言和发音进行了预先训练。 Quand vous effectuez une demande de reconnaissance vocale, le modèle de base le plus récent pour chaque langue prise en charge est utilisé par défaut. Le modèle de base fonctionne très bien dans la plupart des scénarios de reconnaissance vocale. A custom model can be used to augment the base model to improve recognition of domain specific vocabulary specified to the application by providing text data to train the model. It can also be used to improve recognition based for the specific audio conditions of the application by providing audio data with reference transcriptions."
}
],
"phrases": [
{
"offsetMilliseconds": 80,
"durationMilliseconds": 6960,
"text": "With custom speech,you can evaluate and improve the microsoft speech to text accuracy for your applications and products.",
"words": [
{
"text": "with",
"offsetMilliseconds": 80,
"durationMilliseconds": 160
},
{
"text": "custom",
"offsetMilliseconds": 240,
"durationMilliseconds": 480
},
{
"text": "speech",
"offsetMilliseconds": 720,
"durationMilliseconds": 360
},,
// More transcription results...
// Redacted for brevity
],
"locale": "en-us",
"confidence": 0
},
{
"offsetMilliseconds": 8000,
"durationMilliseconds": 8600,
"text": "现成的语音转文本,利用通用语言模型作为一个基本模型,使用microsoft自有数据进行训练,并反映常用的口语。此基础模型使用那些代表各常见领域的方言和发音进行了预先训练。",
"words": [
{
"text": "现",
"offsetMilliseconds": 8000,
"durationMilliseconds": 40
},
{
"text": "成",
"offsetMilliseconds": 8040,
"durationMilliseconds": 40
},
// More transcription results...
// Redacted for brevity
{
"text": "训",
"offsetMilliseconds": 16400,
"durationMilliseconds": 40
},
{
"text": "练",
"offsetMilliseconds": 16560,
"durationMilliseconds": 40
},
],
"locale": "zh-cn",
"confidence": 0
// More transcription results...
// Redacted for brevity
{
"text": "with",
"offsetMilliseconds": 54720,
"durationMilliseconds": 200
},
{
"text": "reference",
"offsetMilliseconds": 54920,
"durationMilliseconds": 360
},
{
"text": "transcriptions.",
"offsetMilliseconds": 55280,
"durationMilliseconds": 1200
}
],
"locale": "en-us",
"confidence": 0
}
]
}
The response format is consistent with other existing speech-to-text outputs, such as fast transcription and batch transcription. Key differences include:
- Word-level
durationMillisecondsandoffsetMillisecondsaren't supported fortranslatetask. - Diarization isn't supported for
translatetask, only thespeaker1label is returned. confidenceisn't available and always0.
Reference documentation | Package (PyPi) | GitHub Samples
Prerequisites
- An Azure subscription. Create one for free.
- Python 3.9 or later version. If you don't have a suitable version of Python installed, you can follow the instructions in the VS Code Python Tutorial for the easiest way of installing Python on your operating system.
- A Microsoft Foundry resource created in one of the supported regions. For more information about region availability, see Region support.
- A sample
.wavaudio file to transcribe.
Microsoft Entra ID prerequisites
For the recommended keyless authentication with Microsoft Entra ID, you need to:
- Install the Azure CLI used for keyless authentication with Microsoft Entra ID.
- Assign the
Cognitive Services Userrole to your user account. You can assign roles in the Azure portal under Access control (IAM) > Add role assignment.
Setup
Create a new folder named
llm-speech-quickstartand go to the quickstart folder with the following command:mkdir llm-speech-quickstart && cd llm-speech-quickstartCreate and activate a virtual Python environment to install the packages you need for this tutorial. We recommend you always use a virtual or conda environment when installing Python packages. Otherwise, you can break your global installation of Python. If you already have Python 3.9 or higher installed, create a virtual environment by using the following commands:
When you activate the Python environment, running
pythonorpipfrom the command line uses the Python interpreter in the.venvfolder of your application. Use thedeactivatecommand to exit the Python virtual environment. You can reactivate it later when needed.Create a file named requirements.txt. Add the following packages to the file:
azure-ai-transcription azure-identityInstall the packages:
pip install -r requirements.txt
Set environment variables
You need to retrieve your resource endpoint and API key for authentication.
Sign in to Foundry portal (classic).
Select Management center from the left menu.
Select Connected resources on the left, and find your Microsoft Foundry resource (or add a connection if it isn't there). Then copy the API Key and Target (endpoint) values. Use these values to set environment variables.
Set the following environment variables:
Note
For Microsoft Entra ID authentication (recommended for production), install azure-identity and configure authentication as described in the Microsoft Entra ID prerequisites section.
Transcribe audio with LLM speech
LLM speech uses the EnhancedModeProperties class to enable large-language-model-enhanced transcription. The model automatically detects the language in your audio.
Create a file named
llm_speech_transcribe.pywith the following code:import os from dotenv import load_dotenv from azure.core.credentials import AzureKeyCredential from azure.ai.transcription import TranscriptionClient load_dotenv() from azure.ai.transcription.models import ( TranscriptionContent, TranscriptionOptions, EnhancedModeProperties, ) # Get configuration from environment variables endpoint = os.environ["AZURE_SPEECH_ENDPOINT"] # Optional: we recommend using role based access control (RBAC) for production scenarios api_key = os.environ["AZURE_SPEECH_API_KEY"] if api_key: credential = AzureKeyCredential(api_key) else: from azure.identity import DefaultAzureCredential credential = DefaultAzureCredential() # Create the transcription client client = TranscriptionClient(endpoint=endpoint, credential=credential) # Path to your audio file (replace with your own file path) audio_file_path = "<path-to-your-audio-file.wav>" # Open and read the audio file with open(audio_file_path, "rb") as audio_file: # Create enhanced mode properties for LLM speech transcription enhanced_mode = EnhancedModeProperties( task="transcribe", prompt=[], ) # Create transcription options with enhanced mode options = TranscriptionOptions(enhanced_mode=enhanced_mode) # Create the request content request_content = TranscriptionContent(definition=options, audio=audio_file) # Transcribe the audio result = client.transcribe(request_content) # Print the transcription result print(f"Transcription: {result.combined_phrases[0].text}") # Print detailed phrase information if result.phrases: print("\nDetailed phrases:") for phrase in result.phrases: print(f" [{phrase.offset_milliseconds}ms]: {phrase.text}")Reference: TranscriptionClient | TranscriptionContent | TranscriptionOptions | EnhancedModeProperties
Replace
<path-to-your-audio-file.wav>with the path to your audio file. The service supports WAV, MP3, FLAC, OGG, and other common audio formats.Run the Python script.
python llm_speech_transcribe.py
Output
The script prints the transcription result to the console:
Transcription: Hi there. This is a sample voice recording created for speech synthesis testing. The quick brown fox jumps over the lazy dog. Just a fun way to include every letter of the alphabet. Numbers, like one, two, three, are spoken clearly. Let's see how well this voice captures tone, timing, and natural rhythm. This audio is provided by samplefiles.com.
Detailed phrases:
[40ms]: Hi there.
[800ms]: This is a sample voice recording created for speech synthesis testing.
[5440ms]: The quick brown fox jumps over the lazy dog.
[9040ms]: Just a fun way to include every letter of the alphabet.
[12720ms]: Numbers, like one, two, three, are spoken clearly.
[17200ms]: Let's see how well this voice captures tone, timing, and natural rhythm.
[22480ms]: This audio is provided by samplefiles.com.
Translate audio with LLM speech
You can also use LLM speech to translate audio into a target language. Set the task to translate and specify the target_language.
Create a file named
llm_speech_translate.pywith the following code:import os from dotenv import load_dotenv from azure.core.credentials import AzureKeyCredential from azure.ai.transcription import TranscriptionClient load_dotenv() from azure.ai.transcription.models import ( TranscriptionContent, TranscriptionOptions, EnhancedModeProperties, ) # Get configuration from environment variables endpoint = os.environ["AZURE_SPEECH_ENDPOINT"] # Optional: we recommend using role based access control (RBAC) for production scenarios api_key = os.environ["AZURE_SPEECH_API_KEY"] if api_key: credential = AzureKeyCredential(api_key) else: from azure.identity import DefaultAzureCredential credential = DefaultAzureCredential() # Create the transcription client client = TranscriptionClient(endpoint=endpoint, credential=credential) # Path to your audio file (replace with your own file path) audio_file_path = "<path-to-your-audio-file.wav>" # Open and read the audio file with open(audio_file_path, "rb") as audio_file: # Create enhanced mode properties for LLM speech translation # Translate to another language enhanced_mode = EnhancedModeProperties( task="translate", target_language="de", prompt=[ "Translate the following audio to German.", "Convert number words to numbers." ], # Optional prompts to guide the enhanced mode ) # Create transcription options with enhanced mode options = TranscriptionOptions(locales=["en-US"], enhanced_mode=enhanced_mode) # Create the request content request_content = TranscriptionContent(definition=options, audio=audio_file) # Translate the audio result = client.transcribe(request_content) # Print the translation result print(f"Translation: {result.combined_phrases[0].text}")Reference: TranscriptionClient | EnhancedModeProperties
Replace
<path-to-your-audio-file.wav>with the path to your audio file.Run the Python script.
python llm_speech_translate.py
Use prompt-tuning
You can provide an optional prompt to guide the output style for transcription or translation tasks. Replace the prompt value in the EnhancedModeProperties object.
import os
from dotenv import load_dotenv
from azure.core.credentials import AzureKeyCredential
from azure.ai.transcription import TranscriptionClient
load_dotenv()
from azure.ai.transcription.models import (
TranscriptionContent,
TranscriptionOptions,
EnhancedModeProperties,
)
# Get configuration from environment variables
endpoint = os.environ["AZURE_SPEECH_ENDPOINT"]
# Optional: we recommend using role based access control (RBAC) for production scenarios
api_key = os.environ["AZURE_SPEECH_API_KEY"]
if api_key:
credential = AzureKeyCredential(api_key)
else:
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
# Create the transcription client
client = TranscriptionClient(endpoint=endpoint, credential=credential)
# Path to your audio file (replace with your own file path)
audio_file_path = "<path-to-your-audio-file.wav>"
# Open and read the audio file
with open(audio_file_path, "rb") as audio_file:
# Create enhanced mode properties for LLM speech transcription
enhanced_mode = EnhancedModeProperties(
task="transcribe",
prompt=[
"Create lexical output only,",
"Convert number words to numbers."
], # Optional prompts to guide the enhanced mode, prompt="Create lexical transcription.")
)
# Create transcription options with enhanced mode
options = TranscriptionOptions(enhanced_mode=enhanced_mode)
# Create the request content
request_content = TranscriptionContent(definition=options, audio=audio_file)
# Print request content for debugging
print("Request Content:", request_content, "\n")
# Transcribe the audio
result = client.transcribe(request_content)
# Print the transcription result
print(f"Transcription: {result.combined_phrases[0].text}")
# Print detailed phrase information
if result.phrases:
print("\nDetailed phrases:")
for phrase in result.phrases:
Best practices for prompts:
- Prompts are subject to a maximum length of 4,096 characters.
- Prompts should preferably be written in English.
- Use
Output must be in lexical format.to enforce lexical formatting instead of the default display format. - Use
Pay attention to *phrase1*, *phrase2*, …to improve recognition of specific phrases or acronyms.
Reference: EnhancedModeProperties
Output
The script prints the transcription result to the console:
Transcription: Hello, this is a test of the LLM speech transcription service.
Detailed phrases:
[0ms]: Hello, this is a test
[1500ms]: of the LLM speech transcription service.
Reference documentation | Package (NuGet) | GitHub samples
Prerequisites
- An Azure subscription. Create one for free.
- .NET 8.0 SDK or later.
- A Microsoft Foundry resource created in one of the supported regions. For more information about region availability, see Region support.
- A sample
.wavaudio file to transcribe.
Microsoft Entra ID prerequisites
For the recommended keyless authentication with Microsoft Entra ID, you need to:
- Install the Azure CLI used for keyless authentication with Microsoft Entra ID.
- Sign in with the Azure CLI by running
az login. - Assign the
Cognitive Services Userrole to your user account. You can assign roles in the Azure portal under Access control (IAM) > Add role assignment.
Set up the project
Create a new console application with the .NET CLI:
dotnet new console -n llm-speech-quickstart cd llm-speech-quickstartInstall the required packages:
dotnet add package Azure.AI.Speech.Transcription --prerelease dotnet add package Azure.Identity
Retrieve resource information
You need to retrieve your resource endpoint for authentication.
Sign in to Foundry portal.
Select Management center from the left menu. Under Connected resources, select your Speech or multi-service resource.
Select Keys and Endpoint.
Copy the Endpoint value and set it as an environment variable:
$env:AZURE_SPEECH_ENDPOINT="<your-speech-endpoint>"
Transcribe audio with LLM speech
LLM speech uses the EnhancedModeProperties class to enable large-language-model-enhanced transcription. Enhanced mode is automatically enabled when you create an EnhancedModeProperties instance. The model automatically detects the language in your audio.
Replace the contents of Program.cs with the following code:
using System;
using System.ClientModel;
using System.Linq;
using System.Threading.Tasks;
using Azure.AI.Speech.Transcription;
using Azure.Identity;
Uri endpoint = new Uri(Environment.GetEnvironmentVariable("AZURE_SPEECH_ENDPOINT")
?? throw new InvalidOperationException("Set the AZURE_SPEECH_ENDPOINT environment variable."));
// Use DefaultAzureCredential for keyless authentication (recommended).
// To use an API key instead, replace with:
// ApiKeyCredential credential = new ApiKeyCredential("<your-api-key>");
var credential = new DefaultAzureCredential();
TranscriptionClient client = new TranscriptionClient(endpoint, credential);
string audioFilePath = "<path-to-your-audio-file.wav>";
using FileStream audioStream = File.OpenRead(audioFilePath);
// Create enhanced mode properties for LLM speech transcription
TranscriptionOptions options = new TranscriptionOptions(audioStream)
{
EnhancedMode = new EnhancedModeProperties
{
Task = "transcribe"
}
};
ClientResult<TranscriptionResult> response = await client.TranscribeAsync(options);
// Print combined transcription
foreach (var combinedPhrase in response.Value.CombinedPhrases)
{
Console.WriteLine($"Transcription: {combinedPhrase.Text}");
}
// Print detailed phrase information
foreach (var channel in response.Value.PhrasesByChannel)
{
Console.WriteLine("\nDetailed phrases:");
foreach (var phrase in channel.Phrases)
{
Console.WriteLine($" [{phrase.Offset}] ({phrase.Locale}): {phrase.Text}");
}
}
Replace <path-to-your-audio-file.wav> with the path to your audio file. The service supports WAV, MP3, FLAC, OGG, and other common audio formats.
Run the application:
dotnet run
Reference: TranscriptionClient, EnhancedModeProperties
Translate audio with LLM speech
You can also use LLM speech to translate audio into a target language. Set the Task to translate and specify the TargetLanguage:
using System;
using System.ClientModel;
using System.Linq;
using System.Threading.Tasks;
using Azure.AI.Speech.Transcription;
using Azure.Identity;
Uri endpoint = new Uri(Environment.GetEnvironmentVariable("AZURE_SPEECH_ENDPOINT")
?? throw new InvalidOperationException("Set the AZURE_SPEECH_ENDPOINT environment variable."));
var credential = new DefaultAzureCredential();
TranscriptionClient client = new TranscriptionClient(endpoint, credential);
string audioFilePath = "<path-to-your-audio-file.wav>";
using FileStream audioStream = File.OpenRead(audioFilePath);
// Create enhanced mode properties for LLM speech translation
TranscriptionOptions options = new TranscriptionOptions(audioStream)
{
EnhancedMode = new EnhancedModeProperties
{
Task = "translate",
TargetLanguage = "de"
}
};
ClientResult<TranscriptionResult> response = await client.TranscribeAsync(options);
// Print translation result
foreach (var combinedPhrase in response.Value.CombinedPhrases)
{
Console.WriteLine($"Translation: {combinedPhrase.Text}");
}
Replace <path-to-your-audio-file.wav> with the path to your audio file.
Reference: EnhancedModeProperties
Use prompt-tuning
You can provide an optional prompt to guide the output style for transcription or translation tasks:
TranscriptionOptions options = new TranscriptionOptions(audioStream)
{
EnhancedMode = new EnhancedModeProperties
{
Task = "transcribe",
Prompt = { "Output must be in lexical format." }
}
};
ClientResult<TranscriptionResult> response = await client.TranscribeAsync(options);
foreach (var combinedPhrase in response.Value.CombinedPhrases)
{
Console.WriteLine($"Transcription: {combinedPhrase.Text}");
}
Best practices for prompts
- Prompts are subject to a maximum length of 4,096 characters.
- Prompts should preferably be written in English.
- Use
Output must be in lexical format.to enforce lexical formatting instead of the default display format. - Use
Pay attention to *phrase1*, *phrase2*, …to improve recognition of specific phrases or acronyms.
Reference: EnhancedModeProperties
Clean up resources
When you finish the quickstart, delete the project folder:
Remove-Item -Recurse -Force llm-speech-quickstart
Reference documentation | Package (Maven) | GitHub Samples
Prerequisites
- An Azure subscription. Create one for free.
- Java Development Kit (JDK) 8 or later.
- Apache Maven for dependency management and building the project.
- A Speech resource in one of the supported regions. For more information about region availability, see Speech service supported regions.
- A sample
.wavaudio file to transcribe.
Set up the environment
Create a new folder named
llm-speech-quickstartand navigate to it:mkdir llm-speech-quickstart && cd llm-speech-quickstartCreate a
pom.xmlfile in the root of your project directory with the following content:<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.example</groupId> <artifactId>transcription-quickstart</artifactId> <version>1.0.0</version> <packaging>jar</packaging> <name>Speech Transcription Quickstart</name> <description>Quickstart sample for Azure Speech Transcription client library.</description> <url>https://github.com/Azure/azure-sdk-for-java</url> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> </properties> <dependencies> <dependency> <groupId>com.azure</groupId> <artifactId>azure-ai-speech-transcription</artifactId> <version>1.0.0-beta.2</version> </dependency> <dependency> <groupId>com.azure</groupId> <artifactId>azure-identity</artifactId> <version>1.18.1</version> </dependency> </dependencies> <build> <sourceDirectory>.</sourceDirectory> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.11.0</version> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> <plugin> <groupId>org.codehaus.mojo</groupId> <artifactId>exec-maven-plugin</artifactId> <version>3.1.0</version> <configuration> <mainClass>TranscriptionQuickstart</mainClass> </configuration> </plugin> </plugins> </build> </project>Note
The
<sourceDirectory>.</sourceDirectory>configuration tells Maven to look for Java source files in the current directory instead of the defaultsrc/main/javastructure. This configuration change allows for a simpler flat project structure.Install the dependencies:
mvn clean install
Set environment variables
Your application must be authenticated to access the Speech service. The SDK supports both API key and Microsoft Entra ID authentication. It automatically detects which method to use based on the environment variables you set.
First, set the endpoint for your Speech resource. Replace <your-speech-endpoint> with your actual resource name:
Then, choose one of the following authentication methods:
Option 1: API key authentication (recommended for getting started)
Set the API key environment variable:
Option 2: Microsoft Entra ID authentication (recommended for production)
Instead of setting AZURE_SPEECH_API_KEY, configure one of the following credential sources:
- Azure CLI: Run
az loginon your development machine. - Managed Identity: For apps running in Azure (App Service, Azure Functions, VMs).
- Environment Variables: Set
AZURE_TENANT_ID,AZURE_CLIENT_ID, andAZURE_CLIENT_SECRET. - Visual Studio Code or IntelliJ: Sign in through your IDE.
You also need to assign the Cognitive Services User role to your identity:
az role assignment create --assignee <your-identity> \
--role "Cognitive Services User" \
--scope /subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.CognitiveServices/accounts/<speech-resource-name>
Note
After setting environment variables on Windows, restart any running programs that need to read them, including the console window. On Linux or macOS, run source ~/.bashrc (or your equivalent shell configuration file) to make the changes effective.
Transcribe audio with LLM speech
LLM speech uses the EnhancedModeOptions class to enable large-language-model-enhanced transcription. Enhanced mode is automatically enabled when you create an EnhancedModeOptions instance. The model automatically detects the language in your audio.
Create a file named LlmSpeechQuickstart.java in your project directory with the following code:
import com.azure.ai.speech.transcription.TranscriptionClient;
import com.azure.ai.speech.transcription.TranscriptionClientBuilder;
import com.azure.ai.speech.transcription.models.AudioFileDetails;
import com.azure.ai.speech.transcription.models.EnhancedModeOptions;
import com.azure.ai.speech.transcription.models.TranscriptionOptions;
import com.azure.ai.speech.transcription.models.TranscriptionResult;
import com.azure.core.credential.KeyCredential;
import com.azure.core.util.BinaryData;
import com.azure.identity.DefaultAzureCredentialBuilder;
import java.nio.file.Files;
import java.nio.file.Paths;
public class LlmSpeechQuickstart {
public static void main(String[] args) {
try {
// Get credentials from environment variables
String endpoint = System.getenv("AZURE_SPEECH_ENDPOINT");
String apiKey = System.getenv("AZURE_SPEECH_API_KEY");
// Create client with API key or Entra ID authentication
TranscriptionClientBuilder builder = new TranscriptionClientBuilder()
.endpoint(endpoint);
TranscriptionClient client;
if (apiKey != null && !apiKey.isEmpty()) {
// Use API key authentication
client = builder.credential(new KeyCredential(apiKey)).buildClient();
} else {
// Use Entra ID authentication
client = builder.credential(new DefaultAzureCredentialBuilder().build()).buildClient();
}
// Load audio file
String audioFilePath = "<path-to-your-audio-file.wav>";
byte[] audioData = Files.readAllBytes(Paths.get(audioFilePath));
// Create audio file details
AudioFileDetails audioFileDetails = new AudioFileDetails(BinaryData.fromBytes(audioData));
// Create enhanced mode options for LLM speech transcription
// Enhanced mode is automatically enabled when you create EnhancedModeOptions
EnhancedModeOptions enhancedModeOptions = new EnhancedModeOptions()
.setTask("transcribe");
// Create transcription options with enhanced mode
TranscriptionOptions options = new TranscriptionOptions(audioFileDetails)
.setEnhancedModeOptions(enhancedModeOptions);
// Transcribe the audio
TranscriptionResult result = client.transcribe(options);
// Print result
System.out.println("Transcription:");
result.getCombinedPhrases().forEach(phrase ->
System.out.println(phrase.getText())
);
// Print detailed phrase information
if (result.getPhrases() != null) {
System.out.println("\nDetailed phrases:");
result.getPhrases().forEach(phrase ->
System.out.println(String.format(" [%dms] (%s): %s",
phrase.getOffset(),
phrase.getLocale(),
phrase.getText()))
);
}
} catch (Exception e) {
System.err.println("Error: " + e.getMessage());
e.printStackTrace();
}
}
}
Replace <path-to-your-audio-file.wav> with the path to your audio file. The service supports WAV, MP3, FLAC, OGG, and other common audio formats.
Run the application
Run the application using Maven:
mvn compile exec:java
Translate audio by using LLM speech
You can also use LLM speech to translate audio into a target language. Modify the EnhancedModeOptions configuration to set the task to translate and specify the target language.
Create a file named LlmSpeechTranslate.java with the following code:
import com.azure.ai.speech.transcription.TranscriptionClient;
import com.azure.ai.speech.transcription.TranscriptionClientBuilder;
import com.azure.ai.speech.transcription.models.AudioFileDetails;
import com.azure.ai.speech.transcription.models.EnhancedModeOptions;
import com.azure.ai.speech.transcription.models.TranscriptionOptions;
import com.azure.ai.speech.transcription.models.TranscriptionResult;
import com.azure.core.credential.KeyCredential;
import com.azure.core.util.BinaryData;
import java.nio.file.Files;
import java.nio.file.Paths;
public class LlmSpeechTranslate {
public static void main(String[] args) {
try {
// Get credentials from environment variables
String endpoint = System.getenv("AZURE_SPEECH_ENDPOINT");
String apiKey = System.getenv("AZURE_SPEECH_API_KEY");
// Create client
TranscriptionClient client = new TranscriptionClientBuilder()
.endpoint(endpoint)
.credential(new KeyCredential(apiKey))
.buildClient();
// Load audio file
String audioFilePath = "<path-to-your-audio-file.wav>";
byte[] audioData = Files.readAllBytes(Paths.get(audioFilePath));
// Create audio file details
AudioFileDetails audioFileDetails = new AudioFileDetails(BinaryData.fromBytes(audioData));
// Create enhanced mode options for LLM speech translation
// Translate to Korean (supported languages: en, zh, de, fr, it, ja, es, pt, ko)
EnhancedModeOptions enhancedModeOptions = new EnhancedModeOptions()
.setTask("translate")
.setTargetLanguage("ko");
// Create transcription options with enhanced mode
TranscriptionOptions options = new TranscriptionOptions(audioFileDetails)
.setEnhancedModeOptions(enhancedModeOptions);
// Translate the audio
TranscriptionResult result = client.transcribe(options);
// Print translation result
System.out.println("Translation:");
result.getCombinedPhrases().forEach(phrase ->
System.out.println(phrase.getText())
);
} catch (Exception e) {
System.err.println("Error: " + e.getMessage());
e.printStackTrace();
}
}
}
Replace <path-to-your-audio-file.wav> with the path to your audio file.
To run the translation example, update the pom.xml main class configuration or run:
mvn exec:java -Dexec.mainClass="LlmSpeechTranslate"
Use prompt-tuning
You can provide an optional prompt to guide the output style for transcription or translation tasks.
import java.util.Arrays;
// Create enhanced mode options with prompt-tuning
EnhancedModeOptions enhancedModeOptions = new EnhancedModeOptions()
.setTask("transcribe")
.setPrompts(Arrays.asList("Output must be in lexical format."));
// Create transcription options with enhanced mode
TranscriptionOptions options = new TranscriptionOptions(audioFileDetails)
.setEnhancedModeOptions(enhancedModeOptions);
Best practices for prompts:
- Prompts are subject to a maximum length of 4,096 characters.
- Prompts should preferably be written in English.
- Use
Output must be in lexical format.to enforce lexical formatting instead of the default display format. - Use
Pay attention to *phrase1*, *phrase2*, …to improve recognition of specific phrases or acronyms.
Clean up resources
When you finish the quickstart, delete the project folder:
rm -rf llm-speech-quickstart
Transcription error handling
Implement retry logic with exponential backoff
When calling the fast transcription API, implement retry logic to handle transient errors and rate limiting. The API enforces rate limits, which can result in HTTP 429 responses during high-concurrency operations.
Recommended retry configuration
- Retry up to 5 times on transient errors.
- Use exponential backoff: 2s, 4s, 8s, 16s, 32s.
- Total backoff time: 62 seconds.
This configuration provides sufficient time for the API to recover during rate-limiting windows, especially when running batch operations with multiple concurrent workers.
When to use retry logic
Implement retry logic for the following error categories:
- HTTP errors - Retry on:
- HTTP 429 (rate limit)
- HTTP 500, 502, 503, 504 (server errors)
status_code=None(incomplete response downloads)
- Azure SDK network errors - Retry on:
ServiceRequestErrorServiceResponseErrorThese errors wrap low-level network exceptions likeurllib3.exceptions.ReadTimeoutError, connection resets, and TLS failures.
- Python network exceptions - Retry on:
ConnectionErrorTimeoutErrorOSError
Don't retry on the following errors, as they indicate client-side issues that require correction:
- HTTP 400 (bad request)
- HTTP 401 (unauthorized)
- HTTP 422 (unprocessable entity)
- Other client errors (4xx status codes)
Implementation notes
- Reset the audio file stream (
seek(0)) before each retry attempt. - When using concurrent workers, be aware that the default HTTP read timeout (300 seconds) might be exceeded under heavy rate limiting.
- Be aware the API might accept a request but time out while generating the response, which can appear as an SDK-wrapped network error rather than standard HTTP error.