Edit

Share via


Create a batch transcription

With batch transcriptions, you submit audio data in a batch. The service transcribes the audio data and stores the results in a storage container. You can then retrieve the results from the storage container.

Batch transcription completion can take several minutes to hours, depending on the size of the audio data and the number of files submitted. Even the same size of audio data can take different amounts of time to transcribe, depending on service load and other factors. The service doesn't provide a way to estimate the time it takes to transcribe a batch of audio data.

Tip

If you need consistent fast speed for audio files less than 2 hours long and less than 300 MB in size, consider using the fast transcription API instead.

Prerequisites

You need a Microsoft Foundry resource for Speech.

Create a transcription job

To create a batch transcription job, use the Transcriptions - Submit operation of the speech to text REST API. Construct the request body according to the following instructions:

  • You must set either the contentContainerUrl or contentUrls property. For more information about Azure blob storage for batch transcription, see Locate audio files for batch transcription.
  • Set the required locale property. This value should match the expected locale of the audio data to transcribe. You can't change the locale later.
  • Set the required displayName property. Choose a transcription name that you can refer to later. The transcription name doesn't have to be unique and can be changed later.
  • Set the required timeToLiveHours property. This property specifies how long the transcription should be kept in the system after it completed. The shortest supported duration is 6 hours, the longest supported duration is 31 days. The recommended value is 48 hours (two days) when data is consumed directly.
  • Optionally, to use a model other than the base model, set the model property to the model ID. For more information, see Use a custom model and Use a Whisper model.
  • Optionally, set the wordLevelTimestampsEnabled property to true to enable word-level timestamps in the transcription results. The default value is false. For Whisper models, set the displayFormWordLevelTimestampsEnabled property instead. Whisper is a display-only model, so the lexical field isn't populated in the transcription.
  • Optionally, set the languageIdentification property. Language identification is used to identify languages spoken in audio when compared against a list of supported languages. If you set the languageIdentification property, then you must also set languageIdentification.candidateLocales with candidate locales.

For more information, see Request configuration options.

Make an HTTP POST request that uses the URI as shown in the following Transcriptions - Submit example.

  • Replace YourSpeechResoureKey with your Microsoft Foundry resource key.
  • Replace YourServiceRegion with your Microsoft Foundry resource region.
  • Set the request body properties as previously described.
curl -v -X POST -H "Ocp-Apim-Subscription-Key: YourSpeechResoureKey" -H "Content-Type: application/json" -d '{
  "contentUrls": [
    "https://crbn.us/hello.wav",
    "https://crbn.us/whatstheweatherlike.wav"
  ],
  "locale": "en-US",
  "displayName": "My Transcription",
  "model": null,
  "properties": {
    "wordLevelTimestampsEnabled": true,
    "languageIdentification": {
      "candidateLocales": [
        "en-US", "de-DE", "es-ES"
      ],
      "mode": "Continuous"
    },
    "timeToLiveHours": 48
  }
}'  "https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/transcriptions:submit?api-version=2024-11-15"

You should receive a response body in the following format:

{
  "self": "https://eastus.api.cognitive.microsoft.com/speechtotext/transcriptions/788a1f24-f980-4809-8978-e5cf41f77b35?api-version=2024-11-15",
  "displayName": "My Transcription 2",
  "locale": "en-US",
  "createdDateTime": "2025-05-24T03:20:39Z",
  "lastActionDateTime": "2025-05-24T03:20:39Z",
  "links": {
    "files": "https://eastus.api.cognitive.microsoft.com/speechtotext/transcriptions/788a1f24-f980-4809-8978-e5cf41f77b35/files?api-version=2024-11-15"
  },
  "properties": {
    "wordLevelTimestampsEnabled": true,
    "displayFormWordLevelTimestampsEnabled": false,
    "channels": [
      0,
      1
    ],
    "punctuationMode": "DictatedAndAutomatic",
    "profanityFilterMode": "Masked",
    "timeToLiveHours": 48,
    "languageIdentification": {
      "candidateLocales": [
        "en-US",
        "de-DE",
        "es-ES"
      ],
      "mode": "Continuous"
    }
  },
  "status": "NotStarted"
}

The top-level self property in the response body is the transcription's URI. Use this URI to get details such as the URI of the transcriptions and transcription report files. You also use this URI to update or delete a transcription.

You can query the status of your transcriptions with the Transcriptions - Get operation.

Call Transcriptions - Delete regularly from the service, after you retrieve the results. Alternatively, set the timeToLive property to ensure the eventual deletion of the results.

Tip

You can also try the Batch Transcription API using Python, C#, or Node.js on GitHub.

To create a transcription, use the spx batch transcription create command. Construct the request parameters according to the following instructions:

  • Set the required content parameter. You can specify a comma delimited list of individual files or the URL for an entire container. For more information about Azure blob storage for batch transcription, see Locate audio files for batch transcription.
  • Set the required language property. This value should match the expected locale of the audio data to transcribe. You can't change the locale later. The Speech CLI language parameter corresponds to the locale property in the JSON request and response.
  • Set the required name property. Choose a transcription name that you can refer to later. The transcription name doesn't have to be unique and can be changed later. The Speech CLI name parameter corresponds to the displayName property in the JSON request and response.
  • Set the required api-version parameter to v3.2. The Speech CLI doesn't support version 2024-11-15 or later yet, so you must use v3.2 for now.

Here's an example Speech CLI command that creates a transcription job:

spx batch transcription create --api-version v3.2 --name "My Transcription" --language "en-US" --content https://crbn.us/hello.wav,https://crbn.us/whatstheweatherlike.wav

You should receive a response body in the following format:

{
  "self": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2/transcriptions/bbbbcccc-1111-dddd-2222-eeee3333ffff",
  "model": {
    "self": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/ccccdddd-2222-eeee-3333-ffff4444aaaa"
  },
  "links": {
    "files": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2/transcriptions/7f4232d5-9873-47a7-a6f7-4a3f00d00dc0/files"
  },
  "properties": {
    "diarizationEnabled": false,
    "wordLevelTimestampsEnabled": false,
    "channels": [
      0,
      1
    ],
    "punctuationMode": "DictatedAndAutomatic",
    "profanityFilterMode": "Masked"
  },
  "lastActionDateTime": "2025-05-24T03:20:39Z",
  "status": "NotStarted",
  "createdDateTime": "2025-05-24T03:20:39Z",
  "locale": "en-US",
  "displayName": "My Transcription",
  "description": ""
}

The top-level self property in the response body is the transcription's URI. Use this URI to get details such as the URI of the transcriptions and transcription report files. You also use this URI to update or delete a transcription.

For Speech CLI help with transcriptions, run the following command:

spx help batch transcription

Request configuration options

Here are some property options to configure a transcription when you call the Transcriptions - Submit operation. You can find more examples on the same page, such as creating a transcription with language identification.

The request body has two distinct levels. Misplacing a property causes the service to silently ignore it or return a validation error.

  • Root level: Metadata that describes the transcription job itself (displayName, locale, model, contentUrls, contentContainerUrl).
  • Inside properties: Options that control transcription behavior. Wrap these in a "properties": { } object.

Important

destinationContainerUrl belongs inside the properties object, not at the root level of the request body. Placing it at the root causes the service to ignore it, and transcription results are silently written to the Microsoft-managed container instead.

The following example shows the correct structure:

{
  "contentUrls": ["https://..."],
  "locale": "en-US",
  "displayName": "My Transcription",
  "model": null,
  "properties": {
    "destinationContainerUrl": "https://<storage>.blob.core.windows.net/<container>?<SAS>",
    "wordLevelTimestampsEnabled": true,
    "timeToLiveHours": 48
  }
}
Property Location in request body Description
contentContainerUrl Root level You can submit individual audio files or a whole storage container.

You must specify the audio data location by using either the contentContainerUrl or contentUrls property. For more information about Azure blob storage for batch transcription, see Locate audio files for batch transcription.

This property isn't returned in the response.
contentUrls Root level You can submit individual audio files or a whole storage container.

You must specify the audio data location by using either the contentContainerUrl or contentUrls property. For more information, see Locate audio files for batch transcription.

This property isn't returned in the response.
displayName Root level The name of the batch transcription. Choose a name that you can refer to later. The display name doesn't have to be unique.

This property is required.
locale Root level The locale of the batch transcription. This value should match the expected locale of the audio data to transcribe. The locale can't be changed later.

This property is required.
model Root level You can set the model property to use a specific base model or custom speech model. If you don't specify the model, the default base model for the locale is used. For more information, see Use a custom model and Use a Whisper model.
channels Inside properties An array of channel numbers to process. Channels 0 and 1 are transcribed by default.
destinationContainerUrl Inside properties The result can be stored in an Azure container. If you don't specify a container, the Speech service stores the results in a container managed by Microsoft. When the transcription job is deleted, the transcription result data is also deleted. For more information, such as the supported security scenarios, see Specify a destination container URL.
diarization Inside properties Indicates that the Speech service should attempt diarization analysis on the input, which is expected to be a mono channel that contains multiple voices. The feature isn't available with stereo recordings.

Diarization is the process of separating speakers in audio data. The batch pipeline can recognize and separate multiple speakers on mono channel recordings.

Specify the minimum and maximum number of people who might be speaking. You must also set the diarizationEnabled property to true. The transcription file contains a speaker entry for each transcribed phrase.

You need to use this property when you expect three or more speakers. For two speakers, setting diarizationEnabled property to true is enough. For an example of the property usage, see Transcriptions - Submit.

The maximum number of speakers for diarization must be less than 36 and more or equal to the minCount property. For an example, see Transcriptions - Submit.

When this property is selected, source audio length can't exceed 240 minutes per file.

Note: This property is only available with Speech to text REST API version 3.1 and later. If you set this property with any previous version, such as version 3.0, it's ignored and only two speakers are identified.
diarizationEnabled Inside properties Specifies that the Speech service should attempt diarization analysis on the input, which is expected to be a mono channel that contains two voices. The default value is false.

For three or more voices you also need to use property diarization. Use only with Speech to text REST API version 3.1 and later.

When this property is selected, source audio length can't exceed 240 minutes per file.
displayFormWordLevelTimestampsEnabled Inside properties Specifies whether to include word-level timestamps on the display form of the transcription results. The results are returned in the displayWords property of the transcription file. The default value is false.

Note: This property is only available with Speech to text REST API version 3.1 and later.
languageIdentification Inside properties Language identification is used to identify languages spoken in audio when compared against a list of supported languages.

If you set the languageIdentification property, then you must also set its enclosed candidateLocales property.
languageIdentification.candidateLocales Inside properties The candidate locales for language identification, such as "properties": { "languageIdentification": { "candidateLocales": ["en-US", "de-DE", "es-ES"]}}. A minimum of two and a maximum of ten candidate locales, including the main locale for the transcription, is supported.
profanityFilterMode Inside properties Specifies how to handle profanity in recognition results. Accepted values are None to disable profanity filtering, Masked to replace profanity with asterisks, Removed to remove all profanity from the result, or Tags to add profanity tags. The default value is Masked.
punctuationMode Inside properties Specifies how to handle punctuation in recognition results. Accepted values are None to disable punctuation, Dictated to imply explicit (spoken) punctuation, Automatic to let the decoder deal with punctuation, or DictatedAndAutomatic to use dictated and automatic punctuation. The default value is DictatedAndAutomatic.

This property isn't applicable for Whisper models.
timeToLiveHours Inside properties This required property specifies how long the transcription should be kept in the system after it completed.

Once the transcription reaches the time to live after completion (successful or failed) it's automatically deleted.

The shortest supported duration is 6 hours, the longest supported duration is 31 days. The recommended value is 48 hours (two days) when data is consumed directly.

As an alternative, you can call Transcriptions - Delete regularly after you retrieve the transcription results.
wordLevelTimestampsEnabled Inside properties Specifies if word level timestamps should be included in the output. The default value is false.

This property isn't applicable for Whisper models. Whisper is a display-only model, so the lexical field isn't populated in the transcription.

For Speech CLI help with transcription configuration options, run the following command:

spx help batch transcription create advanced

Use a custom model

Batch transcription uses the default base model for the locale that you specify. You don't need to set any properties to use the default base model.

Optionally, you can modify the previous create transcription example by setting the model property to use a specific base model or custom speech model.

curl -v -X POST -H "Ocp-Apim-Subscription-Key: YourSpeechResoureKey" -H "Content-Type: application/json" -d '{
  "contentUrls": [
    "https://crbn.us/hello.wav",
    "https://crbn.us/whatstheweatherlike.wav"
  ],
  "locale": "en-US",
  "displayName": "My Transcription",
  "model": {
    "self": "https://eastus.api.cognitive.microsoft.com/speechtotext/models/base/ccccdddd-2222-eeee-3333-ffff4444aaaa"
  },
  "properties": {
    "wordLevelTimestampsEnabled": true,
  }
}'  "https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/transcriptions:submit?api-version=2024-11-15"
spx batch transcription create --name "My Transcription" --language "en-US" --content https://crbn.us/hello.wav,https://crbn.us/whatstheweatherlike.wav --model "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/ccccdddd-2222-eeee-3333-ffff4444aaaa"

To use a custom speech model for batch transcription, you need the model's URI. The top-level self property in the response body is the model's URI. You can retrieve the model location when you create or get a model. For more information, see the JSON response example in Create a model.

Tip

A hosted deployment endpoint isn't required to use custom speech with the batch transcription service. You can conserve resources if you use the custom speech model only for batch transcription.

Batch transcription requests for expired models fail with a 4xx error. Set the model property to a base model or custom model that isn't expired. Otherwise don't include the model property to always use the latest base model. For more information, see Choose a model and Custom speech model lifecycle.

Language identification

To identify languages with Batch transcription REST API, use languageIdentification property in the body of your Transcriptions - Submit request.

Warning

Batch transcription only supports language identification for default base models. If both language identification and a custom model are specified in the transcription request, the service falls back to use the base models for the specified candidate languages. This might result in unexpected recognition results.

If your speech to text scenario requires both language identification and custom models, use real-time speech to text instead of batch transcription.

The following example shows the usage of the languageIdentification property with four candidate languages. For more information about request properties, see Create a batch transcription.

{
    <...>
    
    "properties": {
    <...>
    
        "languageIdentification": {
            "candidateLocales": [
            "en-US",
            "ja-JP",
            "zh-CN",
            "hi-IN"
            ]
        },	
        <...>
    }
}

Use a Whisper model

Azure Speech in Foundry Tools supports OpenAI's Whisper model by using the batch transcription API. You can use the Whisper model for batch transcription.

Note

Azure OpenAI in Microsoft Foundry Models also supports OpenAI's Whisper model for speech to text with a synchronous REST API. To learn more, see Speech to text with the Azure OpenAI Whisper model. For more information about when to use Azure Speech vs. Azure OpenAI in Microsoft Foundry Models, see What is the Whisper model?

To use a Whisper model for batch transcription, you need to set the model property. Whisper is a display-only model, so the lexical field isn't populated in the response.

Important

Batch transcription using Whisper models is available in a subset of regions that support batch transcription. For the current list of supported regions, see the Speech service regions table. Note that Whisper model support may be limited to specific regions within those that support batch transcription.

You can make a Models - List Base Models request to get available base models for all locales.

Make an HTTP GET request as shown in the following example for the eastus region. Replace YourSpeechResoureKey with your Microsoft Foundry resource key. Replace eastus if you're using a different region.

curl -v -X GET "https://eastus.api.cognitive.microsoft.com/speechtotext/models/base?api-version=2024-11-15" -H "Ocp-Apim-Subscription-Key: YourSpeechResoureKey"

By default, only the 100 oldest base models are returned. Use the skip and top query parameters to page through the results. For example, the following request returns the next 100 base models after the first 100.

curl -v -X GET "https://eastus.api.cognitive.microsoft.com/speechtotext/models/base?api-version=2024-11-15&skip=100&top=100" -H "Ocp-Apim-Subscription-Key: YourSpeechResoureKey"

Make sure that you set the configuration variables for a Foundry resource for Speech in one of the supported regions. You can run the spx csr list --base command to get available base models for all locales.

Set the required api-version parameter to v3.2. The Speech CLI doesn't support version 2024-11-15 or later yet, so you must use v3.2 for now.

spx csr list --base --api-version v3.2

The displayName property of a Whisper model contains "Whisper" as shown in this example. Whisper is a display-only model, so the lexical field isn't populated in the transcription.

{
  "links": {
    "manifest": "https://eastus.api.cognitive.microsoft.com/speechtotext/models/base/69adf293-9664-4040-932b-02ed16332e00/manifest?api-version=2024-11-15"
  },
  "properties": {
    "deprecationDates": {
      "adaptationDateTime": "2025-04-15T00:00:00Z",
      "transcriptionDateTime": "2026-04-15T00:00:00Z"
    },
    "features": {
      "supportsAdaptationsWith": [
        "Acoustic"
      ],
      "supportsTranscriptionsSubmit": true,
      "supportsTranscriptionsTranscribe": false,
      "supportsEndpoints": false,
      "supportsTranscriptionsOnSpeechContainers": false,
      "supportedOutputFormats": [
        "Display"
      ]
    },
    "chargeForAdaptation": true
  },
  "self": "https://eastus.api.cognitive.microsoft.com/speechtotext/models/base/69adf293-9664-4040-932b-02ed16332e00?api-version=2024-11-15",
  "displayName": "20240228 Whisper Large V2",
  "description": "OpenAI Whisper Model in Azure Speech (Whisper v2-large)",
  "locale": "en-US",
  "createdDateTime": "2024-02-29T15:46:31Z",
  "lastActionDateTime": "2024-02-29T15:51:53Z",
  "status": "Succeeded"
},

You set the full model URI as shown in this example for the eastus region. Replace YourSpeechResoureKey with your Microsoft Foundry resource key. Replace eastus if you're using a different region.

curl -v -X POST -H "Ocp-Apim-Subscription-Key: YourSpeechResoureKey" -H "Content-Type: application/json" -d '{
  "contentUrls": [
    "https://crbn.us/hello.wav",
    "https://crbn.us/whatstheweatherlike.wav"
  ],
  "locale": "en-US",
  "displayName": "My Transcription",
  "model": {
    "self": "https://eastus.api.cognitive.microsoft.com/speechtotext/models/base/69adf293-9664-4040-932b-02ed16332e00?api-version=2024-11-15"
  },
  "properties": {
    "wordLevelTimestampsEnabled": true,
  },
}'  "https://eastus.api.cognitive.microsoft.com/speechtotext/transcriptions:submit?api-version=2024-11-15"

You set the full model URI as shown in this example for the eastus region. Replace eastus if you're using a different region.

Set the required api-version parameter to v3.2. The Speech CLI doesn't support version 2024-11-15 or later yet, so you must use v3.2 for now.

spx batch transcription create --name "My Transcription" --language "en-US" --content https://crbn.us/hello.wav,https://crbn.us/whatstheweatherlike.wav --model "https://eastus.api.cognitive.microsoft.com/speechtotext/models/base/ddddeeee-3333-ffff-4444-aaaa5555bbbb" --api-version v3.2

Set up webhook notifications

Instead of polling for transcription status, you can register a webhook to receive a notification when a transcription job completes (or reaches any other terminal state).

Use the Web hooks - Create operation to register a webhook endpoint. The Speech service sends HTTP POST callbacks to your endpoint for transcription.created, transcription.processing, transcription.succeeded, transcription.failed, and transcription.deleted events.

Firewall requirements

The Speech service initiates outbound HTTPS calls from its managed infrastructure to your webhook endpoint. To accept those calls, your endpoint's firewall or network security group must allow inbound traffic from the Azure service tag CognitiveServicesManagement.

Important

If inbound traffic from CognitiveServicesManagement is blocked, the Speech service can't reach your endpoint. The webhook registration succeeds—because registration only involves an outbound call from your client to the Speech API—but all subsequent event callbacks silently fail.

For Azure-hosted endpoints, add an inbound rule in your network security group:

  • Source: Service tag CognitiveServicesManagement
  • Destination port: 443 (HTTPS)
  • Action: Allow

Webhook validation handshake

When you register a new webhook, the Speech service immediately sends a validation request to prove that your endpoint is reachable and under your control. Your endpoint must respond correctly or registration fails.

The validation request looks like this:

POST https://your-endpoint.example.com/?validationToken=<token>

Key details:

  • The token is delivered as a query string parameter (validationToken), not in the request body.
  • Your endpoint must respond with HTTP 200 OK and return the raw token string as plain text (Content-Type: text/plain).
  • Returning JSON (for example, {"validationToken": "..."}) causes validation to fail.

Important

Common mistake: Echoing the token as a JSON object instead of plain text causes webhook registration to fail with no clear error from the service. Always return the token as a plain text string.

Python example: validation handler

The following example shows a minimal Python webhook endpoint using Flask. It reads validationToken from the query string and returns it as plain text.

import flask
import json

app = flask.Flask(__name__)

@app.route("/webhook", methods=["POST"])
def webhook():
    # Validation handshake: Speech sends the token as a query parameter.
    # Return it as plain text—do NOT return JSON.
    validation_token = flask.request.args.get("validationToken")
    if validation_token:
        return flask.Response(validation_token, status=200, mimetype="text/plain")

    # Normal event callback: parse the JSON body.
    event = flask.request.get_json(silent=True) or {}
    event_type = event.get("events", [{}])[0].get("kind", "")

    if event_type == "TranscriptionSucceeded":
        transcription_url = event.get("self", "")
        print(f"Transcription completed: {transcription_url}")
        # TODO: fetch results from transcription_url

    return flask.Response(status=200)

if __name__ == "__main__":
    app.run(port=5000)

Register the webhook

After your endpoint passes validation, register it with the Web hooks - Create operation:

curl -X POST \
  -H "Ocp-Apim-Subscription-Key: YourSpeechResourceKey" \
  -H "Content-Type: application/json" \
  -d '{
    "displayName": "My Transcription Webhook",
    "events": {
      "transcriptionSucceeded": true,
      "transcriptionFailed": true
    },
    "webUrl": "https://your-endpoint.example.com/webhook"
  }' \
  "https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/webhooks?api-version=2024-11-15"
  • Replace YourSpeechResourceKey with your Microsoft Foundry resource key.
  • Replace YourServiceRegion with your resource region.
  • Replace the webUrl with your publicly reachable HTTPS endpoint URL.

For the full list of supported event types, see the Web hooks reference.

Specify a destination container URL

The transcription result can be stored in an Azure container. If you don't specify a container, the Speech service stores the results in a container managed by Microsoft. In that case, when the transcription job is deleted, the transcription result data is also deleted.

You can store the results of a batch transcription to a writable Azure Blob storage container using option destinationContainerUrl in the batch transcription creation request. This option uses only an ad hoc SAS URI and doesn't support Trusted Azure services security mechanism. This option also doesn't support Access policy based SAS. The Storage account resource of the destination container must allow all external traffic.

If you want to store the transcription results in an Azure Blob storage container by using the Trusted Azure services security mechanism, consider using Bring-your-own-storage (BYOS). For more information, see Use the Bring your own storage (BYOS) Microsoft Foundry resource for speech to text.