Edit

Share via


Quickstart: Voice Agent with Foundry Agent Service (new)

Note

This feature is currently in public preview. This preview is provided without a service-level agreement, and is not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Note

Foundry agent integration currently only supports agents available on public endpoints. Foundry agents deployed in private VNet aren't supported.

Learn how to use Voice Live with Microsoft Foundry Agent Service and Azure Speech in Foundry Tools in the Microsoft Foundry portal.

You can create and run an application to use Voice Live with agents for real-time voice agents.

  • Using agents allows leveraging a built-in prompt and configuration managed within the agent itself, rather than specifying instructions in the session code.

  • Agents encapsulate more complex logic and behaviors, making it easier to manage and update conversational flows without changing the client code.

  • The agent approach streamlines integration. The agent ID is used to connect and all necessary settings are handled internally, reducing the need for manual configuration in the code.

  • This separation also supports better maintainability and scalability for scenarios where multiple conversational experiences or business logic variations are needed.

To use the Voice Live API without Foundry agents, see the Voice Live API quickstart.

Tip

To use Voice Live, you don't need to deploy an audio model with your Microsoft Foundry resource. Voice Live is fully managed, and the model is automatically deployed for you. For more information about models availability, see the Voice Live overview documentation.

Prerequisites

Note

This document refers to the Microsoft Foundry (new) portal.

Try out Voice Live in the playground

To try out the Voice Live demo, follow these steps:

  1. Sign in to Microsoft Foundry. Make sure the New Foundry toggle is on. These steps refer to Foundry (new).

  2. Select Build in the upper right menu, and select Agents from the left pane.

  3. Select the agent you created previously to go to the Agent playground.

  4. Switch the Voice mode toggle On. Your agent now connects to Voice Live.

  5. Expand the right pane, which contains the Voice Live settings. Optionally choose a voice, adjust the VAD settings, set the voice temperature and speed, and change other settings to configure voice behavior.

  6. Select Start session to start the voice conversation, and select End to end the chat session.

Learn how to use Voice Live with Microsoft Foundry Agent Service using the VoiceLive SDK for Python.

Reference documentation | Package (PyPi) | Additional samples on GitHub

You can create and run an application to use Voice Live with agents for real-time voice agents.

  • Using agents allows leveraging a built-in prompt and configuration managed within the agent itself, rather than specifying instructions in the session code.

  • Agents encapsulate more complex logic and behaviors, making it easier to manage and update conversational flows without changing the client code.

  • The agent approach streamlines integration. The agent ID is used to connect and all necessary settings are handled internally, reducing the need for manual configuration in the code.

  • This separation also supports better maintainability and scalability for scenarios where multiple conversational experiences or business logic variations are needed.

To use the Voice Live API without Foundry agents, see the Voice Live API quickstart.

Tip

To use Voice Live, you don't need to deploy an audio model with your Microsoft Foundry resource. Voice Live is fully managed, and the model is automatically deployed for you. For more information about models availability, see the Voice Live overview documentation.

Follow the quickstart below or get a fully working web app with browser-based voice UI:

Prerequisites

Note

This document refers to the Microsoft Foundry (new) portal and the latest Foundry Agent Service version.

  • Assign the Azure AI User role to your user account. You can assign roles in the Azure portal under Access control (IAM) > Add role assignment.

Prepare the environment

  1. Create a new folder voice-live-quickstart and go to the quickstart folder with the following command:

    mkdir voice-live-quickstart && cd voice-live-quickstart
    
  2. Create a virtual environment. If you already have Python 3.10 or higher installed, you can create a virtual environment using the following commands:

    py -3 -m venv .venv
    .venv\scripts\activate
    

    Activating the Python environment means that when you run python or pip from the command line, you then use the Python interpreter contained in the .venv folder of your application. You can use the deactivate command to exit the python virtual environment, and can later reactivate it when needed.

    Tip

    We recommend that you create and activate a new Python environment to use to install the packages you need for this tutorial. Don't install packages into your global Python installation. You should always use a virtual or conda environment when installing Python packages, otherwise you can break your global installation of Python.

  3. Create a file named requirements.txt. Add the following packages to the file:

    azure-ai-projects>=2.0.0b3
    openai
    azure-ai-voicelive>=1.2.0b4
    pyaudio
    python-dotenv
    azure-identity
    
  4. Install the packages:

    pip install -r requirements.txt
    

Retrieve resource information

Note

The agent integration requires Entra ID authentication. Key-based authentication isn't supported in Agent mode.

Create a new file named .env in the folder where you want to run the code.

In the .env file, add the following environment variables for authentication:

# Settings for Foundry Agent
PROJECT_ENDPOINT=<endpoint copied from welcome screen>
AGENT_NAME="MyVoiceAgent"
MODEL_DEPLOYMENT_NAME="gpt-4.1-mini"
# Settings for Voice Live
AGENT_NAME=<name-used-to-create-agent> # See above
AGENT_VERSION=<version-of-the-agent>
CONVERSATION_ID=<specific conversation id to reconnect to>
PROJECT_NAME=<your_project_name>
VOICELIVE_ENDPOINT=<your_endpoint>
VOICELIVE_API_VERSION=2026-01-01-preview

Replace the default values with your actual project name, agent name, and endpoint values.

Variable name Value
PROJECT_ENDPOINT The Foundry project endpoint copied from the project welcome screen.
AGENT_NAME The name of the agent to use.
AGENT_VERSION Optional: The version of the agent to use.
CONVERSATION_ID Optional: A specific conversation ID to reconnect to.
PROJECT_NAME The name of your Microsoft Foundry project. Project name is the last element of the project endpoint value.
VOICELIVE_ENDPOINT This value can be found in the Keys and Endpoint section when examining your resource from the Azure portal.
FOUNDRY_RESOURCE_OVERRIDE Optional: The Foundry resource name hosting the agent project (for example, my-resource-name).
AGENT_AUTHENTICATION_IDENTITY_CLIENT_ID Optional: The managed identity client ID of the Voice Live resource.

Learn more about keyless authentication and setting environment variables.

Create an agent with Voice Live settings

  1. Create a file create_agent_with_voicelive.py with the following code:

    import os
    import json
    from dotenv import load_dotenv
    from azure.identity import DefaultAzureCredential
    from azure.ai.projects import AIProjectClient
    from azure.ai.projects.models import PromptAgentDefinition
    
    load_dotenv()
    
    # Helper functions for Voice Live configuration chunking (512-char metadata limit)
    def chunk_config(config_json: str, limit: int = 512) -> dict:
        """Split config into chunked metadata entries."""
        metadata = {"microsoft.voice-live.configuration": config_json[:limit]}
        remaining = config_json[limit:]
        chunk_num = 1
        while remaining:
            metadata[f"microsoft.voice-live.configuration.{chunk_num}"] = remaining[:limit]
            remaining = remaining[limit:]
            chunk_num += 1
        return metadata
    
    def reassemble_config(metadata: dict) -> str:
        """Reassemble chunked Voice Live configuration."""
        config = metadata.get("microsoft.voice-live.configuration", "")
        chunk_num = 1
        while f"microsoft.voice-live.configuration.{chunk_num}" in metadata:
            config += metadata[f"microsoft.voice-live.configuration.{chunk_num}"]
            chunk_num += 1
        return config
    
    # Setup client
    project_client = AIProjectClient(
        endpoint=os.environ["PROJECT_ENDPOINT"],
        credential=DefaultAzureCredential(),
    )
    agent_name = os.environ["AGENT_NAME"]
    
    # Define Voice Live session settings
    voice_live_config = {
        "session": {
            "voice": {
                "name": "en-US-Ava:DragonHDLatestNeural",
                "type": "azure-standard",
                "temperature": 0.8
            },
            "input_audio_transcription": {
                "model": "azure-speech"
            },
            "turn_detection": {
                "type": "azure_semantic_vad",
                "end_of_utterance_detection": {
                    "model": "semantic_detection_v1_multilingual"
                }
            },
            "input_audio_noise_reduction": {"type": "azure_deep_noise_suppression"},
            "input_audio_echo_cancellation": {"type": "server_echo_cancellation"}
        }
    }
    
    # Create agent with Voice Live configuration in metadata
    agent = project_client.agents.create_version(
        agent_name=agent_name,
        definition=PromptAgentDefinition(
            model=os.environ["MODEL_DEPLOYMENT_NAME"],
            instructions="You are a helpful assistant that answers general questions",
        ),
        metadata=chunk_config(json.dumps(voice_live_config))
    )
    print(f"Agent created: {agent.name} (version {agent.version})")
    
    # Verify Voice Live configuration was stored correctly
    retrieved_agent = project_client.agents.get(agent_name=agent_name)
    stored_metadata = (retrieved_agent.versions or {}).get("latest", {}).get("metadata", {})
    stored_config = reassemble_config(stored_metadata)
    
    if stored_config:
        print("\nVoice Live configuration:")
        print(json.dumps(json.loads(stored_config), indent=2))
    else:
        print("\nVoice Live configuration not found in agent metadata.")
    
  2. Sign in to Azure with the following command:

    az login
    
  3. Run the Python file.

    python create_agent_with_voicelive.py
    

Talk with a voice agent

The sample code in this quickstart uses Microsoft Entra ID for authentication as the current integration only supports this authentication method.

The sample connects to Foundry Agent Service by passing agent_config in connect(...) using these fields:

  • agent_name: The agent name to invoke.
  • project_name: The Foundry project containing the agent.
  • agent_version: Optional pinned version for controlled rollouts. If omitted, the latest version is used.
  • conversation_id: Optional conversation ID to continue prior conversation context.
  • foundry_resource_override: Optional resource name when the agent is hosted on a different Foundry resource.
  • authentication_identity_client_id: Optional managed identity client ID used with cross-resource agent connections.

Note

Agent mode in Voice Live doesn't support key-based authentication for agent invocation. Use Microsoft Entra ID (for example, AzureCliCredential) for agent access. Voice Live resource configuration might still include API keys for non-agent scenarios.

  1. Create the voice-live-agents-quickstart.py file with the following code:

    from __future__ import annotations
    import os
    import sys
    import asyncio
    import base64
    from datetime import datetime
    import logging
    import queue
    import signal
    from typing import Any, Union, Optional, TYPE_CHECKING, cast
    
    from azure.core.credentials import AzureKeyCredential
    from azure.core.credentials_async import AsyncTokenCredential
    from azure.identity.aio import AzureCliCredential
    
    from azure.ai.voicelive.aio import connect, AgentSessionConfig
    from azure.ai.voicelive.models import (
        InputAudioFormat,
        Modality,
        OutputAudioFormat,
        RequestSession,
        ServerEventType,
        MessageItem,
        InputTextContentPart,
        LlmInterimResponseConfig,
        InterimResponseTrigger,
        AzureStandardVoice,
        AudioNoiseReduction,
        AudioEchoCancellation,
        AzureSemanticVadMultilingual
    )
    from dotenv import load_dotenv
    import pyaudio
    
    if TYPE_CHECKING:
        # Only needed for type checking; avoids runtime import issues
        from azure.ai.voicelive.aio import VoiceLiveConnection
    
    # Environment variable loading
    _script_dir = os.path.dirname(os.path.abspath(__file__))
    load_dotenv(os.path.join(_script_dir, './.env'), override=True)
    
    # Set up logging
    ## Add folder for logging
    os.makedirs(os.path.join(_script_dir, 'logs'), exist_ok=True)
    
    ## Add timestamp for logfiles
    timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    
    ## Create conversation log filename
    logfilename = f"{timestamp}_conversation.log"
    
    ## Set up logging
    logging.basicConfig(
        filename=os.path.join(_script_dir, 'logs', f'{timestamp}_voicelive.log'),
        filemode="w",
        format='%(asctime)s:%(name)s:%(levelname)s:%(message)s',
        level=logging.INFO
    )
    logger = logging.getLogger(__name__)
    
    class AudioProcessor:
        """
        Handles real-time audio capture and playback for the voice assistant.
    
        Threading Architecture:
        - Main thread: Event loop and UI
        - Capture thread: PyAudio input stream reading
        - Send thread: Async audio data transmission to VoiceLive
        - Playback thread: PyAudio output stream writing
        """
        
        loop: asyncio.AbstractEventLoop
        
        class AudioPlaybackPacket:
            """Represents a packet that can be sent to the audio playback queue."""
            def __init__(self, seq_num: int, data: Optional[bytes]):
                self.seq_num = seq_num
                self.data = data
    
        def __init__(self, connection: VoiceLiveConnection) -> None:
            self.connection = connection
            self.audio = pyaudio.PyAudio()
    
            # Audio configuration - PCM16, 24kHz, mono as specified
            self.format = pyaudio.paInt16
            self.channels = 1
            self.rate = 24000
            self.chunk_size = 1200 # 50ms
    
            # Capture and playback state
            self.input_stream = None
    
            self.playback_queue: queue.Queue[AudioProcessor.AudioPlaybackPacket] = queue.Queue()
            self.playback_base = 0
            self.next_seq_num = 0
            self.output_stream: Optional[pyaudio.Stream] = None
    
            logger.info("AudioProcessor initialized with 24kHz PCM16 mono audio")
    
        def start_capture(self) -> None:
            """Start capturing audio from microphone."""
            def _capture_callback(
                in_data,      # data
                _frame_count,  # number of frames
                _time_info,    # dictionary
                _status_flags):
                """Audio capture thread - runs in background."""
                audio_base64 = base64.b64encode(in_data).decode("utf-8")
                asyncio.run_coroutine_threadsafe(
                    self.connection.input_audio_buffer.append(audio=audio_base64), self.loop
                )
                return (None, pyaudio.paContinue)
    
            if self.input_stream:
                return
    
            # Store the current event loop for use in threads
            self.loop = asyncio.get_event_loop()
    
            try:
                self.input_stream = self.audio.open(
                    format=self.format,
                    channels=self.channels,
                    rate=self.rate,
                    input=True,
                    frames_per_buffer=self.chunk_size,
                    stream_callback=_capture_callback,
                )
                logger.info("Started audio capture")
    
            except Exception:
                logger.exception("Failed to start audio capture")
                raise
    
        def start_playback(self) -> None:
            """Initialize audio playback system."""
            if self.output_stream:
                return
    
            remaining = bytes()
            def _playback_callback(
                _in_data,
                frame_count,  # number of frames
                _time_info,
                _status_flags):
    
                nonlocal remaining
                frame_count *= pyaudio.get_sample_size(pyaudio.paInt16)
    
                out = remaining[:frame_count]
                remaining = remaining[frame_count:]
    
                while len(out) < frame_count:
                    try:
                        packet = self.playback_queue.get_nowait()
                    except queue.Empty:
                        out = out + bytes(frame_count - len(out))
                        continue
                    except Exception:
                        logger.exception("Error in audio playback")
                        raise
    
                    if not packet or not packet.data:
                        # None packet indicates end of stream
                        logger.info("End of playback queue.")
                        break
    
                    if packet.seq_num < self.playback_base:
                        # skip requested
                        # ignore skipped packet and clear remaining
                        if len(remaining) > 0:
                            remaining = bytes()
                        continue
    
                    num_to_take = frame_count - len(out)
                    out = out + packet.data[:num_to_take]
                    remaining = packet.data[num_to_take:]
    
                if len(out) >= frame_count:
                    return (out, pyaudio.paContinue)
                else:
                    return (out, pyaudio.paComplete)
    
            try:
                self.output_stream = self.audio.open(
                    format=self.format,
                    channels=self.channels,
                    rate=self.rate,
                    output=True,
                    frames_per_buffer=self.chunk_size,
                    stream_callback=_playback_callback
                )
                logger.info("Audio playback system ready")
            except Exception:
                logger.exception("Failed to initialize audio playback")
                raise
    
        def _get_and_increase_seq_num(self) -> int:
            seq = self.next_seq_num
            self.next_seq_num += 1
            return seq
    
        def queue_audio(self, audio_data: Optional[bytes]) -> None:
            """Queue audio data for playback."""
            self.playback_queue.put(
                AudioProcessor.AudioPlaybackPacket(
                    seq_num=self._get_and_increase_seq_num(),
                    data=audio_data))
    
        def skip_pending_audio(self) -> None:
            """Skip current audio in playback queue."""
            self.playback_base = self._get_and_increase_seq_num()
    
        def shutdown(self) -> None:
            """Clean up audio resources."""
            if self.input_stream:
                self.input_stream.stop_stream()
                self.input_stream.close()
                self.input_stream = None
    
            logger.info("Stopped audio capture")
    
            # Inform thread to complete
            if self.output_stream:
                self.skip_pending_audio()
                self.queue_audio(None)
                self.output_stream.stop_stream()
                self.output_stream.close()
                self.output_stream = None
    
            logger.info("Stopped audio playback")
    
            if self.audio:
                self.audio.terminate()
    
            logger.info("Audio processor cleaned up")
    
    class BasicVoiceAssistant:
        """
        Basic voice assistant implementing the VoiceLive SDK patterns with Foundry Agent.
        
        Uses the new AgentSessionConfig for strongly-typed agent configuration at connection time.
        This sample also demonstrates how to collect a conversation log of user and agent interactions.
        """
    
        def __init__(
            self,
            endpoint: str,
            credential: Union[AzureKeyCredential, AsyncTokenCredential],
            voice: str,
            agent_name: str,
            project_name: str,
            agent_version: Optional[str] = None,
            conversation_id: Optional[str] = None,
            foundry_resource_override: Optional[str] = None,
            agent_authentication_identity_client_id: Optional[str] = None,
        ):
            self.endpoint = endpoint
            self.credential = credential
            self.voice = voice
            # Build AgentSessionConfig internally
            self.agent_config: AgentSessionConfig = {
                "agent_name": agent_name,
                "agent_version": agent_version if agent_version else None,
                "project_name": project_name,
                "conversation_id": conversation_id if conversation_id else None,
                "foundry_resource_override": foundry_resource_override if foundry_resource_override else None, 
                "authentication_identity_client_id": agent_authentication_identity_client_id if agent_authentication_identity_client_id and foundry_resource_override else None,                
            }        
    
            self.connection: Optional["VoiceLiveConnection"] = None
            self.audio_processor: Optional[AudioProcessor] = None
            self.session_ready = False
            self.greeting_sent = False
            self._active_response = False
            self._response_api_done = False
    
        async def start(self) -> None:
            """Start the voice assistant session."""
            try:
                logger.info(
                    "Connecting to VoiceLive API with agent %s for project %s (version=%s, conversation_id=%s, foundry_override=%s, auth_identity=%s)",
                    self.agent_config.get("agent_name"),
                    self.agent_config.get("project_name"),
                    self.agent_config.get("agent_version"),
                    self.agent_config.get("conversation_id"),
                    self.agent_config.get("foundry_resource_override"),
                    self.agent_config.get("authentication_identity_client_id")
                )
    
                # Connect using AgentSessionConfig (new SDK pattern)
                async with connect(
                    endpoint=self.endpoint,
                    credential=self.credential,
                    api_version="2026-01-01-preview",
                    agent_config=self.agent_config,
                ) as connection:
                    conn = connection
                    self.connection = conn
    
                    # Initialize audio processor
                    ap = AudioProcessor(conn)
                    self.audio_processor = ap
    
                    # Configure session for voice conversation
                    await self._setup_session()
    
                    # Start audio systems
                    ap.start_playback()
    
                    logger.info("Voice assistant ready! Start speaking...")
                    print("\n" + "=" * 65)
                    print("šŸŽ¤ VOICE ASSISTANT READY")
                    print("Start speaking to begin conversation")
                    print("Press Ctrl+C to exit")
                    print("=" * 65 + "\n")
    
                    # Process events
                    await self._process_events()
            finally:
                if self.audio_processor:
                    self.audio_processor.shutdown()
    
        async def _setup_session(self) -> None:
            """Configure the VoiceLive session for audio conversation."""
            logger.info("Setting up voice conversation session...")
    
            # Set up interim response configuration to bridge latency gaps during processing
            interim_response_config = LlmInterimResponseConfig(
                triggers=[InterimResponseTrigger.TOOL, InterimResponseTrigger.LATENCY],
                latency_threshold_ms=100,
                instructions="""Create friendly interim responses indicating wait time due to ongoing processing, if any. Do not include
                                in all responses! Do not say you don't have real-time access to information when calling tools!"""
            )
    
            # Create session configuration
            session_config = RequestSession(
                modalities=[Modality.TEXT, Modality.AUDIO],
                input_audio_format=InputAudioFormat.PCM16,
                output_audio_format=OutputAudioFormat.PCM16,
                interim_response=interim_response_config,
                # Uncomment the following, if not stored with agent configuration on the service side
                # voice=AzureStandardVoice(name=self.voice),
                # turn_detection=AzureSemanticVadMultilingual(),
                # input_audio_echo_cancellation=AudioEchoCancellation(),
                # input_audio_noise_reduction=AudioNoiseReduction(type="azure_deep_noise_suppression")
            )
    
            conn = self.connection
            if conn is None:
                raise RuntimeError("Connection must be established before setting up session")
            await conn.session.update(session=session_config)
    
            logger.info("Session configuration sent")
    
        async def _process_events(self) -> None:
            """Process events from the VoiceLive connection."""
            try:
                conn = self.connection
                if conn is None:
                    raise RuntimeError("Connection must be established before processing events")
                async for event in conn:
                    await self._handle_event(event)
            except Exception:
                logger.exception("Error processing events")
                raise
    
        async def _handle_event(self, event: Any) -> None:
            """Handle different types of events from VoiceLive."""
            logger.debug("Received event: %s", event.type)
            ap = self.audio_processor
            conn = self.connection
            if ap is None or conn is None:
                raise RuntimeError("AudioProcessor and Connection must be initialized")
    
            if event.type == ServerEventType.SESSION_UPDATED:
                logger.info("Session ready: %s", event.session.id)
                s, a, v = event.session, event.session.agent, event.session.voice
                await write_conversation_log("\n".join([
                    f"SessionID: {s.id}", f"Agent Name: {a.name}",
                    f"Agent Description: {a.description}", f"Agent ID: {a.agent_id}",
                    f"Voice Name: {v['name']}", f"Voice Type: {v['type']}",
                    f"Voice Temperature: {v['temperature']}", ""
                ]))
                self.session_ready = True
    
                # Invoke Proactive greeting
                if not self.greeting_sent:
                    self.greeting_sent = True
                    logger.info("Sending proactive greeting request")
                    try:
                        await conn.conversation.item.create(
                            item=MessageItem(
                                role="system",
                                content=[
                                    InputTextContentPart(
                                        text="Say something to welcome the user in English."
                                    )
                                ]
                            )
                        )
                        await conn.response.create()
                    except Exception:
                        logger.exception("Failed to send proactive greeting request")
    
                # Start audio capture once session is ready
                ap.start_capture()
    
            elif event.type == ServerEventType.CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_COMPLETED:
                print(f'šŸ‘¤ You said:\t{event.get("transcript", "")}')
                await write_conversation_log(f'User Input:\t{event.get("transcript", "")}')
    
            elif event.type == ServerEventType.RESPONSE_TEXT_DONE:
                print(f'šŸ¤– Agent responded with text:\t{event.get("text", "")}')
                await write_conversation_log(f'Agent Text Response:\t{event.get("text", "")}')
    
            elif event.type == ServerEventType.RESPONSE_AUDIO_TRANSCRIPT_DONE:
                print(f'šŸ¤– Agent responded with audio transcript:\t{event.get("transcript", "")}')
                await write_conversation_log(f'Agent Audio Response:\t{event.get("transcript", "")}')
    
            elif event.type == ServerEventType.INPUT_AUDIO_BUFFER_SPEECH_STARTED:
                logger.info("User started speaking - stopping playback")
                print("šŸŽ¤ Listening...")
    
                ap.skip_pending_audio()
    
                # Only cancel if response is active and not already done
                if self._active_response and not self._response_api_done:
                    try:
                        await conn.response.cancel()
                        logger.debug("Cancelled in-progress response due to barge-in")
                    except Exception as e:
                        if "no active response" in str(e).lower():
                            logger.debug("Cancel ignored - response already completed")
                        else:
                            logger.warning("Cancel failed: %s", e)
    
            elif event.type == ServerEventType.INPUT_AUDIO_BUFFER_SPEECH_STOPPED:
                logger.info("šŸŽ¤ User stopped speaking")
                print("šŸ¤” Processing...")
    
            elif event.type == ServerEventType.RESPONSE_CREATED:
                logger.info("šŸ¤– Assistant response created")
                self._active_response = True
                self._response_api_done = False
    
            elif event.type == ServerEventType.RESPONSE_AUDIO_DELTA:
                logger.debug("Received audio delta")
                ap.queue_audio(event.delta)
    
            elif event.type == ServerEventType.RESPONSE_AUDIO_DONE:
                logger.info("šŸ¤– Assistant finished speaking")
                print("šŸŽ¤ Ready for next input...")
    
            elif event.type == ServerEventType.RESPONSE_DONE:
                logger.info("āœ… Response complete")
                self._active_response = False
                self._response_api_done = True
    
            elif event.type == ServerEventType.ERROR:
                msg = event.error.message
                if "Cancellation failed: no active response" in msg:
                    logger.debug("Benign cancellation error: %s", msg)
                else:
                    logger.error("āŒ VoiceLive error: %s", msg)
                    print(f"Error: {msg}")
    
            elif event.type == ServerEventType.CONVERSATION_ITEM_CREATED:
                logger.debug("Conversation item created: %s", event.item.id)
    
            else:
                logger.debug("Unhandled event type: %s", event.type)
    
    async def write_conversation_log(message: str) -> None:
        """Write a message to the conversation log."""
        log_path = os.path.join(_script_dir, 'logs', logfilename)
        await asyncio.to_thread(
            lambda: open(log_path, 'a', encoding='utf-8').write(message + "\n")
        )
    
    def main() -> None:
        """Main function."""
        endpoint = os.environ.get("VOICELIVE_ENDPOINT", "")
        voice_name = os.environ.get("VOICE_NAME", "en-US-Ava:DragonHDLatestNeural")
        agent_name = os.environ.get("AGENT_NAME", "")
        agent_version = os.environ.get("AGENT_VERSION")
        project_name = os.environ.get("PROJECT_NAME", "")
        conversation_id = os.environ.get("CONVERSATION_ID")
        foundry_resource_override = os.environ.get("FOUNDRY_RESOURCE_OVERRIDE")
        agent_authentication_identity_client_id = os.environ.get("AGENT_AUTHENTICATION_IDENTITY_CLIENT_ID")
    
        print("Environment variables:")
        print(f"VOICELIVE_ENDPOINT: {endpoint}")
        print(f"VOICE_NAME: {voice_name}")
        print(f"AGENT_NAME: {agent_name}")
        print(f"AGENT_VERSION: {agent_version}")
        print(f"PROJECT_NAME: {project_name}")
        print(f"CONVERSATION_ID: {conversation_id}")
        print(f"FOUNDRY_RESOURCE_OVERRIDE: {foundry_resource_override}")
        print(f"AGENT_AUTHENTICATION_IDENTITY_CLIENT_ID: {agent_authentication_identity_client_id}")
    
        if not endpoint or not agent_name or not project_name:
            sys.exit("Set VOICELIVE_ENDPOINT, AGENT_NAME, and PROJECT_NAME in your .env file.")
    
        # Create client with appropriate credential (Entra ID required for Agent mode)
        credential = AzureCliCredential()
        logger.info("Using Azure token credential")
    
        # Create and start voice assistant
        assistant = BasicVoiceAssistant(
            endpoint=endpoint,
            credential=credential,
            voice=voice_name,
            agent_name=agent_name,
            agent_version=agent_version,
            project_name=project_name,
            conversation_id=conversation_id,
            foundry_resource_override=foundry_resource_override,
            agent_authentication_identity_client_id=agent_authentication_identity_client_id,
        )
    
        # Handle SIGTERM for graceful shutdown (SIGINT already raises KeyboardInterrupt)
        signal.signal(signal.SIGTERM, lambda *_: (_ for _ in ()).throw(KeyboardInterrupt()))
    
        # Start the assistant
        try:
            asyncio.run(assistant.start())
        except KeyboardInterrupt:
            print("\nšŸ‘‹ Voice assistant shut down. Goodbye!")
        except Exception as e:
            print("Fatal Error: ", e)
    
    def _check_audio_devices() -> None:
        """Verify audio input/output devices are available."""
        p = pyaudio.PyAudio()
        try:
            def _has_channels(key):
                return any(
                    cast(Union[int, float], p.get_device_info_by_index(i).get(key, 0) or 0) > 0
                    for i in range(p.get_device_count())
                )
            if not _has_channels("maxInputChannels"):
                sys.exit("āŒ No audio input devices found. Please check your microphone.")
            if not _has_channels("maxOutputChannels"):
                sys.exit("āŒ No audio output devices found. Please check your speakers.")
        finally:
            p.terminate()
    
    if __name__ == "__main__":
        try:
            _check_audio_devices()
        except SystemExit:
            raise
        except Exception as e:
            sys.exit(f"āŒ Audio system check failed: {e}")
    
        print("šŸŽ™ļø Basic Foundry Voice Agent with Azure VoiceLive SDK (Agent Mode)")
        print("=" * 65)
        main()
    
  2. Sign in to Azure with the following command:

    az login
    
  3. Run the Python file.

    python voice-live-agents-quickstart.py
    
  4. You can start speaking with the agent and hear responses. You can interrupt the model by speaking. Enter "Ctrl+C" to quit the conversation.

Output

The output of the script is printed to the console. You see messages indicating the status of the connection, audio stream, and playback. The audio is played back through your speakers or headphones.

šŸŽ™ļø  Basic Voice Assistant with Azure VoiceLive SDK
==================================================

============================================================
šŸŽ¤ VOICE ASSISTANT READY
Start speaking to begin conversation
Press Ctrl+C to exit
============================================================

šŸŽ¤ Listening...
šŸ¤” Processing...
šŸ‘¤ You said:  User Input:       Hello.
šŸŽ¤ Ready for next input...
šŸ¤– Agent responded with audio transcript:  Agent Audio Response:        Hello! I'm Tobi the agent. How can I assist you today?
šŸŽ¤ Listening...
šŸ¤” Processing...
šŸ‘¤ You said:  User Input:       What are the opening hours of the Eiffel Tower?
šŸŽ¤ Ready for next input...
šŸ¤– Agent responded with audio transcript:  Agent Audio Response:        The Eiffel Tower's opening hours can vary depending on the season and any special events or maintenance. Generally, the Eiffel Tower is open every day of the year, with the following typical hours:

- Mid-June to early September: 9:00 AM to 12:45 AM (last elevator ride up at 12:00 AM)
- Rest of the year: 9:30 AM to 11:45 PM (last elevator ride up at 11:00 PM)

These times can sometimes change, so it's always best to check the official Eiffel Tower website or contact them directly for the most up-to-date information before your visit.

Would you like me to help you find the official website or any other details about visiting the Eiffel Tower?

šŸ‘‹ Voice assistant shut down. Goodbye!

The script that you ran creates a log file named <timestamp>_voicelive.log in the logs folder.

logging.basicConfig(
    filename=f'logs/{timestamp}_voicelive.log',
    filemode="w",
    format='%(asctime)s:%(name)s:%(levelname)s:%(message)s',
    level=logging.INFO
)

The voicelive.log file contains information about the connection to the Voice Live API, including the request and response data. You can view the log file to see the details of the conversation.

2026-02-10 18:40:19,183:__main__:INFO:Using Azure token credential
2026-02-10 18:40:19,184:__main__:INFO:Connecting to VoiceLive API with Foundry agent connection MyVoiceAgent for project my-voiceagent-project
2026-02-10 18:40:20,801:azure.identity.aio._internal.decorators:INFO:AzureCliCredential.get_token succeeded
2026-02-10 18:40:21,847:__main__:INFO:AudioProcessor initialized with 24kHz PCM16 mono audio
2026-02-10 18:40:21,847:__main__:INFO:Setting up voice conversation session...
2026-02-10 18:40:21,848:__main__:INFO:Session configuration sent
2026-02-10 18:40:22,174:__main__:INFO:Audio playback system ready
2026-02-10 18:40:22,174:__main__:INFO:Voice assistant ready! Start speaking...
2026-02-10 18:40:22,384:__main__:INFO:Session ready: sess_1m1zrSLJSPjJpzbEOyQpTL
2026-02-10 18:40:22,386:__main__:INFO:Sending proactive greeting request
2026-02-10 18:40:22,419:__main__:INFO:Started audio capture
2026-02-10 18:40:22,722:__main__:INFO:\U0001f916 Assistant response created
2026-02-10 18:40:26,054:__main__:INFO:\U0001f916 Assistant finished speaking
2026-02-10 18:40:26,074:__main__:INFO:\u2705 Response complete
2026-02-10 18:40:32,015:__main__:INFO:User started speaking - stopping playback
2026-02-10 18:40:32,866:__main__:INFO:\U0001f3a4 User stopped speaking
2026-02-10 18:40:32,972:__main__:INFO:\U0001f916 Assistant response created
2026-02-10 18:40:35,750:__main__:INFO:User started speaking - stopping playback
2026-02-10 18:40:35,751:__main__:INFO:\U0001f916 Assistant finished speaking
2026-02-10 18:40:36,171:__main__:INFO:\u2705 Response complete
2026-02-10 18:40:37,117:__main__:INFO:\U0001f3a4 User stopped speaking
2026-02-10 18:40:37,207:__main__:INFO:\U0001f916 Assistant response created
2026-02-10 18:40:41,016:__main__:INFO:\U0001f916 Assistant finished speaking
2026-02-10 18:40:41,023:__main__:INFO:\u2705 Response complete
2026-02-10 18:40:44,818:__main__:INFO:Stopped audio capture
2026-02-10 18:40:44,949:__main__:INFO:Stopped audio playback
2026-02-10 18:40:44,950:__main__:INFO:Audio processor cleaned up

Further a session log file is created in the logs folder with the name <timestamp>_conversation.log. This file contains detailed information about the session, including the request and response data.

SessionID: sess_1m1zrSLJSPjJpzbEOyQpTL
Agent Name: VoiceAgentQuickstartTest
Agent Description: 
Agent ID: None
Voice Name: en-US-Ava:DragonHDLatestNeural
Voice Type: azure-standard
Voice Temperature: 0.8

User Input:	Hello.
Agent Audio Response:	Hello! I'm Tobi the agent. How can I assist you today?
User Input:	What are the opening hours of the Eiffel Tower?
Agent Audio Response:	The Eiffel Tower's opening hours can vary depending on the season and any special events or maintenance. Generally, the Eiffel Tower is open every day of the year, with the following typical hours:

- Mid-June to early September: 9:00 AM to 12:45 AM (last elevator ride up at 12:00 AM)
- Rest of the year: 9:30 AM to 11:45 PM (last elevator ride up at 11:00 PM)

These times can sometimes change, so it's always best to check the official Eiffel Tower website or contact them directly for the most up-to-date information before your visit.

Would you like me to help you find the official website or any other details about visiting the Eiffel Tower?

Here are the key differences between the technical log and the conversation log:

Aspect Conversation Log Technical Log
Audience Business users, content reviewers Developers, IT operations
Content What was said in conversations How the system is working
Level Application/conversation level System/infrastructure level
Troubleshooting "What did the agent say?" "Why did the connection fail?"

Example: If your agent wasn't responding, you'd check:

  • voicelive.log → "WebSocket connection failed" or "Audio stream error"
  • conversation.log → "Did the user actually say anything?"

Both logs are complementary - conversation logs for conversation analysis and testing, technical logs for system diagnostics!

Technical log

Purpose: Technical debugging and system monitoring

Contents:

  • WebSocket connection events
  • Audio stream status
  • Error messages and stack traces
  • System-level events (session.created, response.done, etc.)
  • Network connectivity issues
  • Audio processing diagnostics

Format: Structured logging with timestamps, log levels, and technical details

Use Cases:

  • Debugging connection problems
  • Monitoring system performance
  • Troubleshooting audio issues
  • Developer/operations analysis

Conversation log

Purpose: Conversation transcript and user experience tracking

Contents:

  • Agent and project identification
  • Session configuration details
  • User transcripts: "Tell me a story", "Stop"
  • Agent responses: Full story text and follow-up responses
  • Conversation flow and interactions

Format: Plain text, human-readable conversation format

Use Cases:

  • Analyzing conversation quality
  • Reviewing what was actually said
  • Understanding user interactions and agent responses
  • Business/content analysis

Learn how to use Voice Live with Microsoft Foundry Agent Service using the VoiceLive SDK for C#.

Reference documentation | Package (NuGet) | Additional samples on GitHub

You can create and run an application to use Voice Live with agents for real-time voice agents.

  • Using agents allows leveraging a built-in prompt and configuration managed within the agent itself, rather than specifying instructions in the session code.

  • Agents encapsulate more complex logic and behaviors, making it easier to manage and update conversational flows without changing the client code.

  • The agent approach streamlines integration. The agent ID is used to connect and all necessary settings are handled internally, reducing the need for manual configuration in the code.

  • This separation also supports better maintainability and scalability for scenarios where multiple conversational experiences or business logic variations are needed.

To use the Voice Live API without Foundry agents, see the Voice Live API quickstart.

Tip

To use Voice Live, you don't need to deploy an audio model with your Microsoft Foundry resource. Voice Live is fully managed, and the model is automatically deployed for you. For more information about models availability, see the Voice Live overview documentation.

Follow the quickstart below or get a fully working web app with browser-based voice UI:

Prerequisites

Note

This document refers to the Microsoft Foundry (new) portal and the latest Foundry Agent Service version.

  • Assign the Azure AI User role to your user account. You can assign roles in the Azure portal under Access control (IAM) > Add role assignment.

Prepare the environment

  1. Create a new folder voice-live-quickstart and go to the quickstart folder with the following command:

    mkdir voice-live-quickstart && cd voice-live-quickstart
    
  2. Create a .csproj file with the following project configuration:

    <Project Sdk="Microsoft.NET.Sdk">
    
      <PropertyGroup>
        <OutputType>Exe</OutputType>
        <TargetFramework>net8.0</TargetFramework>
        <Nullable>enable</Nullable>
        <ImplicitUsings>enable</ImplicitUsings>
      </PropertyGroup>
    
      <!-- Exclude CreateAgentWithVoiceLive.cs from the main build.
           It's a separate utility that should be compiled independently:
           dotnet run CreateAgentWithVoiceLive.cs -->
      <ItemGroup>
        <Compile Remove="CreateAgentWithVoiceLive.cs" />
      </ItemGroup>
    
      <ItemGroup>
        <PackageReference Include="Azure.AI.VoiceLive" Version="1.1.0-beta.3" />
        <PackageReference Include="Azure.AI.Projects" Version="1.0.0-beta.8" />
        <PackageReference Include="Azure.Identity" Version="1.13.2" />
        <PackageReference Include="NAudio" Version="2.2.1" />
      </ItemGroup>
    
    </Project>
    
  3. Restore NuGet packages:

    dotnet restore
    

Retrieve resource information

Note

The agent integration requires Entra ID authentication. Key-based authentication isn't supported in Agent mode.

Create a new file named .env in the folder where you want to run the code.

In the .env file, add the following environment variables for authentication:

# Settings for Foundry Agent
PROJECT_ENDPOINT=<endpoint copied from welcome screen>
AGENT_NAME="MyVoiceAgent"
MODEL_DEPLOYMENT_NAME="gpt-4.1-mini"
# Settings for Voice Live
AGENT_NAME=<name-used-to-create-agent> # See above
AGENT_VERSION=<version-of-the-agent>
CONVERSATION_ID=<specific conversation id to reconnect to>
PROJECT_NAME=<your_project_name>
VOICELIVE_ENDPOINT=<your_endpoint>
VOICELIVE_API_VERSION=2026-01-01-preview

Replace the default values with your actual project name, agent name, and endpoint values.

Variable name Value
PROJECT_ENDPOINT The Foundry project endpoint copied from the project welcome screen.
AGENT_NAME The name of the agent to use.
AGENT_VERSION Optional: The version of the agent to use.
CONVERSATION_ID Optional: A specific conversation ID to reconnect to.
PROJECT_NAME The name of your Microsoft Foundry project. Project name is the last element of the project endpoint value.
VOICELIVE_ENDPOINT This value can be found in the Keys and Endpoint section when examining your resource from the Azure portal.
FOUNDRY_RESOURCE_OVERRIDE Optional: The Foundry resource name hosting the agent project (for example, my-resource-name).
AGENT_AUTHENTICATION_IDENTITY_CLIENT_ID Optional: The managed identity client ID of the Voice Live resource.

Learn more about keyless authentication and setting environment variables.

Note

The C# Foundry Agent SDK (Azure.AI.Projects) uses a connection string instead of an endpoint URL. Set PROJECT_CONNECTION_STRING to your project connection string (found in the Foundry portal under Project settings > Connected resources).

Create an agent with Voice Live settings

The agent creation script is a separate utility. Create a temporary console project to run it:

  1. Create a separate folder for the agent creation utility:

    mkdir create-agent && cd create-agent
    dotnet new console --framework net8.0
    
  2. Add the required NuGet packages:

    dotnet add package Azure.AI.Projects --prerelease
    dotnet add package Azure.Identity
    
  3. Replace the contents of Program.cs with the following code:

    // Copyright (c) Microsoft Corporation. All rights reserved.
    // Licensed under the MIT License.
    
    using System.Text;
    using Azure.AI.Projects;
    using Azure.Identity;
    
    /// <summary>
    /// Creates an Azure AI Foundry agent configured for Voice Live sessions.
    ///
    /// Voice Live session settings (voice, VAD, noise reduction, etc.) are stored
    /// in the agent's metadata using a chunking strategy because each metadata value
    /// is limited to 512 characters.
    ///
    /// Required environment variables:
    ///   PROJECT_ENDPOINT - Azure AI Foundry project connection string
    ///   AGENT_NAME               - Name for the agent
    ///   MODEL_DEPLOYMENT_NAME    - Model deployment name (e.g., gpt-4o-mini)
    /// </summary>
    
    // <create_agent>
    var connectionString = Environment.GetEnvironmentVariable("PROJECT_ENDPOINT");
    var agentName = Environment.GetEnvironmentVariable("AGENT_NAME");
    var model = Environment.GetEnvironmentVariable("MODEL_DEPLOYMENT_NAME");
    
    if (string.IsNullOrEmpty(connectionString) || string.IsNullOrEmpty(agentName)
        || string.IsNullOrEmpty(model))
    {
        Console.Error.WriteLine("Set PROJECT_ENDPOINT, AGENT_NAME, and MODEL_DEPLOYMENT_NAME environment variables.");
        return;
    }
    
    // Create the Agents client with Entra ID authentication
    var projectClient = new AIProjectClient(connectionString, new DefaultAzureCredential());
    var agentsClient = projectClient.GetAgentsClient();
    
    // Define Voice Live session settings
    var voiceLiveConfig = """
    {
      "session": {
        "voice": {
          "name": "en-US-Ava:DragonHDLatestNeural",
          "type": "azure-standard",
          "temperature": 0.8
        },
        "input_audio_transcription": {
          "model": "azure-speech"
        },
        "turn_detection": {
          "type": "azure_semantic_vad",
          "end_of_utterance_detection": {
            "model": "semantic_detection_v1_multilingual"
          }
        },
        "input_audio_noise_reduction": { "type": "azure_deep_noise_suppression" },
        "input_audio_echo_cancellation": { "type": "server_echo_cancellation" }
      }
    }
    """;
    
    // Chunk the config into metadata entries (512-char limit per value)
    var metadata = ChunkConfig(voiceLiveConfig.Trim());
    
    // Create the agent with Voice Live configuration in metadata
    var agent = await agentsClient.CreateAgentAsync(
        model: model,
        name: agentName,
        instructions: "You are a helpful assistant that answers general questions",
        metadata: metadata);
    
    Console.WriteLine($"Agent created: {agent.Value.Name} (id: {agent.Value.Id})");
    
    // Verify Voice Live configuration was stored correctly
    var retrieved = await agentsClient.GetAgentAsync(agent.Value.Id);
    var storedConfig = ReassembleConfig(retrieved.Value.Metadata);
    
    if (!string.IsNullOrEmpty(storedConfig))
    {
        Console.WriteLine("\nVoice Live configuration:");
        Console.WriteLine(storedConfig);
    }
    else
    {
        Console.WriteLine("\nVoice Live configuration not found in agent metadata.");
    }
    // </create_agent>
    
    // <chunk_config>
    /// <summary>
    /// Splits a configuration JSON string into chunked metadata entries.
    /// Each metadata value is limited to 512 characters.
    /// </summary>
    static Dictionary<string, string> ChunkConfig(string configJson)
    {
        const int limit = 512;
        var metadata = new Dictionary<string, string>
        {
            ["microsoft.voice-live.configuration"] = configJson[..Math.Min(configJson.Length, limit)]
        };
    
        var remaining = configJson.Length > limit ? configJson[limit..] : "";
        var chunkNum = 1;
        while (remaining.Length > 0)
        {
            var chunk = remaining[..Math.Min(remaining.Length, limit)];
            metadata[$"microsoft.voice-live.configuration.{chunkNum}"] = chunk;
            remaining = remaining.Length > limit ? remaining[limit..] : "";
            chunkNum++;
        }
        return metadata;
    }
    // </chunk_config>
    
    // <reassemble_config>
    /// <summary>
    /// Reassembles chunked Voice Live configuration from agent metadata.
    /// </summary>
    static string ReassembleConfig(IReadOnlyDictionary<string, string>? metadata)
    {
        if (metadata == null) return "";
    
        var config = new StringBuilder();
        if (metadata.TryGetValue("microsoft.voice-live.configuration", out var baseValue))
        {
            config.Append(baseValue);
        }
        var chunkNum = 1;
        while (metadata.TryGetValue($"microsoft.voice-live.configuration.{chunkNum}", out var chunk))
        {
            config.Append(chunk);
            chunkNum++;
        }
        return config.ToString();
    }
    // </reassemble_config>
    
  4. Sign in to Azure with the following command:

    az login
    
  5. Build and run the agent creation script:

    dotnet run
    
  6. Return to the quickstart folder:

    cd ..
    

Talk with a voice agent

The sample code in this quickstart uses Microsoft Entra ID for authentication as the current integration only supports this authentication method.

The sample connects to Foundry Agent Service by passing AgentSessionConfig to StartSessionAsync(SessionTarget.FromAgent(...)) using these properties:

  • agentName: The agent name to invoke.
  • projectName: The Foundry project containing the agent.
  • AgentVersion: Optional pinned version for controlled rollouts. If omitted, the latest version is used.
  • ConversationId: Optional conversation ID to continue prior conversation context.
  • FoundryResourceOverride: Optional resource name when the agent is hosted on a different Foundry resource.
  • AuthenticationIdentityClientId: Optional managed identity client ID used with cross-resource agent connections.

Note

Agent mode in Voice Live doesn't support key-based authentication for agent invocation. Use Microsoft Entra ID (for example, AzureCliCredential) for agent access. Voice Live resource configuration might still include API keys for non-agent scenarios.

  1. Create the VoiceLiveWithAgentV2.cs file with the following code:

    // Copyright (c) Microsoft Corporation. All rights reserved.
    // Licensed under the MIT License.
    
    using System.Collections.Concurrent;
    using System.Text;
    using System.Text.Json;
    using Azure.AI.VoiceLive;
    using Azure.Identity;
    using NAudio.Wave;
    
    // <all>
    /// <summary>
    /// Voice assistant using Azure AI Voice Live SDK with Foundry Agent support.
    ///
    /// This sample demonstrates:
    /// - Connecting to Voice Live with AgentSessionConfig via SessionTarget.FromAgent()
    /// - Configuring interim responses to bridge latency gaps
    /// - Proactive greeting message on session start
    /// - Real-time audio capture and playback with barge-in support
    /// - Conversation logging to a file
    ///
    /// Required environment variables:
    ///   VOICELIVE_ENDPOINT - Voice Live service endpoint
    ///   AGENT_NAME         - Name of the Foundry agent
    ///   PROJECT_NAME       - Foundry project name (e.g., myproject)
    ///
    /// Optional environment variables:
    ///   VOICE_NAME                              - Voice name (default: en-US-Ava:DragonHDLatestNeural)
    ///   AGENT_VERSION                           - Specific agent version
    ///   CONVERSATION_ID                         - Resume a previous conversation
    ///   FOUNDRY_RESOURCE_OVERRIDE               - Cross-resource Foundry endpoint
    ///   AGENT_AUTHENTICATION_IDENTITY_CLIENT_ID - Managed identity client ID for cross-resource auth
    /// </summary>
    
    // <audio_processor>
    /// <summary>
    /// Manages real-time audio capture from the microphone and playback to the speakers.
    /// Uses a blocking collection for audio buffering and supports barge-in (skip pending audio).
    /// </summary>
    class AudioProcessor : IDisposable
    {
        private readonly VoiceLiveSession _session;
        private const int SampleRate = 24000;
        private const int BitsPerSample = 16;
        private const int Channels = 1;
    
        private WaveInEvent? _waveIn;
        private WaveOutEvent? _waveOut;
        private BufferedWaveProvider? _playbackBuffer;
    
        private readonly BlockingCollection<byte[]> _sendQueue = new(new ConcurrentQueue<byte[]>());
        private readonly BlockingCollection<byte[]> _playbackQueue = new(new ConcurrentQueue<byte[]>());
        private CancellationTokenSource _playbackCts = new();
        private Task? _sendTask;
        private Task? _playbackTask;
        private bool _isCapturing;
    
        public AudioProcessor(VoiceLiveSession session)
        {
            _session = session ?? throw new ArgumentNullException(nameof(session));
        }
    
        public void StartCapture()
        {
            if (_isCapturing) return;
            _isCapturing = true;
    
            _waveIn = new WaveInEvent
            {
                WaveFormat = new WaveFormat(SampleRate, BitsPerSample, Channels),
                BufferMilliseconds = 50
            };
    
            _waveIn.DataAvailable += (sender, e) =>
            {
                if (e.BytesRecorded > 0 && _isCapturing)
                {
                    var audioData = new byte[e.BytesRecorded];
                    Array.Copy(e.Buffer, audioData, e.BytesRecorded);
                    _sendQueue.TryAdd(audioData);
                }
            };
    
            _waveIn.StartRecording();
            _sendTask = Task.Run(ProcessSendQueueAsync);
            Console.WriteLine("šŸŽ¤ Audio capture started");
        }
    
        public void StartPlayback()
        {
            _playbackBuffer = new BufferedWaveProvider(new WaveFormat(SampleRate, BitsPerSample, Channels))
            {
                BufferDuration = TimeSpan.FromSeconds(10),
                DiscardOnBufferOverflow = true
            };
    
            _waveOut = new WaveOutEvent { DesiredLatency = 100 };
            _waveOut.Init(_playbackBuffer);
            _waveOut.Play();
    
            _playbackCts = new CancellationTokenSource();
            _playbackTask = Task.Run(() => ProcessPlaybackQueue(_playbackCts.Token));
        }
    
        public void QueueAudio(byte[] audioData)
        {
            if (audioData.Length > 0)
            {
                _playbackQueue.TryAdd(audioData);
            }
        }
    
        public void SkipPendingAudio()
        {
            // Clear queued audio for barge-in
            while (_playbackQueue.TryTake(out _)) { }
            _playbackBuffer?.ClearBuffer();
        }
    
        private async Task ProcessSendQueueAsync()
        {
            try
            {
                foreach (var audioData in _sendQueue.GetConsumingEnumerable())
                {
                    try
                    {
                        await _session.SendInputAudioAsync(audioData).ConfigureAwait(false);
                    }
                    catch (Exception ex)
                    {
                        Console.Error.WriteLine($"Error sending audio: {ex.Message}");
                    }
                }
            }
            catch (OperationCanceledException) { }
        }
    
        private void ProcessPlaybackQueue(CancellationToken ct)
        {
            try
            {
                foreach (var audioData in _playbackQueue.GetConsumingEnumerable(ct))
                {
                    _playbackBuffer?.AddSamples(audioData, 0, audioData.Length);
                }
            }
            catch (OperationCanceledException) { }
        }
    
        public void Dispose()
        {
            _isCapturing = false;
            _sendQueue.CompleteAdding();
            _playbackCts.Cancel();
    
            _waveIn?.StopRecording();
            _waveIn?.Dispose();
            _waveOut?.Stop();
            _waveOut?.Dispose();
    
            _sendTask?.Wait(TimeSpan.FromSeconds(2));
            _playbackTask?.Wait(TimeSpan.FromSeconds(2));
    
            _sendQueue.Dispose();
            _playbackQueue.Dispose();
            _playbackCts.Dispose();
        }
    }
    // </audio_processor>
    
    // <voice_assistant>
    /// <summary>
    /// Voice assistant that connects to a Foundry Agent via the Voice Live service.
    /// Handles session lifecycle, event processing, and audio I/O.
    /// </summary>
    class BasicVoiceAssistant : IDisposable
    {
        private readonly string _endpoint;
        private readonly AgentSessionConfig _agentConfig;
        private VoiceLiveSession? _session;
        private AudioProcessor? _audioProcessor;
        private bool _greetingSent;
        private bool _activeResponse;
        private bool _responseApiDone;
    
        // Conversation log
        private static readonly string LogFilename = $"conversation_{DateTime.Now:yyyyMMdd_HHmmss}.log";
    
        // <agent_config>
        public BasicVoiceAssistant(string endpoint, string agentName, string projectName,
            string? agentVersion = null, string? conversationId = null,
            string? foundryResourceOverride = null, string? authIdentityClientId = null)
        {
            _endpoint = endpoint;
    
            // Build the agent session configuration
            var config = new AgentSessionConfig(agentName, projectName);
            if (!string.IsNullOrEmpty(agentVersion))
            {
                config.AgentVersion = agentVersion;
            }
            if (!string.IsNullOrEmpty(conversationId))
            {
                config.ConversationId = conversationId;
            }
            if (!string.IsNullOrEmpty(foundryResourceOverride))
            {
                config.FoundryResourceOverride = foundryResourceOverride;
                if (!string.IsNullOrEmpty(authIdentityClientId))
                {
                    config.AuthenticationIdentityClientId = authIdentityClientId;
                }
            }
            _agentConfig = config;
        }
        // </agent_config>
    
        // <start_session>
        public async Task StartAsync(CancellationToken cancellationToken = default)
        {
            Console.WriteLine("Connecting to VoiceLive API with agent config...");
    
            // Create the Voice Live client with Entra ID authentication
            var client = new VoiceLiveClient(
                new Uri(_endpoint),
                new AzureCliCredential());
    
            // Connect using SessionTarget.FromAgent(AgentSessionConfig)
            _session = await client.StartSessionAsync(
                SessionTarget.FromAgent(_agentConfig), cancellationToken).ConfigureAwait(false);
    
            try
            {
                _audioProcessor = new AudioProcessor(_session);
    
                // Configure session options
                await SetupSessionAsync(cancellationToken).ConfigureAwait(false);
    
                _audioProcessor.StartPlayback();
    
                Console.WriteLine();
                Console.WriteLine(new string('=', 65));
                Console.WriteLine("šŸŽ¤ VOICE ASSISTANT READY");
                Console.WriteLine("Start speaking to begin conversation");
                Console.WriteLine("Press Ctrl+C to exit");
                Console.WriteLine(new string('=', 65));
                Console.WriteLine();
    
                // Process events (blocking)
                await ProcessEventsAsync(cancellationToken).ConfigureAwait(false);
            }
            finally
            {
                _audioProcessor?.Dispose();
                _session?.Dispose();
            }
        }
        // </start_session>
    
        // <setup_session>
        private async Task SetupSessionAsync(CancellationToken cancellationToken)
        {
            Console.WriteLine("Setting up voice conversation session...");
    
            // Create session configuration with interim response to bridge latency gaps
            var interimConfig = new LlmInterimResponseConfig
            {
                Instructions = "Create friendly interim responses indicating wait time due to "
                    + "ongoing processing, if any. Do not include in all responses! Do not "
                    + "say you don't have real-time access to information when calling tools!",
            };
            interimConfig.Triggers.Add(InterimResponseTrigger.Tool);
            interimConfig.Triggers.Add(InterimResponseTrigger.Latency);
            interimConfig.LatencyThresholdMs = 100;
    
            var options = new VoiceLiveSessionOptions
            {
                InputAudioFormat = InputAudioFormat.Pcm16,
                OutputAudioFormat = OutputAudioFormat.Pcm16,
                InterimResponse = BinaryData.FromObjectAsJson(interimConfig)
            };
    
            // Send session configuration
            await _session!.ConfigureSessionAsync(options, cancellationToken).ConfigureAwait(false);
    
            Console.WriteLine("Session configuration sent");
        }
        // </setup_session>
    
        // <process_events>
        private async Task ProcessEventsAsync(CancellationToken cancellationToken)
        {
            await foreach (SessionUpdate serverEvent in _session!.GetUpdatesAsync(cancellationToken).ConfigureAwait(false))
            {
                await HandleEventAsync(serverEvent, cancellationToken).ConfigureAwait(false);
            }
        }
        // </process_events>
    
        // <handle_events>
        private async Task HandleEventAsync(SessionUpdate serverEvent, CancellationToken cancellationToken)
        {
            switch (serverEvent)
            {
                case SessionUpdateSessionUpdated sessionUpdated:
                    Console.WriteLine("Session updated and ready");
    
                    var sessionId = sessionUpdated.Session?.Id;
                    WriteLog($"SessionID: {sessionId}\n");
    
                    // Send a proactive greeting
                    if (!_greetingSent)
                    {
                        _greetingSent = true;
                        await SendProactiveGreetingAsync(cancellationToken).ConfigureAwait(false);
                    }
    
                    // Start audio capture once session is ready
                    _audioProcessor?.StartCapture();
                    break;
    
                case SessionUpdateConversationItemInputAudioTranscriptionCompleted transcription:
                    var userText = transcription.Transcript;
                    Console.WriteLine($"šŸ‘¤ You said:\t{userText}");
                    WriteLog($"User Input:\t{userText}");
                    break;
    
                case SessionUpdateResponseAudioTranscriptDone audioTranscriptDone:
                    var agentText = audioTranscriptDone.Transcript;
                    Console.WriteLine($"šŸ¤– Agent responded:\t{agentText}");
                    WriteLog($"Agent Audio Response:\t{agentText}");
                    break;
    
                case SessionUpdateInputAudioBufferSpeechStarted:
                    Console.WriteLine("šŸŽ¤ Listening...");
                    _audioProcessor?.SkipPendingAudio();
    
                    // Cancel in-progress response for barge-in
                    if (_activeResponse && !_responseApiDone)
                    {
                        try
                        {
                            await _session!.CancelResponseAsync(cancellationToken).ConfigureAwait(false);
                        }
                        catch (Exception ex) when (ex.Message?.Contains("no active response") == true)
                        {
                            // Benign - response already completed
                        }
                    }
                    break;
    
                case SessionUpdateInputAudioBufferSpeechStopped:
                    Console.WriteLine("šŸ¤” Processing...");
                    break;
    
                case SessionUpdateResponseCreated:
                    _activeResponse = true;
                    _responseApiDone = false;
                    break;
    
                case SessionUpdateResponseAudioDelta audioDelta:
                    if (audioDelta.Delta != null)
                    {
                        _audioProcessor?.QueueAudio(audioDelta.Delta.ToArray());
                    }
                    break;
    
                case SessionUpdateResponseAudioDone:
                    Console.WriteLine("šŸŽ¤ Ready for next input...");
                    break;
    
                case SessionUpdateResponseDone:
                    _activeResponse = false;
                    _responseApiDone = true;
                    break;
    
                case SessionUpdateError errorEvent:
                    var errorMsg = errorEvent.Error?.Message;
                    if (errorMsg?.Contains("Cancellation failed: no active response") == true)
                    {
                        // Benign cancellation error
                    }
                    else
                    {
                        Console.Error.WriteLine($"VoiceLive error: {errorMsg}");
                    }
                    break;
            }
        }
        // </handle_events>
    
        // <proactive_greeting>
        private async Task SendProactiveGreetingAsync(CancellationToken cancellationToken)
        {
            Console.WriteLine("Sending proactive greeting request");
            try
            {
                // Create a system message to trigger greeting
                await _session!.SendCommandAsync(
                    BinaryData.FromObjectAsJson(new
                    {
                        type = "conversation.item.create",
                        item = new
                        {
                            type = "message",
                            role = "system",
                            content = new[]
                            {
                                new { type = "input_text", text = "Say something to welcome the user in English." }
                            }
                        }
                    }), cancellationToken).ConfigureAwait(false);
    
                // Request a response
                await _session!.SendCommandAsync(
                    BinaryData.FromObjectAsJson(new { type = "response.create" }),
                    cancellationToken).ConfigureAwait(false);
            }
            catch (Exception ex)
            {
                Console.Error.WriteLine($"Failed to send proactive greeting: {ex.Message}");
            }
        }
        // </proactive_greeting>
    
        private static void WriteLog(string message)
        {
            try
            {
                var logDir = Path.Combine(Directory.GetCurrentDirectory(), "logs");
                Directory.CreateDirectory(logDir);
                File.AppendAllText(Path.Combine(logDir, LogFilename), message + Environment.NewLine);
            }
            catch (IOException ex)
            {
                Console.Error.WriteLine($"Failed to write conversation log: {ex.Message}");
            }
        }
    
        public void Dispose()
        {
            _audioProcessor?.Dispose();
            _session?.Dispose();
        }
    }
    // </voice_assistant>
    
    // <main>
    class Program
    {
        static async Task Main(string[] args)
        {
            var endpoint = Environment.GetEnvironmentVariable("VOICELIVE_ENDPOINT");
            var agentName = Environment.GetEnvironmentVariable("AGENT_NAME");
            var projectName = Environment.GetEnvironmentVariable("PROJECT_NAME");
            var agentVersion = Environment.GetEnvironmentVariable("AGENT_VERSION");
            var conversationId = Environment.GetEnvironmentVariable("CONVERSATION_ID");
            var foundryResourceOverride = Environment.GetEnvironmentVariable("FOUNDRY_RESOURCE_OVERRIDE");
            var authIdentityClientId = Environment.GetEnvironmentVariable("AGENT_AUTHENTICATION_IDENTITY_CLIENT_ID");
    
            Console.WriteLine("Environment variables:");
            Console.WriteLine($"VOICELIVE_ENDPOINT: {endpoint}");
            Console.WriteLine($"AGENT_NAME: {agentName}");
            Console.WriteLine($"PROJECT_NAME: {projectName}");
            Console.WriteLine($"AGENT_VERSION: {agentVersion}");
            Console.WriteLine($"CONVERSATION_ID: {conversationId}");
            Console.WriteLine($"FOUNDRY_RESOURCE_OVERRIDE: {foundryResourceOverride}");
    
            if (string.IsNullOrEmpty(endpoint) || string.IsNullOrEmpty(agentName)
                || string.IsNullOrEmpty(projectName))
            {
                Console.Error.WriteLine("Set VOICELIVE_ENDPOINT, AGENT_NAME, and PROJECT_NAME environment variables.");
                return;
            }
    
            // Verify audio devices
            CheckAudioDevices();
    
            Console.WriteLine("šŸŽ™ļø Basic Foundry Voice Agent with Azure VoiceLive SDK (Agent Mode)");
            Console.WriteLine(new string('=', 65));
    
            using var assistant = new BasicVoiceAssistant(
                endpoint, agentName, projectName,
                agentVersion, conversationId,
                foundryResourceOverride, authIdentityClientId);
    
            // Handle graceful shutdown
            using var cts = new CancellationTokenSource();
            Console.CancelKeyPress += (sender, e) =>
            {
                e.Cancel = true;
                cts.Cancel();
            };
    
            try
            {
                await assistant.StartAsync(cts.Token);
            }
            catch (OperationCanceledException)
            {
                Console.WriteLine("\nšŸ‘‹ Voice assistant shut down. Goodbye!");
            }
            catch (Exception ex)
            {
                Console.Error.WriteLine($"Fatal Error: {ex.Message}");
            }
        }
    
        // <check_audio>
        static void CheckAudioDevices()
        {
            if (WaveInEvent.DeviceCount == 0)
            {
                Console.Error.WriteLine("āŒ No audio input devices found. Please check your microphone.");
                Environment.Exit(1);
            }
            // WaveOutEvent doesn't expose a static DeviceCount; verify by
            // attempting to create a playback instance.
            try
            {
                using var testOut = new WaveOutEvent();
            }
            catch
            {
                Console.Error.WriteLine("āŒ No audio output devices found. Please check your speakers.");
                Environment.Exit(1);
            }
        }
        // </check_audio>
    }
    // </main>
    // </all>
    
  2. Sign in to Azure with the following command:

    az login
    
  3. Build and run the voice assistant:

    dotnet run
    
  4. You can start speaking with the agent and hear responses. You can interrupt the model by speaking. Enter "Ctrl+C" to quit the conversation.

Output

The output of the script is printed to the console. You see messages indicating the status of the connection, audio stream, and playback. The audio is played back through your speakers or headphones.

šŸŽ™ļø  Basic Foundry Voice Agent with Azure VoiceLive SDK (Agent Mode)
=================================================================

=================================================================
šŸŽ¤ VOICE ASSISTANT READY
Start speaking to begin conversation
Press Ctrl+C to exit
=================================================================

šŸŽ¤ Listening...
šŸ¤” Processing...
šŸ‘¤ You said:	Hello.
šŸŽ¤ Ready for next input...
šŸ¤– Agent responded:	Hello! I'm Tobi the agent. How can I assist you today?
šŸŽ¤ Listening...
šŸ¤” Processing...
šŸ‘¤ You said:	What are the opening hours of the Eiffel Tower?
šŸŽ¤ Ready for next input...
šŸ¤– Agent responded:	The Eiffel Tower's opening hours can vary depending on the season and any special events or maintenance. Generally, the Eiffel Tower is open every day of the year, with the following typical hours:

- Mid-June to early September: 9:00 AM to 12:45 AM (last elevator ride up at 12:00 AM)
- Rest of the year: 9:30 AM to 11:45 PM (last elevator ride up at 11:00 PM)

These times can sometimes change, so it's always best to check the official Eiffel Tower website or contact them directly for the most up-to-date information before your visit.

Would you like me to help you find the official website or any other details about visiting the Eiffel Tower?

šŸ‘‹ Voice assistant shut down. Goodbye!

A conversation log file is created in the logs folder with the name conversation_YYYYMMDD_HHmmss.log. This file contains session metadata and the conversation transcript, including user inputs and agent responses.

SessionID: sess_1m1zrSLJSPjJpzbEOyQpTL

User Input:	Hello.
Agent Audio Response:	Hello! I'm Tobi the agent. How can I assist you today?
User Input:	What are the opening hours of the Eiffel Tower?
Agent Audio Response:	The Eiffel Tower's opening hours can vary depending on the season...

Here are the key differences between the technical log and the conversation log:

Aspect Conversation Log Technical Log
Audience Business users, content reviewers Developers, IT operations
Content What was said in conversations How the system is working
Level Application/conversation level System/infrastructure level
Troubleshooting "What did the agent say?" "Why did the connection fail?"

Example: If your agent wasn't responding, you'd check:

  • Console log → "WebSocket connection failed" or "Audio stream error"
  • conversation log → "Did the user actually say anything?"

Both logs are complementary - conversation logs for conversation analysis and testing, technical logs for system diagnostics!

Technical log

Purpose: Technical debugging and system monitoring

Contents:

  • WebSocket connection events
  • Audio stream status
  • Error messages and stack traces
  • System-level events (session.created, response.done, etc.)
  • Network connectivity issues
  • Audio processing diagnostics

Format: Structured logging with timestamps, log levels, and technical details

Use Cases:

  • Debugging connection problems
  • Monitoring system performance
  • Troubleshooting audio issues
  • Developer/operations analysis

Conversation log

Purpose: Conversation transcript and user experience tracking

Contents:

  • Agent and project identification
  • Session configuration details
  • User transcripts: "Tell me a story", "Stop"
  • Agent responses: Full story text and follow-up responses
  • Conversation flow and interactions

Format: Plain text, human-readable conversation format

Use Cases:

  • Analyzing conversation quality
  • Reviewing what was actually said
  • Understanding user interactions and agent responses
  • Business/content analysis

Learn how to use Voice Live with Microsoft Foundry Agent Service using the VoiceLive SDK for JavaScript.

Reference documentation | Package (npm) | Additional samples on GitHub

You can create and run an application to use Voice Live with agents for real-time voice agents.

  • Using agents allows leveraging a built-in prompt and configuration managed within the agent itself, rather than specifying instructions in the session code.

  • Agents encapsulate more complex logic and behaviors, making it easier to manage and update conversational flows without changing the client code.

  • The agent approach streamlines integration. The agent ID is used to connect and all necessary settings are handled internally, reducing the need for manual configuration in the code.

  • This separation also supports better maintainability and scalability for scenarios where multiple conversational experiences or business logic variations are needed.

To use the Voice Live API without Foundry agents, see the Voice Live API quickstart.

Tip

To use Voice Live, you don't need to deploy an audio model with your Microsoft Foundry resource. Voice Live is fully managed, and the model is automatically deployed for you. For more information about models availability, see the Voice Live overview documentation.

Follow the quickstart below or get a fully working web app with browser-based voice UI:

Note

The JavaScript Voice Live SDK is designed for browser-based applications with built-in WebSocket and Web Audio support. This quickstart uses Node.js with node-record-lpcm16 and speaker for a console experience.

Prerequisites

Note

This document refers to the Microsoft Foundry (new) portal and the latest Foundry Agent Service version.

  • Assign the Azure AI User role to your user account. You can assign roles in the Azure portal under Access control (IAM) > Add role assignment.

Prepare the environment

  1. Create a new folder voice-live-quickstart and go to the quickstart folder with the following command:

    mkdir voice-live-quickstart && cd voice-live-quickstart
    
  2. Create a package.json file with the following content:

    {
      "name": "voice-live-quickstart",
      "version": "1.0.0",
      "private": true,
      "type": "module",
      "dependencies": {
        "@azure/ai-voicelive": "1.0.0-beta.3",
        "@azure/ai-agents": "1.2.0-beta.2",
        "@azure/identity": "^4.6.0",
        "dotenv": "^16.4.7"
      },
      "optionalDependencies": {
        "node-record-lpcm16": "^1.0.1",
        "speaker": "^0.5.5"
      }
    }
    
  3. Install the dependencies:

    npm install
    

Retrieve resource information

Note

The agent integration requires Entra ID authentication. Key-based authentication isn't supported in Agent mode.

Create a new file named .env in the folder where you want to run the code.

In the .env file, add the following environment variables for authentication:

# Settings for Foundry Agent
PROJECT_ENDPOINT=<endpoint copied from welcome screen>
AGENT_NAME="MyVoiceAgent"
MODEL_DEPLOYMENT_NAME="gpt-4.1-mini"
# Settings for Voice Live
AGENT_NAME=<name-used-to-create-agent> # See above
AGENT_VERSION=<version-of-the-agent>
CONVERSATION_ID=<specific conversation id to reconnect to>
PROJECT_NAME=<your_project_name>
VOICELIVE_ENDPOINT=<your_endpoint>
VOICELIVE_API_VERSION=2026-01-01-preview

Replace the default values with your actual project name, agent name, and endpoint values.

Variable name Value
PROJECT_ENDPOINT The Foundry project endpoint copied from the project welcome screen.
AGENT_NAME The name of the agent to use.
AGENT_VERSION Optional: The version of the agent to use.
CONVERSATION_ID Optional: A specific conversation ID to reconnect to.
PROJECT_NAME The name of your Microsoft Foundry project. Project name is the last element of the project endpoint value.
VOICELIVE_ENDPOINT This value can be found in the Keys and Endpoint section when examining your resource from the Azure portal.
FOUNDRY_RESOURCE_OVERRIDE Optional: The Foundry resource name hosting the agent project (for example, my-resource-name).
AGENT_AUTHENTICATION_IDENTITY_CLIENT_ID Optional: The managed identity client ID of the Voice Live resource.

Learn more about keyless authentication and setting environment variables.

Create an agent with Voice Live settings

  1. Create a file create-agent-with-voicelive.js with the following code:

    // Copyright (c) Microsoft Corporation. All rights reserved.
    // Licensed under the MIT License.
    
    // Create a Foundry agent with Voice Live session configuration in metadata.
    // Uses @azure/ai-agents SDK to create the agent and store chunked Voice Live
    // session settings so the VoiceLive service can pick them up at connection time.
    
    import "dotenv/config";
    import { AgentsClient } from "@azure/ai-agents";
    import { DefaultAzureCredential } from "@azure/identity";
    
    // ---------------------------------------------------------------------------
    // Voice Live configuration chunking helpers (512-char metadata value limit)
    // ---------------------------------------------------------------------------
    
    /**
     * Split a JSON config string into chunked metadata entries.
     * @param {string} configJson - Serialized JSON configuration.
     * @param {number} [limit=512] - Maximum characters per metadata value.
     * @returns {Record<string, string>} Metadata key/value pairs.
     */
    function chunkConfig(configJson, limit = 512) {
      const metadata = {
        "microsoft.voice-live.configuration": configJson.slice(0, limit),
      };
      let remaining = configJson.slice(limit);
      let chunkNum = 1;
      while (remaining.length > 0) {
        metadata[`microsoft.voice-live.configuration.${chunkNum}`] =
          remaining.slice(0, limit);
        remaining = remaining.slice(limit);
        chunkNum++;
      }
      return metadata;
    }
    
    /**
     * Reassemble a chunked Voice Live configuration from metadata.
     * @param {Record<string, string>} metadata - Agent metadata.
     * @returns {string} The full JSON configuration string.
     */
    function reassembleConfig(metadata) {
      let config = metadata["microsoft.voice-live.configuration"] ?? "";
      let chunkNum = 1;
      while (`microsoft.voice-live.configuration.${chunkNum}` in metadata) {
        config += metadata[`microsoft.voice-live.configuration.${chunkNum}`];
        chunkNum++;
      }
      return config;
    }
    
    // ---------------------------------------------------------------------------
    // Main
    // ---------------------------------------------------------------------------
    async function main() {
      const endpoint = process.env.PROJECT_ENDPOINT;
      const agentName = process.env.AGENT_NAME;
      const modelDeployment = process.env.MODEL_DEPLOYMENT_NAME;
    
      if (!endpoint || !agentName || !modelDeployment) {
        console.error(
          "Set PROJECT_ENDPOINT, AGENT_NAME, and MODEL_DEPLOYMENT_NAME in your .env file.",
        );
        process.exit(1);
      }
    
      const credential = new DefaultAzureCredential();
      const client = new AgentsClient(endpoint, credential);
    
      // Define Voice Live session settings
      const voiceLiveConfig = {
        session: {
          voice: {
            name: "en-US-Ava:DragonHDLatestNeural",
            type: "azure-standard",
            temperature: 0.8,
          },
          input_audio_transcription: {
            model: "azure-speech",
          },
          turn_detection: {
            type: "azure_semantic_vad",
            end_of_utterance_detection: {
              model: "semantic_detection_v1_multilingual",
            },
          },
          input_audio_noise_reduction: { type: "azure_deep_noise_suppression" },
          input_audio_echo_cancellation: { type: "server_echo_cancellation" },
        },
      };
    
      // Create the agent with Voice Live configuration stored in metadata
      const configJson = JSON.stringify(voiceLiveConfig);
      const agent = await client.createAgent(modelDeployment, {
        name: agentName,
        instructions:
          "You are a helpful assistant that answers general questions",
        metadata: chunkConfig(configJson),
      });
      console.log(`Agent created: ${agent.name} (id: ${agent.id})`);
    
      // Verify the stored configuration
      const retrieved = await client.getAgent(agent.id);
      const storedConfig = reassembleConfig(retrieved.metadata ?? {});
    
      if (storedConfig) {
        console.log("\nVoice Live configuration:");
        console.log(JSON.stringify(JSON.parse(storedConfig), null, 2));
      } else {
        console.log("\nVoice Live configuration not found in agent metadata.");
      }
    }
    
    main().catch((err) => {
      console.error("Error:", err);
      process.exit(1);
    });
    
  2. Sign in to Azure with the following command:

    az login
    
  3. Run the agent creation script:

    node create-agent-with-voicelive.js
    

Talk with a voice agent

The sample code in this quickstart uses Microsoft Entra ID for authentication as the current integration only supports this authentication method.

The sample connects to Foundry Agent Service by passing an agent config object to client.createSession(...) using these fields:

  • agentName: The agent name to invoke.
  • projectName: The Foundry project containing the agent.
  • agentVersion: Optional pinned version for controlled rollouts. If omitted, the latest version is used.
  • conversationId: Optional conversation ID to continue prior conversation context.
  • foundryResourceOverride: Optional resource name when the agent is hosted on a different Foundry resource.
  • authenticationIdentityClientId: Optional managed identity client ID used with cross-resource agent connections.

Note

Agent mode in Voice Live doesn't support key-based authentication for agent invocation. Use Microsoft Entra ID (for example, DefaultAzureCredential) for agent access. Voice Live resource configuration might still include API keys for non-agent scenarios.

  1. Create the voice-live-with-agent.js file with the following code:

    // Copyright (c) Microsoft Corporation. All rights reserved.
    // Licensed under the MIT License.
    
    // Voice Live with Foundry Agent Service v2 - Node.js Console Voice Assistant
    // Uses @azure/ai-voicelive SDK with handler-based event subscription pattern.
    
    import "dotenv/config";
    import { VoiceLiveClient } from "@azure/ai-voicelive";
    import { DefaultAzureCredential } from "@azure/identity";
    import { spawn } from "node:child_process";
    import { existsSync, mkdirSync, appendFileSync } from "node:fs";
    import { join, dirname } from "node:path";
    import { fileURLToPath } from "node:url";
    
    const __dirname = dirname(fileURLToPath(import.meta.url));
    
    // ---------------------------------------------------------------------------
    // Logging and conversation log setup
    // ---------------------------------------------------------------------------
    const logsDir = join(__dirname, "logs");
    if (!existsSync(logsDir)) mkdirSync(logsDir, { recursive: true });
    
    const timestamp = new Date()
      .toISOString()
      .replace(/[:.]/g, "-")
      .replace("T", "_")
      .slice(0, 19);
    const conversationLogFile = join(logsDir, `conversation_${timestamp}.log`);
    
    function writeConversationLog(message) {
      appendFileSync(conversationLogFile, message + "\n", "utf-8");
    }
    
    // ---------------------------------------------------------------------------
    // Audio helpers
    // ---------------------------------------------------------------------------
    
    /**
     * AudioProcessor manages microphone capture via node-record-lpcm16
     * and playback via the speaker npm package. Audio format: 24 kHz, 16-bit, mono.
     */
    class AudioProcessor {
      constructor(enableAudio = true, inputDevice = undefined) {
        this._enableAudio = enableAudio;
        this._inputDevice = inputDevice;
        this._recorder = null;
        this._soxProcess = null;
        this._speaker = null;
        this._skipSeq = 0;
        this._nextSeq = 0;
        this._recordModule = null;
        this._speakerCtor = null;
      }
    
      async _ensureAudioModulesLoaded() {
        if (!this._enableAudio) return;
        if (this._recordModule && this._speakerCtor) return;
    
        try {
          const recordModule = await import("node-record-lpcm16");
          const speakerModule = await import("speaker");
          this._recordModule = recordModule.default;
          this._speakerCtor = speakerModule.default;
        } catch {
          throw new Error(
            "Audio dependencies are unavailable. Install optional packages (node-record-lpcm16, speaker) and required native build tools, or run with --no-audio for connectivity-only validation.",
          );
        }
      }
    
      /** Start capturing microphone audio and forward PCM chunks to the session. */
      async startCapture(session) {
        if (!this._enableAudio) {
          console.log("[audio] --no-audio enabled: microphone capture skipped");
          return;
        }
        if (this._recorder || this._soxProcess) return;
    
        if (this._inputDevice) {
          console.log(`[audio] Using explicit input device: ${this._inputDevice}`);
    
          const soxArgs = [
            "-q",
            "-t",
            "waveaudio",
            this._inputDevice,
            "-r",
            "24000",
            "-c",
            "1",
            "-e",
            "signed-integer",
            "-b",
            "16",
            "-t",
            "raw",
            "-",
          ];
    
          this._soxProcess = spawn("sox", soxArgs, {
            stdio: ["ignore", "pipe", "pipe"],
          });
    
          this._soxProcess.stdout.on("data", (chunk) => {
            if (session.isConnected) {
              session.sendAudio(new Uint8Array(chunk)).catch(() => {
                /* ignore send errors after disconnect */
              });
            }
          });
    
          this._soxProcess.stderr.on("data", (data) => {
            const msg = data.toString().trim();
            if (msg) {
              console.error(`[audio] sox stderr: ${msg}`);
            }
          });
    
          this._soxProcess.on("error", (error) => {
            console.error(`[audio] SoX process error: ${error?.message ?? error}`);
          });
    
          this._soxProcess.on("close", (code) => {
            if (code !== 0) {
              console.error(`[audio] SoX exited with code ${code}`);
            }
            this._soxProcess = null;
          });
    
          console.log("[audio] Microphone capture started");
          return;
        }
    
        await this._ensureAudioModulesLoaded();
    
        this._recorder = this._recordModule.record({
          sampleRate: 24000,
          channels: 1,
          audioType: "raw",
          recorder: "sox",
          encoding: "signed-integer",
          bitwidth: 16,
        });
    
        const recorderStream = this._recorder.stream();
    
        recorderStream.on("data", (chunk) => {
          if (session.isConnected) {
            session.sendAudio(new Uint8Array(chunk)).catch(() => {
              /* ignore send errors after disconnect */
            });
          }
        });
    
        recorderStream.on("error", (error) => {
          console.error(`[audio] Recorder stream error: ${error?.message ?? error}`);
          console.error(
            "[audio] SoX capture failed. Check microphone permissions/device and run with DEBUG=record for details.",
          );
        });
    
        console.log("[audio] Microphone capture started");
      }
    
      /** Initialise the speaker for playback. */
      async startPlayback() {
        if (!this._enableAudio) {
          console.log("[audio] --no-audio enabled: speaker playback skipped");
          return;
        }
        if (this._speaker) return;
        await this._resetSpeaker();
        console.log("[audio] Playback ready");
      }
    
      /** Queue a PCM16 buffer (base64 from service) for playback. */
      queueAudio(base64Delta) {
        const seq = this._nextSeq++;
        if (seq < this._skipSeq) return; // skip if barge-in happened
        const buf = Buffer.from(base64Delta, "base64");
        if (this._speaker && !this._speaker.destroyed) {
          this._speaker.write(buf);
        }
      }
    
      /** Discard queued audio (barge-in). */
      skipPendingAudio() {
        if (!this._enableAudio) return;
        this._skipSeq = this._nextSeq++;
        // Reset speaker to flush its internal buffer
        this._resetSpeaker().catch(() => {
          // best-effort reset
        });
      }
    
      /** Shut down capture and playback. */
      shutdown() {
        if (this._soxProcess) {
          try {
            this._soxProcess.kill();
          } catch {
            /* ignore */
          }
          this._soxProcess = null;
        }
    
        if (this._recorder) {
          this._recorder.stop();
          this._recorder = null;
        }
        if (this._speaker) {
          this._speaker.end();
          this._speaker = null;
        }
        console.log("[audio] Audio processor shut down");
      }
    
      /** (Re-)create the Speaker instance. */
      async _resetSpeaker() {
        await this._ensureAudioModulesLoaded();
    
        if (this._speaker && !this._speaker.destroyed) {
          try {
            this._speaker.destroy();
          } catch {
            /* ignore */
          }
        }
        this._speaker = new this._speakerCtor({
          channels: 1,
          bitDepth: 16,
          sampleRate: 24000,
          signed: true,
        });
        // Swallow speaker errors (e.g. device busy after barge-in reset)
        this._speaker.on("error", () => {});
      }
    }
    
    // ---------------------------------------------------------------------------
    // BasicVoiceAssistant
    // ---------------------------------------------------------------------------
    class BasicVoiceAssistant {
      /**
       * @param {object} opts
       * @param {string} opts.endpoint
       * @param {import("@azure/identity").TokenCredential} opts.credential
       * @param {string} opts.agentName
       * @param {string} opts.projectName
       * @param {string} [opts.agentVersion]
       * @param {string} [opts.conversationId]
       * @param {string} [opts.foundryResourceOverride]
       * @param {string} [opts.authenticationIdentityClientId]
       * @param {string} [opts.audioInputDevice]
       * @param {string} [opts.greetingText]
       * @param {boolean} [opts.noAudio]
       */
      constructor(opts) {
        this.endpoint = opts.endpoint;
        this.credential = opts.credential;
        this.greetingText = opts.greetingText;
        this.noAudio = opts.noAudio;
        this.agentConfig = {
          agentName: opts.agentName,
          projectName: opts.projectName,
          ...(opts.agentVersion && { agentVersion: opts.agentVersion }),
          ...(opts.conversationId && { conversationId: opts.conversationId }),
          ...(opts.foundryResourceOverride && {
            foundryResourceOverride: opts.foundryResourceOverride,
          }),
          ...(opts.foundryResourceOverride &&
            opts.authenticationIdentityClientId && {
              authenticationIdentityClientId: opts.authenticationIdentityClientId,
            }),
        };
    
        this._session = null;
        this._audio = new AudioProcessor(!opts.noAudio, opts.audioInputDevice);
        this._greetingSent = false;
        this._activeResponse = false;
        this._responseApiDone = false;
      }
    
      /** Connect, subscribe to events, and run until interrupted. */
      async start() {
        const client = new VoiceLiveClient(this.endpoint, this.credential);
        const session = client.createSession({ agent: this.agentConfig });
        this._session = session;
    
        console.log(
          `[init] Connecting to VoiceLive with agent "${this.agentConfig.agentName}" ` +
            `for project "${this.agentConfig.projectName}" ...`,
        );
    
        // Subscribe to VoiceLive events BEFORE connecting, so the
        // SESSION_UPDATED event is not missed.
        const subscription = session.subscribe({
          onSessionUpdated: async (event, context) => {
            const s = event.session;
            const agent = s?.agent;
            const voice = s?.voice;
            console.log(`[session] Session ready: ${context.sessionId}`);
            writeConversationLog(
              [
                `SessionID: ${context.sessionId}`,
                `Agent Name: ${agent?.name ?? ""}`,
                `Agent Description: ${agent?.description ?? ""}`,
                `Agent ID: ${agent?.agentId ?? ""}`,
                `Voice Name: ${voice?.name ?? ""}`,
                `Voice Type: ${voice?.type ?? ""}`,
                "",
              ].join("\n"),
            );
          },
    
          onConversationItemInputAudioTranscriptionCompleted: async (event) => {
            const transcript = event.transcript ?? "";
            console.log(`šŸ‘¤ You said:\t${transcript}`);
            writeConversationLog(`User Input:\t${transcript}`);
          },
    
          onResponseTextDone: async (event) => {
            const text = event.text ?? "";
            console.log(`šŸ¤– Agent responded with text:\t${text}`);
            writeConversationLog(`Agent Text Response:\t${text}`);
          },
    
          onResponseAudioTranscriptDone: async (event) => {
            const transcript = event.transcript ?? "";
            console.log(`šŸ¤– Agent responded with audio transcript:\t${transcript}`);
            writeConversationLog(`Agent Audio Response:\t${transcript}`);
          },
    
          onInputAudioBufferSpeechStarted: async () => {
            console.log("šŸŽ¤ Listening...");
            this._audio.skipPendingAudio();
    
            // Cancel in-progress response (barge-in)
            if (this._activeResponse && !this._responseApiDone) {
              try {
                await session.sendEvent({ type: "response.cancel" });
              } catch (err) {
                const msg = err?.message ?? "";
                if (!msg.toLowerCase().includes("no active response")) {
                  console.warn("[barge-in] Cancel failed:", msg);
                }
              }
            }
          },
    
          onInputAudioBufferSpeechStopped: async () => {
            console.log("šŸ¤” Processing...");
          },
    
          onResponseCreated: async () => {
            this._activeResponse = true;
            this._responseApiDone = false;
          },
    
          onResponseAudioDelta: async (event) => {
            if (event.delta) {
              this._audio.queueAudio(event.delta);
            }
          },
    
          onResponseAudioDone: async () => {
            console.log("šŸŽ¤ Ready for next input...");
          },
    
          onResponseDone: async () => {
            console.log("āœ… Response complete");
            this._activeResponse = false;
            this._responseApiDone = true;
          },
    
          onServerError: async (event) => {
            const msg = event.error?.message ?? "";
            if (msg.includes("Cancellation failed: no active response")) {
              // Benign – ignore
              return;
            }
            console.error(`āŒ VoiceLive error: ${msg}`);
          },
    
          onConversationItemCreated: async (event) => {
            console.log(`[event] Conversation item created: ${event.item?.id ?? ""}`);
          },
        });
    
        // Connect after subscribing so SESSION_UPDATED is not missed
        await session.connect();
        console.log("[init] Connected to VoiceLive session websocket");
    
        // Configure session eagerly after connect
        await this._setupSession();
    
        // Proactive greeting
        if (!this._greetingSent) {
          this._greetingSent = true;
          await this._sendProactiveGreeting();
        }
    
        // Start audio after session is configured
        await this._audio.startPlayback();
        await this._audio.startCapture(session);
    
        console.log("\n" + "=".repeat(65));
        console.log("šŸŽ¤ VOICE ASSISTANT READY");
        console.log("Start speaking to begin conversation");
        console.log("Press Ctrl+C to exit");
        console.log("=".repeat(65) + "\n");
    
        if (this.noAudio) {
          setTimeout(() => {
            process.emit("SIGINT");
          }, 6000);
        }
    
        // Keep the process alive until disconnect or Ctrl+C
        await new Promise((resolve) => {
          const onSigint = () => {
            resolve();
          };
          process.once("SIGINT", onSigint);
          process.once("SIGTERM", onSigint);
    
          // Also resolve if subscription closes (e.g. server-side disconnect)
          const poll = setInterval(() => {
            if (!session.isConnected) {
              clearInterval(poll);
              resolve();
            }
          }, 500);
        });
    
        // Cleanup
        await subscription.close();
        try {
          await session.disconnect();
        } catch {
          // ignore disconnect errors during shutdown
        }
        this._audio.shutdown();
        try {
          await session.dispose();
        } catch {
          // ignore dispose errors during shutdown
        }
      }
    
      /**
       * Send a proactive greeting when the session starts.
       * Supports pre-defined (--greeting-text) or LLM-generated (default).
       */
      async _sendProactiveGreeting() {
        const session = this._session;
    
        if (this.greetingText) {
          // Pre-generated assistant message (deterministic)
          console.log("[session] Sending pre-generated greeting ...");
          try {
            await session.sendEvent({
              type: "response.create",
              response: {
                preGeneratedAssistantMessage: {
                  content: [{ type: "text", text: this.greetingText }],
                },
              },
            });
          } catch (err) {
            console.error("[session] Failed to send pre-generated greeting:", err.message);
          }
        } else {
          // LLM-generated greeting (default)
          console.log("[session] Sending proactive greeting ...");
          try {
            await session.addConversationItem({
              type: "message",
              role: "system",
              content: [
                {
                  type: "input_text",
                  text: "Say something to welcome the user in English.",
                },
              ],
            });
            await session.sendEvent({ type: "response.create" });
          } catch (err) {
            console.error("[session] Failed to send greeting:", err.message);
          }
        }
      }
    
      /** Configure session modalities, audio format, and interim response. */
      async _setupSession() {
        console.log("[session] Configuring session ...");
        await this._session.updateSession({
          modalities: ["text", "audio"],
          inputAudioFormat: "pcm16",
          outputAudioFormat: "pcm16",
          interimResponse: {
            type: "llm_interim_response",
            triggers: ["tool", "latency"],
            latencyThresholdInMs: 100,
            instructions:
              "Create friendly interim responses indicating wait time due to ongoing processing, if any. " +
              "Do not include in all responses! Do not say you don't have real-time access to information when calling tools!",
          },
        });
        console.log("[session] Session configuration sent");
      }
    }
    
    // ---------------------------------------------------------------------------
    // CLI helpers
    // ---------------------------------------------------------------------------
    
    function printUsage() {
      console.log("Usage: node voice-live-with-agent-v2.js [options]");
      console.log("");
      console.log("Options:");
      console.log("  --endpoint <url>            VoiceLive endpoint URL");
      console.log("  --agent-name <name>         Foundry agent name");
      console.log("  --project-name <name>       Foundry project name");
      console.log("  --agent-version <ver>       Agent version");
      console.log("  --conversation-id <id>      Conversation ID to resume");
      console.log("  --foundry-resource <name>   Foundry resource override");
      console.log("  --auth-client-id <id>       Authentication identity client ID");
      console.log("  --audio-input-device <name> Explicit SoX input device name (Windows)");
      console.log("  --list-audio-devices        List available audio input devices and exit");
      console.log("  --greeting-text <text>      Send a pre-defined greeting instead of LLM-generated");
      console.log("  --no-audio                  Connect and configure session without mic/speaker");
      console.log("  -h, --help                  Show this help text");
    }
    
    function parseArguments(argv) {
      const parsed = {
        endpoint: process.env.VOICELIVE_ENDPOINT ?? "",
        agentName: process.env.AGENT_NAME ?? "",
        projectName: process.env.PROJECT_NAME ?? "",
        agentVersion: process.env.AGENT_VERSION,
        conversationId: process.env.CONVERSATION_ID,
        foundryResourceOverride: process.env.FOUNDRY_RESOURCE_OVERRIDE,
        authenticationIdentityClientId:
          process.env.AGENT_AUTHENTICATION_IDENTITY_CLIENT_ID,
        audioInputDevice: process.env.AUDIO_INPUT_DEVICE,
        listAudioDevices: false,
        greetingText: undefined,
        noAudio: false,
        help: false,
      };
    
      for (let i = 0; i < argv.length; i++) {
        const arg = argv[i];
        switch (arg) {
          case "--endpoint":
            parsed.endpoint = argv[++i];
            break;
          case "--agent-name":
            parsed.agentName = argv[++i];
            break;
          case "--project-name":
            parsed.projectName = argv[++i];
            break;
          case "--agent-version":
            parsed.agentVersion = argv[++i];
            break;
          case "--conversation-id":
            parsed.conversationId = argv[++i];
            break;
          case "--foundry-resource":
            parsed.foundryResourceOverride = argv[++i];
            break;
          case "--auth-client-id":
            parsed.authenticationIdentityClientId = argv[++i];
            break;
          case "--audio-input-device":
            parsed.audioInputDevice = argv[++i];
            break;
          case "--list-audio-devices":
            parsed.listAudioDevices = true;
            break;
          case "--greeting-text":
            parsed.greetingText = argv[++i];
            break;
          case "--no-audio":
            parsed.noAudio = true;
            break;
          case "--help":
          case "-h":
            parsed.help = true;
            break;
          default:
            if (arg?.startsWith("-")) {
              throw new Error(`Unknown option: ${arg}`);
            }
            break;
        }
      }
    
      return parsed;
    }
    
    /**
     * List available audio input devices on Windows (AudioEndpoint via WMI).
     */
    async function listAudioDevices() {
      if (process.platform !== "win32") {
        console.log("Device listing is currently supported on Windows only.");
        console.log("On macOS/Linux, run: sox -V6 -n -t coreaudio -n trim 0 0  (or similar)");
        return;
      }
    
      const { execSync } = await import("node:child_process");
      try {
        const output = execSync(
          'powershell -NoProfile -Command "Get-CimInstance Win32_PnPEntity | Where-Object { $_.PNPClass -eq \'AudioEndpoint\' } | Select-Object -ExpandProperty Name"',
          { encoding: "utf-8", timeout: 10000 },
        ).trim();
    
        if (!output) {
          console.log("No audio endpoint devices found.");
          return;
        }
    
        console.log("Available audio endpoint devices:");
        console.log("");
        for (const line of output.split(/\r?\n/)) {
          const name = line.trim();
          if (name) console.log(`  ${name}`);
        }
        console.log("");
        console.log("Use the device name (or a unique substring) with --audio-input-device.");
        console.log('Example: node voice-live-with-agent-v2.js --audio-input-device "Microphone"');
      } catch (err) {
        console.error("Failed to query audio devices:", err.message);
      }
    }
    
    // ---------------------------------------------------------------------------
    // Main
    // ---------------------------------------------------------------------------
    async function main() {
      let args;
      try {
        args = parseArguments(process.argv.slice(2));
      } catch (err) {
        console.error(`āŒ ${err.message}`);
        printUsage();
        process.exit(1);
      }
    
      if (args.help) {
        printUsage();
        return;
      }
    
      if (args.listAudioDevices) {
        await listAudioDevices();
        return;
      }
    
      if (!args.endpoint || !args.agentName || !args.projectName) {
        console.error(
          "āŒ Set VOICELIVE_ENDPOINT, AGENT_NAME, and PROJECT_NAME in your .env file or pass via CLI.",
        );
        printUsage();
        process.exit(1);
      }
    
      console.log("Configuration:");
      console.log(`  VOICELIVE_ENDPOINT: ${args.endpoint}`);
      console.log(`  AGENT_NAME: ${args.agentName}`);
      console.log(`  PROJECT_NAME: ${args.projectName}`);
      console.log(`  AGENT_VERSION: ${args.agentVersion ?? "(not set)"}`);
      console.log(`  CONVERSATION_ID: ${args.conversationId ?? "(not set)"}`);
      console.log(
        `  FOUNDRY_RESOURCE_OVERRIDE: ${args.foundryResourceOverride ?? "(not set)"}`,
      );
      console.log(
        `  AGENT_AUTHENTICATION_IDENTITY_CLIENT_ID: ${args.authenticationIdentityClientId ?? "(not set)"}`,
      );
      console.log(`  AUDIO_INPUT_DEVICE: ${args.audioInputDevice ?? "(not set)"}`);
      if (args.greetingText) {
        console.log(`  Proactive greeting: pre-defined`);
      } else {
        console.log(`  Proactive greeting: LLM-generated (default)`);
      }
      console.log(`  No audio mode: ${args.noAudio ? "enabled" : "disabled"}`);
    
      const credential = new DefaultAzureCredential();
    
      const assistant = new BasicVoiceAssistant({
        endpoint: args.endpoint,
        credential,
        agentName: args.agentName,
        projectName: args.projectName,
        agentVersion: args.agentVersion,
        conversationId: args.conversationId,
        foundryResourceOverride: args.foundryResourceOverride,
        authenticationIdentityClientId: args.authenticationIdentityClientId,
        audioInputDevice: args.audioInputDevice,
        greetingText: args.greetingText,
        noAudio: args.noAudio,
      });
    
      try {
        await assistant.start();
      } catch (err) {
        if (err?.code === "ERR_USE_AFTER_CLOSE") return; // normal on Ctrl+C
        console.error("Fatal error:", err);
        process.exit(1);
      }
    }
    
    console.log("šŸŽ™ļø  Basic Foundry Voice Agent with Azure VoiceLive SDK (Agent Mode)");
    console.log("=".repeat(65));
    main().then(
      () => console.log("\nšŸ‘‹ Voice assistant shut down. Goodbye!"),
      (err) => {
        console.error("Unhandled error:", err);
        process.exit(1);
      },
    );
    
  2. Sign in to Azure with the following command:

    az login
    
  3. Run the voice assistant:

    node voice-live-with-agent.js
    
  4. You can start speaking with the agent and hear responses. You can interrupt the model by speaking. Enter "Ctrl+C" to quit the conversation.

Output

The output of the script is printed to the console. You see messages indicating the status of the connection, audio stream, and playback. The audio is played back through your speakers or headphones.

šŸŽ™ļø  Basic Foundry Voice Agent with Azure VoiceLive SDK (Agent Mode)
=================================================================

=================================================================
šŸŽ¤ VOICE ASSISTANT READY
Start speaking to begin conversation
Press Ctrl+C to exit
=================================================================

šŸŽ¤ Listening...
šŸ¤” Processing...
šŸ‘¤ You said:	Hello.
šŸŽ¤ Ready for next input...
šŸ¤– Agent responded with audio transcript:	Hello! I'm Tobi the agent. How can I assist you today?
šŸŽ¤ Listening...
šŸ¤” Processing...
šŸ‘¤ You said:	What are the opening hours of the Eiffel Tower?
šŸŽ¤ Ready for next input...
šŸ¤– Agent responded with audio transcript:	The Eiffel Tower's opening hours can vary depending on the season and any special events or maintenance. Generally, the Eiffel Tower is open every day of the year, with the following typical hours:

- Mid-June to early September: 9:00 AM to 12:45 AM (last elevator ride up at 12:00 AM)
- Rest of the year: 9:30 AM to 11:45 PM (last elevator ride up at 11:00 PM)

These times can sometimes change, so it's always best to check the official Eiffel Tower website or contact them directly for the most up-to-date information before your visit.

Would you like me to help you find the official website or any other details about visiting the Eiffel Tower?

šŸ‘‹ Voice assistant shut down. Goodbye!

A conversation log file is created in the logs folder with the name conversation_YYYYMMDD_HHmmss.log. This file contains session metadata and the conversation transcript, including user inputs and agent responses.

SessionID: sess_1m1zrSLJSPjJpzbEOyQpTL
Agent Name: VoiceAgentQuickstartTest
Agent Description:
Agent ID:
Voice Name: en-US-Ava:DragonHDLatestNeural
Voice Type: azure-standard

User Input:	Hello.
Agent Audio Response:	Hello! I'm Tobi the agent. How can I assist you today?
User Input:	What are the opening hours of the Eiffel Tower?
Agent Audio Response:	The Eiffel Tower's opening hours can vary depending on the season...

Here are the key differences between the technical log and the conversation log:

Aspect Conversation Log Technical Log
Audience Business users, content reviewers Developers, IT operations
Content What was said in conversations How the system is working
Level Application/conversation level System/infrastructure level
Troubleshooting "What did the agent say?" "Why did the connection fail?"

Example: If your agent wasn't responding, you'd check:

  • Console log → "WebSocket connection failed" or "Audio stream error"
  • conversation log → "Did the user actually say anything?"

Both logs are complementary - conversation logs for conversation analysis and testing, technical logs for system diagnostics!

Technical log

Purpose: Technical debugging and system monitoring

Contents:

  • WebSocket connection events
  • Audio stream status
  • Error messages and stack traces
  • System-level events (session.created, response.done, etc.)
  • Network connectivity issues
  • Audio processing diagnostics

Format: Console output with bracketed prefixes (for example, [session], [audio], [init])

Use Cases:

  • Debugging connection problems
  • Monitoring system performance
  • Troubleshooting audio issues
  • Developer/operations analysis

Conversation log

Purpose: Conversation transcript and user experience tracking

Contents:

  • Agent and project identification
  • Session configuration details
  • User transcripts: "Tell me a story", "Stop"
  • Agent responses: Full story text and follow-up responses
  • Conversation flow and interactions

Format: Plain text, human-readable conversation format

Use Cases:

  • Analyzing conversation quality
  • Reviewing what was actually said
  • Understanding user interactions and agent responses
  • Business/content analysis

Learn how to use Voice Live with Microsoft Foundry Agent Service using the VoiceLive SDK for Java.

Reference documentation | Package (Maven) | Additional samples on GitHub

You can create and run an application to use Voice Live with agents for real-time voice agents.

  • Using agents allows leveraging a built-in prompt and configuration managed within the agent itself, rather than specifying instructions in the session code.

  • Agents encapsulate more complex logic and behaviors, making it easier to manage and update conversational flows without changing the client code.

  • The agent approach streamlines integration. The agent ID is used to connect and all necessary settings are handled internally, reducing the need for manual configuration in the code.

  • This separation also supports better maintainability and scalability for scenarios where multiple conversational experiences or business logic variations are needed.

To use the Voice Live API without Foundry agents, see the Voice Live API quickstart.

Tip

To use Voice Live, you don't need to deploy an audio model with your Microsoft Foundry resource. Voice Live is fully managed, and the model is automatically deployed for you. For more information about models availability, see the Voice Live overview documentation.

Follow the quickstart below or get a fully working web app with browser-based voice UI:

Prerequisites

Note

This document refers to the Microsoft Foundry (new) portal and the latest Foundry Agent Service version.

  • Assign the Azure AI User role to your user account. You can assign roles in the Azure portal under Access control (IAM) > Add role assignment.

Prepare the environment

  1. Create a new folder voice-live-quickstart and go to the quickstart folder with the following command:

    mkdir voice-live-quickstart && cd voice-live-quickstart
    
  2. Create a file named pom.xml with the following Maven configuration:

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
    
        <groupId>com.azure.ai.voicelive.samples</groupId>
        <artifactId>voice-live-agent-quickstart</artifactId>
        <version>1.0.0</version>
        <packaging>jar</packaging>
    
        <name>Voice Live Agent Quickstart</name>
        <description>Azure AI Voice Live Agent quickstart sample for Java</description>
    
        <properties>
            <maven.compiler.source>11</maven.compiler.source>
            <maven.compiler.target>11</maven.compiler.target>
            <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        </properties>
    
        <dependencies>
            <!-- Azure AI Voice Live SDK -->
            <dependency>
                <groupId>com.azure</groupId>
                <artifactId>azure-ai-voicelive</artifactId>
                <version>1.0.0-beta.5</version>
            </dependency>
    
            <!-- Azure AI Agents SDK (for agent creation) -->
            <dependency>
                <groupId>com.azure</groupId>
                <artifactId>azure-ai-agents</artifactId>
                <version>1.0.0-beta.1</version>
            </dependency>
    
            <!-- Azure Identity for Entra ID authentication -->
            <dependency>
                <groupId>com.azure</groupId>
                <artifactId>azure-identity</artifactId>
                <version>1.15.4</version>
            </dependency>
    
            <!-- SLF4J logging -->
            <dependency>
                <groupId>org.slf4j</groupId>
                <artifactId>slf4j-simple</artifactId>
                <version>1.7.36</version>
            </dependency>
        </dependencies>
    
        <build>
            <plugins>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <version>3.11.0</version>
                    <configuration>
                        <source>11</source>
                        <target>11</target>
                    </configuration>
                </plugin>
            </plugins>
        </build>
    </project>
    
  3. Create the Java source directory structure:

    mkdir src\main\java
    
  4. Download the Maven dependencies:

    mvn dependency:resolve
    

Retrieve resource information

Note

The agent integration requires Entra ID authentication. Key-based authentication isn't supported in Agent mode.

Create a new file named .env in the folder where you want to run the code.

In the .env file, add the following environment variables for authentication:

# Settings for Foundry Agent
PROJECT_ENDPOINT=<endpoint copied from welcome screen>
AGENT_NAME="MyVoiceAgent"
MODEL_DEPLOYMENT_NAME="gpt-4.1-mini"
# Settings for Voice Live
AGENT_NAME=<name-used-to-create-agent> # See above
AGENT_VERSION=<version-of-the-agent>
CONVERSATION_ID=<specific conversation id to reconnect to>
PROJECT_NAME=<your_project_name>
VOICELIVE_ENDPOINT=<your_endpoint>
VOICELIVE_API_VERSION=2026-01-01-preview

Replace the default values with your actual project name, agent name, and endpoint values.

Variable name Value
PROJECT_ENDPOINT The Foundry project endpoint copied from the project welcome screen.
AGENT_NAME The name of the agent to use.
AGENT_VERSION Optional: The version of the agent to use.
CONVERSATION_ID Optional: A specific conversation ID to reconnect to.
PROJECT_NAME The name of your Microsoft Foundry project. Project name is the last element of the project endpoint value.
VOICELIVE_ENDPOINT This value can be found in the Keys and Endpoint section when examining your resource from the Azure portal.
FOUNDRY_RESOURCE_OVERRIDE Optional: The Foundry resource name hosting the agent project (for example, my-resource-name).
AGENT_AUTHENTICATION_IDENTITY_CLIENT_ID Optional: The managed identity client ID of the Voice Live resource.

Learn more about keyless authentication and setting environment variables.

Create an agent with Voice Live settings

  1. Create a file src/main/java/CreateAgentWithVoiceLive.java with the following code:

    // Copyright (c) Microsoft Corporation. All rights reserved.
    // Licensed under the MIT License.
    
    import com.azure.ai.agents.AgentsClient;
    import com.azure.ai.agents.AgentsClientBuilder;
    import com.azure.ai.agents.models.AgentDetails;
    import com.azure.ai.agents.models.AgentVersionDetails;
    import com.azure.ai.agents.models.PromptAgentDefinition;
    import com.azure.identity.DefaultAzureCredentialBuilder;
    
    import java.util.LinkedHashMap;
    import java.util.Map;
    
    /**
     * Creates an Azure AI Foundry agent configured for Voice Live sessions.
     *
     * <p>Voice Live session settings (voice, VAD, noise reduction, etc.) are stored
     * in the agent's metadata using a chunking strategy because each metadata value
     * is limited to 512 characters.</p>
     *
     * <p>Required environment variables:</p>
     * <ul>
     *   <li>PROJECT_ENDPOINT - Azure AI Foundry project endpoint</li>
     *   <li>AGENT_NAME - Name for the agent</li>
     *   <li>MODEL_DEPLOYMENT_NAME - Model deployment name (e.g., gpt-4o-mini)</li>
     * </ul>
     */
    public class CreateAgentWithVoiceLive {
    
        private static final int METADATA_VALUE_LIMIT = 512;
    
        // <create_agent>
        public static void main(String[] args) {
            String endpoint = System.getenv("PROJECT_ENDPOINT");
            String agentName = System.getenv("AGENT_NAME");
            String model = System.getenv("MODEL_DEPLOYMENT_NAME");
    
            if (endpoint == null || agentName == null || model == null) {
                System.err.println("Set PROJECT_ENDPOINT, AGENT_NAME, and MODEL_DEPLOYMENT_NAME environment variables.");
                System.exit(1);
            }
    
            // Create the Agents client with Entra ID authentication
            AgentsClient agentsClient = new AgentsClientBuilder()
                    .credential(new DefaultAzureCredentialBuilder().build())
                    .endpoint(endpoint)
                    .buildAgentsClient();
    
            // Define Voice Live session settings
            String voiceLiveConfig = "{"
                    + "\"session\": {"
                    + "\"voice\": {"
                    + "\"name\": \"en-US-Ava:DragonHDLatestNeural\","
                    + "\"type\": \"azure-standard\","
                    + "\"temperature\": 0.8"
                    + "},"
                    + "\"input_audio_transcription\": {"
                    + "\"model\": \"azure-speech\""
                    + "},"
                    + "\"turn_detection\": {"
                    + "\"type\": \"azure_semantic_vad\","
                    + "\"end_of_utterance_detection\": {"
                    + "\"model\": \"semantic_detection_v1_multilingual\""
                    + "}"
                    + "},"
                    + "\"input_audio_noise_reduction\": {\"type\": \"azure_deep_noise_suppression\"},"
                    + "\"input_audio_echo_cancellation\": {\"type\": \"server_echo_cancellation\"}"
                    + "}"
                    + "}";
    
            // Chunk the config into metadata entries (512-char limit per value)
            Map<String, String> metadata = chunkConfig(voiceLiveConfig);
    
            // Create the agent with Voice Live configuration in metadata
            PromptAgentDefinition definition = new PromptAgentDefinition(model)
                    .setInstructions("You are a helpful assistant that answers general questions");
    
            AgentVersionDetails agent = agentsClient.createAgentVersion(agentName, definition, metadata, null);
            System.out.println("Agent created: " + agent.getName() + " (version " + agent.getVersion() + ")");
    
            // Verify Voice Live configuration was stored correctly
            AgentDetails retrieved = agentsClient.getAgent(agentName);
            Map<String, String> storedMetadata = retrieved.getVersions().getLatest().getMetadata();
            String storedConfig = reassembleConfig(storedMetadata);
    
            if (storedConfig != null && !storedConfig.isEmpty()) {
                System.out.println("\nVoice Live configuration:");
                System.out.println(storedConfig);
            } else {
                System.out.println("\nVoice Live configuration not found in agent metadata.");
            }
        }
        // </create_agent>
    
        // <chunk_config>
        /**
         * Splits a configuration JSON string into chunked metadata entries.
         * Each metadata value is limited to 512 characters.
         */
        static Map<String, String> chunkConfig(String configJson) {
            Map<String, String> metadata = new LinkedHashMap<>();
            metadata.put("microsoft.voice-live.configuration",
                    configJson.substring(0, Math.min(configJson.length(), METADATA_VALUE_LIMIT)));
    
            String remaining = configJson.length() > METADATA_VALUE_LIMIT
                    ? configJson.substring(METADATA_VALUE_LIMIT) : "";
            int chunkNum = 1;
            while (!remaining.isEmpty()) {
                String chunk = remaining.substring(0, Math.min(remaining.length(), METADATA_VALUE_LIMIT));
                metadata.put("microsoft.voice-live.configuration." + chunkNum, chunk);
                remaining = remaining.length() > METADATA_VALUE_LIMIT
                        ? remaining.substring(METADATA_VALUE_LIMIT) : "";
                chunkNum++;
            }
            return metadata;
        }
        // </chunk_config>
    
        // <reassemble_config>
        /**
         * Reassembles chunked Voice Live configuration from agent metadata.
         */
        static String reassembleConfig(Map<String, String> metadata) {
            if (metadata == null) {
                return "";
            }
            StringBuilder config = new StringBuilder();
            String base = metadata.get("microsoft.voice-live.configuration");
            if (base != null) {
                config.append(base);
            }
            int chunkNum = 1;
            while (metadata.containsKey("microsoft.voice-live.configuration." + chunkNum)) {
                config.append(metadata.get("microsoft.voice-live.configuration." + chunkNum));
                chunkNum++;
            }
            return config.toString();
        }
        // </reassemble_config>
    }
    
  2. Sign in to Azure with the following command:

    az login
    
  3. Build and run the agent creation script:

    mvn compile exec:java -Dexec.mainClass="CreateAgentWithVoiceLive" -q
    

Talk with a voice agent

The sample code in this quickstart uses Microsoft Entra ID for authentication as the current integration only supports this authentication method.

The sample connects to Foundry Agent Service by passing AgentSessionConfig to startSession(...) using these fields:

  • agentName: The agent name to invoke.
  • projectName: The Foundry project containing the agent.
  • agentVersion: Optional pinned version for controlled rollouts. If omitted, the latest version is used.
  • conversationId: Optional conversation ID to continue prior conversation context.
  • foundryResourceOverride: Optional resource name when the agent is hosted on a different Foundry resource.
  • authenticationIdentityClientId: Optional managed identity client ID used with cross-resource agent connections.

Note

Agent mode in Voice Live doesn't support key-based authentication for agent invocation. Use Microsoft Entra ID (for example, AzureCliCredential) for agent access. Voice Live resource configuration might still include API keys for non-agent scenarios.

  1. Create the src/main/java/VoiceLiveWithAgent.java file with the following code:

    // Copyright (c) Microsoft Corporation. All rights reserved.
    // Licensed under the MIT License.
    
    import com.azure.ai.voicelive.VoiceLiveAsyncClient;
    import com.azure.ai.voicelive.VoiceLiveClientBuilder;
    import com.azure.ai.voicelive.VoiceLiveSessionAsyncClient;
    import com.azure.ai.voicelive.models.AgentSessionConfig;
    import com.azure.ai.voicelive.models.ClientEventConversationItemCreate;
    import com.azure.ai.voicelive.models.ClientEventResponseCancel;
    import com.azure.ai.voicelive.models.ClientEventResponseCreate;
    import com.azure.ai.voicelive.models.ClientEventSessionUpdate;
    import com.azure.ai.voicelive.models.ConversationRequestItem;
    import com.azure.ai.voicelive.models.InputAudioFormat;
    import com.azure.ai.voicelive.models.InputTextContentPart;
    import com.azure.ai.voicelive.models.InteractionModality;
    import com.azure.ai.voicelive.models.InterimResponseTrigger;
    import com.azure.ai.voicelive.models.LlmInterimResponseConfig;
    import com.azure.ai.voicelive.models.MessageContentPart;
    import com.azure.ai.voicelive.models.OutputAudioFormat;
    import com.azure.ai.voicelive.models.ServerEventType;
    import com.azure.ai.voicelive.models.SessionUpdate;
    import com.azure.ai.voicelive.models.SessionUpdateError;
    import com.azure.ai.voicelive.models.SessionUpdateResponseAudioDelta;
    import com.azure.ai.voicelive.models.SystemMessageItem;
    import com.azure.ai.voicelive.models.VoiceLiveSessionOptions;
    import com.azure.core.util.BinaryData;
    import com.azure.identity.AzureCliCredentialBuilder;
    
    import javax.sound.sampled.AudioFormat;
    import javax.sound.sampled.AudioSystem;
    import javax.sound.sampled.DataLine;
    import javax.sound.sampled.LineUnavailableException;
    import javax.sound.sampled.SourceDataLine;
    import javax.sound.sampled.TargetDataLine;
    import java.io.FileWriter;
    import java.io.IOException;
    import java.io.PrintWriter;
    import java.nio.file.Files;
    import java.nio.file.Path;
    import java.nio.file.Paths;
    import java.time.LocalDateTime;
    import java.time.format.DateTimeFormatter;
    import java.util.Arrays;
    import java.util.concurrent.BlockingQueue;
    import java.util.concurrent.CountDownLatch;
    import java.util.concurrent.LinkedBlockingQueue;
    import java.util.concurrent.atomic.AtomicBoolean;
    import java.util.concurrent.atomic.AtomicInteger;
    import java.util.logging.Level;
    import java.util.logging.Logger;
    
    /**
     * Voice assistant using Azure AI Voice Live SDK with Foundry Agent support.
     *
     * <p>This sample demonstrates:</p>
     * <ul>
     *   <li>Connecting to Voice Live with AgentSessionConfig</li>
     *   <li>Configuring interim responses to bridge latency gaps</li>
     *   <li>Proactive greeting message on session start</li>
     *   <li>Real-time audio capture and playback with barge-in support</li>
     *   <li>Conversation logging to a file</li>
     * </ul>
     *
     * <p>Required environment variables:</p>
     * <ul>
     *   <li>VOICELIVE_ENDPOINT - Voice Live service endpoint</li>
     *   <li>AGENT_NAME - Name of the Foundry agent</li>
     *   <li>PROJECT_NAME - Foundry project name (e.g., myproject)</li>
     * </ul>
     *
     * <p>Optional environment variables:</p>
     * <ul>
     *   <li>VOICE_NAME - Voice name (default: en-US-Ava:DragonHDLatestNeural)</li>
     *   <li>AGENT_VERSION - Specific agent version</li>
     *   <li>CONVERSATION_ID - Resume a previous conversation</li>
     *   <li>FOUNDRY_RESOURCE_OVERRIDE - Cross-resource Foundry endpoint</li>
     *   <li>AGENT_AUTHENTICATION_IDENTITY_CLIENT_ID - Managed identity client ID for cross-resource auth</li>
     * </ul>
     */
    // <all>
    public class VoiceLiveWithAgentV2 {
    
        private static final Logger logger = Logger.getLogger(VoiceLiveWithAgentV2.class.getName());
    
        // Audio configuration: 24 kHz, 16-bit, mono PCM
        private static final float SAMPLE_RATE = 24000;
        private static final int SAMPLE_SIZE_BITS = 16;
        private static final int CHANNELS = 1;
        private static final int FRAME_SIZE = 2; // 16-bit mono = 2 bytes per frame
        private static final int BUFFER_SIZE = 4800; // 100ms at 24kHz
    
        // Conversation log
        private static final String LOG_FILENAME = "conversation_"
                + LocalDateTime.now().format(DateTimeFormatter.ofPattern("yyyyMMdd_HHmmss")) + ".log";
    
        // <audio_processor>
        /**
         * Manages real-time audio capture from the microphone and playback to the speakers.
         * Uses a blocking queue for audio buffering and supports barge-in (skip pending audio).
         */
        static class AudioProcessor {
            private final AudioFormat format;
            private TargetDataLine captureLine;
            private SourceDataLine playbackLine;
            private final BlockingQueue<byte[]> playbackQueue = new LinkedBlockingQueue<>();
            private final AtomicBoolean capturing = new AtomicBoolean(false);
            private final AtomicBoolean playing = new AtomicBoolean(false);
            private final AtomicInteger nextSeqNum = new AtomicInteger(0);
            private volatile int playbackBase = 0;
            private Thread captureThread;
            private Thread playbackThread;
            private final VoiceLiveSessionAsyncClient session;
    
            AudioProcessor(VoiceLiveSessionAsyncClient session) {
                this.session = session;
                this.format = new AudioFormat(SAMPLE_RATE, SAMPLE_SIZE_BITS, CHANNELS, true, false);
            }
    
            void startCapture() throws LineUnavailableException {
                if (capturing.get()) {
                    return;
                }
    
                DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);
                if (!AudioSystem.isLineSupported(info)) {
                    throw new LineUnavailableException("Microphone not available");
                }
    
                captureLine = (TargetDataLine) AudioSystem.getLine(info);
                captureLine.open(format, BUFFER_SIZE * FRAME_SIZE);
                captureLine.start();
                capturing.set(true);
    
                captureThread = new Thread(() -> {
                    byte[] buffer = new byte[BUFFER_SIZE * FRAME_SIZE];
                    while (capturing.get()) {
                        int bytesRead = captureLine.read(buffer, 0, buffer.length);
                        if (bytesRead > 0) {
                            byte[] audioChunk = Arrays.copyOf(buffer, bytesRead);
                            try {
                                session.sendInputAudio(BinaryData.fromBytes(audioChunk)).block();
                            } catch (Exception e) {
                                if (capturing.get()) {
                                    logger.warning("Audio send failed: " + e.getMessage());
                                }
                            }
                        }
                    }
                }, "audio-capture");
                captureThread.setDaemon(true);
                captureThread.start();
                logger.info("Started audio capture");
            }
    
            void startPlayback() throws LineUnavailableException {
                if (playing.get()) {
                    return;
                }
    
                DataLine.Info info = new DataLine.Info(SourceDataLine.class, format);
                if (!AudioSystem.isLineSupported(info)) {
                    throw new LineUnavailableException("Speakers not available");
                }
    
                playbackLine = (SourceDataLine) AudioSystem.getLine(info);
                playbackLine.open(format, BUFFER_SIZE * FRAME_SIZE);
                playbackLine.start();
                playing.set(true);
    
                playbackThread = new Thread(() -> {
                    while (playing.get()) {
                        try {
                            byte[] data = playbackQueue.take();
                            if (data.length == 0) {
                                // Poison pill to stop playback
                                break;
                            }
                            playbackLine.write(data, 0, data.length);
                        } catch (InterruptedException e) {
                            Thread.currentThread().interrupt();
                            break;
                        }
                    }
                }, "audio-playback");
                playbackThread.setDaemon(true);
                playbackThread.start();
                logger.info("Audio playback system ready");
            }
    
            void queueAudio(byte[] audioData) {
                int seqNum = nextSeqNum.getAndIncrement();
                if (seqNum >= playbackBase) {
                    playbackQueue.offer(audioData);
                }
            }
    
            void skipPendingAudio() {
                playbackBase = nextSeqNum.getAndIncrement();
                playbackQueue.clear();
                if (playbackLine != null) {
                    playbackLine.flush();
                }
            }
    
            void shutdown() {
                capturing.set(false);
                playing.set(false);
    
                if (captureLine != null) {
                    captureLine.stop();
                    captureLine.close();
                    logger.info("Stopped audio capture");
                }
    
                skipPendingAudio();
                playbackQueue.offer(new byte[0]); // poison pill
                if (playbackLine != null) {
                    playbackLine.drain();
                    playbackLine.stop();
                    playbackLine.close();
                    logger.info("Stopped audio playback");
                }
                logger.info("Audio processor cleaned up");
            }
        }
        // </audio_processor>
    
        // <voice_assistant>
        /**
         * Voice assistant that connects to a Foundry Agent via the Voice Live service.
         * Handles session lifecycle, event processing, and audio I/O.
         */
        static class BasicVoiceAssistant {
            private final String endpoint;
            private final AgentSessionConfig agentConfig;
            private VoiceLiveSessionAsyncClient session;
            private AudioProcessor audioProcessor;
            private boolean sessionReady = false;
            private boolean greetingSent = false;
            private boolean activeResponse = false;
            private boolean responseApiDone = false;
    
            // <agent_config>
            BasicVoiceAssistant(String endpoint, String agentName, String projectName,
                                String agentVersion, String conversationId,
                                String foundryResourceOverride, String authIdentityClientId) {
                this.endpoint = endpoint;
    
                // Build the agent session configuration
                AgentSessionConfig config = new AgentSessionConfig(agentName, projectName);
                if (agentVersion != null && !agentVersion.isEmpty()) {
                    config.setAgentVersion(agentVersion);
                }
                if (conversationId != null && !conversationId.isEmpty()) {
                    config.setConversationId(conversationId);
                }
                if (foundryResourceOverride != null && !foundryResourceOverride.isEmpty()) {
                    config.setFoundryResourceOverride(foundryResourceOverride);
                    if (authIdentityClientId != null && !authIdentityClientId.isEmpty()) {
                        config.setAuthenticationIdentityClientId(authIdentityClientId);
                    }
                }
                this.agentConfig = config;
            }
            // </agent_config>
    
            // <start_session>
            void start() throws Exception {
                logger.info("Connecting to VoiceLive API with agent config...");
    
                // Create the Voice Live async client with Entra ID authentication
                VoiceLiveAsyncClient client = new VoiceLiveClientBuilder()
                        .endpoint(endpoint)
                        .credential(new AzureCliCredentialBuilder().build())
                        .buildAsyncClient();
    
                // Connect using AgentSessionConfig
                session = client.startSession(agentConfig).block();
                if (session == null) {
                    throw new RuntimeException("Failed to start Voice Live session");
                }
    
                try {
                    audioProcessor = new AudioProcessor(session);
                    setupSession();
                    audioProcessor.startPlayback();
    
                    logger.info("Voice assistant ready! Start speaking...");
                    System.out.println();
                    System.out.println("=".repeat(65));
                    System.out.println("šŸŽ¤ VOICE ASSISTANT READY");
                    System.out.println("Start speaking to begin conversation");
                    System.out.println("Press Ctrl+C to exit");
                    System.out.println("=".repeat(65));
                    System.out.println();
    
                    // Process events (blocking)
                    processEvents();
                } finally {
                    if (audioProcessor != null) {
                        audioProcessor.shutdown();
                    }
                    if (session != null) {
                        session.closeAsync().block();
                    }
                }
            }
            // </start_session>
    
            // <setup_session>
            private void setupSession() {
                logger.info("Setting up voice conversation session...");
    
                // Configure interim responses to bridge latency gaps during processing
                LlmInterimResponseConfig interimResponseConfig = new LlmInterimResponseConfig()
                        .setTriggers(Arrays.asList(
                                InterimResponseTrigger.TOOL,
                                InterimResponseTrigger.LATENCY))
                        .setLatencyThresholdMs(100)
                        .setInstructions("Create friendly interim responses indicating wait time due to "
                                + "ongoing processing, if any. Do not include in all responses! Do not "
                                + "say you don't have real-time access to information when calling tools!");
    
                // Create session configuration
                VoiceLiveSessionOptions sessionOptions = new VoiceLiveSessionOptions()
                        .setModalities(Arrays.asList(InteractionModality.TEXT, InteractionModality.AUDIO))
                        .setInputAudioFormat(InputAudioFormat.PCM16)
                        .setOutputAudioFormat(OutputAudioFormat.PCM16)
                        .setInterimResponse(BinaryData.fromObject(interimResponseConfig));
    
                // Send session update
                session.sendEvent(new ClientEventSessionUpdate(sessionOptions)).block();
                logger.info("Session configuration sent");
            }
            // </setup_session>
    
            // <process_events>
            private void processEvents() throws InterruptedException {
                CountDownLatch latch = new CountDownLatch(1);
    
                session.receiveEvents().subscribe(
                        event -> handleEvent(event),
                        error -> {
                            logger.log(Level.SEVERE, "Error processing events", error);
                            latch.countDown();
                        },
                        () -> {
                            logger.info("Event stream completed");
                            latch.countDown();
                        }
                );
    
                latch.await();
            }
            // </process_events>
    
            // <handle_events>
            private void handleEvent(SessionUpdate event) {
                ServerEventType type = event.getType();
                logger.fine("Received event: " + type);
    
                if (type == ServerEventType.SESSION_UPDATED) {
                    logger.info("Session updated and ready");
                    sessionReady = true;
                    String sessionId = extractField(event, "id");
                    writeLog(String.format("SessionID: %s\n", sessionId));
    
                    // Send a proactive greeting
                    if (!greetingSent) {
                        greetingSent = true;
                        sendProactiveGreeting();
                    }
    
                    // Start audio capture once session is ready
                    try {
                        audioProcessor.startCapture();
                    } catch (LineUnavailableException e) {
                        logger.log(Level.SEVERE, "Failed to start audio capture", e);
                    }
    
                } else if (type == ServerEventType.CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_COMPLETED) {
                    String transcript = extractField(event, "transcript");
                    System.out.println("šŸ‘¤ You said:\t" + transcript);
                    writeLog("User Input:\t" + transcript);
    
                } else if (type == ServerEventType.RESPONSE_AUDIO_TRANSCRIPT_DONE) {
                    String transcript = extractField(event, "transcript");
                    System.out.println("šŸ¤– Agent responded:\t" + transcript);
                    writeLog("Agent Audio Response:\t" + transcript);
    
                } else if (type == ServerEventType.INPUT_AUDIO_BUFFER_SPEECH_STARTED) {
                    logger.info("User started speaking - stopping playback");
                    System.out.println("šŸŽ¤ Listening...");
                    audioProcessor.skipPendingAudio();
    
                    // Cancel in-progress response for barge-in
                    if (activeResponse && !responseApiDone) {
                        try {
                            session.sendEvent(new ClientEventResponseCancel()).block();
                            logger.fine("Cancelled in-progress response due to barge-in");
                        } catch (Exception e) {
                            if (e.getMessage() != null && e.getMessage().toLowerCase().contains("no active response")) {
                                logger.fine("Cancel ignored - response already completed");
                            } else {
                                logger.warning("Cancel failed: " + e.getMessage());
                            }
                        }
                    }
    
                } else if (type == ServerEventType.INPUT_AUDIO_BUFFER_SPEECH_STOPPED) {
                    logger.info("User stopped speaking");
                    System.out.println("šŸ¤” Processing...");
    
                } else if (type == ServerEventType.RESPONSE_CREATED) {
                    logger.info("Assistant response created");
                    activeResponse = true;
                    responseApiDone = false;
    
                } else if (type == ServerEventType.RESPONSE_AUDIO_DELTA) {
                    logger.fine("Received audio delta");
                    SessionUpdateResponseAudioDelta audioDelta = (SessionUpdateResponseAudioDelta) event;
                    byte[] audioData = audioDelta.getDelta();
                    if (audioData != null && audioData.length > 0) {
                        audioProcessor.queueAudio(audioData);
                    }
    
                } else if (type == ServerEventType.RESPONSE_AUDIO_DONE) {
                    logger.info("Assistant finished speaking");
                    System.out.println("šŸŽ¤ Ready for next input...");
    
                } else if (type == ServerEventType.RESPONSE_DONE) {
                    logger.info("Response complete");
                    activeResponse = false;
                    responseApiDone = true;
    
                } else if (type == ServerEventType.ERROR) {
                    SessionUpdateError errorEvent = (SessionUpdateError) event;
                    String msg = errorEvent.getError().getMessage();
                    if (msg != null && msg.contains("Cancellation failed: no active response")) {
                        logger.fine("Benign cancellation error: " + msg);
                    } else {
                        logger.severe("VoiceLive error: " + msg);
                        System.out.println("Error: " + msg);
                    }
    
                } else {
                    logger.fine("Unhandled event type: " + type);
                }
            }
            // </handle_events>
    
            // <proactive_greeting>
            private void sendProactiveGreeting() {
                logger.info("Sending proactive greeting request");
                try {
                    // Create a system message to trigger greeting
                    SystemMessageItem greetingMessage = new SystemMessageItem(
                            Arrays.asList(new InputTextContentPart("Say something to welcome the user in English.")));
                    ClientEventConversationItemCreate createEvent = new ClientEventConversationItemCreate()
                            .setItem(greetingMessage);
                    session.sendEvent(createEvent).block();
    
                    // Request a response
                    session.sendEvent(new ClientEventResponseCreate()).block();
                } catch (Exception e) {
                    logger.log(Level.WARNING, "Failed to send proactive greeting", e);
                }
            }
            // </proactive_greeting>
    
            private void writeLog(String message) {
                try {
                    Path logDir = Paths.get("logs");
                    Files.createDirectories(logDir);
                    try (PrintWriter writer = new PrintWriter(
                            new FileWriter(logDir.resolve(LOG_FILENAME).toString(), true))) {
                        writer.println(message);
                    }
                } catch (IOException e) {
                    logger.warning("Failed to write conversation log: " + e.getMessage());
                }
            }
    
            /**
             * Extracts a string field value from a SessionUpdate event's JSON representation.
             */
            private String extractField(SessionUpdate event, String fieldName) {
                try {
                    String json = event.toJsonString();
                    // Simple extraction: find "fieldName":"value"
                    String key = "\"" + fieldName + "\":\"";
                    int start = json.indexOf(key);
                    if (start >= 0) {
                        start += key.length();
                        int end = json.indexOf("\"", start);
                        if (end >= 0) {
                            return json.substring(start, end);
                        }
                    }
                } catch (IOException ignored) { }
                return "";
            }
        }
        // </voice_assistant>
    
        // <main>
        public static void main(String[] args) {
            String endpoint = System.getenv("VOICELIVE_ENDPOINT");
            String agentName = System.getenv("AGENT_NAME");
            String projectName = System.getenv("PROJECT_NAME");
            String agentVersion = System.getenv("AGENT_VERSION");
            String conversationId = System.getenv("CONVERSATION_ID");
            String foundryResourceOverride = System.getenv("FOUNDRY_RESOURCE_OVERRIDE");
            String authIdentityClientId = System.getenv("AGENT_AUTHENTICATION_IDENTITY_CLIENT_ID");
    
            System.out.println("Environment variables:");
            System.out.println("VOICELIVE_ENDPOINT: " + endpoint);
            System.out.println("AGENT_NAME: " + agentName);
            System.out.println("PROJECT_NAME: " + projectName);
            System.out.println("AGENT_VERSION: " + agentVersion);
            System.out.println("CONVERSATION_ID: " + conversationId);
            System.out.println("FOUNDRY_RESOURCE_OVERRIDE: " + foundryResourceOverride);
    
            if (endpoint == null || endpoint.isEmpty()
                    || agentName == null || agentName.isEmpty()
                    || projectName == null || projectName.isEmpty()) {
                System.err.println("Set VOICELIVE_ENDPOINT, AGENT_NAME, and PROJECT_NAME environment variables.");
                System.exit(1);
            }
    
            // Verify audio devices
            checkAudioDevices();
    
            System.out.println("šŸŽ™ļø Basic Foundry Voice Agent with Azure VoiceLive SDK (Agent Mode)");
            System.out.println("=".repeat(65));
    
            BasicVoiceAssistant assistant = new BasicVoiceAssistant(
                    endpoint, agentName, projectName,
                    agentVersion, conversationId,
                    foundryResourceOverride, authIdentityClientId);
    
            // Handle graceful shutdown
            Runtime.getRuntime().addShutdownHook(new Thread(() -> {
                System.out.println("\nšŸ‘‹ Voice assistant shut down. Goodbye!");
            }));
    
            try {
                assistant.start();
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
                System.out.println("\nšŸ‘‹ Voice assistant shut down. Goodbye!");
            } catch (Exception e) {
                System.err.println("Fatal Error: " + e.getMessage());
                e.printStackTrace();
            }
        }
        // </main>
    
        // <check_audio>
        private static void checkAudioDevices() {
            AudioFormat format = new AudioFormat(SAMPLE_RATE, SAMPLE_SIZE_BITS, CHANNELS, true, false);
            DataLine.Info captureInfo = new DataLine.Info(TargetDataLine.class, format);
            DataLine.Info playbackInfo = new DataLine.Info(SourceDataLine.class, format);
    
            if (!AudioSystem.isLineSupported(captureInfo)) {
                System.err.println("āŒ No audio input devices found. Please check your microphone.");
                System.exit(1);
            }
            if (!AudioSystem.isLineSupported(playbackInfo)) {
                System.err.println("āŒ No audio output devices found. Please check your speakers.");
                System.exit(1);
            }
        }
        // </check_audio>
    }
    // </all>
    
  2. Sign in to Azure with the following command:

    az login
    
  3. Build and run the voice assistant:

    mvn compile exec:java -Dexec.mainClass="VoiceLiveWithAgent" -q
    
  4. You can start speaking with the agent and hear responses. You can interrupt the model by speaking. Enter "Ctrl+C" to quit the conversation.

Output

The output of the script is printed to the console. You see messages indicating the status of the connection, audio stream, and playback. The audio is played back through your speakers or headphones.

šŸŽ™ļø  Basic Foundry Voice Agent with Azure VoiceLive SDK (Agent Mode)
=================================================================

============================================================
šŸŽ¤ VOICE ASSISTANT READY
Start speaking to begin conversation
Press Ctrl+C to exit
============================================================

šŸŽ¤ Listening...
šŸ¤” Processing...
šŸ‘¤ You said:	Hello.
šŸŽ¤ Ready for next input...
šŸ¤– Agent responded:	Hello! I'm Tobi the agent. How can I assist you today?
šŸŽ¤ Listening...
šŸ¤” Processing...
šŸ‘¤ You said:	What are the opening hours of the Eiffel Tower?
šŸŽ¤ Ready for next input...
šŸ¤– Agent responded:	The Eiffel Tower's opening hours can vary depending on the season and any special events or maintenance. Generally, the Eiffel Tower is open every day of the year, with the following typical hours:

- Mid-June to early September: 9:00 AM to 12:45 AM (last elevator ride up at 12:00 AM)
- Rest of the year: 9:30 AM to 11:45 PM (last elevator ride up at 11:00 PM)

These times can sometimes change, so it's always best to check the official Eiffel Tower website or contact them directly for the most up-to-date information before your visit.

Would you like me to help you find the official website or any other details about visiting the Eiffel Tower?

šŸ‘‹ Voice assistant shut down. Goodbye!

The program uses Java's java.util.logging framework for technical logs, which are written to the console (stderr) by default. You can configure a logging properties file to redirect output to a file if needed.

Logger logger = Logger.getLogger(VoiceLiveWithAgentV2.class.getName());

The console output includes technical information about the connection to the Voice Live API, audio processing, and session events:

2026-02-10 18:40:19,183 INFO Using Azure token credential
2026-02-10 18:40:19,184 INFO Connecting to VoiceLive API with agent config...
2026-02-10 18:40:21,847 INFO AudioProcessor initialized with 24kHz PCM16 mono audio
2026-02-10 18:40:21,847 INFO Setting up voice conversation session...
2026-02-10 18:40:21,848 INFO Session configuration sent
2026-02-10 18:40:22,174 INFO Audio playback system ready
2026-02-10 18:40:22,174 INFO Voice assistant ready! Start speaking...
2026-02-10 18:40:22,384 INFO Session ready
2026-02-10 18:40:22,386 INFO Sending proactive greeting request
2026-02-10 18:40:22,419 INFO Started audio capture
2026-02-10 18:40:22,722 INFO šŸ¤– Assistant response created
2026-02-10 18:40:26,054 INFO šŸ¤– Assistant finished speaking
2026-02-10 18:40:26,074 INFO āœ… Response complete

Further, a conversation log file is created in the logs folder with the name conversation_YYYYMMDD_HHmmss.log. This file contains session metadata and the conversation transcript, including user inputs and agent responses.

SessionID: sess_1m1zrSLJSPjJpzbEOyQpTL

User Input:	Hello.
Agent Audio Response:	Hello! I'm Tobi the agent. How can I assist you today?
User Input:	What are the opening hours of the Eiffel Tower?
Agent Audio Response:	The Eiffel Tower's opening hours can vary depending on the season...

Here are the key differences between the technical log and the conversation log:

Aspect Conversation Log Technical Log
Audience Business users, content reviewers Developers, IT operations
Content What was said in conversations How the system is working
Level Application/conversation level System/infrastructure level
Troubleshooting "What did the agent say?" "Why did the connection fail?"

Example: If your agent wasn't responding, you'd check:

  • Console log → "WebSocket connection failed" or "Audio stream error"
  • conversation log → "Did the user actually say anything?"

Both logs are complementary - conversation logs for conversation analysis and testing, technical logs for system diagnostics!

Technical log

Purpose: Technical debugging and system monitoring

Contents:

  • WebSocket connection events
  • Audio stream status
  • Error messages and stack traces
  • System-level events (session.created, response.done, etc.)
  • Network connectivity issues
  • Audio processing diagnostics

Format: Structured logging with timestamps, log levels, and technical details

Use Cases:

  • Debugging connection problems
  • Monitoring system performance
  • Troubleshooting audio issues
  • Developer/operations analysis

Conversation log

Purpose: Conversation transcript and user experience tracking

Contents:

  • Agent and project identification
  • Session configuration details
  • User transcripts: "Tell me a story", "Stop"
  • Agent responses: Full story text and follow-up responses
  • Conversation flow and interactions

Format: Plain text, human-readable conversation format

Use Cases:

  • Analyzing conversation quality
  • Reviewing what was actually said
  • Understanding user interactions and agent responses
  • Business/content analysis