Edit

Share via


Import data assets (preview)

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

Warning

Import data from external sources (preview) and Data Connections (preview) in Azure Machine Learning are deprecated and won't be available after September 30, 2026. Until then, you can continue to use these features without disruption. After that date, any workloads that depend on them will be disrupted.

Recommended action: Migrate external data imports to Microsoft Fabric and use Azure Machine Learning datastores to make data available in Azure Machine Learning.

In this article, you learn how to import data into the Azure Machine Learning platform from external sources. A successful data import automatically creates and registers an Azure Machine Learning data asset with the name you provide during that import. An Azure Machine Learning data asset resembles a web browser bookmark (favorites). You don't need to remember long storage paths (URIs) that point to your most-frequently used data. Instead, you can create a data asset, and then access that asset by using a friendly name.

A data import creates a cache of the source data, along with metadata, for faster and more reliable data access in Azure Machine Learning training jobs. The data cache avoids network and connection constraints. The cached data is versioned to support reproducibility. This feature provides versioning capabilities for data imported from SQL Server sources. Additionally, the cached data provides data lineage for auditing tasks. A data import uses Azure Data Factory (ADF) pipelines behind the scenes, which means that you can avoid complex interactions with ADF. Azure Machine Learning also handles management of ADF compute resource pool size, compute resource provisioning, and tear-down. This management optimizes data transfer by determining proper parallelization.

The transferred data is partitioned and securely stored as parquet files in Azure storage. This storage enables faster processing during training. ADF compute costs only involve the time used for data transfers. Storage costs only involve the time needed to cache the data, because cached data is a copy of the data imported from an external source. Azure storage hosts that external source.

The caching feature involves upfront compute and storage costs. However, it pays for itself, and can save money, because it reduces recurring training compute costs, compared to direct connections to external source data during training. It caches data as parquet files, which makes job training faster and more reliable against connection timeouts for larger data sets. This caching leads to fewer reruns and fewer training failures.

You can import data from Amazon S3, Azure SQL, and Snowflake.

Important

This feature is currently in public preview. This preview version is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Prerequisites

To create and work with data assets, you need:

Note

For a successful data import, verify that you installed the latest azure-ai-ml package (version 1.31.0 or later) for SDK, and the ml extension (version 2.37.0 or later). Python 3.9 or later is required.

If you have an older SDK package or CLI extension, remove the old one and install the new one by using the code shown in the tab section. Follow the instructions for SDK and CLI as shown here:

Code versions

az extension remove -n ml
az extension add -n ml --yes
az extension show -n ml #(the version value needs to be 2.37.0 or later)

Import from an external database as an mltable data asset

Note

External databases include Snowflake and Azure SQL.

The following code samples can import data from external databases. The connection that handles the import action determines the external database data source metadata. In this sample, the code imports data from a Snowflake resource. The connection points to a Snowflake source. With a little modification, the connection can point to an Azure SQL database source or another supported database source. The imported asset type from an external database source is mltable.

Create a YAML file <file-name>.yml:

$schema: http://azureml/sdk-2-0/DataImport.json
# Supported connections include:
# Connection: azureml:<workspace_connection_name>
# Supported paths include:
# Datastore: azureml://datastores/<data_store_name>/paths/<my_path>/${{name}}


type: mltable
name: <name>
source:
  type: database
  query: <query>
  connection: <connection>
path: <path>

Next, run the following command in the CLI:

> az ml data import -f <file-name>.yml

Import data from an external file system as a folder data asset

Note

An Amazon S3 data resource can serve as an external file system resource.

The connection that handles the data import action determines the aspects of the external data source. The connection defines an Amazon S3 bucket as the target. The connection expects a valid path value. An asset value imported from an external file system source has a type of uri_folder.

The next code sample imports data from an Amazon S3 resource.

Create a YAML file <file-name>.yml:

$schema: http://azureml/sdk-2-0/DataImport.json
# Supported connections include:
# Connection: azureml:<workspace_connection_name>
# Supported paths include:
# path: azureml://datastores/<data_store_name>/paths/<my_path>/${{name}}


type: uri_folder
name: <name>
source:
  type: file_system
  path: <path_on_source>
  connection: <connection>
path: <path>

Next, execute this command in the CLI:

> az ml data import -f <file-name>.yml

Check the import status of external data sources

The data import action is an asynchronous action. It can take a long time. After you submit an import data action via the CLI or SDK, the Azure Machine Learning service might need several minutes to connect to the external data source. Then, the service starts the data import, and handles data caching and registration. The time needed for a data import also depends on the size of the source data set.

The following example returns the status of the submitted data import activity. The command or method uses the data asset name as the input to determine the status of the data materialization.

> az ml data list-materialization-status --name <name>

Next steps