Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Applies to:
Databricks SQL
Databricks Runtime
Important
This functionality is in Public Preview and HIPAA compliant.
During the preview:
- The underlying language model can handle several languages, but this AI Function is tuned for English.
- See Features with limited regional availability for AI Functions region availability.
The ai_extract() function extracts structured data from text and documents according to a schema you provide. You can use simple field names for basic extraction, or define complex schemas with nested objects, arrays, type validation, and field descriptions for business documents like invoices, contracts, and financial filings.
The function accepts text or VARIANT output from other AI functions like ai_parse_document, enabling composable workflows for end-to-end document processing.
For a visual UI to validate and iterate on the results of ai_extract, see Information Extraction.
Requirements
Apache 2.0 license
The underlying models that might be used at this time are licensed under the Apache 2.0 License, Copyright © The Apache Software Foundation. Customers are responsible for ensuring compliance with applicable model licenses.
Databricks recommends reviewing these licenses to ensure compliance with any applicable terms. If models emerge in the future that perform better according to Databricks's internal benchmarks, Databricks might change the model (and the list of applicable licenses provided on this page).
The model powering this function is made available using Model Serving Foundation Model APIs. See Applicable model terms for information about which models are available on Databricks and the licenses and policies that govern the use of those models.
If models emerge in the future that perform better according to Databricks's internal benchmarks, Databricks may change the models and update the documentation.
- This function is only available in some regions, see AI function availability.
- This function is not available on Azure Databricks SQL Classic.
- Check the Databricks SQL pricing page.
- In Databricks Runtime 15.1 and above, this function is supported in Databricks notebooks, including notebooks that are run as a task in a Databricks workflow.
- Batch inference workloads require Databricks Runtime 15.4 ML LTS for improved performance.
Syntax
Databricks recommends using version 2 of this function because it supports nested field extraction and descriptions.
Version 2 (recommended)
ai_extract(
content VARIANT | STRING,
schema STRING,
[options MAP<STRING, STRING>]
) RETURNS VARIANT
Version 1
ai_extract(
content STRING,
labels ARRAY<STRING>,
[options MAP<STRING, STRING>]
) RETURNS STRUCT
Arguments
Version 2 (recommended)
content: AVARIANTorSTRINGexpression. Accepts either:- Raw text as a
STRING - A
VARIANTproduced by another AI function (such asai_parse_document)
- Raw text as a
schema: ASTRINGliteral defining the JSON schema for extraction. The schema can be:- Simple schema: A JSON array of field names (assumed to be strings)
["vendor_name", "invoice_id", "total_amount"] - Advanced schema: A JSON object with type information, descriptions, and nested structures
- Supports
string,integer,number,boolean, andenumtypes. Performs type validation, invalid values will result in an error. Maximum of 500 enum values. - Supports nested objects using
"type": "object"with"properties" - Supports arrays of primitives or objects using
"type": "array"with"items" - Optional
"description"field for each property to guide extraction quality
- Supports
- Simple schema: A JSON array of field names (assumed to be strings)
options: An optionalMAP<STRING, STRING>containing configuration options:version: Version switch to support migration ("1.0"for v1 behavior,"2.0"for v2 behavior). Default is based on input types, but will fall back to"1.0".instructions: Global description of the task and domain to improve extraction quality. Must be less than 20,000 characters.
Version 1
content: ASTRINGexpression containing the raw text.labels: AnARRAY<STRING>literal. Each element is a type of entity to be extracted.options: An optionalMAP<STRING, STRING>containing configuration options:version: Version switch to support migration ("1.0"for v1 behavior,"2.0"for v2 behavior). Default is based on input types, but will fall back to"1.0".
Returns
Version 2 (recommended)
Returns a VARIANT containing:
{
"response": { ... }, // Extracted data matching the provided schema
"error_message": null // null on success, or error message on failure
}
The response field contains the structured data extracted according to the schema:
- Field names and types match the schema definition
- Nested objects and arrays are preserved in the structure
- Fields may be
nullif not found - Type validation is enforced for
integer,number,boolean, andenumtypes
If content is NULL, the result is NULL.
Version 1
Returns a STRUCT where each field corresponds to an entity type specified in labels. Each field contains a string representing the extracted entity. If the function finds more than one candidate for any entity type, it returns only one.
Examples
Version 2 (recommended)
Simple schema - field names only
> SELECT ai_extract(
'Invoice #12345 from Acme Corp for $1,250.00 dated 2024-01-15',
'["invoice_id", "vendor_name", "total_amount", "invoice_date"]'
);
{
"response": {
"invoice_id": "12345",
"vendor_name": "Acme Corp",
"total_amount": "1250.00",
"invoice_date": "2024-01-15"
},
"error_message": null
}
Advanced schema - with types and descriptions:
> SELECT ai_extract(
'Invoice #12345 from Acme Corp for $1,250.00 dated 2024-01-15',
'{
"invoice_id": {"type": "string", "description": "Unique invoice identifier"},
"vendor_name": {"type": "string", "description": "Legal business name"},
"total_amount": {"type": "number", "description": "Total invoice amount"},
"invoice_date": {"type": "string", "description": "Date in YYYY-MM-DD format"}
}'
);
{
"response": {
"invoice_id": "12345",
"vendor_name": "Acme Corp",
"total_amount": 1250.00,
"invoice_date": "2024-01-15"
},
"error_message": null
}
Nested objects and arrays:
> SELECT ai_extract(
'Invoice #12345 from Acme Corp
Line 1: Widget A, qty 10, $50.00 each
Line 2: Widget B, qty 5, $100.00 each
Subtotal: $1,000.00, Tax: $80.00, Total: $1,080.00',
'{
"invoice_header": {
"type": "object",
"properties": {
"invoice_id": {"type": "string"},
"vendor_name": {"type": "string"}
}
},
"line_items": {
"type": "array",
"description": "List of invoiced products",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "integer"},
"unit_price": {"type": "number"}
}
}
},
"totals": {
"type": "object",
"properties": {
"subtotal": {"type": "number"},
"tax_amount": {"type": "number"},
"total_amount": {"type": "number"}
}
}
}'
);
{
"response": {
"invoice_header": {
"invoice_id": "12345",
"vendor_name": "Acme Corp"
},
"line_items": [
{"description": "Widget A", "quantity": 10, "unit_price": 50.00},
{"description": "Widget B", "quantity": 5, "unit_price": 100.00}
],
"totals": {
"subtotal": 1000.00,
"tax_amount": 80.00,
"total_amount": 1080.00
}
},
"error": null
}
Composability with ai_parse_document:
> WITH parsed_docs AS (
SELECT
path,
ai_parse_document(
content,
MAP('version', '2.0')
) AS parsed_content
FROM READ_FILES('/Volumes/finance/invoices/', format => 'binaryFile')
)
SELECT
path,
ai_extract(
parsed_content,
'["invoice_id", "vendor_name", "total_amount"]',
MAP('instructions', 'These are vendor invoices.')
) AS invoice_data
FROM parsed_docs;
Using enums:
> SELECT ai_extract(
'Invoice #12345 from Acme Corp, amount: $1,250.00 USD',
'{
"invoice_id": {"type": "string"},
"vendor_name": {"type": "string"},
"total_amount": {"type": "number"},
"currency": {
"type": "enum",
"labels": ["USD", "EUR", "GBP", "CAD", "AUD"],
"description": "Currency code"
},
"payment_terms": {"type": "string"}
}'
);
{
"response": {
"invoice_id": "12345",
"vendor_name": "Acme Corp",
"total_amount": 1250.00,
"currency": "USD",
"payment_terms": null
},
"error": null
}
Version 1
> SELECT ai_extract(
'John Doe lives in New York and works for Acme Corp.',
array('person', 'location', 'organization')
);
{"person": "John Doe", "location": "New York", "organization": "Acme Corp."}
> SELECT ai_extract(
'Send an email to jane.doe@example.com about the meeting at 10am.',
array('email', 'time')
);
{"email": "jane.doe@example.com", "time": "10am"}
Limitations
Version 2 (recommended)
- This function is not available on Azure Databricks SQL Classic.
- This function cannot be used with Views.
- The schema supports a maximum of 128 fields.
- Field names can contain up to 150 characters.
- Schemas support up to 7 levels of nesting for nested fields.
- Enum fields support a maximum of 500 values.
- Type validation is enforced for
integer,number,boolean, andenumtypes. If a value does not match the specified type, the function returns an error. - The maximum total context size is 128,000 tokens.
Version 1
- This function is not available on Azure Databricks SQL Classic.
- This function cannot be used with Views.
- If more than one candidate for an entity type is found in the content, only one value is returned.