Share via


ai_extract function

Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime

Important

This functionality is in Public Preview and HIPAA compliant.

During the preview:

The ai_extract() function extracts structured data from text and documents according to a schema you provide. You can use simple field names for basic extraction, or define complex schemas with nested objects, arrays, type validation, and field descriptions for business documents like invoices, contracts, and financial filings.

The function accepts text or VARIANT output from other AI functions like ai_parse_document, enabling composable workflows for end-to-end document processing.

For a visual UI to validate and iterate on the results of ai_extract, see Information Extraction.

Requirements

Apache 2.0 license

The underlying models that might be used at this time are licensed under the Apache 2.0 License, Copyright © The Apache Software Foundation. Customers are responsible for ensuring compliance with applicable model licenses.

Databricks recommends reviewing these licenses to ensure compliance with any applicable terms. If models emerge in the future that perform better according to Databricks's internal benchmarks, Databricks might change the model (and the list of applicable licenses provided on this page).

The model powering this function is made available using Model Serving Foundation Model APIs. See Applicable model terms for information about which models are available on Databricks and the licenses and policies that govern the use of those models.

If models emerge in the future that perform better according to Databricks's internal benchmarks, Databricks may change the models and update the documentation.

  • This function is only available in some regions, see AI function availability.
  • This function is not available on Azure Databricks SQL Classic.
  • Check the Databricks SQL pricing page.
  • In Databricks Runtime 15.1 and above, this function is supported in Databricks notebooks, including notebooks that are run as a task in a Databricks workflow.
  • Batch inference workloads require Databricks Runtime 15.4 ML LTS for improved performance.

Syntax

Databricks recommends using version 2 of this function because it supports nested field extraction and descriptions.

ai_extract(
    content VARIANT | STRING,
    schema STRING,
    [options MAP<STRING, STRING>]
) RETURNS VARIANT

Version 1

ai_extract(
    content STRING,
    labels ARRAY<STRING>,
    [options MAP<STRING, STRING>]
) RETURNS STRUCT

Arguments

  • content: A VARIANT or STRING expression. Accepts either:

    • Raw text as a STRING
    • A VARIANT produced by another AI function (such as ai_parse_document)
  • schema: A STRING literal defining the JSON schema for extraction. The schema can be:

    • Simple schema: A JSON array of field names (assumed to be strings)
      ["vendor_name", "invoice_id", "total_amount"]
      
    • Advanced schema: A JSON object with type information, descriptions, and nested structures
      • Supports string, integer, number, boolean, and enum types. Performs type validation, invalid values will result in an error. Maximum of 500 enum values.
      • Supports nested objects using "type": "object" with "properties"
      • Supports arrays of primitives or objects using "type": "array" with "items"
      • Optional "description" field for each property to guide extraction quality
  • options: An optional MAP<STRING, STRING> containing configuration options:

    • version: Version switch to support migration ("1.0" for v1 behavior, "2.0" for v2 behavior). Default is based on input types, but will fall back to "1.0".
    • instructions: Global description of the task and domain to improve extraction quality. Must be less than 20,000 characters.

Version 1

  • content: A STRING expression containing the raw text.

  • labels: An ARRAY<STRING> literal. Each element is a type of entity to be extracted.

  • options: An optional MAP<STRING, STRING> containing configuration options:

    • version: Version switch to support migration ("1.0" for v1 behavior, "2.0" for v2 behavior). Default is based on input types, but will fall back to "1.0".

Returns

Returns a VARIANT containing:

{
  "response": { ... },   // Extracted data matching the provided schema
  "error_message": null          // null on success, or error message on failure
}

The response field contains the structured data extracted according to the schema:

  • Field names and types match the schema definition
  • Nested objects and arrays are preserved in the structure
  • Fields may be null if not found
  • Type validation is enforced for integer, number, boolean, and enum types

If content is NULL, the result is NULL.

Version 1

Returns a STRUCT where each field corresponds to an entity type specified in labels. Each field contains a string representing the extracted entity. If the function finds more than one candidate for any entity type, it returns only one.

Examples

Simple schema - field names only

> SELECT ai_extract(
    'Invoice #12345 from Acme Corp for $1,250.00 dated 2024-01-15',
    '["invoice_id", "vendor_name", "total_amount", "invoice_date"]'
  );
 {
   "response": {
     "invoice_id": "12345",
     "vendor_name": "Acme Corp",
     "total_amount": "1250.00",
     "invoice_date": "2024-01-15"
   },
   "error_message": null
 }

Advanced schema - with types and descriptions:

> SELECT ai_extract(
    'Invoice #12345 from Acme Corp for $1,250.00 dated 2024-01-15',
    '{
      "invoice_id": {"type": "string", "description": "Unique invoice identifier"},
      "vendor_name": {"type": "string", "description": "Legal business name"},
      "total_amount": {"type": "number", "description": "Total invoice amount"},
      "invoice_date": {"type": "string", "description": "Date in YYYY-MM-DD format"}
    }'
  );
 {
   "response": {
     "invoice_id": "12345",
     "vendor_name": "Acme Corp",
     "total_amount": 1250.00,
     "invoice_date": "2024-01-15"
   },
   "error_message": null
 }

Nested objects and arrays:

> SELECT ai_extract(
    'Invoice #12345 from Acme Corp
     Line 1: Widget A, qty 10, $50.00 each
     Line 2: Widget B, qty 5, $100.00 each
     Subtotal: $1,000.00, Tax: $80.00, Total: $1,080.00',
    '{
      "invoice_header": {
        "type": "object",
        "properties": {
          "invoice_id": {"type": "string"},
          "vendor_name": {"type": "string"}
        }
      },
      "line_items": {
        "type": "array",
        "description": "List of invoiced products",
        "items": {
          "type": "object",
          "properties": {
            "description": {"type": "string"},
            "quantity": {"type": "integer"},
            "unit_price": {"type": "number"}
          }
        }
      },
      "totals": {
        "type": "object",
        "properties": {
          "subtotal": {"type": "number"},
          "tax_amount": {"type": "number"},
          "total_amount": {"type": "number"}
        }
      }
    }'
  );
 {
   "response": {
     "invoice_header": {
       "invoice_id": "12345",
       "vendor_name": "Acme Corp"
     },
     "line_items": [
       {"description": "Widget A", "quantity": 10, "unit_price": 50.00},
       {"description": "Widget B", "quantity": 5, "unit_price": 100.00}
     ],
     "totals": {
       "subtotal": 1000.00,
       "tax_amount": 80.00,
       "total_amount": 1080.00
     }
   },
   "error": null
 }

Composability with ai_parse_document:

> WITH parsed_docs AS (
    SELECT
      path,
      ai_parse_document(
        content,
        MAP('version', '2.0')
      ) AS parsed_content
    FROM READ_FILES('/Volumes/finance/invoices/', format => 'binaryFile')
  )
  SELECT
    path,
    ai_extract(
      parsed_content,
      '["invoice_id", "vendor_name", "total_amount"]',
      MAP('instructions', 'These are vendor invoices.')
    ) AS invoice_data
  FROM parsed_docs;

Using enums:

> SELECT ai_extract(
    'Invoice #12345 from Acme Corp, amount: $1,250.00 USD',
    '{
      "invoice_id": {"type": "string"},
      "vendor_name": {"type": "string"},
      "total_amount": {"type": "number"},
      "currency": {
        "type": "enum",
        "labels": ["USD", "EUR", "GBP", "CAD", "AUD"],
        "description": "Currency code"
      },
      "payment_terms": {"type": "string"}
    }'
  );
 {
   "response": {
     "invoice_id": "12345",
     "vendor_name": "Acme Corp",
     "total_amount": 1250.00,
     "currency": "USD",
     "payment_terms": null
   },
   "error": null
 }

Version 1

> SELECT ai_extract(
    'John Doe lives in New York and works for Acme Corp.',
    array('person', 'location', 'organization')
  );
 {"person": "John Doe", "location": "New York", "organization": "Acme Corp."}

> SELECT ai_extract(
    'Send an email to jane.doe@example.com about the meeting at 10am.',
    array('email', 'time')
  );
 {"email": "jane.doe@example.com", "time": "10am"}

Limitations

  • This function is not available on Azure Databricks SQL Classic.
  • This function cannot be used with Views.
  • The schema supports a maximum of 128 fields.
  • Field names can contain up to 150 characters.
  • Schemas support up to 7 levels of nesting for nested fields.
  • Enum fields support a maximum of 500 values.
  • Type validation is enforced for integer, number, boolean, and enum types. If a value does not match the specified type, the function returns an error.
  • The maximum total context size is 128,000 tokens.

Version 1

  • This function is not available on Azure Databricks SQL Classic.
  • This function cannot be used with Views.
  • If more than one candidate for an entity type is found in the content, only one value is returned.