Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
The ai.extract function uses generative AI to scan input text and extract specific types of information designated by labels you choose (for example, locations or names). It uses only a single line of code.
Note
- This article covers using ai.extract with PySpark. To use ai.extract with pandas, see this article.
- See other AI functions in this overview article.
- Learn how to customize the configuration of AI functions.
Overview
The ai.extract function is available for Spark DataFrames. You must specify the name of an existing input column as a parameter, along with a list of entity types to extract from each row of text.
The function returns a new DataFrame, with a separate column for each specified entity type that contains extracted values for each input row.
Schema-driven extraction
aifunc.ExtractLabel supports JSON Schema definitions to enforce structured outputs for extracted data. Beyond basic types (string, number, integer, boolean), you can use the following schema constructs:
- Enums: Constrain values to a fixed set (for example,
"enum": ["midfielder", "striker", "defender"]). - Arrays: Define element schemas via
items(for example,"type": "array", "items": {"type": "string"}). - Objects with properties: Specify nested fields with
propertiesand their types. - Required fields: Mark mandatory fields with
requiredto ensure they're always present in the output. - No extra fields: Set
additionalProperties=falseto prevent the model from returning fields outside the defined schema. - Nullable values: Express nullable types (for example,
type=["string", "null"]) for optional data.
This enables reliable, typed extraction for downstream processing. When used in PySpark on Fabric, ai.extract executes as a distributed Spark transformation, leveraging the Fabric Spark cluster for scale and parallel processing across partitions.
Syntax
from synapse.ml.spark import aifunc
Note
The PySpark import path is from synapse.ml.spark import aifunc. For pandas DataFrames, use from synapse.ml import aifunc instead. See the pandas extract article for pandas-specific details.
df.ai.extract(labels=["entity1", "entity2", "entity3"], input_col="input")
Parameters
| Name | Description |
|---|---|
labels Required |
An array of strings that represents the set of entity types to extract from the text values in the input column. |
input_col Required |
A string that contains the name of an existing column with input text values to scan for the custom entities. |
aifunc.ExtractLabel Optional |
One or more label definitions describing the fields to extract. For more information, refer to the ExtractLabel Parameters table. |
error_col Optional |
A string that contains the name of a new column to store any OpenAI errors that result from processing each input text row. If you don't set this parameter, a default name generates for the error column. If an input row has no errors, the value in this column is null. |
ExtractLabel Parameters
| Name | Description |
|---|---|
label Required |
A string that represents the entity to extract from the input text values. |
description Optional |
A string that adds extra context for the AI model. It can include requirements, context, or instructions for the AI to consider while performing the extraction. |
max_items Optional |
An int that specifies the maximum number of items to extract for this label. |
type Optional |
JSON schema type for the extracted value. Supported types for this class include string, number, integer, boolean, object, and array. |
properties Optional |
Additional JSON Schema properties for the type, provided as a dictionary. Commonly used keys include: items (define element schemas for arrays), properties (define fields for objects), enum (constrain values to a fixed set), required (list mandatory field names), and additionalProperties (set to false to prevent extra fields). Express nullable values with type=["string", "null"]. For the full list of supported schema constructs, see Structured Outputs: Supported schemas. Note: when additionalProperties=false is set, the model returns only the fields defined in the schema. |
raw_col Optional |
A string that sets the column name for the raw LLM response. The raw response provides a list of dictionary pairs for every entity label, including "reason" and "extraction_text". |
Returns
The function returns a Spark DataFrame with a new column for each specified entity type. The column or columns contain the entities extracted for each row of input text. If no match is found, the result is null.
The default return type is a list of strings for each label. When max_items isn't specified, multiple matches are returned as a list. If you specify a different type in the aifunc.ExtractLabel configuration (for example, type="integer"), the output is a list of values of that type. If you specify max_items=1, a single-element list is produced for that label. The element type of each list follows the schema you provide.
Example
# This code uses AI. Always review output for mistakes.
df = spark.createDataFrame([
("MJ Lee lives in Tucson, AZ, and works as a software engineer for Contoso.",),
("Kris Turner, a nurse at NYU Langone, is a resident of Jersey City, New Jersey.",)
], ["descriptions"])
df_entities = df.ai.extract(labels=["name", "profession", "city"], input_col="descriptions")
display(df_entities)
This example code cell provides the following output:
Multimodal input
The ai.extract function supports file-based multimodal input. You can extract entities from images, PDFs, and text files by setting input_col_type="path". For more information about supported file types and setup, see Use multimodal input with AI functions.
# This code uses AI. Always review output for mistakes.
extracted = custom_df.ai.extract(
labels=[
aifunc.ExtractLabel(
"name",
description="The full name of the candidate, first letter capitalized.",
max_items=1,
),
"companies_worked_for",
aifunc.ExtractLabel(
"year_of_experience",
description="The total years of professional work experience the candidate has, excluding internships.",
type="integer",
max_items=1,
),
],
input_col="file_path",
input_col_type="path",
)
display(extracted)
Related content
Detect sentiment with ai.analyze_sentiment.
Categorize text with ai.classify.
Generate vector embeddings with ai.embed.
Fix grammar with ai.fix_grammar.
Answer custom user prompts with ai.generate_response.
Calculate similarity with ai.similarity.
Summarize text with ai.summarize.
Translate text with ai.translate.
Learn more about the full set of AI functions.
Customize the configuration of AI functions.
Did we miss a feature you need? Suggest it on the Fabric Ideas forum.