Share via


regexp_extract function

Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime

Extracts the first string in str that matches the regexp expression and corresponds to the regex group index.

Syntax

regexp_extract(str, regexp [, idx] )

Arguments

  • str: A STRING expression to be matched.
  • regexp: A STRING expression with a matching pattern.
  • idx: An optional integral number expression greater or equal 0 with default 1.

Returns

A STRING.

If str, regexp, or idx is NULL, the result is NULL. If regexp does not match any part of str, the result is an empty string. If the overall pattern matches but the group specified by idx does not participate in the match, the result is an empty string.

The regexp string must be a Java regular expression. See Java regex patterns for the supported syntax.

When using literals, use raw-literal (r prefix) to avoid escape character pre-processing.

regexp may contain multiple groups. Capture groups are defined by enclosing part of the pattern in parentheses (...). idx indicates which regex group to extract. An idx of 0 means matching the entire regular expression. The default value of idx is 1, meaning the first capture group is returned when idx is omitted.

If idx is greater than the number of capture groups defined in regexp, Databricks raises an error.

Only the first match of regexp in str is considered. To extract all matches, use regexp_extract_all.

Common error conditions

Examples

Extract specific capture groups

> SELECT regexp_extract('100-200', '(\\d+)-(\\d+)', 1);
 100

> SELECT regexp_extract('100-200', '(\\d+)-(\\d+)', 2);
 200

Use idx 0 to return the entire match

> SELECT regexp_extract('Order: 100-200', '(\\d+)-(\\d+)', 0);
 100-200

Default idx is 1

When idx is omitted, the first capture group is returned.

> SELECT regexp_extract('abc123def', '([a-z]+)(\\d+)');
 abc

No match returns an empty string

> SELECT regexp_extract('hello world', '(\\d+)', 1);

NULL input returns NULL

> SELECT regexp_extract(NULL, '(\\d+)', 1);
 NULL

Use raw string literals to avoid double escaping

In a regular SQL string, backslashes must be doubled (\\d to mean \d). Use the r prefix for raw string literals to write patterns more naturally.

> SELECT regexp_extract('abc123', r'(\d+)', 1);
 123

Extract a domain from a URL

> SELECT regexp_extract('https://www.databricks.com/product', r'https?://([^/]+)', 1);
 www.databricks.com

Extract a component from a date string

> SELECT regexp_extract('Event on 2024-03-15', r'(\d{4})-(\d{2})-(\d{2})', 0);
 2024-03-15

> SELECT regexp_extract('Event on 2024-03-15', r'(\d{4})-(\d{2})-(\d{2})', 2);
 03

Optional capture group that does not participate in the match

When a group is optional and does not match, an empty string is returned for that group.

> SELECT regexp_extract('ac', r'(a)(b)?(c)', 2);

Invalid regex pattern

> SELECT regexp_extract('abc', '[invalid', 0);
  Error: INVALID_PARAMETER_VALUE.PATTERN