Train regression models with AutoML Python API

Open notebook version of this page

This example notebook shows how to train a regression model on Databricks using the AutoML Python API. Using the California housing dataset, you call automl.regress() to predict median house value, then use the best trial to run inference on a held-out test set.

Requirements

Databricks Runtime for Machine Learning 8.3 or above.

California housing dataset

This dataset was derived from the 1990 US census, using one row per census block group. The target variable is the median house value for California districts.

import sklearn
input_pdf = sklearn.datasets.fetch_california_housing(as_frame=True)
display(input_pdf.frame)

Train/test split

from sklearn.model_selection import train_test_split

train_pdf, test_pdf = train_test_split(input_pdf.frame, test_size=0.01, random_state=42)
display(train_pdf)

Training

The following command starts an AutoML run. You must provide the column that the model should predict in the target_col argument.
When the run completes, you can follow the link to the best trial notebook to examine the training code. This notebook also includes a feature importance plot.

from databricks import automl
summary = automl.regress(train_pdf, target_col="MedHouseVal", timeout_minutes=30)

The following command displays information about the AutoML output.

help(summary)

Iterate on the model

Explore the notebooks and experiments linked above.
If the metrics for the best trial notebook look good, skip directly to the inference section.
If you want to improve on the model generated by the best trial:
- Go to the notebook with the best trial and clone it.
- Edit the notebook as necessary to improve the model. For example, you might try different hyperparameters.
- When you are satisfied with the model, note the URI where the artifact for the trained model is logged. Assign this URI to the model_uri variable in Cmd 12.

Inference

You can use the model trained by AutoML to make predictions on new data. The examples below demonstrate how to make predictions on data in pandas DataFrames, or register the model as a Spark UDF for prediction on Spark DataFrames.

pandas DataFrame

model_uri = summary.best_trial.model_path
# model_uri = "<model-uri-from-generated-notebook>"

import mlflow

# Prepare test dataset
y_test = test_pdf["MedHouseVal"]
X_test = test_pdf.drop("MedHouseVal", axis=1)

# Run inference using the best model
model = mlflow.pyfunc.load_model(model_uri)
predictions = model.predict(X_test)
test_pdf["MedHouseVal_predicted"] = predictions
display(test_pdf)

Spark DataFrame

# Prepare the test dataset
test_df = spark.createDataFrame(test_pdf)
predict_udf = mlflow.pyfunc.spark_udf(spark, model_uri=model_uri)
display(test_df.withColumn("MedHouseVal_predicted", predict_udf()))

Test

Use the final model to make predictions on the holdout test set to estimate how the model would perform in a production setting.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Prepare the dataset
y_pred = test_pdf["MedHouseVal_predicted"]
test = pd.DataFrame({"Predicted":y_pred,"Actual":y_test})
test = test.reset_index()
test = test.drop(["index"], axis=1)

# plot graphs
fig= plt.figure(figsize=(16,8))
plt.plot(test[:50])
plt.legend(["Actual", "Predicted"])
sns.jointplot(x="Actual", y="Predicted", data=test, kind="reg");

Register and deploy the model

You can register and deploy a model trained by AutoML like any other model in the MLflow Model Registry. See Log, load, and register MLflow models.

Troubleshooting: `No module named pandas.core.indexes.numeric`

When serving an AutoML-trained model with Mosaic AI Model Serving, you may see the error No module named pandas.core.indexes.numeric. This happens when the pandas version used by AutoML differs from the one in the model serving endpoint environment. To resolve:

Download the add-pandas-dependency.py script. The script edits requirements.txt and conda.yaml for the logged model to pin pandas==1.5.3.
Edit the script to include the run_id of the MLflow run where the model was logged.
Re-register the model.
Serve the new model version.

Example notebook

Train regression models with

Get notebook

Next steps

AutoML Python API reference.

Palaute

Onko tästä sivusta apua?

Last updated on 2026-05-01