Simple Model Experiment Tracking with MLFlow and DVC

Combine the powerful open-source tools to improve your data science project workflow.

Jun 23, 2024

This newsletter is part of a series introducing open-source tools used in Machine Learning Operations (MLOps). Each series will introduce new tools for different parts of the process. We would combine everything at the end of the series to make a cohesive MLOps project.

Spoiler Alert! The above graph is what this article is all about. The repository for the article is located below, so read through.

So, if you decide to read this article, it means that you want to learn more about Experiment tracking.

In the data science project, Experiment tracking ensures reproducibility by logging code, data, and parameters, which is important for verifying results and sharing findings. It maintains organization by recording experiments systematically, facilitating performance comparison, and helping in model selection.

The two tools we would use in this article are MLFlow and DVC. I have explained what both tools are in my previous newsletter. If you haven’t read them, you can check it out below.

MLOps Basic Open-Source Tool Series #1: MLflow for Experiment Tracking and Model Management

Cornellius Yudha Wijaya

March 16, 2024

MLOps Basic Open-Source Tool Series #1: MLflow for Experiment Tracking and Model Management

Read full story

MLOps Basic Open-Source Tool Series #3: DVC for Data Versioning

Cornellius Yudha Wijaya

April 9, 2024

MLOps Basic Open-Source Tool Series #3: DVC for Data Versioning

Read full story

With that in mind, let’s go to the main part of our story today.

Experiment Tracking with MLFlow and DVC

For this tutorial, we would use the Open-Source Pima Indian Diabetes Dataset from UCI hosted in Kaggle.

Let’s start creating folders for your project in your favorite IDE. For this one, I am using the Visual Studio code. In your Command Prompt, run the following code to create folders in your environment.

mkdir my_project
cd my_project
mkdir data models scripts TEMP

Replace “my_project“ with your intended project name. Within our folder, we create 4 different folders: data, models, scripts, and TEMP. We would use everything in our scripts.

In the data folder, put the dataset from Kaggle. I renamed the dataset as “data.csv”, so I would refer to them in the scripts as that.

Next, we would set up the virtual environment. You can set it up using the following code:

python -m venv myvenv

Change the “myvenv” to any name for a virtual environment you want. Activate the virtual environment, and install the following packages.

pip install mlflow dvc scikit-learn pandas matplotlib joblib

After we have installed all the packages, we will initiate the Git on our folder.

git init

Then we would also initiate the DVC.

dvc init

Also, we would use a local folder to store our data tracked in the DVC. We would do that with the following code.

mkdir TEMP/dvcstore
dvc remote add -d myremote TEMP/dvcstore

We would next need to create the .gitignore file as our experiment would produce many artifacts, logs, and files. Create the file and add the following information. Don’t forget to add your Virtual Environment folder.

myvenv/
.env/
__pycache__/
*.py[cod]
.dvc/cache/
.dvc/tmp/
.dvc/.lock
mlruns/
.DS_Store
Thumbs.db
*.log

Thank you for reading Non-Brand Data. This post is public so feel free to share it.

Then, create a file called training.py in the scripts folder and fill them with the following code.

import pandas as pd
from sklearn.model_selection import train_test_split
import mlflow
import mlflow.sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import joblib
import os

# Step 1: Load and preprocess the data
df = pd.read_csv('data/data.csv')

X = df.drop('Outcome', axis=1)
y = df['Outcome']

# Step 2: Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Save the processed data
X_train.to_csv('data/X_train.csv', index=False)
X_test.to_csv('data/X_test.csv', index=False)
y_train.to_csv('data/y_train.csv', index=False)
y_test.to_csv('data/y_test.csv', index=False)

# Step 4: Version the data with DVC
os.system('dvc add data/X_train.csv data/X_test.csv data/y_train.csv data/y_test.csv')
os.system('dvc push')

# Set the MLFlow experiment name
experiment_name = 'Diabetes Prediction Experiment'
run_name = 'diabetes_rfc_run'
experiment = mlflow.get_experiment_by_name(experiment_name)

if experiment is not None:
    experiment_id = experiment.experiment_id
else:
    experiment_id = mlflow.create_experiment(experiment_name)


# Step 5: Start an MLFlow run and train the model
with mlflow.start_run(experiment_id=experiment_id, run_name=run_name) as run:
    model = LogisticRegression()
    model.fit(X_train, y_train)
    
    # Step 6: Predict and evaluate the model
    predictions = model.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    
    # Step 7: Log parameters, metrics, and model
    mlflow.log_param("solver", "lbfgs")
    mlflow.log_metric("accuracy", accuracy)
    mlflow.sklearn.log_model(model, "model")
    
    # Step 8: Save and version the model
    joblib.dump(model, 'models/model.pkl')
    os.system('dvc add models/model.pkl')
    os.system('dvc push')

    # Step 9: Evaluate the model
    conf_matrix = confusion_matrix(y_test, predictions)
    disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix)
    disp.plot()
    plt.savefig('confusion_matrix.png')
    
    # Step 10: Log additional artifacts
    mlflow.log_artifact('confusion_matrix.png')

In the code above, we have 10 steps happening. From preprocessing the data, versioning data and model with DVC, and tracking the experiment with MLFlow. Using MLFlow, we can track the hyperparameters, metrics, and the model. Also, the dataset and model object can be tracked with DVC.

With the file ready, try to run the experiment. You can run it with the following code.

python scripts/experiment.py

You can see the experiment tracking result in the MLFlow GUI if the process succeeds. To access that, you can use the following code.

mlflow server

The server can be accessed at the local address of your choosing. If it’s successful, you will see an experiment similar to the image below.

Once everything is good, you can push the project to your GitHub repository. For me, I put them in this repository.

That’s all you need to do! Using one simple script, you can track your experiment and the file objects in one run. You can visit the MLFlow documentation to see which one you can track.

I hope it helps!

Non-Brand Data

Simple Model Experiment Tracking with MLFlow and DVC

Combine the powerful open-source tools to improve your data science project workflow.

MLOps Basic Open-Source Tool Series #1: MLflow for Experiment Tracking and Model Management

MLOps Basic Open-Source Tool Series #3: DVC for Data Versioning

Experiment Tracking with MLFlow and DVC

Discussion about this post