Simple Model Experiment Tracking with MLFlow and DVC
Combine the powerful open-source tools to improve your data science project workflow.
This newsletter is part of a series introducing open-source tools used in Machine Learning Operations (MLOps). Each series will introduce new tools for different parts of the process. We would combine everything at the end of the series to make a cohesive MLOps project.
Spoiler Alert! The above graph is what this article is all about. The repository for the article is located below, so read through.
So, if you decide to read this article, it means that you want to learn more about Experiment tracking.
In the data science project, Experiment tracking ensures reproducibility by logging code, data, and parameters, which is important for verifying results and sharing findings. It maintains organization by recording experiments systematically, facilitating performance comparison, and helping in model selection.
The two tools we would use in this article are MLFlow and DVC. I have explained what both tools are in my previous newsletter. If you haven’t read them, you can check it out below.
With that in mind, let’s go to the main part of our story today.
Experiment Tracking with MLFlow and DVC
For this tutorial, we would use the Open-Source Pima Indian Diabetes Dataset from UCI hosted in Kaggle.
Let’s start creating folders for your project in your favorite IDE. For this one, I am using the Visual Studio code. In your Command Prompt, run the following code to create folders in your environment.
mkdir my_project
cd my_project
mkdir data models scripts TEMP
Replace “my_project“ with your intended project name. Within our folder, we create 4 different folders: data, models, scripts, and TEMP. We would use everything in our scripts.
In the data folder, put the dataset from Kaggle. I renamed the dataset as “data.csv”, so I would refer to them in the scripts as that.
Next, we would set up the virtual environment. You can set it up using the following code:
python -m venv myvenv
Change the “myvenv” to any name for a virtual environment you want. Activate the virtual environment, and install the following packages.
pip install mlflow dvc scikit-learn pandas matplotlib joblib
After we have installed all the packages, we will initiate the Git on our folder.
git init
Then we would also initiate the DVC.
dvc init
Also, we would use a local folder to store our data tracked in the DVC. We would do that with the following code.
mkdir TEMP/dvcstore
dvc remote add -d myremote TEMP/dvcstore
We would next need to create the .gitignore file as our experiment would produce many artifacts, logs, and files. Create the file and add the following information. Don’t forget to add your Virtual Environment folder.
myvenv/
.env/
__pycache__/
*.py[cod]
.dvc/cache/
.dvc/tmp/
.dvc/.lock
mlruns/
.DS_Store
Thumbs.db
*.log
Then, create a file called training.py
in the scripts folder and fill them with the following code.
import pandas as pd
from sklearn.model_selection import train_test_split
import mlflow
import mlflow.sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import joblib
import os
# Step 1: Load and preprocess the data
df = pd.read_csv('data/data.csv')
X = df.drop('Outcome', axis=1)
y = df['Outcome']
# Step 2: Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 3: Save the processed data
X_train.to_csv('data/X_train.csv', index=False)
X_test.to_csv('data/X_test.csv', index=False)
y_train.to_csv('data/y_train.csv', index=False)
y_test.to_csv('data/y_test.csv', index=False)
# Step 4: Version the data with DVC
os.system('dvc add data/X_train.csv data/X_test.csv data/y_train.csv data/y_test.csv')
os.system('dvc push')
# Set the MLFlow experiment name
experiment_name = 'Diabetes Prediction Experiment'
run_name = 'diabetes_rfc_run'
experiment = mlflow.get_experiment_by_name(experiment_name)
if experiment is not None:
experiment_id = experiment.experiment_id
else:
experiment_id = mlflow.create_experiment(experiment_name)
# Step 5: Start an MLFlow run and train the model
with mlflow.start_run(experiment_id=experiment_id, run_name=run_name) as run:
model = LogisticRegression()
model.fit(X_train, y_train)
# Step 6: Predict and evaluate the model
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
# Step 7: Log parameters, metrics, and model
mlflow.log_param("solver", "lbfgs")
mlflow.log_metric("accuracy", accuracy)
mlflow.sklearn.log_model(model, "model")
# Step 8: Save and version the model
joblib.dump(model, 'models/model.pkl')
os.system('dvc add models/model.pkl')
os.system('dvc push')
# Step 9: Evaluate the model
conf_matrix = confusion_matrix(y_test, predictions)
disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix)
disp.plot()
plt.savefig('confusion_matrix.png')
# Step 10: Log additional artifacts
mlflow.log_artifact('confusion_matrix.png')
In the code above, we have 10 steps happening. From preprocessing the data, versioning data and model with DVC, and tracking the experiment with MLFlow. Using MLFlow, we can track the hyperparameters, metrics, and the model. Also, the dataset and model object can be tracked with DVC.
With the file ready, try to run the experiment. You can run it with the following code.
python scripts/experiment.py
You can see the experiment tracking result in the MLFlow GUI if the process succeeds. To access that, you can use the following code.
mlflow server
The server can be accessed at the local address of your choosing. If it’s successful, you will see an experiment similar to the image below.
Once everything is good, you can push the project to your GitHub repository. For me, I put them in this repository.
That’s all you need to do! Using one simple script, you can track your experiment and the file objects in one run. You can visit the MLFlow documentation to see which one you can track.
I hope it helps!