Understanding Data Leakage in Machine Learning - NBD Lite #12

Small thing with massive effect...

Sep 20, 2024

If you are interested in more audio explanations, you can listen to the article in the AI-Generated Podcast by NotebookLM!👇👇👇

1×

0:00

-10:36

Have you ever had to train your model and achieve perfect performance?

Like 100% accuracy, precision, recall, etc., in the test set?

Would you be happy if that happened? Well, I would certainly dumbfounded as I can’t believe it.

There is a saying that “All models are wrong; some are useful.”

It means that models can’t be perfect; if it happens, something might be wrong.

One of the things that often happens is a data leakage case, which we will explore much more in-depth in the next section. Here is a summary of data leakage.

Word From Sponsor

Get immediate access to up to 8 NVIDIA® GPUs, along with CPU resources, storage, and additional services through our user-friendly self-service console.

Learn more at Nebius.ai.

Data Leakage

Data Leakage is an event where the training dataset we have contains information from outside that shouldn’t be.

Why should we be concerned about data leakage? There are a few things that could happen, including:

Overfitting to Training Data
Inflated Performance Metrics
Misleading Insights
Wasted Resources

We don’t want data leakage during the model training process.

There are two types of Data Leakage, they are:

1. Target Leakage

Target Leakage occurs when the model is trained on training data that contains target or feature information that should not be available at the prediction time.

For example, let’s take a look at the table below.

We have training data that wants to predict fraud occurrence. However, there is a leak with a feature called Fraud Loss, which exists only after the fraud event.

The presence of a Fraud Loss feature means there is a Target Leakage that would cause the model overfitting.

That’s why we need to avoid any information that directly affects the prediction but should not be available during the prediction time.

2. Train-Test Contamination

The Train-Test contamination is an event where the test data “leaks” into the training data.

There are many situations where where it could happen, including:

Data preprocessing steps for transformation (e.g., scaling or encoding) are applied before splitting the dataset.

Data transformation, such as normalization, requires parameters from the data applied to the whole dataset. Test data should not contain any information from the training data. Thus, data leakage happens if data transformation is done before splitting the dataset.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Fit only on the training data
X_train_scaled = scaler.fit_transform(X_train)

# Apply the transformation on the test data
X_test_scaled = scaler.transform(X_test)

Improper time-series handling, where future information is used to predict past events.

We can’t split time-series data in the same way as our normal tabular data. Time-series data is special in that the data is ordered, and each data point is related in some way.

If we split them, there would usually be leakage as the data from the future is used to predict the past—which should not happen.

Here is a simple Python implementation for splitting the time series.

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

Overlapping samples between training and test sets.

When we have an improper data splitting, there could be cases where the sample data from test data is present in the training data.

This causes a data leakage where data that should be unknown to the model is now present during training, causing an overfitting.

To avoid that, we can use a normal train-test split from the scikit-learn.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

That’s all for today! I hope this helps you understand what happens when and when a data leakage event occurs.

Are there any more things you would love to discuss? Let’s talk about it together!

👇👇👇

Previous NBD Lite Series

Non-Brand Data

7 LLM Generation Parameters To Know - NBD Lite #11

If you are interested in more audio explanations, you can listen to the article in the AI-Generated Podcast by NotebookLM!👇👇👇…

9 months ago · 5 likes · Cornellius Yudha Wijaya