Do You Want to Develop an ML Model? Avoid these Data Common Pitfalls

Various machine learning pitfalls that you need to know

Feb 15, 2024

A wide image visualizing the concept of machine learning development without using any text. The image illustrates a clear path through a stylized, minimalist landscape. Along this path, symbolic icons represent common machine learning pitfalls—like overfitting, underfitting, and biased data—without using any textual labels. The path leads towards a glowing, abstract representation of a neural network at the end, symbolizing the goal of successful ML development. The landscape is serene and simple, with a sky above that transitions from dawn to daylight, indicating progress and enlightenment in the journey of ML development. — Image generated by DALL-E 3

Machine Learning development seems easy; you just need the data and run the .fit() code. Press run, and you have the model. Well, I am going to tell you that you might have the model, but is it the best model? Or even, is it the right model for your problem?

Many things can go wrong when you want to train the model—and the consequences can sometimes be fatal. Remember the case when Amazon facial recognition falsely matched the congressman as a criminal? Or when racial bias is found in the American healthcare risk algorithm?

Machine learning has the potential to bring competitive advantage to the company, but only if it is correctly assessed and implemented. A slight mistake in the algorithm could jeopardize many things after all.

That’s why we need to learn about the machine learning pitfall to avoid any unnecessary situations. If you fall into these data common machine learning traps, your models might be inaccurate, biased, or simply a waste of time.

What are these pitfalls? Let’s get into it.

Pitfall #1: Data Leakage

Data is the heart of our machine-learning model. Without them, it would not work well. It means bad data would equal a bad model. That’s why we first need to avoid any data troubles that would cause problems in the long run.

Data leakage is one of the most common pitfalls in machine learning. It happens when your model accidentally gets access to information during training that wouldn't be available when it makes real-world predictions. This leads to high metric scores that, sadly, won't translate to success in the real world.

target2 Screen Shot 2020 03 02 at 3.59.19 PM — Data Leakage diagram (Source: Datarobot)

A common scenario for this happens when building a predictive model for stock prices. One mistake would be including features calculated after the point in time you're trying to predict, like the day's closing price. In a real trading scenario, you wouldn't know the closing price beforehand, rendering this feature unusable and a mistake in the model metrics.

Data leakage isn't always obvious. Often, it comes from subtle errors in how you process features or split your data into training and testing sets. Also, it requires domain knowledge to understand that some features might not be available in the real world.

Avoiding data leakage requires detailed work. Here are some tips that might help you:

Time-Based Splits: If you're working with time-series data, only use past data to train a model that predicts future points. Understand your data carefully.
Detail during Feature Engineering: Be detailed when using the data to create features. Avoid incorporating information the model wouldn't have access to during prediction.
Data Pipelines: If you have complex data preparation steps, use tools and practices that help track how data is transformed to prevent the inclusion of future information.
Thank you for reading Non-Brand Data. This post is public, so feel free to share it.
Share

Pitfall #2: Bad Quality Data

Machine learning algorithms are fundamentally data-driven. They learn by identifying patterns and relationships within the data we provide. That means the quality of the data would impact the quality of our model.

The terms "garbage in, garbage out" are especially true in the machine learning model development. Here is why:

Misleading Patterns: Dirty or inconsistent data can lead your model to discover false or unreliable patterns. Imagine a customer purchase dataset containing many typos and formatting errors. Your model might incorrectly learn that product names with misspellings are less popular when that's simply a reflection of bad data entry.
Lack of Representativeness: If your dataset doesn't accurately mirror the real-world scenarios your model will encounter, the model performance will suffer. A facial recognition system trained only on brightly lit photos of individuals might struggle to recognize faces in dim lighting.
Insufficient Volume: In many cases, a small dataset won't have enough examples for the model to generalize the data better. Consider trying to teach a self-driving car based only on a few hours of driving dataset—there's no way it would have seen enough situations to navigate safely.

Building a successful machine learning model often requires shifting focus from the machine learning algorithm selection to fixing the data quality. The activity could include:

Data Collection: Prioritize careful data collection, focusing on representativeness and eliminating errors as early as possible.
Cleaning and Preprocessing: Take time to remove outliers, work with the missing values, and transform data into formats our algorithms can work with.
Active Learning: If acquiring clean data is hard, we could consider active learning techniques, where our model strategically helps us decide which new data points to label, maximizing the use of our resources.

Share Non-Brand Data

Pitfall #3: Data Drift and Concept Drift

In machine learning, we often assume that the data our model is trained on is representative of the data it will see in the real world. However, the real world is dynamic. Trends change, behaviours evolve, and new patterns emerge. This is what we call changing data distributions or data drift. There is another drift called concept drift, which is the change between the interaction of inputs and outputs.

Data Drift

Data drift refers to changes in the distribution of input data that a model receives over time. This type of drift is dangerous because machine learning models assume that the data in the real world would always follow the same distribution as the data they were trained on. When this assumption is violated due to drift in the input data distribution, the model's performance can degrade.

Data drift can occur due to various factors, such as seasonal variations, changes in consumer behaviour, or evolving market conditions. Identifying data drift involves a few things, including:

Statistical Monitoring: Use statistical tests to monitor the distribution of input features over time. Methods like the Kolmogorov-Smirnov test, Chi-square test, or Kullback-Leibler divergence can help identify changes in data distributions.
Visualization: Use visualization techniques like histograms or density plots to compare the distributions of data over different periods. This can help in visually identifying shifts in the data.
Change Point Detection: Implement change point detection algorithms that can automatically identify points in time where the data distribution changes significantly.

To mitigate the Data Drift, you can implement various methodologies:

Re-training Models: Regularly update the model with recent data to ensure it reflects the current data distribution. We can automate them when data drift is detected or a fixed schedule.
Feature Engineering: Adjust or create new features that are more robust to the changes in the data distribution. This might involve normalizing data or designing features that capture long-term trends rather than short-term fluctuations.
Adaptive Models: Use models that are inherently adaptive to changes in input data, such as online learning algorithms that continuously update their parameters based on new data.

Concept Drift

Concept Drift happens when the relationship between the input data and the target variable changes over time. This means that even if the distribution of the input data remains constant, the way it relates to the target changes.

Concept drift is a reflection of changes in the real-world situation—for example, a change in consumer preferences affecting the correlation between product features and customer satisfaction.

To detect the concept drift, there are a few methodologies that we can use. They are:

Performance Monitoring: Regularly evaluate the model's performance on new data. A sudden drop in performance metrics like accuracy or F1 score might indicate concept drift.
Drift Detection Algorithms: Implement specific drift detection methods such as the Drift Detection Method (DDM), Early Drift Detection Method (EDDM), or ADaptive WINdowing (ADWIN) algorithm to automatically detect changes in the relationship between input data and the target.

Conclusion

Those are all the common data pitfalls you might find when developing a machine learning model. There are a few pitfalls to consider, including:

Data Leakage
Bad Quality Data
Data and Concept Drift

I hope it helps.

Don’t forget to share the post if you feel it is useful.

Non-Brand Data