The Fault Maybe Lie in Your Data

Data Problem where your model is underperform

Feb 14, 2021

Have you ever creating a machine learning model and find that the metric is not performing as you expected? Well, the problem might lie in your data.

Try to ask this question before you examine the other root cause:

Do You Have Enough Data?

Your machine learning might underperform if you did not have enough data to process. Not enough data leads to higher bias, which creates a model that only predicts the biased data.

Do You Have an Imbalanced Data?

Having an imbalance of data might skew your data to a particular class. Some might say that you need to remove the outlier or use the oversampling technique (such as SMOTE) to tackle this. I only somewhat agree with this approach.

Why? Because this is the pattern of the data. The imbalance itself is what we found in the real world. That is why check if the imbalance data happen because of the data itself or some other circumstances.

I would suggest using a more appropriate model, do feature engineering, and try weighting.

Are You Sure Your Data Quality is good?

When you get the data from the sources, make sure the quality is there. I have a case before that the data I get is not useable because the way it is collected is messy and not even met the data science project's lower standard.

Do you have clean data?

Most of the time, your data is not clean at all. There might be missing values or inconsistent categories within the column. Try to clean the information as well as possible and ask questions with the data sources if you are not sure.

Non-Brand Data

Discussion about this post