The Hidden Mistakes in Machine Learning Models
A few subtle pitfalls that limit your model's performance.
🔥Reading Time: More than 3 Minutes🔥
🔥Benefit Time: A lot of Time🔥
Extra Learning Materials Below. Don’t miss it👇👇
If you prefer to listen to the article, you can listen to the audio summary here👇👇
Machine Learning modeling seems easy. You only need the data and Python library, then BOOM! You end up with the model.
Well, I hate to tell you that you might end up with a “machine learning model”, but not the right one. It is probably a model unsuitable for production and lacks business values.
Programming makes it easier to develop the model, but many things could go wrong, and you could possibly have done them. No worries. If you do not know what could go wrong or even already know about them, we will explore the machine learning pitfall that I personally want you guys to know together.
What are they? So, let’s get into it!
Btw. In case you are missing some of the best FREE End-to-End MLOps Projects, you should visit it here💥👇
#1. Poor Data Quality
I think this is one of the most obvious pitfalls in machine learning modelling, yet many still make the mistake of training models with poor data quality.
Poor data quality can constitute many things, including data that are noisy, incomplete, or irrelevant:
Being noisy means that the data contains many errors or random variations, which could lead to obscure patterns.
Incomplete data means some values are missing, yet the missingness happens because of human error rather than natural causes.
Lastly, Irrelevant data means that our data has no meaningful relationship with the problem we are trying to solve.
Any poor data quality problems would lead to inaccurate models and degrade the model performances.
You can take a few actions to improve the data quality, including thorough data cleaning, such as handling missing values, removing duplicates, and correcting inconsistencies. Also, ensure alignment with business is there to ensure the data used is relevant to the problem.
#2. Ignoring the Business Context
Speaking of business alignment, there are many times that the modeling is only performed for the sake of developing something advanced rather than providing business values.
Ultimately, we are using the model to solve the business problem. Focusing solely on technical aspects without considering the business implications can lead to technically sound models that are practically useless.
I know it was nice to explore what model we can develop and experiment with complex techniques. We can still do that, but we must work closely with the domain experts to ensure the model aligns with business needs.
I would also want to iterate to the next point in business alignment that we need to validate our model’s impact through business metrics, not just technical metrics. Accuracy, Precision, Recall, etc., sounds nice, but the business KPI also needs to take the spotlight here.
Regarding the business KPI, your model is only useful if the business uses the result. I have so much experience with a business that our model became obsolete because they don’t believe in our results—no matter how great the technical results are.
If the business doesn’t even want to use your model, you can’t validate your business KPI result via the ML model. What to do in this case? IMO. It’s a much more complex problem in the business, and the leaders must act on it.
If you are a data leader or aim to become one and want to learn how to solve the problem above, I recommend you read
the latest article about Explaining Why Data & Models Aren’t Always Right & Getting Leaders To Act On Them.In the meantime, I also have a nice article to read if you want to become a great data scientist.👇
#3. Overcomplicating the Model
Let’s be real. I know many readers would love to develop a fancy model and aim to use the advanced model for any use cases.
I didn’t say it was wrong, but many business problems can usually be solved with the simplest or even without using the ML model.
I would refer to the philosophy principle, Occam’s Razor or Law of Parsimony, which states that the simplest explanation is usually the best. This means that the simplest solution often works for most problems. It also works in machine learning, where the simplest model usually solves the best.
There is no exact definition that defines a model as simple or complex. However, there are rules of thumb where the Linear Regression model is deemed simple while Deep Learning is complex. If you want to know more about simple vs complex models, read my article: Are We Undervaluing Simple Models?.
The important thing is to use simpler models and gradually increase complexity only if necessary. If the performance already meets your requirements, stick with a simpler model. Also, the simple model would be your benchmark if you want to increase the model complexity.
Regularly validate model performance on unseen data to ensure generalization. Your model could change performance with time, so monitoring is also important. If, during monitoring, your simple model can’t do the work anymore, then it’s time to increase the model complexity.
#4. Avoid Model Interpretability
One of my experiences in dealing with business people to trust our model is to explain why our model provides such output or how each prediction could impact the business.
Normally, many focus on having the highest possible predictive performance with complex models. While these models bring exceptional metrics results, they can be problematic if they can be explained well to the business.
Furthermore, technical explainability is not business explainability. You would think explainability could be explained easily with model-agnostic techniques such as SHAP or LIME.
It could possibly help you explain the contribution of the features for the model prediction, but could the business understand the meaning of “Salary has a negative SHAP value, meaning it decreases the likelihood of a particular class being predicted”?
What does that mean to the business? Is it good or bad? Our task is to translate the technical explainability above to business phrases as well.
To be able to do that, we need to discuss them with the business and ensure that our output is explained in a way that the business would understand. The business needs to understand the result and the features you used.
I assume that my business person doesn’t understand the technical explanations, so I would learn how the business works and explain my result in a way they can digest. Explainability should cover any gap between technical and business, so ensure you do not avoid them.
#5. Neglect Model Versioning and Maintenance
Having the model is not enough. Machine learning is a lifecycle that needs to be iterated over time to constantly provide value to the business.
The problem I see often in machine learning projects is neglecting the versioning and maintenance in favor of quick model deployment without a standard for model iteration.
It’s even more true in a fast-paced environment, as teams might prioritize rapid development and deployment over maintaining a rigorous version control system and the model.
However, neglecting them can lead to several problems, including challenges in model debugging, reproducing the results, and even compliance issues.
The proper concept for deploying and maintaining machine learning is dealt with within the MLOps field. If you want to learn more, you can visit my previous article.
As versioning and maintenance are important in machine learning development, neglecting them would be a big mistake.
That’s all for the mistakes you might make in machine learning development. I hope your Machine Learning journey is going strong🔥!
What do you think? What do you want me to write about the content even deeper? Share and discuss it together in the comment below.👇
Or by joining the chat👇
Articles to Read
Here are some of my latest articles you might miss this week.
Building a Recommendation System with Hugging Face Transformers in KDnuggets
NumPy with Pandas for More Efficient Data Analysis in KDnuggets
Use XGBoost Like a Pro in Non-Brand Data
🤩Also, here are some of my favorite readings of the week.
How to Network as a Data Scientist by Haden Pelletier
Learning to Unlearn: Why Data Scientists and AI Practitioners Should Understand Machine Unlearning by Raul Vizcarra Chirinos
Build Multi-Index Advanced RAG Apps by
How to Write a Resume That Doesn’t Suck: Must-Know Rules and Two Secrets to Beat ATS in 2024 by
FREE Resources to Take
Don’t miss these resources for your data learning!
GenAI for Beginners by Microsoft
Practical Deep Learning by fast.ai
Cookiecutter Template for Python Data Science Projects by AWS
Mode Analytics SQL Tutorial by Mode.com
Personal Notes
Not much is happening in my personal life other than I re-run my Baldur’s Gate 3 walkthrough, but now I am playing it along with my friends. They had already spent over 900 hours in the game, so they knew better than me.
Additionally, I had a Dungeon and Dragon session last weekend. Amazingly, I managed to roll double natural 20 with advantages, followed by roll double natural 1 with advantages. When I calculate the probability, the chance it will happen is around 0.95%. What an event!
In addition, my data learning business for Indonesian people called Berdata, recently in discussion to partner with the USA to implement Artificial Intelligence courses for Indonesian University. I will tell you more about the prospect after everything is established.
Looking forward to them!