5 Don’t and 5 Do for Data Scientist
After collecting our data, as data people, we want to get on with our data analysis and create our perfect machine learning model.
We have all the creativity in the world and are only limited by what is in the brain, although there are just some things wrong with our data.
For that reason, here are my five personal don’t and do as a data scientist.
1. Dropping Data
Don’t: Dropping data without proper analysis.
When we were collecting our data and investigating, it was bound to happen that the data contained missing data or outliers.
I often saw how the data is dropped without any further analysis. Usually, the justification is “There are only a few data” or “ The data would not affect the Machine Learning model.” This line of thought is dangerous.
Do: Drop the data after careful analysis.
I didn’t say that you can’t drop this data but only do this after a thorough analysis.
This missing or outlier data might contain a vital pattern that would help answer our data science question.
2. Relying on Accuracy
Don’t: Only relying on Accuracy to measure the “Success.”
We have already analysed our data and created the machine learning model. This model was then evaluated it showing a 98% accuracy. You then think that it is a useful model already and ready to be deployed. While it is true that higher accuracy could indicate a great model but solely relying on accuracy to measure the data or model success would be wrong.
Do: Measure the “Success” from many other metrics.
Depending on the question we asked, Accuracy might not be the best metric to represent the model; and it often the case. Many metrics out there might better represent our data, such as Precision, Recall, F1, Log Loss, and many more.
For example, we created a model to predict loan default. In the business case, we would not measure the success of the model only based on the accuracy, but we would consider what is essential first. Is predicting the default case would be more important or not? From here, we pick which metric is the best to use.
3. Cherry-Picking
Don’t: Select a subset of data to support your hypothesis.
You have constructed proper research and analysed your data; then, our previous claim is wrong. Then you think, “Would it be better just to select the data that I deemed good for my hypothesis?.” If you feel like this and proceed with selecting only the data supporting your claim, then it’s the wrong thing to do.
Do: Let the data speak as it is.
Let the data show you what pattern they have and work from there. It is hard sometimes to accept that our data did not explain what we want, but if you were only selecting the data you want to see, it would lead to a disastrous decision.
It is also applicable to Machine Learning cases. Yes, selecting a subset of data might improve the accuracy, but be careful because your data would represent not all cases.
For example, when you remove people under 25 years old from your data, the accuracy improves by 25%. It might look good, but your model would not represent any people under 25 years old.
4. Causation is present
Don’t: Assume that correlation means causation without proper analysis
We have the data then we applying correlation analysis. The analysis showed a strong linear correlation between the two variables. Here you think that the relationship happens because one of the variable cause the other variable to be present and vice versa.
While it could be correct, as Data Scientist, you need to have your doubt first because correlation does not mean causation most of the time. Most of the time, things are correlated because of chances.
Do: Find evidence to support a causation assumption.
It takes more than “Pearson” Correlation or “Spearman” Correlation as the evidence. Write proper research methodology and read much literature to find that evidence.
5. Statistical Assumption
Don’t: Use a statistic or Machine Learning model without knowing their assumption.
I often saw aspiring data scientists applying statistic methods (e.g., t-test, ANOVA, Pearson Correlation, etc.) or machine learning models (Linear Regression, Random Forest, Boosting, etc.) without knowing the assumption before using them.
While it could still be useable, the price for violating this assumption would be a less reliable result, which means the statistical analysis or the machine learning model would be misleading.
Do: Read the assumption but do not force the data to follow this assumption.
You need to follow the assumption if you want a trustable result. Take an example; an independent t-test needs the data to be following a normal distribution, independent, and homoscedasticity. Violation of one assumption would mean that the result is less reliable.
In this case, you might think to make the data fit the assumption, but it is also not advisable. While you could try to transform the data into the normal distribution (This is the case that happens most of the time), the original pattern would be gone. It’s much better if you could try another method that the data could fulfil the assumption without doing any transformation or cherry-picking.
Conclusion
Here I showed you my five personal five things to do and don’t as a data scientist. While others might argue that there are many more important things, for me, these five things need to be remembered as a Data Scientist.