Advice to remember when dealing with data
After collecting our data, as a Data Scientist, we just want to get on with our data analysis and creating our perfect machine learning model. We have all the creativity in the world and only limited by what is in the brain, although there are just some things that are wrong to do with our data.
For that reason, here are my five personal don’t and do as a data scientist.
Although, there are just some things that are wrong to do with our data.
1. Dropping Data
Don’t: Dropping data without proper analysis
When we were collecting our data and doing the investigation, it bound to happen that the data contain missing data or outliers. What I often saw with the aspiring Data Scientist is how they drop this data without any further analysis. Usually, the justification is “There are only a few data” or “ The data would not affect the Machine Learning model at all.” This line of thought is dangerous.
Do: Drop the data after careful analysis.
I didn’t say that you can’t drop this data but only do this after a thorough analysis. This missing data or outlier data might contain a vital pattern that would help in answering our data science question.
2. Relying on Accuracy
Don’t: Only relying on Accuracy to measure the “Success.”
We already analyze our data and creating the machine learning model. This model then we evaluate it showing a 98% accuracy. You then think that it is a useful model already and ready to be deployed. While it is true that higher accuracy could indicate a great model but solely rely on accuracy to measure the data or model success would be wrong.
Do: Measure the “Success” from many other metrics.
Depend on the question we asked, Accuracy might not be the best metric to represent the model; and it often the case. There are many metrics out there that might be better represent our data, such as Precision, Recall, F1, Log Loss, and many more.
For example, we created a model to predict loan default. In the business case, we would not measure the success of the model only based on the accuracy, but we would consider what is essential first. Is predicting the default case would be more important or not? From here, we pick which metric is the best to use.
3. Cherry-Picking
Don’t: Selecting a subset of data to support your hypothesis.
You have constructed proper research and analyzed your data then it comes out that our previous claim is wrong. Then you think, “Would it be better just to select the data that I deemed good for my hypothesis?.” If you feel like this and proceed with selecting only the data that supporting your claim, then it is a wrong thing to do.
Do: Let the data speak as it is.
Just let the data show you what pattern they have and work from there. It is hard sometimes to accept that our data did not explain what we want, but if you were only selecting the data that you want to see, then it would lead to a disastrous decision.
It is also applicable to Machine Learning cases. Yes, selecting a subset of data might improve the accuracy, but be careful because then not all cases would be represented by your data. For example, when you remove the people under 25 years old in your data, the accuracy is improving by 25%. It might look good, but then your model would not represent any of the people under 25 years old.
4. Causation is present
Don’t: Assume that correlation means causation without proper analysis
We have the data then we applying correlation analysis. From the analysis, it showed a strong linear correlation within two variables. Here you think that the relationship happens because one of the variable cause the other variable to be present and vice versa. While it could be correct, but as Data Scientist, you need to have your doubt first because, most of the time, correlation does not mean causation. Most of the time, things are correlated because of chances.
Do: Find evidence to support a causation assumption.
It takes more than “Pearson” Correlation or “Spearman” Correlation as the evidence. Write proper research methodology and read much literature to find that evidence.
5. Statistical Assumption
Don’t: Use a statistic or Machine Learning model without knowing their assumption
Many times I saw aspiring data scientists applying statistic methods (e.g., t-test, ANOVA, Pearson Correlation, etc.) or machine learning model (Linear Regression, Random Forest, Boosting, etc.) without knowing what the assumption is before using them. While it could still be useable, the price for violating this assumption would be a less reliable result; means the statistical analysis or the machine learning model would be misleading.
Do: Read the assumption but do not force the data to follow this assumption.
You need to follow the assumption if you want a trustable result. Take an example; an independent t-test needs the data to be following a normal distribution, independent, and homoscedasticity. Violation one of the assumption would mean that the result is less reliable. In this case, you might think to make the data fit the assumption, but it is also not advisable. While you could try to transform the data into the normal distribution (This is the case that happens most of the time), the original pattern would be gone. It’s much better if you could try another method that the data could fulfill the assumption without doing any transformation or cherry-picking.
Conclusion
Here I showed you my five personal five things to do and don’t as a data scientist while others might argue that there are many other things more important, but for me, these five things that need to remembered as a Data Scientist.