Your Data Processing is Your Data Science Brand
Enabling your intellectual process as your brand
What could be a personal brand for Data scientists? It is the project they did? or the personality they shown on social media? Or is it the company they are employed in?. In my opinion, everything I just mentioned could become your brand — it depends on how you wrap it up. However, I want to give my opinion on another aspect of the Data Scientist personal brand: Data Processing.
What exactly is Data Processing? It is a stage where we are processing raw data we have collected or acquired and transform it into “cleaner” data that is ready to use for any purpose (analysis, visualization, description, etc.). Essentially, Data Processing is the pillar to have any decent result/product out of your dataset.
How is Data Processing could become your personal brand? In this article, I would show you the detail of why Data Processing is your own personal brand as a Data Scientist. Let’s get into it.
Data Processing is all about asking the right question
You already collected or acquired the dataset in some way; what is next then?. How you'd processing the data would eventually depend on what kind of question you want to solve — and it needs you to compose the right question to ask.
It might seem like a simple task. The task is about asking a question, what is so hard about it? Nothing could go wrong with it, right?. The same question could have a different meaning depending on the context, and composing a good question would consider the business and technical understanding.
Imagine that your company has a fraud problem and ask your data science team to do something about it by developing a machine learning model to detect any potential fraud. Sounds easy? You only need to pull the data from a database, process the data, develop the model, and BOOM! You have the machine learning model. Then the business user asks, have the model considering the case “XYZ,” or the fraud prediction is the type “ABC”?. Have you already considering those cases?.
This problem often happens; what you develop or process is not suitable for what the problem should solve or miss some important detail. Without the right information, your question becomes different and leads you to wrong data processing.
What is unique for each person is how they would address the “right question” problem. It is often all about discussing with the business user what they want, but sometimes the user does not even know what they want until they get the product. In this case, you need to exert your creativity to compose the right question based on the information you had and the research you do where the data processing would depend on all of this activity — the creative process is your brand and something you should show.
Data handling is all about the intellectual process.
What choices do you choose in your data processing steps? Missing data removal or imputation? Outlier removal or transformation? Categorical data transformation? Many more you could do during the data handling in the data processing step. Different from person to person is how they see the data itself — even with the same understanding of the question.
I remember that in one of my classes, there is a time when I presented the same dataset to my student, and I told them to process the data according to the question I have stated. The result is exciting because I could see the differences in thought processes for each student. One student decided to remove the data because he said it is only a few data, while the other insisting on keeping the data because every data is useful. The decision to keep or remove the data is based on each perception of “Right,” and it is uniquely varied.
I am said data handling is an intellectual process because you need to research and understand the data context to decide what to do in the data handling part. Like the case I mentioned above, people would have a different stand for each handling problem and explain why they did something — and it depends on their intellectual process.
To summarize, the choices you make during the data handling are a personal brand because the activity here often varies from person to person. There might be a right way to remove the data or keeping the outlier, but to know this takes a lot of research and creativity. How you explain the reason behind your decision on data handling and presently as neat as possible is important.
Feature Engineering takes a creativity process.
Another step in data processing is feature engineering or creating features from existing features. This step is important because it could reveal any previously unseen pattern or boost your predictive model metric. Why is feature engineering your personal brand? I have seen many data scientists having a different way of doing feature engineering, and it is really interesting for me because everybody has their own creative process. Feature engineering is like art — everybody has their own style.
While feature engineering theoretically seems simple, the truth is that feature engineering is something hard to do. Many cannot create meaningful features because of a lack of data understanding or not knowing what to do. The cases become harder when they need to change the data dimension where they need to see the data from another perspective; e.g., the data is from a customer point of view, but they need to transform it into a sales point of view.
A lot of data science competition-winning comes down to the feature engineering the winner come up. Many features are something that people would not think of previously, but they manage to get it right. This is their personal brand, knowing what features create and utilize the engineered features for their modeling.
This is something you could do as well. Doing a feature engineering that is uniquely you and comes up with features that signify your personal brand.
Conclusion
A data scientist's personal brand could have many forms, but I want to argue that Data Processing is one of them. The way you processing data is different for each person. The detail where it is different include:
How you ask the Question
How you handling the data
How you do feature engineering
Each detail takes an intellectual and creative process uniquely you, and as a Data Scientist, you should show up more like your own personal brand.