Hi all!
It’s
and from here. In a time where data is king, machine learning has become a valuable tool for many businesses.Creating a competitive advantage is not only optional; it’s become mandatory for many companies. That’s why many employers are willing to pay a lot of money to get the best data talent.
Many start their machine learning journey but haven’t maximized their models. That’s why we both want to share our experience in machine learning development.
We are currently collaborating to create this article, as combining our experiences would result in a piece that certainly would help everyone. So, let’s go!
1. Understand the Business Problem
The first thing we must remember when developing the machine learning model: What problem would our model try to solve?
When we want to put our resources into something, we want to get the most out of it. So, we don’t want to play around with machine learning model development and just end up with nothing.
In machine learning development, we need to understand the problem the business is facing and what exactly they need to solve. To get a better understanding, here are some best practices you could follow:
Define Clear Objectives: Compose the objective. Understanding the problem's scope helps to create the pathway for machine learning development.
Stakeholder Engagement: If you have someone you are working with, engage with all the stakeholders early and often as their insight ensures the solution aligns with business goals.
Research and Contextualize: Learn about the business you are working with or for the machine learning model you want to solve for.
Set Success Criteria: Define what success is like with your machine learning model. Not just the technical metrics but also the business KPI.
For example, the retail business has a problem with retention problem and wants to reduce customer churn. From this problem, we could work with stakeholders to identify key factors leading to churn, research industry trends in customer loyalty, and set specific targets for churn reduction.
2. Data Quality Management
Garbage in, Garbage out. This phrase rings true to any machine learning, as what feeds to the data is what we would get.
Machine learning models work based on the data; having quality data is the first step to a valid model. If we didn’t care about this aspect, we would never have the best model to solve the business problem.
That’s why evaluating our data source is important to achieve the best model. The critical part of this phase is assessing the data source to ensure that it is correct and reflects real-world conditions.
Assess completeness: Make sure the data have the necessary information.
Evaluate relevance: Check if the data is suitable for the analysis.
And to improve the data quality, here are some approaches to performing data cleaning:
Identify inconsistencies and Correct errors: Standardize the values so every data row is consistent in quality.
def correct_data(row):
if row['Col'] == 'Error_1':
return 'Error' if row['col2'] > 195 else 'OK'
else:
return row['Col']
df['Col'] = df.apply(correct_data, axis=1)
Address missing values: Handling missing data correctly, such as imputation or removal, depending on the context and the impact on the model.
# Check for missing values
print(“Missing Values in DataFrame:")
print(df.isnull().sum())
df['Col'].fillna(df['Col'].mean(), inplace=True)
These tips also bridge to our next best practices.
3. Exploratory Data Analysis (EDA)
To understand the data we would feed into the model, we need to explore them. This is where the Exploratory Data Analysis comes in. EDA is also important for any machine learning development process as it lays the grounds for subsequent steps.
EDA is an activity that helps us understand our data, summarizes the data's main characteristics, and is often facilitated by visual methods. Some of the EDA activity, including:
Statistical Summaries: Provide the data description using statistical summaries (mean, median, mode, standard deviation) to understand data distributions and variability.
Visual Analysis: Use methods like histograms, box plots, scatter plots, and heat maps to identify patterns, trends, and outliers in the data.
Correlation Analysis: Assess relationships between the variables that can help during feature selection and model development.
EDA is an iterative process that might need further exploration depending on our insight or the follow-up question. For example, we can perform descriptive analysis with the following code.
df.describe()
We can also do data visualization to see the relationship between variables with the Pair Plot.
import seaborn as sns
import matplotlib.pyplot as plt
sns.pairplot(df, hue='class')plt.show()
4. Encoding the metrics
Once we understand our data, we need to make sure that the ML algorithms understand it too.
And this is quite a tricky task, as most data come with categories, which are string variables that segment the set.
So, what’s their main problem?
ML algorithms do not understand string variables, as they work best with numerical data. Therefore, it’s essential to translate categorical variables such as gender, size or colors into numerical representations suitable for the chosen algorithm.
The two most common encoding techniques include:
One-Hot Encoding: This method creates new binary features for each category, with one feature being active (1) for the corresponding category and inactive (0) for others. In python this would be performed as follows:
from sklearn.preprocessing import OneHotEncoder
# Initialize the OneHotEncoder
onehot_encoder = OneHotEncoder(sparse=False)
# Fit and transform the data
encoded_data = onehot_encoder.fit_transform(df)
encoded_df = pd.DataFrame(encoded_data, columns=onehot_encoder.get_feature_names_out())print(encoded_df)
However, this method is limited to binary options. When having multiple options (like colors or age) we need another approach to deal with the data.
Label Encoding: We assign a unique numerical value to each category.
from sklearn.preprocessing import LabelEncoder
# Applying Label Encoding to each column
df_encoded = df.copy()
label_encoder = LabelEncoder()
for column in df_encoded.columns:
df_encoded[column] = label_encoder.fit_transform(df_encoded[column])
print(df_encoded)
Let's continue our discussion in the context of predicting customer churn.
To determine whether a customer will churn or not, we can employ a binary logistic regression model. This model requires that we encode the target variable, churned, as a One-Hot encoding variable, where:
1 represents customers who have churned
0 indicates those who have not churned.
This binary representation allows the model to treat the prediction task as a binary classification problem.
If we wish to include a categorical variable like income level as an input variable in our model, it's advantageous to use Label Encoding instead.
This approach assigns a unique integer to each category of income level, for instance low, medium and high, turning these ordinal categories into a format that our model can interpret and learn from.
You can find more techniques in the following article about encoding features.
5. Experimentation
Experimentation is crucial for identifying the optimal model for your machine learning project. It's often the most time-consuming step, requiring you to answer essential questions before diving in:
What aspects should we experiment with?
How do we handle negative results?
How can we streamline the iterative process?
Experimentation usually depends on the specific characteristics of your project. However, some common approaches are:
5.1 Choosing the Best Data Splitting Method
Split your data into training, validation, and testing sets.
The training set is used to train the model.
The validation set helps fine-tune hyperparameters
The unseen testing set provides an unbiased assessment of the model’s generalizability.
Choose an appropriate splitting method like random sampling or stratified sampling depending on the data characteristics.
For the customer churn prediction problem, handling the dataset correctly is crucial, especially when dealing with imbalanced data—a common scenario where the number of churned customers is significantly less than the number of retained customers.
So in order to consider this imbalanced problem, simple random sampling might not preserve the proportion of churned customers across training, validation, and testing sets, which could bias the model.
Thus, we should opt for stratified sampling, ensuring that each split maintains the same proportion of churned and non-churned customers as the original dataset.
from sklearn.model_selection
import train_test_split
# Assuming `X` is your feature matrix and `y` is the target variable
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42, stratify=y)
# Further split X_temp and y_temp into validation and test sets
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)
5.2 Best Method Selection
Experiment with different algorithms, hyperparameters, and feature engineering techniques. Evaluate their performance using the defined metrics on the validation set.
Cross-validation: This technique further improves the evaluation process by repeatedly splitting the data into training and validation sets and averaging the performance across all splits.
6. Documentation and Reproducibility
Ever revisit your own project and draw a complete blank?
This happens to everyone! That's why documentation is key.
Let’s imagine we have already finished and delivered our customer churn prediction model. However, some months later, we are asked to improve this model's effectiveness. You have already forgotten most of the project, and there’s no documentation.
That’s the worst scenario possible, as you will have to reintroduce yourself to your own project… from scratch! This is why documentation is so important!!
It fosters clear understanding, simplifies collaboration, and ensures reproducibility.
Some documentation common good practices include:
Clear and concise code: Use meaningful variable names, comments, and version control systems like Git.
Detailed reports: Document data collection procedures, feature engineering steps, model architectures, hyperparameter settings, and performance metrics.
Model serialization: Save trained models in a format that allows for reloading and making predictions on new data.
If a proper documentation is performed, next time we go back to previous project we will have:
Data: Source, feature engineering steps, details on each data point used.
Model: Model architecture, hyperparameter settings.
Training: Script with comments explaining each step, including data preparation.
Evaluation: Metrics chosen (AUC-ROC, churn rate) and their interpretation.
Conclusions
Following these best practices provides a solid foundation for successful machine learning projects.
By focusing on understanding the business problem, ensuring data quality, conducting thorough exploration, and iteratively experimenting, you can build robust and reliable models that deliver real value.
Remember that machine learning is an iterative journey of exploration and learning. While initial results may not be perfect, the process itself leads you towards the optimal solution.
And always documentate each of the steps you perform, so your future self saves many hours of recap.
Embrace this process, continuously refine your skills, and explore the exciting possibilities of this powerful technology!