Non-Brand Data

Non-Brand Data

Share this post

Non-Brand Data
Non-Brand Data
Use XGBoost Like a Pro
Copy link
Facebook
Email
Notes
More

Use XGBoost Like a Pro

There are many ways to improve your XGBoost experience. Let's see if you already knew these tips.

Cornellius Yudha Wijaya's avatar
Cornellius Yudha Wijaya
Aug 20, 2024
∙ Paid
7

Share this post

Non-Brand Data
Non-Brand Data
Use XGBoost Like a Pro
Copy link
Facebook
Email
Notes
More
1
Share

XGBoost, or Extreme Gradient Boosting, is a machine learning algorithm under the Gradient Boosting framework. It is based on the ensemble of decision trees and sequentially optimizing them to give a better prediction.

It’s a popular model because its performance is stellar compared to the other models in practical applications. Besides its performance, XGBoost is also known for its speed processing (parallel processing), regularization to prevent overfitting, and handling of missing data and categorical data.

I enjoy XGBoost and use it in my personal and professional life. As Bojan Tunguz said in X,

“XGBoost is all you need.”

Many new fancy models exist, especially with LLM and Generative AI becoming so popular. However, XGBoost could easily solve 90% of the business problems.

While it’s easy to use, many people haven’t utilize the full capability of what XGBoost could do. That’s why I want to share my tips on improving your experience working with XGBoost with you guys.

💪I assure you these tips will make you a pro at XGBoost.

So, let’s get into it!

Non-Brand Data is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Preparation

I assume you already have Python installed. We need the XGBoost package with Seaborn to load the dataset example in this sample. So we can install them using the following code:

pip install xgboost seaborn pandas scikit-learn numpy

Then, we would use Titanic data for our example, using only a few columns and dropping the missing data.

import xgboost as xgb
import seaborn as sns
from sklearn.model_selection import train_test_split

df = sns.load_dataset('titanic')
df = df[['survived', 'pclass', 'age', 'sibsp', 'fare']].dropna()

X_train, X_test, y_train, y_test = train_test_split(df.drop('survived', axis=1), df['survived'], test_size=0.2, random_state=42)

With the packages and dataset ready, here are some pro tips you can use to use XGBoost.

Get more from Cornellius Yudha Wijaya in the Substack app
Available for iOS and Android

1. Use DMatrix XGBoost for Efficient Data Handling

DMatrix is a data structure used by XGBoost internally. It’s designed to handle large datasets, allowing faster computation and better data memory management. That’s why it’s preferable to use DMatrix when you have large-scale machine-learning tasks.

The way to convert the Dataframe into DMatrix is by using the xgb.DMatrix and pass the Pandas DataFrame.

dtrain = xgb.DMatrix(X_train, label=y_train,  feature_names=X_train.columns.to_list())

dtest = xgb.DMatrix(X_test, label=y_test, feature_names=X_test.columns.to_list())

Then, how you train the model and get the prediction is similar to when using the Pandas Dataframe.

params = {
    'objective': 'binary:logistic',
    'max_depth': 4,
    'learning_rate': 0.1,
    'eval_metric': 'logloss'
}

model = xgb.train(
    params,
    dtrain,
    num_boost_round=100)

y_pred = model.predict(dtest)

With DMatrix, we can tweak some of their parameters as well.

First, we can control for the missing data representation. For example, we can put the missing data as -999.

dtrain = xgb.DMatrix(X_train, label=y_train,  missing=-999)

We can also assign weights to each data in the dataset. It’s useful for an imbalanced dataset problem.

weights = [0.1 if label == 0 else 1.0 for label in y_train]
dtrain = xgb.DMatrix(X_train, label=y_train, weight=weights)

Lastly, we can control the CPU usage when building the DMatrix.

dtrain = xgb.DMatrix(X_train, label=y_train,nthread=4)

Using DMatrix, we can improve our raining process more finely and optimize the model performance even better.

Keep reading with a 7-day free trial

Subscribe to Non-Brand Data to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Cornellius Yudha Wijaya
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More