Top 10 Lesser-Known Python Packages Every Data Scientist Should Know

Don't miss out on these packages.

Jun 06, 2024

∙ Paid

Python is a programming language that already equates with the data scientist skillset. First-time learners even have been introduced to Python as the main language.

Data scientists use many common Python packages such as Numpy, Pandas, and Scikit-Learn. But do you know there are many more useful packages other than that?

In this newsletter, we will discuss 10 Python packages that you might not know but should.

Curious about it? Let’s get going.

1. Featuretools

Featuretools is a Python package that automates the feature engineering process, especially for temporal and relational datasets. It would help the user to gain all the possible combinations from the available features.

Let’s try out the package to understand better. First, we need to install Featuretools.

pip install featuretools

Then, we would initiate sample datasets from the package.

import featuretools as ft

data = ft.demo.load_mock_customer()
customers_df = data["customers"]
sessions_df = data["sessions"]
transactions_df = data["transactions"]

With all the datasets ready, we must create a dictionary with all the DataFrames in our dataset. The DataFrames need to be passed with their index column and time index column if one exists for the DataFrame

dataframes = {
    "customers": (customers_df, "customer_id"),
    "sessions": (sessions_df, "session_id", "session_start"),
    "transactions": (transactions_df, "transaction_id", "transaction_time"),
}

Next, we need to clarify how the DataFrames are related. When two DataFrames have a one-to-many relationship, we call the “one” DataFrame, the “parent DataFrame”.

relationships = [
    ("sessions", "session_id", "transactions", "session_id"),
    ("customers", "customer_id", "sessions", "customer_id"),
]

We can automatically develop the features with everything ready using the following code.

feature_matrix_customers, features_defs = ft.dfs(
    dataframes=dataframes,
    relationships=relationships,
    target_dataframe_name="customers",
)
feature_matrix_customers

The result would be a combination of features from the DataFrame with the specific dimension from the DataFrame you choose. As you can see, the features could be like the Count, Mode, Max, and many more. It’s mostly the basic statistic but can be important for your modelling.

2. Missingno

Missingno is a Python package that was developed specifically for missing data analysis and visualization. The package allows us to quickly get missing data information.

Let’s try it out. First, install the package.

pip install missingno

Once installed, let’s try the package APIs with an example dataset.

import pandas as pd
collisions = pd.read_csv("https://raw.githubusercontent.com/ResidentMario/missingno-data/master/nyc_collision_factors.csv")

For example, you can create the missing data matrix visualization.

import missingno as msno
msno.matrix(collisions.sample(250))

Or you can create the missing data correlation heatmap with the following code.

msno.heatmap(collisions)

The package still has many things to offer. Use them to gain more insight from your missing data.

Join Cornellius Yudha Wijaya’s subscriber chat

Available in the Substack app and on web

3. Mlxtend

Mlxtend is a Python package for everyday data science tasks. The library consists of APIs collection that we can use to make our data exploration and machine learning modeling easier.

To use the package, you need to install them initially.

pip install mlxtend

Then, here is an example of decision regions plotted with Mlxtend.

from mlxtend.plotting import plot_decision_regions
from sklearn.datasets import load_iris
from sklearn.svm import SVC
import matplotlib.pyplot as plt

# Load dataset
iris = load_iris()
X = iris.data[:, :2]  # we only take the first two features.
y = iris.target

clf = SVC(kernel='linear')
clf.fit(X, y)

plot_decision_regions(X, y, clf=clf, legend=2)

plt.show()

You could also develop a model, perform feature selection, model evaluation, and many more with the packages.

Keep reading with a 7-day free trial

Subscribe to Non-Brand Data to keep reading this post and get 7 days of free access to the full post archives.