Python Packages for Outlier Detection - #NBD Lite 31
These packages would give you an edge in the data analysis
One of the staple activities of our data analysis is identifying the outliers in our data.
Outlier could be defined as extreme data or data points that do not follow the patterns.
There are many techniques for detecting outliers. However, the technique’s usefulness depends on the data and the outlier classification.
This article will show my top three Python packages for detecting outliers.
I assume the reader has some knowledge about outliers and the impact on the data, so I would not explain it further.
Let’s get into it.
As a note, I am using sample data from Seaborn for the example.
df = sns.load_dataset('tips')
1. PyOD
PyOD, or Python Outlier Detection, is a Python package toolkit for detecting outlier data. The PyOD package boasts 30 outlier detection algorithms, ranging from the classic to the latest, proof that the PyOD package is well maintained.
Examples of the outlier detection model include:
Angle-Based Outlier Detection
Cluster-Based Local Outlier Factor
Principal Component Analysis Outlier Detection
Variational Auto Encoder
and many more. If you want to see all the available methods, you should visit the following page.
You can install the PyOD package using the following code.
pip install pyod
After installing the package, we can test the detection method.
I would use only the Angle-Based Outlier Detection (ABOD) method.
from pyod.models.abod import ABOD
Let’s start with the ABOD model; we need to set the contamination parameter or the fraction number of outliers detected from our data.
If I set the contamination to 0.05, I want to detect 5% of outliers from our data. Let’s try it with our code.
abod_clf = ABOD(contamination=outliers_fraction)
abod_clf.fit(df[['total_bill', 'tip']]))
We fit the data we want to detect the outlier. Like the model classifier, we could access the score/label and predict using this classifier.
#Return the classified inlier/outlier
abod_clf.labels_
We store the result on the data frame to compare both detection algorithms.
sns.scatterplot(data = df, x = 'total_bill', y = 'tip', hue = 'ABOD_Clf')
You can see that the package could help you directly identify which data points are outliers.
You could try another algorithm to detect the outlier from the data.
2. alibi-detect
The alibi-detect Python package is an open-source package that focuses on outlier, adversarial, and drift detection.
This package could be used for tabular and unstructured data such as images or text. If you are interested in outlier detection in image data, you could visit the example here.
However, in this article, I will focus on tabular data.
The alibi-detect package offers ten methods for outlier detection, which you can read here. Let’s try one of the methods with a dataset example. I would use the same data as the previous package.
For this example, I would use the Isolation Forest method.
from alibi_detect.od import IForestod = IForest(
threshold=0.,
n_estimators=100
)
We set the threshold; if you want to put the threshold automatically, there is a threshold method infer. Next, we train the model to our dataset.
od.fit(df[['total_bill', 'tip']])
After the model fitting, we need to make a prediction.
preds = od.predict(
df[['total_bill', 'tip']],
return_instance_score=True
)
If we set the instance score to True and the outlier label, the result would be a dictionary with the score and the outlier label.
preds['data'].keys()
Then, we input our outlier detection into the data frame and use the scatter plot to visualize the findings.
df['IF_alibi'] = preds['data']['is_outlier']
sns.scatterplot(data = df, x = 'total_bill', y = 'tip', hue = 'IF_alibi')
The image shows a more conservative result, as a slight deviation from the center would be treated as an outlier.
There are many algorithms you could try out from the alibi-detect. The algorithm overview is shown here.
3. PyNomaly
PyNomaly is a python package to detect outliers based on the LoOP (Local Outlier Probabilities).
The LoOP is based on the Local Outlier Factor (LOF), but the scores are normalized to the range [0–1].
The application of PyNomaly is simple and intuitive, similar to the previous package. Let’s use the dataset example to experiment with outlier detection.
from PyNomaly import loop
m = loop.LocalOutlierProbability(df[['total_bill', 'tip']], use_numba=True, progress_bar=True).fit()
scores = m.local_outlier_probabilities
We could use Numba here if there are many datasets to predict; otherwise, you could turn it off. The training resulted in the probabilities which we could infer by ourselves.
Every data contains probability as an outlier. We could try to infer by our judgment which dataset is an outlier, for example, the data point with a probability higher than 0.5
df['loop_score'] = scores
df['loop_label'] = df['loop_score'].apply(lambda x: 1 if x >0.5 else 0)
sns.scatterplot(data = df, x = 'total_bill', y = 'tip', hue = 'loop_label')
As we can see from the image above, the outlier is located in the most extreme point of data if the probability is higher than 0.5.
That’s all about my top outlier detection Python packages!
Are there any more things you would love to discuss? Let’s talk about it together!
👇👇👇