5 Python Libraries That You Don’t Know About, But Should

Josep and Cornellius Version of Lesser-known libraries

and

Jul 17, 2024

person holding sticky note — Photo by Hitesh Choudhary on Unsplash

Hi there! Josep and Cornellius are back once again for our article collaboration! This time, we will present to you 5 Python libraries that you (maybe) don’t know but should.

Each of us would present our version of the lesser-known Python libraries.

Let’s check it out.

🔥Josep’s Version

Python’s environment has so many libraries, making it challenging to select the right ones. Here are five essential libraries that I highly recommend for any project.

DuckDB

Handling large amounts of data can be challenging, especially when performing analytical queries efficiently. Traditional databases may struggle with performance issues, and using complex big data frameworks can be overwhelming.
DuckDB is an in-process SQL OLAP (Online Analytical Processing) database management system for high-performance analytics. It's optimized for analytical workloads and can efficiently handle complex queries on large datasets.
One of its main features is the ability to run directly within your applications without requiring a separate server. This in-process nature ensures minimal latency and maximizes performance. It can be easily integrated with popular tools such as Pandas, R, and Jupyter notebooks.
DuckDB is a must-know tool for anyone involved in data analysis and manipulation. Its simplicity, performance, and flexibility can leverage its powerful querying capabilities within your existing workflows without much hassle.

H3 Uber

Working with location-based information can often be complex and challenging. Geographical segmentation typically depends on historical land divisions, such as zip codes, or on political boundaries, like countries or subregions. This usually implies irregular sizes and complex boundaries, which makes it difficult to handle real-time data or achieve high-resolution spatial data with consistent land areas and conditions.
This is where H3, an open-source hexagonal grid system developed by Uber, becomes essential. It provides a unique way of partitioning the globe into hexagonal cells, enabling efficient spatial indexing and querying. The hexagonal structure offers consistent area and distance metrics, making it ideal for applications requiring high-resolution spatial data.
It can be used for many geospatial analyses, from proximity searches to clustering and hotspot detection. By utilizing H3, you can simplify the handling of complex geospatial datasets, enhance the precision of spatial queries, and ultimately improve the performance of your geospatial applications.

Ydata Profiling

As a data professional, understanding and exploring your dataset is a critical first step. However, manually analyzing and summarizing data can be time-consuming (and boring!).
Ydata Profiling can smooth the analytical process. It is an open-source Python library designed to automate the process of data profiling. It generates comprehensive reports that provide detailed insights into the dataset's structure, distribution, and relationships.
With Ydata Profiling, data professionals can quickly assess data quality, identify missing values, detect outliers, and understand data distributions without writing extensive code. The package is incredibly user-friendly, producing interactive HTML reports that visualize various statistics and correlations within the dataset.

Poetry

Managing dependencies and packaging for Python projects can often be confusing and a big source of errors. Poetry tries to assess this by facilitating the whole process.
It is an open-source Python dependency management and packaging tool designed to simplify the entire process of project setup, dependency resolution, and packaging. It handles everything from installation to publishing, ensuring that your project dependencies are consistent.
It uses a straightforward pyproject.toml file to manage dependencies, which makes it easier to maintain and understand compared to traditional requirements.txt files. With Poetry, you can easily create and manage virtual environments, ensuring that your project remains isolated and its dependencies do not interfere with other projects.
For any data professional looking to streamline their workflow, reduce complexity, and ensure reliable dependency management, Poetry is a must-know tool!

MLflow

Managing the end-to-end ML lifecycle can be daunting due to the complexity and variety of tools required. This is where MLflow comes in, offering a streamlined solution for tracking experiments, packaging code, and deploying models efficiently.

MLflow is an open-source platform designed to manage the complete ML lifecycle. It provides four main components:

The Tracking component allows us to log and query experiments.
The Projects component standardizes your project format, enabling the encapsulation of code.
The Models component provides a unified format for deploying diverse ML models.
The Registry component offers a centralized model store, facilitating model versioning and collaborative management.

MLflow integrates seamlessly with other libraries and tools, ensuring flexibility and scalability. Whether working locally or scaling up to a distributed environment, MLflow simplifies the operational aspects of ML, allowing us to focus more on building and improving models.

🚀Cornellius’s Version

Python works well for many data projects. However, there is just so many libraries out there. Here I present my version of 5 library you should know.

BentoML

Machine Learning is not helpful until it’s deployed and used in the business. However, there are times when the deployment process is rigorous because there are too many requirements and dependencies clashing. This is where BentoML comes in.
BentoML is an open-source Python library that streamlines the deployment process and unifies the model standard to produce a production-ready model serving endpoints. It’s designed specifically for data scientists for swift model deployment where all the hard details such as config files, environment, and dependencies are managed with this package.
It’s a must-know package for any data scientist who wants to deploy their model.

Funcy

As data people, we often repeat data-related activities with our dataset, such as data merging, data mapping, debugging, and many others. We usually create simple functions to perform these repeated actions. Well, the Funcy package makes it even easier.
Funcy markets itself as a collection of fancy functional tools focused on practicality in Python. The package provides all the functions necessary during the data processing, as mentioned above.
It’s a simple package but useful for reducing the time we spend on data analysis.

Fairlearn

The machine learning model has been used in many automation processes in the modern era. However, the ethical implications have become more important than ever. We don’t want to exclude certain social groups as a result of our model prediction. This is where the fairness concept came from, and Fairlearn would facilitate fairness in the machine learning process.
Fairness can be complex as it depends on the use cases, the output of our model, and the society where the model exists. Fairlearn could help to provide resources about fairness in the machine learning project and fairness metrics to help mitigate any unfairness.
Fairness in machine learning is still a complex problem, but Fairlearn could help assess the situation while we construct our machine learning model as fairly as possible.

Yellowbrick

In a typical machine learning project, we would use the simple Scikit-Learn package, as it’s easy to use and intuitive. However, selecting the best model might be a different story. We might not have the whole story from the numbers alone.
Yellowbrick extends the Scikit-Learn modelling process by providing diagnostic tools for machine learning evaluation visualization. It helps evaluate the machine learning stability and predictive value while improving the workflow process.
Use this package if you want to have a better way to select your model.

Cleanlab

Data is the heart of any data project. Without quality data, our machine-learning model would certainly become a mess. However, there are times when we would get a problematic dataset for our machine learning project.
Cleanlab is perfect for addressing quality data issues in your dataset. The package provides an API to automatically detect problems in our dataset and help improve its quality. Cleaning the data with Cleanlab would improve your machine learning model performance.
If you have a problem with your dataset, Cleanlab could certainly help clean it.

Here are our recommended Python libraries for today, hope they are useful.

We encourage you to give them a try and share your feedback with us!

Non-Brand Data

Discussion about this post