21 Python Standard-Library Gems for Data Work You’re Probably Re-implementing

Boost Your Data Science Workflow with These Built-In Python Tools

Jun 23, 2025

21 Python Standard-Library Gems for Data Work You’re Probably Re-implementing

As data scientists who work with Python, we continually strive to enhance our data workflow to improve our performance. There are many ways to do that, but we can always utilize the standard library in Python to achieve this.

In this article, we will explore 21 different standard library functions that will undoubtedly improve your data work.

Curious about it? Let’s get into it.

Referral Recommendation

Techpresso gives you a daily rundown of what's happening in tech and what you shouldn't miss. Read by 300,000+ professionals from Google, Apple, OpenAI...

https://sparklp.co/p/cb74463ce4 — Techpresso

`1. collections.Counter`

A Counter is a dictionary subclass for counting hashable items. It tallies the occurrences of each element, making frequency counts (e.g., word counts or category counts) quick and easy.

This is useful in data analysis for summarizing categorical data or any list of items. It also provides convenient methods like most_common() to get the most frequent elements.

from collections import Counter

fruits = ["apple", "banana", "orange", "apple", "banana", "apple"]
counts = Counter(fruits)
print(counts)           
print(counts.most_common(1))

2. `collections.defaultdict`

A defaultdict is a dictionary that provides a default value for missing keys, avoiding KeyError without explicit checks.

This is helpful for grouping or collecting data (e.g., organizing records by a key) because it automatically creates new entries. For example, you can gather lists of values for each key without manually checking if the key already exists.

from collections import defaultdict

data = [("New York City", "NY"), ("Albany", "NY"), ("Los Angeles", "CA")]
grouped = defaultdict(list)

for city, state in data:
    grouped[state].append(city)
    
print(grouped["NY"])  
print(grouped["CA"])
print(grouped["TX"])

`3. collections.namedtuple`

namedtuple creates tuple-like objects with named fields, making code more readable by allowing access to elements by name instead of index.

This is useful for storing data records (such as rows or points) in a lightweight manner. It acts like a class but is as memory-efficient as a tuple.

from collections import namedtuple

Point = namedtuple('Point', ['x', 'y'])
p = Point(x=2, y=3)

print(p.x * p.y)  
print(p)

`4. dataclasses.dataclass`

The dataclass decorator (Python 3.7+) automatically generates methods for classes that primarily store data.

This reduces boilerplate code when defining simple classes to hold data, like configuration objects or records, and makes the code cleaner. It’s useful for data science when you want to bundle related data with minimal code.

from dataclasses import dataclass
@dataclass
class Student:
    name: str
    score: float

s = Student("Alice", 95.5)
print(s)

`5. heapq.nlargest`

Part of the heapq library, nlargest efficiently finds the top n elements without sorting the entire dataset.

This is useful for retrieving, for example, the highest values or top records based on a specific metric (such as the top 5 salaries or scores) in O(m log n) time. It accepts a function for complex data, such as dictionaries.

import heapq
scores = [35, 87, 45, 92, 67, 76]
top3 = heapq.nlargest(3, scores)
print(top3) 

players = [("Alice", 50), ("Bob", 80), ("Cathy", 65)]
top_player = heapq.nlargest(1, players, key=lambda x: x[1])
print(top_player)

`6. bisect`

The bisect library helps maintain sorted lists. It provides bisect.bisect_left/right to find insertion points and bisect.insort to insert while keeping order.

This avoids re-sorting after each insertion and improves efficiency for binary search tasks. For example, you can categorize numeric data into bins or keep a continuously sorted list of values, such as running medians or quantiles.

import bisect
nums = [10, 20, 40, 50]       
pos = bisect.bisect(nums,  thirty := 30)   
print(pos)                   
bisect.insort(nums, thirty)
print(nums)

`7. itertools.groupby`

itertools.groupby groups consecutive elements of an iterable that share the same key. It returns an iterator of (key, group) pairs. This can mimic basic group-by operations on sorted data (e.g., grouping records by category).

It’s useful for data cleaning tasks, such as grouping sequences of values or summarizing data from sorted logs.

Note: The data should be sorted by the key so that all identical keys are grouped together.

import itertools
words = ["apple", "car", "ape", "boat", "bus"]

for first_letter, group in itertools.groupby(sorted(words), key=lambda w: w[0]):
    print(first_letter, ":", list(group))

`8. itertools.combinations`

Generates all combinations (subsets) of a specified length from a given iterable. This is useful for exploring pairs or tuples of features, items, or parameters without repetition (order doesn’t matter).

For example, you can use it to test all pairs of variables for correlation analysis or to create feature interactions.

from itertools import combinations
items = ['a', 'b', 'c', 'd']
for combo in combinations(items, 2):
    print(combo)

`9. itertools.chain`

Concatenates multiple iterables into a single continuous sequence. This is useful for processing several datasets as if they were one (for example, iterating through multiple lists of records without copying them into a single list).

It’s memory-efficient (lazy evaluation) because it doesn’t create a new list. It yields elements on the fly. This is useful for combining results or reading multiple files sequentially.

from itertools import chain
list1 = [1, 2, 3]
list2 = [4, 5, 6]
for x in chain(list1, list2):
    print(x, end=" ")

`10. re`

The re library is a powerhouse for text parsing and cleaning. Regular expressions (regex) allow pattern matching and substitution in text data. For example, re.findall can extract all substrings that match a pattern, and re.sub can replace all occurrences of a pattern.

Regular expressions are very helpful in data cleaning, especially when dealing with complex string patterns.

import re
text = "Contact us at cat@kitten.com and dog@puppy.org"
emails = re.findall(r'\S+@\S+', text) 
print(emails)

masked = re.sub(r'\d', '#', "Order 12345, Cost 67")
print(masked)

`11. statistics`

The statistics library provides basic statistical functions like mean, median, mode, variance, etc., all in pure Python.

It’s useful for quick analyses when you want to avoid importing larger libraries. For example, you can calculate the average or median of a list of numbers without needing NumPy.

It also manages certain edge cases (such as multimode). This is helpful for basic summary statistics in scripts or when developing algorithms that should not need external dependencies.

import statistics
data = [1, 3, 5, 7]
print(statistics.mean(data))    
print(statistics.median(data)) 
print(statistics.pstdev(data))

`12. pathlib.Path`

The pathlib library provides an object-oriented approach to filesystem paths, which is more intuitive and cross-platform than using strings and os.path.

The Path object has methods for common tasks like checking if a file exists, creating directories, reading or writing files, and looping through directory contents. This makes file handling much easier in data science workflows, such as building paths to data files or output folders.

from pathlib import Path
p = Path("data/results.txt")
print(p.name)    
print(p.parent)   
print(p.suffix)   

if p.exists():
    content = p.read_text()
    print(content[:100])

`13. glob.glob`

The glob module finds files matching wildcard patterns, which is very useful for batch processing files (e.g., reading all CSVs in a directory).

You can use wildcards like * and ? to match filenames. This saves time over manually listing directories and filtering. For data science, you might use glob to load multiple data files or to select files of certain types.

import glob

csv_files = glob.glob("data/*.csv")
print(csv_files)  

for fname in csv_files:
    print("Processing", fname)

14. `zipfile.ZipFile`

The zipfile library lets you work with ZIP archives, which is useful when data comes compressed. You can read from or write to ZIP files just like regular files, and even list their contents.

For example, if you have a ZIP file containing many CSVs, you can iterate through them without manually extracting each one. This streamlines automating data ingestion from compressed datasets.

import zipfile

with zipfile.ZipFile("data/archive.zip", 'r') as zipf:
    file_list = zipf.namelist()
    print("Files in archive:", file_list)

    with zipf.open(file_list[0]) as f:  
        content = f.read().decode('utf-8')
        print(content[:50], "...")

`15. subprocess.run`

The subprocess library allows you to run external commands from Python. This is ideal for workflow automation, such as executing shell scripts, invoking system tools, or even running R/Julia scripts as part of a pipeline.

Using subprocess.run provides the command’s output and exit status. This can eliminate manual steps in a data pipeline.

import subprocess
result = subprocess.run(["echo", "Hello World"], capture_output=True, text=True)
print("Exit code:", result.returncode)     
print("Output:", result.stdout.strip())  

ls_result = subprocess.run(["ls", "data"], capture_output=True, text=True)
files = ls_result.stdout.split()
print("Data folder contains:", files)

`16. argparse.ArgumentParser`

The argparse library helps you build user-friendly command-line interfaces for your Python scripts.

It automatically creates help messages and parses types. For example, you can add arguments for input/output file paths, model parameters, and more, making your script easy to reuse and incorporate into larger workflows.

import argparse
parser = argparse.ArgumentParser(description="Process a data file")
parser.add_argument("--file", help="Path to input data CSV", required=True)
parser.add_argument("--verbose", action="store_true", help="Increase output verbosity")

args = parser.parse_args(["--file", "data.csv", "--verbose"])
print("File:", args.file)       
print("Verbose:", args.verbose)

17. `concurrent.futures`

This library provides a high-level approach to parallel processing, supporting either threads or processes. It’s invaluable for speeding up I/O-bound tasks (using ThreadPoolExecutor) or CPU-bound computations (using ProcessPoolExecutor).

For example, you can use executor.map to concurrently download multiple URLs or process chunks of data in parallel, utilizing multiple cores with minimal effort. This can significantly boost performance in parallel data workflows.

from concurrent.futures import ThreadPoolExecutor
import math

nums = [10000000, 10000001, 10000002, 10000003]
def count_digits(n):
    return len(str(math.factorial(n)))

with ThreadPoolExecutor() as executor:
    results = list(executor.map(count_digits, nums))
print(results)

`18. contextlib.suppress`

A context manager that suppresses specified exceptions within its block. It provides a clean way to handle scenarios where certain errors are expected and can be safely ignored, which is better than using a try/except pass.

For example, if you want to attempt to delete a file without error if it’s already gone, you can suppress FileNotFoundError. This keeps the code tidy and focused.

import contextlib, os

with contextlib.suppress(FileNotFoundError):
    os.remove("temp.csv")
print("Cleanup complete.")

`19. logging`

The logging library Logging is essential for any non-trivial code. It enables you to record messages at different severity levels (DEBUG, INFO, WARNING, ERROR) and direct them to various outputs (console, file, etc.).

For data pipelines, logging is preferable to print statements because it allows you to easily adjust verbosity and keep persistent logs of runs. You can configure the format to include timestamps and other context, which help in debugging and monitoring long-running jobs or scheduled tasks.

import logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s: %(message)s")
logging.info("Pipeline started.")
try:
    # Imagine some processing here
    raise ValueError("Sample error")
except Exception as e:
    logging.error(f"Error during processing: {e}")
logging.info("Pipeline finished.")

`20. pdb`

Python’s built-in debugger pdb enables step-by-step execution, stack inspection, and breakpoints, which are extremely useful for troubleshooting code.

Using breakpoint (available in Python 3.7+) is the simplest way to pause in the debugger at a specific point in your script. In data science, you might use this to examine data at a specific pipeline stage or determine why a calculation is incorrect. It’s a valuable tool for interactive debugging, extending beyond the use of print statements.

for i in range(5):
    if i == 2:
        breakpoint()    
    print(i)

`21. pprint.pprint`

The Pretty Printer module pprint is great for debugging or logging complex data structures. It formats nested lists and dictionaries (or other structures) in a clear, indented style with line breaks.

This is especially helpful when working with JSON data or large dictionaries – it’s much simpler to examine the output than the default one-line display. Using guide tools helps in understanding data structures quickly, which is useful during data cleaning and exploration.

import pprint
data = {
    "name": "Alice",
    "age": 30,
    "hobbies": ["reading", "cycling", "chess"],
    "scores": {"math": 95, "science": 88, "literature": 92}
}
pprint.pprint(data, width=40)

Love this article? Comment and share them with Your Network!

Non-Brand Data

Discussion about this post

Non-Brand Data

21 Python Standard-Library Gems for Data Work You’re Probably Re-implementing

Boost Your Data Science Workflow with These Built-In Python Tools

Referral Recommendation

1. collections.Counter

2. collections.defaultdict

3. collections.namedtuple

4. dataclasses.dataclass

5. heapq.nlargest

6. bisect

7. itertools.groupby

8. itertools.combinations

9. itertools.chain

10. re

11. statistics

12. pathlib.Path

13. glob.glob

14. zipfile.ZipFile

15. subprocess.run

16. argparse.ArgumentParser

17. concurrent.futures

18. contextlib.suppress

19. logging

20. pdb

21. pprint.pprint