21 Python Standard-Library Gems for Data Work You’re Probably Re-implementing
Boost Your Data Science Workflow with These Built-In Python Tools
As data scientists who work with Python, we continually strive to enhance our data workflow to improve our performance. There are many ways to do that, but we can always utilize the standard library in Python to achieve this.
In this article, we will explore 21 different standard library functions that will undoubtedly improve your data work.
Curious about it? Let’s get into it.
Referral Recommendation
Techpresso gives you a daily rundown of what's happening in tech and what you shouldn't miss. Read by 300,000+ professionals from Google, Apple, OpenAI...
1. collections.Counter
A Counter
is a dictionary subclass for counting hashable items. It tallies the occurrences of each element, making frequency counts (e.g., word counts or category counts) quick and easy.
This is useful in data analysis for summarizing categorical data or any list of items. It also provides convenient methods like most_common()
to get the most frequent elements.
from collections import Counter
fruits = ["apple", "banana", "orange", "apple", "banana", "apple"]
counts = Counter(fruits)
print(counts)
print(counts.most_common(1))
2. collections.defaultdict
A defaultdict
is a dictionary that provides a default value for missing keys, avoiding KeyError
without explicit checks.
This is helpful for grouping or collecting data (e.g., organizing records by a key) because it automatically creates new entries. For example, you can gather lists of values for each key without manually checking if the key already exists.
from collections import defaultdict
data = [("New York City", "NY"), ("Albany", "NY"), ("Los Angeles", "CA")]
grouped = defaultdict(list)
for city, state in data:
grouped[state].append(city)
print(grouped["NY"])
print(grouped["CA"])
print(grouped["TX"])
3. collections.namedtuple
namedtuple
creates tuple-like objects with named fields, making code more readable by allowing access to elements by name instead of index.
This is useful for storing data records (such as rows or points) in a lightweight manner. It acts like a class but is as memory-efficient as a tuple.
from collections import namedtuple
Point = namedtuple('Point', ['x', 'y'])
p = Point(x=2, y=3)
print(p.x * p.y)
print(p)
4. dataclasses.dataclass
The dataclass
decorator (Python 3.7+) automatically generates methods for classes that primarily store data.
This reduces boilerplate code when defining simple classes to hold data, like configuration objects or records, and makes the code cleaner. It’s useful for data science when you want to bundle related data with minimal code.
from dataclasses import dataclass
@dataclass
class Student:
name: str
score: float
s = Student("Alice", 95.5)
print(s)
5. heapq.nlargest
Part of the heapq
library, nlargest
efficiently finds the top n elements without sorting the entire dataset.
This is useful for retrieving, for example, the highest values or top records based on a specific metric (such as the top 5 salaries or scores) in O(m log n) time. It accepts a function for complex data, such as dictionaries.
import heapq
scores = [35, 87, 45, 92, 67, 76]
top3 = heapq.nlargest(3, scores)
print(top3)
players = [("Alice", 50), ("Bob", 80), ("Cathy", 65)]
top_player = heapq.nlargest(1, players, key=lambda x: x[1])
print(top_player)
6. bisect
The bisect
library helps maintain sorted lists. It provides bisect.bisect_left/right
to find insertion points and bisect.insort
to insert while keeping order.
This avoids re-sorting after each insertion and improves efficiency for binary search tasks. For example, you can categorize numeric data into bins or keep a continuously sorted list of values, such as running medians or quantiles.
import bisect
nums = [10, 20, 40, 50]
pos = bisect.bisect(nums, thirty := 30)
print(pos)
bisect.insort(nums, thirty)
print(nums)
7. itertools.groupby
itertools.groupby
groups consecutive elements of an iterable that share the same key. It returns an iterator of (key, group)
pairs. This can mimic basic group-by operations on sorted data (e.g., grouping records by category).
It’s useful for data cleaning tasks, such as grouping sequences of values or summarizing data from sorted logs.
Note: The data should be sorted by the key so that all identical keys are grouped together.
import itertools
words = ["apple", "car", "ape", "boat", "bus"]
for first_letter, group in itertools.groupby(sorted(words), key=lambda w: w[0]):
print(first_letter, ":", list(group))
8. itertools.combinations
Generates all combinations (subsets) of a specified length from a given iterable. This is useful for exploring pairs or tuples of features, items, or parameters without repetition (order doesn’t matter).
For example, you can use it to test all pairs of variables for correlation analysis or to create feature interactions.
from itertools import combinations
items = ['a', 'b', 'c', 'd']
for combo in combinations(items, 2):
print(combo)
9. itertools.chain
Concatenates multiple iterables into a single continuous sequence. This is useful for processing several datasets as if they were one (for example, iterating through multiple lists of records without copying them into a single list).
It’s memory-efficient (lazy evaluation) because it doesn’t create a new list. It yields elements on the fly. This is useful for combining results or reading multiple files sequentially.
from itertools import chain
list1 = [1, 2, 3]
list2 = [4, 5, 6]
for x in chain(list1, list2):
print(x, end=" ")
10. re
The re
library is a powerhouse for text parsing and cleaning. Regular expressions (regex) allow pattern matching and substitution in text data. For example, re.findall
can extract all substrings that match a pattern, and re.sub
can replace all occurrences of a pattern.
Regular expressions are very helpful in data cleaning, especially when dealing with complex string patterns.
import re
text = "Contact us at cat@kitten.com and dog@puppy.org"
emails = re.findall(r'\S+@\S+', text)
print(emails)
masked = re.sub(r'\d', '#', "Order 12345, Cost 67")
print(masked)
11. statistics
The statistics
library provides basic statistical functions like mean, median, mode, variance, etc., all in pure Python.
It’s useful for quick analyses when you want to avoid importing larger libraries. For example, you can calculate the average or median of a list of numbers without needing NumPy.
It also manages certain edge cases (such as multimode). This is helpful for basic summary statistics in scripts or when developing algorithms that should not need external dependencies.
import statistics
data = [1, 3, 5, 7]
print(statistics.mean(data))
print(statistics.median(data))
print(statistics.pstdev(data))
12. pathlib.Path
The pathlib
library provides an object-oriented approach to filesystem paths, which is more intuitive and cross-platform than using strings and os.path
.
The Path
object has methods for common tasks like checking if a file exists, creating directories, reading or writing files, and looping through directory contents. This makes file handling much easier in data science workflows, such as building paths to data files or output folders.
from pathlib import Path
p = Path("data/results.txt")
print(p.name)
print(p.parent)
print(p.suffix)
if p.exists():
content = p.read_text()
print(content[:100])
13. glob.glob
The glob
module finds files matching wildcard patterns, which is very useful for batch processing files (e.g., reading all CSVs in a directory).
You can use wildcards like *
and ?
to match filenames. This saves time over manually listing directories and filtering. For data science, you might use glob
to load multiple data files or to select files of certain types.
import glob
csv_files = glob.glob("data/*.csv")
print(csv_files)
for fname in csv_files:
print("Processing", fname)
14. zipfile.ZipFile
The zipfile
library lets you work with ZIP archives, which is useful when data comes compressed. You can read from or write to ZIP files just like regular files, and even list their contents.
For example, if you have a ZIP file containing many CSVs, you can iterate through them without manually extracting each one. This streamlines automating data ingestion from compressed datasets.
import zipfile
with zipfile.ZipFile("data/archive.zip", 'r') as zipf:
file_list = zipf.namelist()
print("Files in archive:", file_list)
with zipf.open(file_list[0]) as f:
content = f.read().decode('utf-8')
print(content[:50], "...")
15. subprocess.run
The subprocess
library allows you to run external commands from Python. This is ideal for workflow automation, such as executing shell scripts, invoking system tools, or even running R/Julia scripts as part of a pipeline.
Using subprocess.run
provides the command’s output and exit status. This can eliminate manual steps in a data pipeline.
import subprocess
result = subprocess.run(["echo", "Hello World"], capture_output=True, text=True)
print("Exit code:", result.returncode)
print("Output:", result.stdout.strip())
ls_result = subprocess.run(["ls", "data"], capture_output=True, text=True)
files = ls_result.stdout.split()
print("Data folder contains:", files)
16. argparse.ArgumentParser
The argparse
library helps you build user-friendly command-line interfaces for your Python scripts.
It automatically creates help messages and parses types. For example, you can add arguments for input/output file paths, model parameters, and more, making your script easy to reuse and incorporate into larger workflows.
import argparse
parser = argparse.ArgumentParser(description="Process a data file")
parser.add_argument("--file", help="Path to input data CSV", required=True)
parser.add_argument("--verbose", action="store_true", help="Increase output verbosity")
args = parser.parse_args(["--file", "data.csv", "--verbose"])
print("File:", args.file)
print("Verbose:", args.verbose)
17. concurrent.futures
This library provides a high-level approach to parallel processing, supporting either threads or processes. It’s invaluable for speeding up I/O-bound tasks (using ThreadPoolExecutor
) or CPU-bound computations (using ProcessPoolExecutor
).
For example, you can use executor.map
to concurrently download multiple URLs or process chunks of data in parallel, utilizing multiple cores with minimal effort. This can significantly boost performance in parallel data workflows.
from concurrent.futures import ThreadPoolExecutor
import math
nums = [10000000, 10000001, 10000002, 10000003]
def count_digits(n):
return len(str(math.factorial(n)))
with ThreadPoolExecutor() as executor:
results = list(executor.map(count_digits, nums))
print(results)
18. contextlib.suppress
A context manager that suppresses specified exceptions within its block. It provides a clean way to handle scenarios where certain errors are expected and can be safely ignored, which is better than using a try/except pass.
For example, if you want to attempt to delete a file without error if it’s already gone, you can suppress FileNotFoundError
. This keeps the code tidy and focused.
import contextlib, os
with contextlib.suppress(FileNotFoundError):
os.remove("temp.csv")
print("Cleanup complete.")
19. logging
The logging
library Logging is essential for any non-trivial code. It enables you to record messages at different severity levels (DEBUG, INFO, WARNING, ERROR) and direct them to various outputs (console, file, etc.).
For data pipelines, logging is preferable to print statements because it allows you to easily adjust verbosity and keep persistent logs of runs. You can configure the format to include timestamps and other context, which help in debugging and monitoring long-running jobs or scheduled tasks.
import logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s: %(message)s")
logging.info("Pipeline started.")
try:
# Imagine some processing here
raise ValueError("Sample error")
except Exception as e:
logging.error(f"Error during processing: {e}")
logging.info("Pipeline finished.")
20. pdb
Python’s built-in debugger pdb
enables step-by-step execution, stack inspection, and breakpoints, which are extremely useful for troubleshooting code.
Using breakpoint
(available in Python 3.7+) is the simplest way to pause in the debugger at a specific point in your script. In data science, you might use this to examine data at a specific pipeline stage or determine why a calculation is incorrect. It’s a valuable tool for interactive debugging, extending beyond the use of print statements.
for i in range(5):
if i == 2:
breakpoint()
print(i)
21. pprint.pprint
The Pretty Printer module pprint
is great for debugging or logging complex data structures. It formats nested lists and dictionaries (or other structures) in a clear, indented style with line breaks.
This is especially helpful when working with JSON data or large dictionaries – it’s much simpler to examine the output than the default one-line display. Using guide tools helps in understanding data structures quickly, which is useful during data cleaning and exploration.
import pprint
data = {
"name": "Alice",
"age": 30,
"hobbies": ["reading", "cycling", "chess"],
"scores": {"math": 95, "science": 88, "literature": 92}
}
pprint.pprint(data, width=40)
Love this article? Comment and share them with Your Network!
If you're at a pivotal point in your career or sitting on skills you're unsure how to use, I offer 1:1 mentorship.
It's personal, flexible, and built around you.
For long-term mentorship, visit me here (you can even enjoy a 7-day free trial).