20 Pandas One-Liners That Can Save You Hours of Work
Simple code that you should not miss at all
As a data professional with expertise in Python programming, I am sure that we have been using Pandas in our everyday work.
We might have learned the basics, but did you know that some one-liner code could dramatically improve your workflow?
In this article, we will explore 20 different one-liner Pandas that will save you hours of work!
Curious about it? Let’s get into it.
Referral Recommendation
If you want a crash course in turning healthcare questions into publishable, data-driven findings, subscribe to Paras Karmacharya’s newsletter. An NIH-funded physician-scientist, Paras shows how he pairs classic clinical research methods with AI tools, covering protocol design, reproducible data pipelines, and peer-review-ready writing tips in bite-sized, actionable notes.
It’s the perfect bridge between your data-science skills and real-world medical research. Check it out—and support my work—through this referral link.
1. Read Only What You Need
df = pd.read_csv("big.csv", usecols=["id", "total", "date"], dtype_backend="pyarrow")
How it works:
usecols
skips parsing unnecessary columns at the I/O layer, reducing memory overhead.dtype_backend="pyarrow"
leverages Apache Arrow’s memory-mapped data structures, which are more compact than NumPy/Pandas dtypes (e.g., strings are stored as UTF-8, not Python objects).
Why efficient:
Avoids loading gigabytes of unused data into memory.
Arrow’s columnar format accelerates downstream operations like filtering/aggregation.
Use case:
Initial exploration of large datasets (e.g., 10M+ rows) where only specific columns are relevant.
2. Automatically Optimize Column Types
df = df.convert_dtypes(dtype_backend="pyarrow")
How it works:
Infers optimal nullable dtypes (e.g.,
Int64
instead offloat64
for integers with NaNs).Uses Arrow-backed types like
string[pyarrow]
for faster string operations.
Why efficient:
Reduces memory by ~30-50% compared to object/float64 defaults.
Enables vectorized operations on Arrow strings (e.g.,
.str.contains()
runs at C++ speed).
Use case:
Cleaning raw data from CSVs/Excel where Pandas misinfers dtypes as objects.
3. Full Data Summary in One Line
df.describe(include="all", datetime_is_numeric=True)
How it works:
include="all"
forces stats for all dtypes: counts/unique/top/freq for objects, percentiles for numerics.datetime_is_numeric=True
treats datetime columns as numeric, showing min/max/mean.
Why efficient:
Replaces manual
.info()
,.mean()
,.value_counts()
calls.Exposes data issues (e.g., unexpected unique counts, skewed distributions).
Use case:
First-step EDA to identify outliers, missing values, or dtype mismatches.
4. Get Memory Usage in MB
df.memory_usage(deep=True).sum() / 1_048_576
How it works:
deep=True
calculates actual memory used by object columns (e.g., strings), not just pointer overhead.Divides by
1024^2
to convert bytes to MB.
Why efficient:
Identifies memory-hungry columns (e.g., 10,000 unique strings stored as objects).
Critical for optimizing before processing on RAM-constrained machines.
Use case:
Debugging MemoryError
issues or tuning Dask/Modin workflows.
5. SQL-Like Filtering with Query
high_apac = df.query("sales > 1000 & region == 'APAC'")
How it works:
Translates the query string into an optimized abstract syntax tree (AST).
Uses the
numexpr
library for parallelized evaluation.
Why efficient:
Avoids intermediate boolean arrays created by
(df.sales > 1000) & (df.region == 'APAC')
.Clean syntax for multi-condition filters.
Use case:
Interactive filtering in Jupyter, especially with dynamic query strings.
6. Top-N Without Sorting Everything
top10 = df.nlargest(10, "profit")
How it works:
Uses a partitioning algorithm (like Quickselect) to find the top-N rows in O(N) time vs. O(N log N) for full sort.
Why efficient:
No need to sort the entire DataFrame when you only care about extremes.
Works seamlessly with multi-indexed DataFrames.
Use case:
Identifying top customers/products without expensive full sorts.
7. Column Math with eval
df["revenue"] = df.eval("quantity * price")
How it works:
Parses the expression into a vectorized operation using Pandas’ internal engine.
Avoids temporary variables (e.g., no intermediate
df["quantity"] * df["price"]
stored in memory).
Why efficient:
Faster for complex expressions (e.g.,
(a + b) / (c - d)
).Reduces memory overhead during chained calculations.
Use case:
Feature engineering with multi-column dependencies.
8. Add Column with assign
and lambda
df = df.assign(gm=lambda x: x.gross - x.cogs)
How it works:
assign
creates a new DataFrame with the added column.The lambda takes the current DataFrame as input (
x
), enabling dynamic references (e.g., newly created columns).
Why efficient:
Supports method chaining (e.g.,
df.read_csv().assign().groupby()
).No side effects (original DataFrame remains unchanged).
Use case:
Building reproducible pipelines with immutable DataFrames.
9. Compute Share Within Group
df["sales_share"] = df.groupby("product")["sales"].transform(lambda x: x / x.sum())
How it works:
transform
broadcasts the group-wise sum back to the original rows, matching the index.Uses window functions under the hood for partitioned calculations.
Why efficient:
Avoids merging group summaries back to the original DataFrame.
Handles overlapping groups gracefully.
Use case:
Calculating contribution percentages within categories (e.g., product sales as % of total category sales).
10. Group and Aggregate Multiple Metrics
summary = df.groupby("region", as_index=False).agg(
total=("sales", "sum"), margin_avg=("margin", "mean")
)
How it works:
Named aggregations explicitly map columns to functions (clearer than
{"sales": ["sum", "mean"]}
).as_index=False
preventsregion
from becoming the index, making the output easier to merge.
Why efficient:
Computes all aggregates in a single pass over the data.
Output is a tidy DataFrame, not a multi-indexed one.
Use case:
Creating summary tables for dashboards/reports.
11. Reshape from Wide to Long
long = df.melt(id_vars="id", var_name="metric", value_name="value")
How it works:
Unpivots columns into rows (e.g., columns
jan_sales
,feb_sales
becomemetric="jan_sales", value=100
).Preserves
id_vars
as identifiers.
Why efficient:
Simpler than manually stacking DataFrames.
Required for tools like
seaborn
that expect long-form data.
Use case:
Preparing time series or panel data for visualization.
12. Unnest List Columns
flat = df.explode("tags")
How it works:
Duplicates parent rows for each element in the list-like
tags
column.Handles lists, tuples, or Series within cells.
Why efficient:
Faster than
apply(pd.Series)
+stack
for nested data.Preserves non-list columns efficiently.
Use case:
Flattening JSON-like data (e.g., order line items stored as lists).
13. Pivot with Aggregation and Fill
pivot = df.pivot_table(
index="date", columns="region", values="sales", aggfunc="sum", fill_value=0
)
How it works:
Groups by
date
andregion
, computessum(sales)
.fill_value=0
replaces missing combinations (e.g., no sales in a region on a date) with 0.
Why efficient:
Combines grouping, reshaping, and imputation in one step.
Uses efficient Cython routines for cross-tabulation.
Use case:
Creating heatmap-ready matrices for time/region analyses.
14. Drop Mostly-Empty Columns
df = df.dropna(axis=1, thresh=len(df) * 0.8)
How it works:
thresh
requires a column to have at least 80% non-NaN values to be kept.Drops columns missing more than 20% of data.
Why efficient:
Automates a common data-cleaning task.
Prevents models from breaking on columns with too many NaNs.
Use case:
Automated preprocessing in ML pipelines.
15. Parse and Convert Timezones
df["date"] = pd.to_datetime(df.date).dt.tz_localize("UTC").dt.tz_convert("Asia/Jakarta")
How it works:
to_datetime
parses strings/epochs into UTC-naive timestamps.tz_localize
assigns UTC timezone to naive timestamps.tz_convert
shifts timestamps to the target timezone.
Why efficient:
Vectorized operations avoid Python-level loops.
Critical for time zone-aware aggregations.
Use case:
Analyzing global event logs across regions.
16. Validate Emails with Regex
df["email_ok"] = df.email.str.contains(r"^[\w\.-]+@[\w\.-]+\.\w+$", na=False)
How it works:
str.contains
applies regex to each element in theemail
column.na=False
treats NaN emails as invalid (avoidsNaN
in boolean column).
Why efficient:
Vectorized regex via Pandas’ string extension (faster than
apply
).Identifies malformed emails for data cleaning.
Use case:
Validating user-provided email addresses in registration data.
17. Resample and Fill Missing Dates
daily = df.set_index("timestamp").asfreq("D").ffill()
How it works:
asfreq("D")
ensures a row exists for every calendar day.ffill()
propagates last valid observation forward to fill gaps.
Why efficient:
Faster than manual reindexing with
pd.date_range()
+merge
.Handles irregular time series (e.g., sensor data with gaps).
Use case:
Preparing data for forecasting models that require regular intervals.
18. Year-over-Year Percentage Change
yoy = df["sales"].pct_change(12).mul(100).round(2)
How it works:
pct_change(12)
computes the percentage change between current and 12-periods-ago (e.g., monthly → YoY).Assumes data is sorted and has no gaps.
Why efficient:
Vectorized operation avoids manual window shifting.
Built-in handling of division-by-zero (returns
inf
/NaN
).
Use case:
Financial reporting or trend analysis.
19. Enable Copy-on-Write Mode
pd.options.mode.copy_on_write = True
How it works:
Defers DataFrame copies until modifications occur (lazy evaluation).
Part of Pandas 2.0+ optimizations for memory efficiency.
Why efficient:
Reduces peak memory usage during chained operations (e.g.,
df1 = df[df.a > 0]; df1["b"] = ...
).Prevents accidental mutations of parent DataFrames.
Use case:
Large-scale data processing where memory is constrained.
20. Visual Summary with Styling
df.style.background_gradient()
How it works:
Generates a
Styler
object with HTML/CSS formatting for Jupyter.background_gradient
applies a color map to each column.
Why efficient:
No need for
matplotlib
/seaborn
for quick visual checks.Styler integrates with Pandas’ display logic.
Use case:
Quickly spotting outliers/patterns during exploratory analysis.
Love this article? Comment and share them with Your Network!
If you're at a pivotal point in your career or sitting on skills you're unsure how to use, I offer 1:1 mentorship.
It's personal, flexible, and built around you.
For long-term mentorship, visit me here (you can even enjoy a 7-day free trial).
Awesome share