20 Pandas One-Liners That Can Save You Hours of Work

Simple code that you should not miss at all

May 26, 2025

20 Pandas One-Liners That Can Save You Hours of Work

As a data professional with expertise in Python programming, I am sure that we have been using Pandas in our everyday work.

We might have learned the basics, but did you know that some one-liner code could dramatically improve your workflow?

In this article, we will explore 20 different one-liner Pandas that will save you hours of work!

Curious about it? Let’s get into it.

Referral Recommendation

If you want a crash course in turning healthcare questions into publishable, data-driven findings, subscribe to Paras Karmacharya’s newsletter. An NIH-funded physician-scientist, Paras shows how he pairs classic clinical research methods with AI tools, covering protocol design, reproducible data pipelines, and peer-review-ready writing tips in bite-sized, actionable notes.

It’s the perfect bridge between your data-science skills and real-world medical research. Check it out—and support my work—through this referral link.

1. Read Only What You Need

df = pd.read_csv("big.csv", usecols=["id", "total", "date"], dtype_backend="pyarrow")

How it works:

usecols skips parsing unnecessary columns at the I/O layer, reducing memory overhead.
dtype_backend="pyarrow" leverages Apache Arrow’s memory-mapped data structures, which are more compact than NumPy/Pandas dtypes (e.g., strings are stored as UTF-8, not Python objects).

Why efficient:

Avoids loading gigabytes of unused data into memory.
Arrow’s columnar format accelerates downstream operations like filtering/aggregation.

Use case:
Initial exploration of large datasets (e.g., 10M+ rows) where only specific columns are relevant.

2. Automatically Optimize Column Types

df = df.convert_dtypes(dtype_backend="pyarrow")

How it works:

Infers optimal nullable dtypes (e.g., Int64 instead of float64 for integers with NaNs).
Uses Arrow-backed types like string[pyarrow] for faster string operations.

Why efficient:

Reduces memory by ~30-50% compared to object/float64 defaults.
Enables vectorized operations on Arrow strings (e.g., .str.contains() runs at C++ speed).

Use case:
Cleaning raw data from CSVs/Excel where Pandas misinfers dtypes as objects.

3. Full Data Summary in One Line

df.describe(include="all", datetime_is_numeric=True)

How it works:

include="all" forces stats for all dtypes: counts/unique/top/freq for objects, percentiles for numerics.
datetime_is_numeric=True treats datetime columns as numeric, showing min/max/mean.

Why efficient:

Replaces manual .info(), .mean(), .value_counts() calls.
Exposes data issues (e.g., unexpected unique counts, skewed distributions).

Use case:
First-step EDA to identify outliers, missing values, or dtype mismatches.

4. Get Memory Usage in MB

df.memory_usage(deep=True).sum() / 1_048_576

How it works:

deep=True calculates actual memory used by object columns (e.g., strings), not just pointer overhead.
Divides by 1024^2 to convert bytes to MB.

Why efficient:

Identifies memory-hungry columns (e.g., 10,000 unique strings stored as objects).
Critical for optimizing before processing on RAM-constrained machines.

Use case:
Debugging MemoryError issues or tuning Dask/Modin workflows.

5. SQL-Like Filtering with Query

high_apac = df.query("sales > 1000 & region == 'APAC'")

How it works:

Translates the query string into an optimized abstract syntax tree (AST).
Uses the numexpr library for parallelized evaluation.

Why efficient:

Avoids intermediate boolean arrays created by (df.sales > 1000) & (df.region == 'APAC').
Clean syntax for multi-condition filters.

Use case:
Interactive filtering in Jupyter, especially with dynamic query strings.

6. Top-N Without Sorting Everything

top10 = df.nlargest(10, "profit")

How it works:

Uses a partitioning algorithm (like Quickselect) to find the top-N rows in O(N) time vs. O(N log N) for full sort.

Why efficient:

No need to sort the entire DataFrame when you only care about extremes.
Works seamlessly with multi-indexed DataFrames.

Use case:
Identifying top customers/products without expensive full sorts.

7. Column Math with `eval`

df["revenue"] = df.eval("quantity * price")

How it works:

Parses the expression into a vectorized operation using Pandas’ internal engine.
Avoids temporary variables (e.g., no intermediate df["quantity"] * df["price"] stored in memory).

Why efficient:

Faster for complex expressions (e.g., (a + b) / (c - d)).
Reduces memory overhead during chained calculations.

Use case:
Feature engineering with multi-column dependencies.

8. Add Column with `assign` and `lambda`

df = df.assign(gm=lambda x: x.gross - x.cogs)

How it works:

assign creates a new DataFrame with the added column.
The lambda takes the current DataFrame as input (x), enabling dynamic references (e.g., newly created columns).

Why efficient:

Supports method chaining (e.g., df.read_csv().assign().groupby()).
No side effects (original DataFrame remains unchanged).

Use case:
Building reproducible pipelines with immutable DataFrames.

9. Compute Share Within Group

df["sales_share"] = df.groupby("product")["sales"].transform(lambda x: x / x.sum())

How it works:

transform broadcasts the group-wise sum back to the original rows, matching the index.
Uses window functions under the hood for partitioned calculations.

Why efficient:

Avoids merging group summaries back to the original DataFrame.
Handles overlapping groups gracefully.

Use case:
Calculating contribution percentages within categories (e.g., product sales as % of total category sales).

10. Group and Aggregate Multiple Metrics

summary = df.groupby("region", as_index=False).agg(
    total=("sales", "sum"), margin_avg=("margin", "mean")
)

How it works:

Named aggregations explicitly map columns to functions (clearer than {"sales": ["sum", "mean"]}).
as_index=False prevents region from becoming the index, making the output easier to merge.

Why efficient:

Computes all aggregates in a single pass over the data.
Output is a tidy DataFrame, not a multi-indexed one.

Use case:
Creating summary tables for dashboards/reports.

11. Reshape from Wide to Long

long = df.melt(id_vars="id", var_name="metric", value_name="value")

How it works:

Unpivots columns into rows (e.g., columns jan_sales, feb_sales become metric="jan_sales", value=100).
Preserves id_vars as identifiers.

Why efficient:

Simpler than manually stacking DataFrames.
Required for tools like seaborn that expect long-form data.

Use case:
Preparing time series or panel data for visualization.

12. Unnest List Columns

flat = df.explode("tags")

How it works:

Duplicates parent rows for each element in the list-like tags column.
Handles lists, tuples, or Series within cells.

Why efficient:

Faster than apply(pd.Series) + stack for nested data.
Preserves non-list columns efficiently.

Use case:
Flattening JSON-like data (e.g., order line items stored as lists).

13. Pivot with Aggregation and Fill

pivot = df.pivot_table(
    index="date", columns="region", values="sales", aggfunc="sum", fill_value=0
)

How it works:

Groups by date and region, computes sum(sales).
fill_value=0 replaces missing combinations (e.g., no sales in a region on a date) with 0.

Why efficient:

Combines grouping, reshaping, and imputation in one step.
Uses efficient Cython routines for cross-tabulation.

Use case:
Creating heatmap-ready matrices for time/region analyses.

14. Drop Mostly-Empty Columns

df = df.dropna(axis=1, thresh=len(df) * 0.8)

How it works:

thresh requires a column to have at least 80% non-NaN values to be kept.
Drops columns missing more than 20% of data.

Why efficient:

Automates a common data-cleaning task.
Prevents models from breaking on columns with too many NaNs.

Use case:
Automated preprocessing in ML pipelines.

15. Parse and Convert Timezones

df["date"] = pd.to_datetime(df.date).dt.tz_localize("UTC").dt.tz_convert("Asia/Jakarta")

How it works:

to_datetime parses strings/epochs into UTC-naive timestamps.
tz_localize assigns UTC timezone to naive timestamps.
tz_convert shifts timestamps to the target timezone.

Why efficient:

Vectorized operations avoid Python-level loops.
Critical for time zone-aware aggregations.

Use case:
Analyzing global event logs across regions.

16. Validate Emails with Regex

df["email_ok"] = df.email.str.contains(r"^[\w\.-]+@[\w\.-]+\.\w+$", na=False)

How it works:

str.contains applies regex to each element in the email column.
na=False treats NaN emails as invalid (avoids NaN in boolean column).

Why efficient:

Vectorized regex via Pandas’ string extension (faster than apply).
Identifies malformed emails for data cleaning.

Use case:
Validating user-provided email addresses in registration data.

17. Resample and Fill Missing Dates

daily = df.set_index("timestamp").asfreq("D").ffill()

How it works:

asfreq("D") ensures a row exists for every calendar day.
ffill() propagates last valid observation forward to fill gaps.

Why efficient:

Faster than manual reindexing with pd.date_range() + merge.
Handles irregular time series (e.g., sensor data with gaps).

Use case:
Preparing data for forecasting models that require regular intervals.

18. Year-over-Year Percentage Change

yoy = df["sales"].pct_change(12).mul(100).round(2)

How it works:

pct_change(12) computes the percentage change between current and 12-periods-ago (e.g., monthly → YoY).
Assumes data is sorted and has no gaps.

Why efficient:

Vectorized operation avoids manual window shifting.
Built-in handling of division-by-zero (returns inf/NaN).

Use case:
Financial reporting or trend analysis.

19. Enable Copy-on-Write Mode

pd.options.mode.copy_on_write = True

How it works:

Defers DataFrame copies until modifications occur (lazy evaluation).
Part of Pandas 2.0+ optimizations for memory efficiency.

Why efficient:

Reduces peak memory usage during chained operations (e.g., df1 = df[df.a > 0]; df1["b"] = ...).
Prevents accidental mutations of parent DataFrames.

Use case:
Large-scale data processing where memory is constrained.

20. Visual Summary with Styling

df.style.background_gradient()

How it works:

Generates a Styler object with HTML/CSS formatting for Jupyter.
background_gradient applies a color map to each column.

Why efficient:

No need for matplotlib/seaborn for quick visual checks.
Styler integrates with Pandas’ display logic.

Use case:
Quickly spotting outliers/patterns during exploratory analysis.

Love this article? Comment and share them with Your Network!

Non-Brand Data

Discussion about this post

Non-Brand Data

20 Pandas One-Liners That Can Save You Hours of Work

Simple code that you should not miss at all

Referral Recommendation

1. Read Only What You Need

2. Automatically Optimize Column Types

3. Full Data Summary in One Line

4. Get Memory Usage in MB

5. SQL-Like Filtering with Query

6. Top-N Without Sorting Everything

7. Column Math with eval

8. Add Column with assign and lambda

9. Compute Share Within Group

10. Group and Aggregate Multiple Metrics

11. Reshape from Wide to Long

12. Unnest List Columns

13. Pivot with Aggregation and Fill

14. Drop Mostly-Empty Columns

15. Parse and Convert Timezones

16. Validate Emails with Regex

17. Resample and Fill Missing Dates

18. Year-over-Year Percentage Change

19. Enable Copy-on-Write Mode

20. Visual Summary with Styling

Discussion about this post

7. Column Math with `eval`

8. Add Column with `assign` and `lambda`