Data Science Codes Collection To Improve Your Skill

Some of my favorite codes that will useful for your works

Jun 10, 2025

In the fast-paced world of data science, working with efficient Python code becomes an important skill.

As working data scientists, we often juggle complex workflows, switching between libraries such as Pandas and NumPy for data wrangling, SQL for querying, and regex for data cleaning and feature extraction.

The four tools above are commonly used, and thus, knowing more about them is always useful.

In this article, I will discuss various code collections that will certainly improve your skills with Pandas, NumPy, SQL, and regex.

Let’s get into it!

Referral Recommendation

If you want a crash course in turning healthcare questions into publishable, data-driven findings, subscribe to Paras Karmacharya’s newsletter. An NIH-funded physician-scientist, Paras shows how he pairs classic clinical research methods with AI tools, covering protocol design, reproducible data pipelines, and peer-review-ready writing tips in bite-sized, actionable notes.

It’s the perfect bridge between your data-science skills and real-world medical research. Check it out—and support my work—through this referral link.

💡Pandas Code Collection

Read & Process in Chunks with Parallel I/O

# Process a 10GB CSV in 1M-row chunks, aggregating on the fly
agg = []
for chunk in pd.read_csv('big.csv', usecols=['user_id','purchase_amt'], chunksize=10**6):
    agg.append(chunk.groupby('user_id')['purchase_amt'].sum())
result = pd.concat(agg).groupby(level=0).sum()

How it works:

usecols skips parsing unnecessary columns at the I/O layer, reducing memory overhead.
chunksize streams data in manageable batches to avoid OOM errors.

On-the-fly Expression Evaluation

# Compute complex formula without creating temp columns
df['score'] = pd.eval("0.3 * df[col1] + 0.7 * (df[col2] ** 2) / df[col3]", engine='numexpr')

How it works:

pd.eval with numexpr compiles the expression into fast C loops, avoiding Python-level overhead and intermediate DataFrame allocations.

Pipeline-Style Cleaning with `.pipe()`

def drop_low_freq(df, min_count):
    freqs = df['category'].value_counts()
    return df[df['category'].isin(freqs[freqs >= min_count].index)]

cleaned = (
    pd.read_csv('data.csv', dtype={'colA': 'Int64'})
      .pipe(drop_low_freq, min_count=100)
      .assign(flag=lambda d: d['colB'] > d['colB'].quantile(0.9))
)

How it works:

.pipe() passes the intermediate DataFrame directly into your custom function, preserving readability and enabling chaining.

Fast Dtype Conversion for Thousands of Columns

# Downcast all floats and ints in one go
for col in df.select_dtypes(include=['float','int']):
    df[col] = pd.to_numeric(df[col], downcast=('integer' if df[col].dtype.kind=='i' else 'float'))

How it works:

downcast picks the smallest subtype that fits, slashing memory usage on wide tables.

SQL-Style Joins with Multiple Keys & Indicators

merged = pd.merge(
    df_logs, df_users,
    left_on=['uid','session'],
    right_on=['user_id','sess_id'],
    how='inner',
    indicator='_merge_flag'
)

How it works:

indicator=True adds a column marking the source of each row, which helps debug mismatches immediately.

Time Zone–Aware Resampling

df['ts'] = pd.to_datetime(df['ts']).dt.tz_localize('UTC').dt.tz_convert('America/New_York')
hourly = df.set_index('ts').resample('H').agg({'value':'mean'})

How it works:

tz_localize/tz_convert handle DST and offsets correctly; setting as index lets you resample by any calendar frequency.

Rapid Multi-Index Summaries

idx = pd.MultiIndex.from_product([['A','B'], ['X','Y']], names=['grp','sub'])
df2 = df.set_index(['grp','sub']).reindex(idx, fill_value=0).groupby(level=0).sum()

How it works:

reindex ensures all combinations exist, fill_value avoids missing groups, and groupby(level=0) aggregates at top level.

Explode List-Like Columns into Rows

df = df.assign(tags=df['tags'].str.split(',')).explode('tags')

How it works:

.str.split turns strings into lists, .explode then creates one row per list element, ideal for tag analysis.

Stratified Sampling by Group

sampled = df.groupby('category', group_keys=False).apply(lambda x: x.sample(frac=0.1))

How it works:

Ensures each category contributes 10% of its rows, preserving group proportions.

Merge with Tolerance for Fuzzy Joins

pd.merge_asof(
    df1.sort_values('time'),
    df2.sort_values('time'),
    on='time',
    tolerance=pd.Timedelta('5s'),
    direction='nearest'
)

How it works:

Joins rows whose time values are within 5 seconds, great for aligning sensor logs.

IntervalIndex for Binning Numeric Data

bins = pd.interval_range(start=0, end=100, freq=10)
df['range'] = pd.cut(df['score'], bins)

How it works:

pd.cut assigns each score to a labeled interval, using an IntervalIndex for efficient lookups.

Rolling Apply of Custom Functions

df['volatility'] = df['returns'].rolling(20).apply(lambda x: np.std(x) * np.sqrt(252))

How it works:

Runs a custom annualized volatility function over a 20-day window without materializing all windows.

Categorical Ordering for Time Periods

order = ['Q1','Q2','Q3','Q4']
df['quarter'] = pd.Categorical(df['quarter'], categories=order, ordered=True)
df.sort_values('quarter')

How it works:

Defines a logical order for quarters so sorting respects business chronology, not alphabetical.

Sliding-Window View Without Copies

from numpy.lib.stride_tricks import sliding_window_view 
# 1D rolling window of length 5 
windows = sliding_window_view(arr, 5)

How it works:

Uses array strides to produce overlapping “views” on the original data, no new memory allocation for each window.

Masked Arrays for Invalid Data

# Mask negatives and compute mean ignoring them
masked = np.ma.masked_less(arr, 0)
mean_nonneg = masked.mean()

How it works:

np.ma.masked_less creates a masked array where invalid entries are excluded from computations.

Custom Random Choice with Weights & No Replacement

def weighted_choice(a, weights, k):
    rng = np.random.default_rng()
    return rng.choice(a, size=k, replace=False, p=weights/weights.sum())

sample = weighted_choice(user_ids, user_weights, k=100)

How it works:

default_rng provides a modern generator API; p= uses normalized weights to sample without replacement.

Block-Matrix Multiplication via `einsum`

# Multiply a stack of matrices A (n×m×p) with B (n×p×q) in one call
C = np.einsum('ijk,ikl->ijl', A, B)

How it works:

einsum parses the subscript notation to fuse loops and minimize temporaries, yielding both clarity and speed.

Memory-Mapped Arrays for Huge Data

mmap = np.memmap('large.dat', dtype='float32', mode='r', shape=(10000,10000)) mean = mmap.mean(axis=0)

How it works:

memmap loads only needed pages from disk on demand, enabling out-of-core operations without full reads.

JIT-Accelerated Ufuncs with Numba

from numba import vectorize, float64

@vectorize([float64(float64, float64)])
def fast_add(a, b):
    return a + b

C = fast_add(A, B)

How it works:

Numba’s vectorize compiles elementwise operations to native code, drastically speeding up custom loops.

Efficient Quantile Computation

# Approximate median for large arrays
median_approx = np.percentile(arr, 50, interpolation='nearest')

How it works:

Using interpolation='nearest' trades perfect accuracy for speed by avoiding full sorts.

Boolean Masked Assignment Across Axes

# Set entire columns to NaN where condition holds per row
mask = arr.max(axis=1) < threshold
arr[mask, :] = np.nan

How it works:

Creates a 1D boolean mask per row, then assigns across all columns for matching rows.

Fast Pairwise Euclidean Distances

# Given data matrix X (n×d), compute pairwise distance matrix
sq = np.sum(X**2, axis=1, keepdims=True)
dist2 = sq + sq.T - 2 * X.dot(X.T)
dist = np.sqrt(np.maximum(dist2, 0))

How it works:

Reformulates (x−y)² = x² + y² − 2xy to leverage fast matrix multiplication and avoid Python loops.

Generate Cartesian Product Efficiently

# Given vectors a, b, c, produce all combinations
grid = np.stack(np.meshgrid(a, b, c, indexing='ij'), -1).reshape(-1, 3)

How it works:

meshgrid produces coordinate matrices; stacking and reshaping yields a compact (|a|·|b|·|c|)×3 array.

Vectorized Conditional Count per Row

# Count values >0 per row 
counts = (arr > 0).sum(axis=1)

How it works:

Boolean array (arr > 0) treated as 1/0; summing across axis gives counts without Python loops.

Memory-Efficient Unique Elements with Inverse

uniques, inv = np.unique(big_arr, return_inverse=True)

How it works:

return_inverse provides indices mapping each element to its unique value, useful for integer encoding.

Broadcasted Polynomial Features

# Given features X (n×d), compute X, X^2, and pairwise products X_i*X_j
X1 = X[:, :, None] * X[:, None, :]
poly = np.concatenate([X, X**2, X1.reshape(X.shape[0], -1)], axis=1)

How it works:

Uses broadcasting to compute all pairwise products and flattens into polynomial feature vectors in one shot.

🚀SQL Code Collection

JSON Extraction & Aggregation (Postgres)

SELECT
  user_id,
  COUNT(*) FILTER (WHERE payload->>'event' = 'click') AS clicks,
  SUM((payload->>'value')::numeric) AS total_value
FROM events
GROUP BY user_id;

How it works:

payload->>'key' extracts JSON text fields; FILTER and ::numeric let you aggregate and cast inline.

LATERAL Join for Top-N per Group

SELECT u.user_id, t.*
FROM users u
CROSS JOIN LATERAL (
  SELECT * FROM transactions t
  WHERE t.user_id = u.user_id
  ORDER BY t.amount DESC
  LIMIT 3
) AS t;

How it works:

LATERAL lets each user row feed into the subquery, returning that user’s top 3 transactions without extra grouping.

GROUPING SETS for Multi-Level Totals

SELECT region, product, SUM(sales) AS total
FROM sales
GROUP BY GROUPING SETS (
  (region, product),
  (region),
  (product),
  ()
);

How it works:

GROUPING SETS computes specified groupings plus subtotals and grand totals in a single pass.

Approximate Distinct Counts with HyperLogLog

CREATE EXTENSION IF NOT EXISTS hll;
SELECT hll_cardinality(hll_add_agg(hll_hash_text(user_id))) AS approx_users
FROM page_views;

How it works:

HyperLogLog hashes input into fixed-size registers, trading minimal accuracy for massive scale on cardinality estimates.

Incremental Materialized Views (Postgres)

CREATE MATERIALIZED VIEW mv_sales AS
SELECT date_trunc('day', created_at) AS day, SUM(amount) AS total
FROM orders
GROUP BY day;
-- later, to refresh only new data:
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_sales;

How it works:

Materialized views cache heavy aggregates; CONCURRENTLY lets you refresh without blocking reads.

Partition Pruning for Faster Scans

CREATE TABLE logs_y2025 PARTITION OF logs FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');
-- Queries filtering on log_date automatically hit only relevant partitions

How it works:

Partitioned tables let the engine skip irrelevant data blocks, reducing IO.

UTF-8 Pattern Search in SQL Server

SELECT *
FROM docs
WHERE CONVERT(varchar(max), content) LIKE N'%résumé%';

How it works:

N'…' prefix ensures Unicode literal matching in SQL Server, capturing accented text accurately.

UPSERT (Insert or Update) with `ON CONFLICT` (Postgres)

INSERT INTO users (id, email, last_login)
VALUES (123, 'a@example.com', NOW())
ON CONFLICT (id) 
DO UPDATE SET
  email      = EXCLUDED.email,
  last_login = NOW();

How it works:

ON CONFLICT (id) catches duplicate-key errors on id and lets you update specified columns atomically.

JSONB Array Expansion with `jsonb_array_elements`

SELECT order_id, item->>'product_id' AS product_id
FROM orders, 
     jsonb_array_elements(order_items) AS item;

How it works:

jsonb_array_elements unnests a JSONB array into rows; the comma join pairs each element with its parent row.

Full-Text Search with GIN Index (Postgres)

CREATE INDEX idx_docs_fts ON documents
  USING GIN (to_tsvector('english', content));
 
SELECT id, ts_headline(content, q) AS snippet
FROM documents, to_tsquery('analytics & data') q
WHERE to_tsvector('english', content) @@ q;

How it works:

A GIN index on to_tsvector speeds up @@ (match) queries; ts_headline highlights matches in output.

Dynamic Pivot for Crosstab Reports

-- Build and execute dynamic pivot for monthly sales by region
DECLARE @cols NVARCHAR(MAX), @query NVARCHAR(MAX);

SELECT @cols = STRING_AGG(QUOTENAME(month), ',')
FROM (SELECT DISTINCT FORMAT(order_date, 'yyyy-MM') AS month FROM sales) x;

SET @query = '
  SELECT region, ' + @cols + '
  FROM (
    SELECT region, FORMAT(order_date, ''yyyy-MM'') AS month, amount
    FROM sales
  ) src
  PIVOT (
    SUM(amount) FOR month IN (' + @cols + ')
  ) pvt;';

EXEC sp_executesql @query;

How it works:

Builds a comma-separated list of month columns dynamically, then pivots sums across those columns.

Rolling Difference with Window Frame

SELECT
  order_date,
  sales,
  sales - LAG(sales, 1) OVER (ORDER BY order_date) AS delta
FROM daily_sales;

How it works:

LAG(...,1) fetches the previous row’s value; subtraction computes day-over-day change in one pass.

🔍Regex Code Collection

Parse Key-Value Pairs from Logs

pattern = r'(?P<key>\w+)=\[(?P<values>[^\]]+)\]'

How it works:

Named groups extract both the key and its bracketed values in one pass.

Extract Domain & Path from URLs

pattern = r'^(?:https?://)?([^/]+)(/.*)?$'

How it works:

The non-capturing ?: skips protocol; group1/2 split host and path without extra parsing.

Cleanup Nested HTML Tags

pattern = r'<(?P<tag>\w+)[^>]*>(.*?)</(?P=tag)>'

How it works:

(?P=tag) backreference ensures you only match closing tags that correspond to the opening tag.

Conditional Replacement with Callback

import re

def repl(m):
    return m.group(1).upper()

text = re.sub(r'item:(\w+)', repl, text)

How it works:

Passing a function to re.sub lets you dynamically compute replacement text for each match.

Time-Stamped Log Segmentation

pattern = r'(?ms)^\[(?P<ts>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\] (?P<msg>.*?)(?=^\[|\Z)'

How it works:

(?ms) sets DOTALL and MULTILINE; lookahead (?=^\[|\Z) splits logs at timestamps or end-of-text.

Code Block Extraction from Markdown

pattern = r'```(?P<lang>\w+)\n(?P<code>.*?)```'

How it works:

Captures fenced code blocks by language and content, .*? is lazy to stop at the next fence.

Demarcate XML Sections Without Parsing

pattern = r'(?s)<item>.*?</item>'

How it works:

Single-line flag s makes . match newlines, so you grab entire <item>…</item> blocks in one go.

Match Balanced Parentheses (One Level)

pattern = r"\((?:[^()]+|\([^()]*\))*\)"

How it works:

Uses a non-capturing group with alternation to allow either non-parentheses or a single nested pair, matching balanced (...) blocks.

Extract HTML Tag Attributes

pattern = r'(?P<tag><(\w+)([^>]*)>)|(?P<attr>\b\w+=(?:"[^"]*"|\'[^\']*\'))'

How it works:

Captures the full tag and separately each key="value" or key='value' attribute via named groups.

Find Duplicate Words, Case-Insensitive

pattern = r"(?i)\b(\w+)\s+\1\b"

How it works:

(?i) ignores case; \1 backreference ensures the same word appears twice in a row.

Split on Multiple Delimiters

pattern = r"[,\s;|]+"

How it works:

Character class [,\s;|] matches commas, whitespace, semicolons, or pipes; + collapses consecutive delimiters into one split.

Validate IPv4 Addresses

pattern = r"\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d?\d)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d?\d)\b"

How it works:

Ensures each octet is 0–255 by matching 25[0-5], 2[0-4]\d, or 0–199 patterns, repeated 4 times with dots.

Named Group Verbose Mode

pattern = re.compile(r"""
    ^                       # start of string
    (?P<year>\d{4})[-/]     # year and separator
    (?P<month>0[1-9]|1[0-2])[-/]
    (?P<day>0[1-9]|[12]\d|3[01])  # day
    $                       # end of string
""", re.VERBOSE)

How it works:

re.VERBOSE lets you annotate each part with comments and whitespace for clarity.

Atomic Group to Prevent Backtracking

pattern = r"(?>(?:\w+\.)+)\w+@\w+\.\w+"

How it works:

The atomic group (?>(?:\w+\.)+) locks the dot-separated segments, preventing the engine from backtracking into them, improving performance on malicious input.

Join Cornellius Yudha Wijaya’s subscriber chat

Available in the Substack app and on web

Love this article? Comment and share them with Your Network!

Non-Brand Data

Discussion about this post

Non-Brand Data

Data Science Codes Collection To Improve Your Skill

Some of my favorite codes that will useful for your works

Referral Recommendation

💡Pandas Code Collection

Read & Process in Chunks with Parallel I/O

On-the-fly Expression Evaluation

Pipeline-Style Cleaning with .pipe()

Fast Dtype Conversion for Thousands of Columns

SQL-Style Joins with Multiple Keys & Indicators

Time Zone–Aware Resampling

Rapid Multi-Index Summaries

Explode List-Like Columns into Rows

Stratified Sampling by Group

Merge with Tolerance for Fuzzy Joins

IntervalIndex for Binning Numeric Data

Rolling Apply of Custom Functions

Categorical Ordering for Time Periods

📌NumPy Code Collection

Sliding-Window View Without Copies

Masked Arrays for Invalid Data

Custom Random Choice with Weights & No Replacement

Block-Matrix Multiplication via einsum

Memory-Mapped Arrays for Huge Data

JIT-Accelerated Ufuncs with Numba

Efficient Quantile Computation

Boolean Masked Assignment Across Axes

Fast Pairwise Euclidean Distances

Generate Cartesian Product Efficiently

Vectorized Conditional Count per Row

Memory-Efficient Unique Elements with Inverse

Broadcasted Polynomial Features

🚀SQL Code Collection

JSON Extraction & Aggregation (Postgres)

LATERAL Join for Top-N per Group

GROUPING SETS for Multi-Level Totals

Approximate Distinct Counts with HyperLogLog

Incremental Materialized Views (Postgres)

Partition Pruning for Faster Scans

UTF-8 Pattern Search in SQL Server

UPSERT (Insert or Update) with ON CONFLICT (Postgres)

JSONB Array Expansion with jsonb_array_elements

Full-Text Search with GIN Index (Postgres)

Dynamic Pivot for Crosstab Reports

Rolling Difference with Window Frame

🔍Regex Code Collection

Parse Key-Value Pairs from Logs

Extract Domain & Path from URLs

Cleanup Nested HTML Tags

Conditional Replacement with Callback

Time-Stamped Log Segmentation

Code Block Extraction from Markdown

Demarcate XML Sections Without Parsing

Match Balanced Parentheses (One Level)

Extract HTML Tag Attributes

Find Duplicate Words, Case-Insensitive

Split on Multiple Delimiters

Validate IPv4 Addresses

Named Group Verbose Mode

Atomic Group to Prevent Backtracking

Discussion about this post

Pipeline-Style Cleaning with `.pipe()`

Block-Matrix Multiplication via `einsum`

UPSERT (Insert or Update) with `ON CONFLICT` (Postgres)

JSONB Array Expansion with `jsonb_array_elements`