Data Science Codes Collection To Improve Your Skill
Some of my favorite codes that will useful for your works
In the fast-paced world of data science, working with efficient Python code becomes an important skill.
As working data scientists, we often juggle complex workflows, switching between libraries such as Pandas and NumPy for data wrangling, SQL for querying, and regex for data cleaning and feature extraction.
The four tools above are commonly used, and thus, knowing more about them is always useful.
In this article, I will discuss various code collections that will certainly improve your skills with Pandas, NumPy, SQL, and regex.
Let’s get into it!
Referral Recommendation
If you want a crash course in turning healthcare questions into publishable, data-driven findings, subscribe to Paras Karmacharya’s newsletter. An NIH-funded physician-scientist, Paras shows how he pairs classic clinical research methods with AI tools, covering protocol design, reproducible data pipelines, and peer-review-ready writing tips in bite-sized, actionable notes.
It’s the perfect bridge between your data-science skills and real-world medical research. Check it out—and support my work—through this referral link.
💡Pandas Code Collection
Read & Process in Chunks with Parallel I/O
# Process a 10GB CSV in 1M-row chunks, aggregating on the fly
agg = []
for chunk in pd.read_csv('big.csv', usecols=['user_id','purchase_amt'], chunksize=10**6):
agg.append(chunk.groupby('user_id')['purchase_amt'].sum())
result = pd.concat(agg).groupby(level=0).sum()
How it works:
usecols
skips parsing unnecessary columns at the I/O layer, reducing memory overhead.chunksize
streams data in manageable batches to avoid OOM errors.
On-the-fly Expression Evaluation
# Compute complex formula without creating temp columns
df['score'] = pd.eval("0.3 * df[col1] + 0.7 * (df[col2] ** 2) / df[col3]", engine='numexpr')
How it works:
pd.eval
withnumexpr
compiles the expression into fast C loops, avoiding Python-level overhead and intermediate DataFrame allocations.
Pipeline-Style Cleaning with .pipe()
def drop_low_freq(df, min_count):
freqs = df['category'].value_counts()
return df[df['category'].isin(freqs[freqs >= min_count].index)]
cleaned = (
pd.read_csv('data.csv', dtype={'colA': 'Int64'})
.pipe(drop_low_freq, min_count=100)
.assign(flag=lambda d: d['colB'] > d['colB'].quantile(0.9))
)
How it works:
.pipe()
passes the intermediate DataFrame directly into your custom function, preserving readability and enabling chaining.
Fast Dtype Conversion for Thousands of Columns
# Downcast all floats and ints in one go
for col in df.select_dtypes(include=['float','int']):
df[col] = pd.to_numeric(df[col], downcast=('integer' if df[col].dtype.kind=='i' else 'float'))
How it works:
downcast
picks the smallest subtype that fits, slashing memory usage on wide tables.
SQL-Style Joins with Multiple Keys & Indicators
merged = pd.merge(
df_logs, df_users,
left_on=['uid','session'],
right_on=['user_id','sess_id'],
how='inner',
indicator='_merge_flag'
)
How it works:
indicator=True
adds a column marking the source of each row, which helps debug mismatches immediately.
Time Zone–Aware Resampling
df['ts'] = pd.to_datetime(df['ts']).dt.tz_localize('UTC').dt.tz_convert('America/New_York')
hourly = df.set_index('ts').resample('H').agg({'value':'mean'})
How it works:
tz_localize
/tz_convert
handle DST and offsets correctly; setting as index lets you resample by any calendar frequency.
Rapid Multi-Index Summaries
idx = pd.MultiIndex.from_product([['A','B'], ['X','Y']], names=['grp','sub'])
df2 = df.set_index(['grp','sub']).reindex(idx, fill_value=0).groupby(level=0).sum()
How it works:
reindex
ensures all combinations exist,fill_value
avoids missing groups, andgroupby(level=0)
aggregates at top level.
Explode List-Like Columns into Rows
df = df.assign(tags=df['tags'].str.split(',')).explode('tags')
How it works:
.str.split
turns strings into lists,.explode
then creates one row per list element, ideal for tag analysis.
Stratified Sampling by Group
sampled = df.groupby('category', group_keys=False).apply(lambda x: x.sample(frac=0.1))
How it works:
Ensures each category contributes 10% of its rows, preserving group proportions.
Merge with Tolerance for Fuzzy Joins
pd.merge_asof(
df1.sort_values('time'),
df2.sort_values('time'),
on='time',
tolerance=pd.Timedelta('5s'),
direction='nearest'
)
How it works:
Joins rows whose time values are within 5 seconds, great for aligning sensor logs.
IntervalIndex for Binning Numeric Data
bins = pd.interval_range(start=0, end=100, freq=10)
df['range'] = pd.cut(df['score'], bins)
How it works:
pd.cut
assigns each score to a labeled interval, using anIntervalIndex
for efficient lookups.
Rolling Apply of Custom Functions
df['volatility'] = df['returns'].rolling(20).apply(lambda x: np.std(x) * np.sqrt(252))
How it works:
Runs a custom annualized volatility function over a 20-day window without materializing all windows.
Categorical Ordering for Time Periods
order = ['Q1','Q2','Q3','Q4']
df['quarter'] = pd.Categorical(df['quarter'], categories=order, ordered=True)
df.sort_values('quarter')
How it works:
Defines a logical order for quarters so sorting respects business chronology, not alphabetical.
📌NumPy Code Collection
Sliding-Window View Without Copies
from numpy.lib.stride_tricks import sliding_window_view
# 1D rolling window of length 5
windows = sliding_window_view(arr, 5)
How it works:
Uses array strides to produce overlapping “views” on the original data, no new memory allocation for each window.
Masked Arrays for Invalid Data
# Mask negatives and compute mean ignoring them
masked = np.ma.masked_less(arr, 0)
mean_nonneg = masked.mean()
How it works:
np.ma.masked_less
creates a masked array where invalid entries are excluded from computations.
Custom Random Choice with Weights & No Replacement
def weighted_choice(a, weights, k):
rng = np.random.default_rng()
return rng.choice(a, size=k, replace=False, p=weights/weights.sum())
sample = weighted_choice(user_ids, user_weights, k=100)
How it works:
default_rng
provides a modern generator API;p=
uses normalized weights to sample without replacement.
Block-Matrix Multiplication via einsum
# Multiply a stack of matrices A (n×m×p) with B (n×p×q) in one call
C = np.einsum('ijk,ikl->ijl', A, B)
How it works:
einsum
parses the subscript notation to fuse loops and minimize temporaries, yielding both clarity and speed.
Memory-Mapped Arrays for Huge Data
mmap = np.memmap('large.dat', dtype='float32', mode='r', shape=(10000,10000)) mean = mmap.mean(axis=0)
How it works:
memmap loads only needed pages from disk on demand, enabling out-of-core operations without full reads.
JIT-Accelerated Ufuncs with Numba
from numba import vectorize, float64
@vectorize([float64(float64, float64)])
def fast_add(a, b):
return a + b
C = fast_add(A, B)
How it works:
Numba’s
vectorize
compiles elementwise operations to native code, drastically speeding up custom loops.
Efficient Quantile Computation
# Approximate median for large arrays
median_approx = np.percentile(arr, 50, interpolation='nearest')
How it works:
Using
interpolation='nearest'
trades perfect accuracy for speed by avoiding full sorts.
Boolean Masked Assignment Across Axes
# Set entire columns to NaN where condition holds per row
mask = arr.max(axis=1) < threshold
arr[mask, :] = np.nan
How it works:
Creates a 1D boolean mask per row, then assigns across all columns for matching rows.
Fast Pairwise Euclidean Distances
# Given data matrix X (n×d), compute pairwise distance matrix
sq = np.sum(X**2, axis=1, keepdims=True)
dist2 = sq + sq.T - 2 * X.dot(X.T)
dist = np.sqrt(np.maximum(dist2, 0))
How it works:
Reformulates (x−y)² = x² + y² − 2xy to leverage fast matrix multiplication and avoid Python loops.
Generate Cartesian Product Efficiently
# Given vectors a, b, c, produce all combinations
grid = np.stack(np.meshgrid(a, b, c, indexing='ij'), -1).reshape(-1, 3)
How it works:
meshgrid
produces coordinate matrices; stacking and reshaping yields a compact (|a|·|b|·|c|)×3 array.
Vectorized Conditional Count per Row
# Count values >0 per row
counts = (arr > 0).sum(axis=1)
How it works:
Boolean array
(arr > 0)
treated as 1/0; summing across axis gives counts without Python loops.
Memory-Efficient Unique Elements with Inverse
uniques, inv = np.unique(big_arr, return_inverse=True)
How it works:
return_inverse
provides indices mapping each element to its unique value, useful for integer encoding.
Broadcasted Polynomial Features
# Given features X (n×d), compute X, X^2, and pairwise products X_i*X_j
X1 = X[:, :, None] * X[:, None, :]
poly = np.concatenate([X, X**2, X1.reshape(X.shape[0], -1)], axis=1)
How it works:
Uses broadcasting to compute all pairwise products and flattens into polynomial feature vectors in one shot.
🚀SQL Code Collection
JSON Extraction & Aggregation (Postgres)
SELECT
user_id,
COUNT(*) FILTER (WHERE payload->>'event' = 'click') AS clicks,
SUM((payload->>'value')::numeric) AS total_value
FROM events
GROUP BY user_id;
How it works:
payload->>'key'
extracts JSON text fields;FILTER
and::numeric
let you aggregate and cast inline.
LATERAL Join for Top-N per Group
SELECT u.user_id, t.*
FROM users u
CROSS JOIN LATERAL (
SELECT * FROM transactions t
WHERE t.user_id = u.user_id
ORDER BY t.amount DESC
LIMIT 3
) AS t;
How it works:
LATERAL
lets each user row feed into the subquery, returning that user’s top 3 transactions without extra grouping.
GROUPING SETS for Multi-Level Totals
SELECT region, product, SUM(sales) AS total
FROM sales
GROUP BY GROUPING SETS (
(region, product),
(region),
(product),
()
);
How it works:
GROUPING SETS
computes specified groupings plus subtotals and grand totals in a single pass.
Approximate Distinct Counts with HyperLogLog
CREATE EXTENSION IF NOT EXISTS hll;
SELECT hll_cardinality(hll_add_agg(hll_hash_text(user_id))) AS approx_users
FROM page_views;
How it works:
HyperLogLog hashes input into fixed-size registers, trading minimal accuracy for massive scale on cardinality estimates.
Incremental Materialized Views (Postgres)
CREATE MATERIALIZED VIEW mv_sales AS
SELECT date_trunc('day', created_at) AS day, SUM(amount) AS total
FROM orders
GROUP BY day;
-- later, to refresh only new data:
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_sales;
How it works:
Materialized views cache heavy aggregates;
CONCURRENTLY
lets you refresh without blocking reads.
Partition Pruning for Faster Scans
CREATE TABLE logs_y2025 PARTITION OF logs FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');
-- Queries filtering on log_date automatically hit only relevant partitions
How it works:
Partitioned tables let the engine skip irrelevant data blocks, reducing IO.
UTF-8 Pattern Search in SQL Server
SELECT *
FROM docs
WHERE CONVERT(varchar(max), content) LIKE N'%résumé%';
How it works:
N'…'
prefix ensures Unicode literal matching in SQL Server, capturing accented text accurately.
UPSERT (Insert or Update) with ON CONFLICT
(Postgres)
INSERT INTO users (id, email, last_login)
VALUES (123, 'a@example.com', NOW())
ON CONFLICT (id)
DO UPDATE SET
email = EXCLUDED.email,
last_login = NOW();
How it works:
ON CONFLICT (id)
catches duplicate-key errors onid
and lets you update specified columns atomically.
JSONB Array Expansion with jsonb_array_elements
SELECT order_id, item->>'product_id' AS product_id
FROM orders,
jsonb_array_elements(order_items) AS item;
How it works:
jsonb_array_elements
unnests a JSONB array into rows; the comma join pairs each element with its parent row.
Full-Text Search with GIN Index (Postgres)
CREATE INDEX idx_docs_fts ON documents
USING GIN (to_tsvector('english', content));
SELECT id, ts_headline(content, q) AS snippet
FROM documents, to_tsquery('analytics & data') q
WHERE to_tsvector('english', content) @@ q;
How it works:
A GIN index on
to_tsvector
speeds up@@
(match) queries;ts_headline
highlights matches in output.
Dynamic Pivot for Crosstab Reports
-- Build and execute dynamic pivot for monthly sales by region
DECLARE @cols NVARCHAR(MAX), @query NVARCHAR(MAX);
SELECT @cols = STRING_AGG(QUOTENAME(month), ',')
FROM (SELECT DISTINCT FORMAT(order_date, 'yyyy-MM') AS month FROM sales) x;
SET @query = '
SELECT region, ' + @cols + '
FROM (
SELECT region, FORMAT(order_date, ''yyyy-MM'') AS month, amount
FROM sales
) src
PIVOT (
SUM(amount) FOR month IN (' + @cols + ')
) pvt;';
EXEC sp_executesql @query;
How it works:
Builds a comma-separated list of month columns dynamically, then pivots sums across those columns.
Rolling Difference with Window Frame
SELECT
order_date,
sales,
sales - LAG(sales, 1) OVER (ORDER BY order_date) AS delta
FROM daily_sales;
How it works:
LAG(...,1)
fetches the previous row’s value; subtraction computes day-over-day change in one pass.
🔍Regex Code Collection
Parse Key-Value Pairs from Logs
pattern = r'(?P<key>\w+)=\[(?P<values>[^\]]+)\]'
How it works:
Named groups extract both the key and its bracketed values in one pass.
Extract Domain & Path from URLs
pattern = r'^(?:https?://)?([^/]+)(/.*)?$'
How it works:
The non-capturing
?:
skips protocol; group1/2 split host and path without extra parsing.
Cleanup Nested HTML Tags
pattern = r'<(?P<tag>\w+)[^>]*>(.*?)</(?P=tag)>'
How it works:
(?P=tag)
backreference ensures you only match closing tags that correspond to the opening tag.
Conditional Replacement with Callback
import re
def repl(m):
return m.group(1).upper()
text = re.sub(r'item:(\w+)', repl, text)
How it works:
Passing a function to
re.sub
lets you dynamically compute replacement text for each match.
Time-Stamped Log Segmentation
pattern = r'(?ms)^\[(?P<ts>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\] (?P<msg>.*?)(?=^\[|\Z)'
How it works:
(?ms)
sets DOTALL and MULTILINE; lookahead(?=^\[|\Z)
splits logs at timestamps or end-of-text.
Code Block Extraction from Markdown
pattern = r'```(?P<lang>\w+)\n(?P<code>.*?)```'
How it works:
Captures fenced code blocks by language and content,
.*?
is lazy to stop at the next fence.
Demarcate XML Sections Without Parsing
pattern = r'(?s)<item>.*?</item>'
How it works:
Single-line flag
s
makes.
match newlines, so you grab entire<item>…</item>
blocks in one go.
Match Balanced Parentheses (One Level)
pattern = r"\((?:[^()]+|\([^()]*\))*\)"
How it works:
Uses a non-capturing group with alternation to allow either non-parentheses or a single nested pair, matching balanced
(...)
blocks.
Extract HTML Tag Attributes
pattern = r'(?P<tag><(\w+)([^>]*)>)|(?P<attr>\b\w+=(?:"[^"]*"|\'[^\']*\'))'
How it works:
Captures the full tag and separately each
key="value"
orkey='value'
attribute via named groups.
Find Duplicate Words, Case-Insensitive
pattern = r"(?i)\b(\w+)\s+\1\b"
How it works:
(?i)
ignores case;\1
backreference ensures the same word appears twice in a row.
Split on Multiple Delimiters
pattern = r"[,\s;|]+"
How it works:
Character class
[,\s;|]
matches commas, whitespace, semicolons, or pipes;+
collapses consecutive delimiters into one split.
Validate IPv4 Addresses
pattern = r"\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d?\d)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d?\d)\b"
How it works:
Ensures each octet is 0–255 by matching
25[0-5]
,2[0-4]\d
, or0–199
patterns, repeated 4 times with dots.
Named Group Verbose Mode
pattern = re.compile(r"""
^ # start of string
(?P<year>\d{4})[-/] # year and separator
(?P<month>0[1-9]|1[0-2])[-/]
(?P<day>0[1-9]|[12]\d|3[01]) # day
$ # end of string
""", re.VERBOSE)
How it works:
re.VERBOSE
lets you annotate each part with comments and whitespace for clarity.
Atomic Group to Prevent Backtracking
pattern = r"(?>(?:\w+\.)+)\w+@\w+\.\w+"
How it works:
The atomic group
(?>(?:\w+\.)+)
locks the dot-separated segments, preventing the engine from backtracking into them, improving performance on malicious input.
Love this article? Comment and share them with Your Network!
If you're at a pivotal point in your career or sitting on skills you're unsure how to use, I offer 1:1 mentorship.
It's personal, flexible, and built around you.
For long-term mentorship, visit me here (you can even enjoy a 7-day free trial).