Non-Brand Data

Stock API: What it is and Why it Matters

Cornellius Yudha Wijaya — Sun, 26 Jul 2026 16:32:54 GMT

The hard part of using a stock API is not getting the data. It is choosing the provider, and you live with that choice long after the integration is done.

This is why I want to walk through what a stock API is, the two routes to its data, and how I decide whether a provider is worth building on. This article is for two types of readers: the quantitative investor who wants clean history and reliable backtests, and the AI developer who wants an agent that can reach the markets on its own.

One note before I start. This is not a ranking of every provider, and it is not a buyer’s guide. It is the way I think through the decision, with Alpha Vantage as the running example, because it offers both a REST API and a mature MCP server over the same data.

Curious about it? Let’s get into it.

Subscribe now

Why does the choice matter more than the code?

Connecting to a stock API is usually simple: get a key, call an endpoint, and parse the response. Because that part is so easy, it is tempting to treat the choice of provider as a detail you can settle later.

In my experience, that is backwards. The real problems rarely show up in the integration code. For example, a free tier capped at a few requests a day stalls a research session halfway through an idea, and a feed with no tool interface leaves you hand-building the adapter an agent needs before it can call anything.

None of these are coding problems. Each one traces back to the provider you chose. That is why I spend more time choosing a provider than integrating one. Before judging one, though, it helps to be clear on what a stock API actually gives you.

What is a stock API?

A stock API is an HTTP service that returns structured market data, usually as JSON or CSV. You send parameters such as a ticker, date range, and interval, and receive data in a predictable format.

Unlike a finance website, an API is built for repeatable access. The same request can run across hundreds or thousands of tickers while returning the same structure each time. That consistency makes it useful for backtests, dashboards, and other automated workflows.

There are two common callers. Traditional code calls the endpoint and parses the response. An AI agent can instead call the API through a tool such as an MCP server. The data source is the same; the access layer is different.

How do you reach the data: REST requests and the MCP layer?

There are two ways to reach the data. The first is the REST request, the route a traditional integration uses.

Providers build it two ways: some give each kind of data its own URL, while Alpha Vantage keeps a single URL and lets a function parameter choose the operation. A daily price request looks like this:

GET https://www.alphavantage.co/query
    ?function=TIME_SERIES_DAILY
    &symbol=IBM
    &apikey=YOUR_KEY

Only the function value changes from call to call. Swap TIME_SERIES_DAILY for another function and the same URL returns a live quote or a company overview instead. I like this design because one endpoint is simpler to document and maintain.

The second route is built for the agent, and it uses the Model Context Protocol (MCP), an open standard. The provider runs a server that lists its tools. The agent reads that list with tools/list, then calls the one it needs with tools/call, filling in the arguments from the conversation. The server runs the query and returns the result.

The exchange looks like this:

// 1. Agent asks what the server can do
{”method”: “tools/list”}
 
// 2. Server responds with a typed catalog
{
  “tools”: [{
    “name”: “get_daily_time_series”,
    “description”: “Daily OHLCV for a listed equity”,
    “inputSchema”: {
      “type”: “object”,
      “properties”: {
        “symbol”: {”type”: “string”},
        “outputsize”: {”type”: “string”, “enum”: [”compact”, “full”]}
      },
      “required”: [”symbol”]
    }
  }]
}
 
// 3. Agent calls the tool with arguments the model filled in
{
  “method”: “tools/call”,
  “params”: {
    “name”: “get_daily_time_series”,
    “arguments”: {”symbol”: “IBM”, “outputsize”: “compact”}
  }
}

Alpha Vantage runs an official MCP server with more than 100 tools across nine categories. Each maps to a single REST function, so the same request returns the same data whichever route you take:

Core stocks: quotes and time series
Options: real-time chains with Greeks
Fundamentals: statements, overviews, and earnings
Forex: spot and historical exchange rates
Cryptocurrencies: digital asset prices and time series
Commodities: oil, metals, and agricultural prices
Economic indicators: GDP, CPI, and treasury yields
Technical indicators: more than 50 tools from SMA to MACD
Alpha Intelligence: news sentiment, earnings transcripts, and insider transactions

It also sits in Anthropic’s Claude Connectors directory. That means it installs as a tool source across editors and chat apps like Claude and Cursor, instead of being tied to one app.

Both routes return the same data. The only difference is who makes the call:

Figure 1. Differences between REST request and MCP layer

A good MCP server now matters as much as a good REST API.

It used to be enough to check that the REST docs were solid. Now I also ask whether the provider runs an MCP server and whether its tools match the REST API.

An agent calling a typed tool makes fewer mistakes and burns fewer tokens than one making raw HTTP calls from hand-written prompts. Both routes pull the same data, so the next question is what data is on offer.

Which categories of market data matter?

Market data divides into five broad categories, and most full-featured APIs cover a version of each. Know which one your project requires, because that is where a provider’s limits show first.

Quotes and real-time prices are the simplest: what a stock trades at right now. Free tiers usually delay that by about fifteen minutes, which is fine for some jobs and useless for others. A live feed costs more, because it is licensed and needs a broker or exchange agreement that most retail APIs skip.
Next comes intraday and historical time series: the open, high, low, close, and volume at intervals from a minute to a month. This is the backbone of backtesting, and the one I check hardest, because a strategy is only as reliable as its history.
Fundamentals cover the financial statements and valuation ratios behind screeners and valuation models. They matter the moment a strategy depends on more than price.
Then there are technical indicators like moving averages and RSI. Alpha Vantage alone exposes more than fifty as ready-made tools. Providers differ on delivery: some compute the indicator on the server, so you request it like a quote, while others hand you raw prices to compute yourself with TA-Lib or pandas-ta.
Last is news, macro, and alternative data: headlines, sentiment scores, and economic releases. It is far less standardized than price data, so it is usually the first thing a smaller provider drops.

Think what you need, and the provider you might want to use becomes clearer.

How do you call a stock API the right way?

A reliable traditional integration can still be concise. Here is one:

import requests
 
API_KEY = “YOUR_KEY”
BASE_URL = “https://www.alphavantage.co/query”
 
def get_daily_prices(symbol: str) -> dict:
    params = {
        “function”: “TIME_SERIES_DAILY”,
        “symbol”: symbol,
        “apikey”: API_KEY,
        “outputsize”: “compact”,
    }
    response = requests.get(BASE_URL, params=params, timeout=10)
    response.raise_for_status()
    data = response.json()
 
    if “Error Message” in data:
        raise ValueError(f”Bad request: {data[’Error Message’]}”)
    if “Note” in data:
        raise RuntimeError(”Rate limit hit, back off and retry”)
 
    return data[”Time Series (Daily)”]

Three checks matter here: Error Message catches bad requests, Note catches rate limits, and timeout prevents stalled requests.

An agent needs the same safeguards, but they move into the MCP server. The server should handle errors and rate limits clearly, and cache repeated requests when appropriate. That is the difference between an MCP server that exists and one that is production-ready.

What does production-ready actually mean?

Production-ready comes down to six things: coverage, data quality, latency, rate limits, documentation and licensing, and AI-readiness.

Coverage means whether the provider supports the tickers, exchanges, and asset classes your strategy actually trades. Broad US coverage, for example, says little about another market.
Data quality is less visible. Adjusted prices matter because splits can distort raw series. Historical coverage matters too: missing delisted companies introduce survivorship bias, while gaps in thinly traded tickers can distort a backtest.
Latency depends on the strategy. End-of-day data may be enough for daily rebalancing, but not for market making. The question is whether the feed is fast enough for the decision it supports.
Rate limits determine how much data you can realistically pull. A low per-minute or daily cap can stop a job before it finishes, while bulk endpoints can reduce the number of calls needed.
Documentation should explain adjustments, coverage, and edge cases. Licensing matters just as much: personal and commercial use often come with different terms.
AI-readiness adds another layer. I look for an MCP server, how much of the REST API it exposes, OAuth support, and compatibility with tools such as Claude, Cursor, and VS Code. Without MCP support, teams need to build and maintain the adapter themselves.

Those six checks narrow the provider choice quickly.

How do you choose a stock data API?

When choosing a provider, I use five questions:

What data do I need: quotes, historical prices, fundamentals, indicators, news, or a mix?
How fresh must it be: real-time, delayed, or end-of-day?
What is my request volume?
Is the use personal or commercial?
Will an AI agent consume the data?

The last question changes the requirements. If an agent is involved, MCP support becomes part of the evaluation.

Before committing, test the free tier with your actual tickers and expected request volume. For agent use cases, test the MCP tools separately from the REST API. A good API does not automatically mean a good MCP implementation.

Applying these questions to the provider used throughout this article shows where it fits.

Where does Alpha Vantage fit?

Since I have leaned on Alpha Vantage throughout, it is fair to say where it fits and where it does not. As a best-overall pick, its home is quantitative research and agentic AI work, and because its REST API and MCP server draw on the same data, a research team and an AI team can share one provider instead of maintaining two integrations that drift apart.

That shared foundation is the one thing I would want an AI developer to take away. One integration serves both the backtests and the agents. The historical series a notebook pulls today is the same tool an agent calls next week, when a user asks in plain language instead of writing a query.

It fits less well for ultra-low-latency trading, and that limit is worth stating plainly. Nothing returned through a shared REST or MCP layer will match a co-located feed. A strategy built on microsecond execution needs a co-located or institutional feed like Bloomberg, or a direct exchange line. No amount of MCP support closes that gap.

The takeaway

The main lesson is simple: the provider creates most of the long-term constraints, not the integration.

Before building, check the full contract around the data: coverage, quality, rate limits, licensing, pricing, and AI support. A REST client can be rewritten. An MCP adapter can be replaced. Swapping the provider underneath a working system is much harder.

That is why I would rather spend more time evaluating the provider upfront than discover its limits after the backtests, dashboards, or agents already depend on it.

For me, that is the real definition of production-ready: not just an API that works, but a provider whose limits are clear enough to build around.

I would be interested to hear what you check before committing to a market data provider, especially the checks that have caught problems early.

Manager Memo: Does Your Team Still Need SQL when AI writes it?

Cornellius Yudha Wijaya — Sun, 19 Jul 2026 06:54:09 GMT

The short answer

Building a Minimal Agent Harness in Python

Cornellius Yudha Wijaya — Thu, 16 Jul 2026 03:29:41 GMT

An agent harness is the code around that agent. It starts the work. It waits for the answer and handles failure but the harness does not decide what the answer should be.

Imagine that your application asks an agent to summarize a document. On a normal run, the agent returns some text and the application continues. But the agent might crash. It might also wait forever for another service.

A normal Python function call leaves the application waiting at that line until the function returns. If the agent never returns, the code after the call never runs. The application therefore needs a way to stay in control while the agent works.

This is why, this articlewill build that idea before looking at any files. Once the roles are clear, each part will become a small piece of Python code.

Curious about it? Let’s get into it.

Subscribe now

The idea before the code

The first decision is to run the application and the agent as two separate programs. A running program is called a process. The operating system gives each process its own memory and can stop one without automatically stopping the other.

The application stays in the first process. It starts a second process when it has work for the agent. We call the second process the worker because its job is to perform one piece of agent work.

The first process now has another role. It starts the worker and watches for the result. We call it the supervisor because it can stop a failed worker and start a replacement. Worker and supervisor are names for these roles. They are not special Python features.

One request now follows this path:

1. The application asks the harness for an answer.

2. The harness starts a worker and sends it the request.

3. The worker gives the request to the agent code.

4. The worker sends the answer back to the harness.

5. If the answer takes too long, the harness stops the worker and reports an error.

This time limit is called a timeout. A timeout is the longest period the harness is willing to wait. The harness can enforce it because the harness is still running outside the worker.

The two-process design keeps the application in control. It also creates the next problem: the two processes no longer share the same memory. They need a way to exchange a request and a reply.

How the two processes exchange a message

When two pieces of code run in the same process, one function can pass a Python value directly to another. A dictionary is one kind of Python value. It stores named fields such as id and message. Separate processes cannot share that dictionary as a live value.

The operating system can connect the processes with a pipe. A pipe is a channel for moving data from one program to another. The harness writes at one end. The worker reads at the other end and uses a second connection for its reply.

The harness keeps track of the request. The worker runs the agent. Messages move between the processes through pipes.

A pipe carries bytes, which are the raw units a program uses for data. Both programs therefore need a message format that Python can turn into bytes and back again. This example uses JSON, a common text format for named fields and values.

The example puts one complete JSON message on each line. This format is called JSON Lines. The line break marks the end of one message, so the reader knows when it has received the complete request.

A request contains an id and a message. The id is a label for the request. The worker copies that label into its reply. The ok field says whether the reply contains a value or an error.

request {”id”: “1”, “message”: “hello”}
reply {”id”: “1”, “ok”: true, “value”: “Agent received: hello”}
error {”id”: “1”, “ok”: false, “error”: “agent failed”}

In this beginner version, one worker handles only one request. Matching the id is still useful because it makes the message format clear. It also prepares the design for a later version that keeps workers alive.

We now have the full idea: two processes and a small message between them. Only now do we need to give the Python pieces filenames.

The four Python files

The example separates the design into four files. Each file has one job.

agent.py contains the code that produces an answer.
worker.py receives one request and calls the agent.
harness.py starts the worker and enforces the timeout.
main.py represents the application that wants an answer.

The filenames do not define the design. They make the roles easier to see in a small example. A larger application may organize the code differently while keeping the same separation.

We will build the files in the order that a request uses them. The agent behavior comes first because it is ordinary Python code.

1. The agent produces an answer

In Python, a class groups related behavior. The Agent class in this example has one method named reply(). A method is a function that belongs to a class. reply() receives a message and returns a short response.

Keeping the class this small matters. You can call reply() in a basic test without starting the rest of the system. You can also change the response logic without changing how the process is managed.

# agent.py
class Agent:
    def reply(self, message: str) -> str:
        return f"Agent received: {message}"

For now, reply() only shows which message it received. A real agent could call a model or use a tool here. The harness does not need to know how that answer is produced.

The class accepts a normal Python value. The worker is the piece that turns an incoming JSON message into that method call.

2. The worker connects the pipe to the agent

A command-line Python program has two standard text streams. Standard input is where text enters the program. Standard output is where the program writes its normal result.

The harness connects both streams to pipes when it starts the worker. The worker reads one line from standard input and converts the JSON text into a Python dictionary. It takes the message from that dictionary and passes it to Agent.reply().

If reply() returns normally, the worker creates a successful JSON reply. If Python reports an error, the except block receives that exception and creates an error reply instead. An exception is Python’s way of stopping the current operation and reporting what went wrong.

After building the reply, the worker writes one JSON line to standard output. Python may hold output briefly before sending it. flush() makes sure this line is sent at once. The worker then reaches the end of the file and exits.

# worker.py
import json
import sys

from agent import Agent

agent = Agent()
request = json.loads(sys.stdin.readline())

try:
    value = agent.reply(request["message"])
    reply = {
        "id": request["id"],
        "ok": True,
        "value": value,
    }
except Exception as exc:
    reply = {
        "id": request["id"],
        "ok": False,
        "error": str(exc),
    }

sys.stdout.write(json.dumps(reply) + "\n")
sys.stdout.flush()

Standard output is reserved The harness expects every line on standard output to contain JSON. A normal print() would add an unexpected line and break the message format. Send logs to standard error, which is a separate stream for diagnostic messages.

The exception path works only when Python gives control back to the worker. An endless loop does not do that. Code that never returns cannot create its own error reply, so the harness must handle that case from outside the worker.

3. The harness controls the worker

The harness runs in the application process. For each request, it starts one worker and sends one JSON line. It then waits for one reply.

While it waits, the harness keeps the timeout. If the reply arrives in time, the harness returns the value to the application. If the time limit passes, it ends the worker process and reports an error.

The harness waits outside the worker. It can stop the worker when the timeout expires.

The example uses asyncio because starting a process and reading from a pipe both involve waiting. A function written with async def can pause at await until an operation finishes. During that pause, asyncio can run another part of the program.

asyncio.subprocess.PIPE tells Python to connect the worker’s standard input and standard output to the harness. Those connections give ask() a place to send the request and receive the reply.

With those ideas in place, the harness code is easier to read. The ask() function follows the same request path shown earlier.

# harness.py
import asyncio
import json
import sys
import uuid


async def ask(message: str, timeout: float = 5.0) -> str:
    process = await asyncio.create_subprocess_exec(
        sys.executable,
        "worker.py",
        stdin=asyncio.subprocess.PIPE,
        stdout=asyncio.subprocess.PIPE,
    )

    request = {
        "id": uuid.uuid4().hex,
        "message": message,
    }
    data = (json.dumps(request) + "\n").encode()

    process.stdin.write(data)
    await process.stdin.drain()

    try:
        line = await asyncio.wait_for(
            process.stdout.readline(),
            timeout,
        )
    except TimeoutError:
        process.kill()
        await process.wait()
        raise RuntimeError("agent timed out")

    await process.wait()
    if not line:
        raise RuntimeError("worker exited without a reply")

    reply = json.loads(line)
    if reply["id"] != request["id"]:
        raise RuntimeError("received the wrong reply")
    if not reply["ok"]:
        raise RuntimeError(reply["error"])

    return reply["value"]

Reading ask() from top to bottom

Start a new Python process that runs worker.py.
Create a request and turn it into one JSON line.
Write the line to the worker through standard input.
Wait for one reply until the timeout expires.
Check the reply and return the value or raise an error.

Starting the worker. asyncio.create_subprocess_exec() asks the operating system to start another program. sys.executable selects the same Python installation that is running the harness. The next argument tells Python to run worker.py.

Building the request. uuid.uuid4().hex creates a text id that is very unlikely to repeat. json.dumps() turns the request dictionary into JSON text. encode() changes that text into bytes for the pipe.

Sending and reading. drain() waits until Python has handed the request bytes to the pipe. readline() waits for one complete reply line. json.loads() changes the reply back into a Python dictionary.

Enforcing the timeout. asyncio.wait_for() stops waiting when the time limit passes and raises TimeoutError. Stopping the wait does not stop the worker. process.kill() is the line that ends it.

Cleaning up. process.wait() confirms that the worker has ended. The remaining checks reject a missing reply or the wrong id. RuntimeError passes those failures back to the application as normal Python exceptions.

The harness now has everything it needs to control one request. The last file shows what the application sees.

4. The application asks for an answer

main.py stands in for the rest of the application. It does not open pipes or create JSON. It calls ask() with a message and a time limit, then prints the returned answer.

# main.py
import asyncio

from harness import ask


async def main() -> None:
    answer = await ask("hello", timeout=2.0)
    print(answer)


asyncio.run(main())

Keep the four Python files in the same folder. Run python main.py from that folder. The normal result is Agent received: hello. The worker then closes because its one request is complete.

You can also test the failure path. Add time.sleep(10) at the start of Agent.reply() and set the timeout in main.py to 0.5 seconds. The harness should end the worker and report agent timed out.

That test is more than a demonstration of the error message. It shows that the application can recover without the agent’s help. The timeout works because it belongs to the process that can still act.

What changes in a larger system

The one-request version is enough to explain the boundary. A production service may need to reuse workers and handle more than one request at a time. Those changes require more tracking, but they do not change the basic roles.

Reusing workers

Starting Python for every request adds delay. A longer-running service may keep several workers ready. This fixed group of workers is called a pool.

Each worker should still handle one request at a time. If four workers are ready, the service can run four requests. A fifth request waits or receives a busy response until a worker is free.

Recovering after a failure

A worker that crashes should not return to the pool. Replace it with a new process because you cannot know which in-memory values remain usable. The request that was running should receive an error instead of waiting forever.

A retry means sending the same request again after a failure. A retry can help when the cause was brief, but it needs a limit. Stop after a small number of attempts so one bad request cannot create an endless crash loop.

A reused worker also needs a state. State means the stage the worker is in now. It may be starting or ready. While a request runs it is busy. Once stopping begins it must not accept more work.

A reused worker moves through clear states. Its current state decides whether it can accept a request.

Permissions and untrusted code

A separate process isolates memory. It is not a security barrier. By default the worker may still read the application’s files and use its network access.

For an agent you trust, give the worker only the settings it needs. Run it with a user account that has limited file and network permissions. This reduces what a mistake can affect.

For code you do not trust, use a container or another restricted environment. A container can limit which files the program sees and which network connections it may open. Those limits come from the container settings rather than from the Python harness.

These production controls are easier to add after the one-request design is clear. The same request format can cross into a container or a reused worker.

Conclusion

An agent harness is the control code around agent behavior. It keeps the application able to respond when agent work crashes or takes too long. The separate worker process gives the harness that control.

The Python files follow the design rather than defining it. agent.py contains the behavior. worker.py connects that behavior to messages. harness.py owns the process and timeout. main.py only asks for an answer.

Start with one worker and one request until that flow is reliable. Add pools and retries only when the simpler version is clear. Keep the timeout in the process that can stop the worker.

Why Valid SQL Returns Wrong Numbers

Cornellius Yudha Wijaya — Thu, 09 Jul 2026 14:37:27 GMT

A query that runs is not necessarily right. Running means the database understood you, nothing more. The syntax can be clean, and the number can still be wrong.

I have reviewed a lot of SQL, and most wrong numbers come from three places:

A join that inflates a total,
A filter that drops rows you meant to keep, and
A metric that two people define differently.

All three pass without errors, which is exactly why they were considered correct.

An agent boosts their frequency by swiftly generating and executing queries, leading to more outputs with less verification. While typing is now a quick and inexpensive step, confirming accuracy still depends on your judgment.

Below are examples, including the query and the fix, on a small dataset you can try yourself:

SQL Fiddle

Well, curious about it? Let’s get into it.

Subscribe now

1. Joins that inflate a total

You are asked for total revenue. You have orders and items, so you join them and add up the amount.

SELECT SUM(o.amount) AS revenue
FROM orders o
JOIN order_items i ON i.order_id = o.order_id; -- 1140

It returns 1140. The real total is 740, and 1140 is not obviously wrong, so it goes into the report.

A join repeats each row once for every match on the other side. Order 1 has three items, so the join turns it into three rows, each carrying the order’s amount of 100. Sum the column, and you add that 100 three times. Order 3, with two items, gets counted twice. Eight orders become eleven rows, and the amount for each order is added once per item.

Revenue lives at the order level, so add it there without the items:

SELECT SUM(amount) AS revenue FROM orders; -- 740

If you need the items in a single query, roll them up into one row per order first. Either way, count the rows before and after any join. Eight orders should not become eleven rows unless you meant them to.

Ask: Does the total change when you remove the join? If it does, the join is moving the number, and someone should know why, as that number came out too high.

The next bug does the opposite and is easier to miss.

2. Filters that drop rows you meant to keep

There are eight orders, and one is canceled, so “not canceled” should be seven. This returns six:

SELECT COUNT(*) FROM orders
WHERE order_status <> ‘canceled’; -- 6

The missing row is the order with no status. When comparing anything to NULL, SQL returns neither true nor false but unknown, and WHERE keeps only what is true. So NULL <> ‘canceled’ is unknown; the row fails the test, and it drops out. You never chose to exclude it.

Dates break the same way, and they are harder to spot. Store timestamps in UTC, filter by a local date, and orders near midnight land on the wrong day. A filter for “all of January” in local time loses the orders that UTC has already pushed into February, and nothing warns you.

The fix is to decide what happens to the missing value yourself, instead of letting the comparison decide:

SELECT COUNT(*) FROM orders
WHERE order_status IS DISTINCT FROM ‘canceled’; -- 7

IS DISTINCT FROM compares NULL as a real value, so the row stays. For dates, filter in the zone where the data is stored. In both cases, look at how many rows you kept and dropped before you trust the count.

Ask: What did the filter keep and drop, and was the missing case handled intentionally?

The first two bugs were caused by SQL doing something you did not expect. The last one is stranger. The SQL does exactly what each person asked for, and they still disagree.

3. Metrics that two people define differently

Two analysts run the same data and report two numbers: 60 percent and 20 percent. Both wrote correct SQL.

“Repeat purchase rate” sounds precise, and it is not. It hides three choices, and the two are made differently. The first counts any customer with two or more orders across all customers:

SELECT ROUND(100.0 * COUNT(*) FILTER (WHERE orders_count >= 2)
 / COUNT(*), 0) AS repeat_rate_pct
FROM (SELECT customer_id, COUNT(*) AS orders_count
 FROM orders GROUP BY customer_id) c; -- 60

Three of the five customers have a second order, so 60 percent. The other analyst set a higher bar: two delivered orders inside 90 days, over everyone who ordered.

WITH delivered AS (
 SELECT customer_id, order_ts
 FROM orders WHERE order_status = ‘delivered’
),
repeat_customers AS (
 SELECT d1.customer_id
 FROM delivered d1
 JOIN delivered d2
 ON d1.customer_id = d2.customer_id
 AND d2.order_ts > d1.order_ts
 AND d2.order_ts <= d1.order_ts + INTERVAL ‘90 days’
 GROUP BY d1.customer_id
)
SELECT ROUND(100.0 * COUNT(DISTINCT customer_id)
 / (SELECT COUNT(DISTINCT customer_id) FROM orders), 0) AS repeat_rate_pct
FROM repeat_customers; -- 20

Now one customer qualifies. Of the three with two orders, one had a canceled second order, one ordered again more than 90 days later, and one had two delivered orders close together. So 20 percent.

The data and skills are identical, yet there's a threefold gap, and neither analyst erred. The key differences are three unrecorded aspects: who is considered a customer, what qualifies as a purchase, and the time window used. A tool cannot resolve these because they relate to the business context, not just the database's capabilities.

The fix is simple but effective: write the metric as a single sentence that includes its denominator and window before touching SQL. If two people write different sentences, that is the real disagreement, and you found it early.

Ask: What is the definition, in one sentence, including who counts, over what window, and what is counted?

What to check before you trust a number

Most wrong numbers fail in one of three places, so those are the three to check.

The first is the grain: what one row represents, and whether a join changed it. The second is the filter: which rows it kept, which it dropped, and how it handled the missing values. The third is the definition: what the metric actually means, written as one sentence that two people would agree on.

This isn't about rushing SQL development now that it's affordable. Instead, focus on identifying critical areas where a seemingly correct query might still fail, and check them up front before outcomes impact decisions. An agent enhances this process by providing reliable results more quickly than manual checks, although it won't notify you if a result is wrong.

Don’t forget that we will have our certified SQL courses soon! Meanwhile, you can also visit our previous SQL written courses.

My book is out: Python Data Analysis, Fourth Edition

Cornellius Yudha Wijaya — Sat, 04 Jul 2026 10:12:57 GMT

Python Data Analysis, Fourth Edition, which I wrote with my co-author, is out today. You can get it here:

Book Link

Before I describe what’s in it, I want to write down what the year of working on it was like.

The work

My co-author and I started this over a year ago. I’ve worked in data for years and I know the material, so I expected the writing to be straightforward. It wasn’t.

Knowing something and explaining it in a book are different jobs. Every example has to run. Every explanation has to make sense to someone reading it for the first time. And you have to keep doing that, chapter after chapter, for months. This book has 17 of them.

Writing with anxiety

I’ll be honest about this part. There were stretches of the year when my anxiety was high. Some days, opening the manuscript felt harder than it should have.

What worked was not waiting to feel better before writing. On bad days I wrote a paragraph, sometimes a bad one, and fixed it later. The manuscript kept moving, and imperfect pages slowly became a book.

If you’re building something while dealing with your own anxiety, this is the only advice I have: keep the work moving, even in small pieces.

The collaboration

The other thing that carried this book was working with a co-author. When one of us was stuck, the other pushed forward. When one of us wrote something unclear, the other caught it. We each brought different expertise, and the book shows it.

What’s in the book

The full title is Python Data Analysis: Master Python Analytics with Machine Learning, Deep Learning, GenAI, LLMs, and Data Engineering.

Modern data analysis goes beyond cleaning and visualizing data. Practitioners today build scalable pipelines, apply machine learning, work with text and image data, and use Generative AI and LLMs. Rather than focus on a single library, the book covers the whole workflow:

Foundations: Python libraries, NumPy, pandas, statistics, linear algebra, and visualization
Data work: retrieving, processing, storing, and cleaning messy data
Time-series analysis, forecasting, and signal processing
Supervised and unsupervised learning: regression, classification, clustering, dimensionality reduction, and anomaly detection
Ensemble methods, neural networks, and deep learning
Text and image analytics, including NLP and sentiment analysis
LLMs and Generative AI
Scaling with Dask, Modin, Ray, and PySpark

By the end, you can build end-to-end data analysis pipelines and apply modern data science and AI techniques to real problems.

It’s written for data analysts, data scientists, business analysts, statisticians, and students. You’ll need basic math and working knowledge of Python.

Thank you

Get the book here:

Book Link

If you read it, tell me what you think. I read every reply and comment.

Fix it or throw it away: deciding what to do with an AI output

Cornellius Yudha Wijaya — Tue, 30 Jun 2026 15:33:57 GMT

An AI tool writes a summary of last quarter’s sales. Most numbers match the source data, but one sentence says a marketing campaign “drove” the revenue lift, and nothing supports that. Fixing it takes thirty seconds: soften the wording or cut the sentence.

Now picture the same summary, just as fluent, except it used the wrong table the whole time. It used gross revenue when the question was about net. Every figure agrees with the others, and every figure is wrong.

Both drafts look the same on screen. One needs a light edit. The other needs to be deleted. Telling those two cases apart, fast, is the skill. The decision is always one of three: use, revise, or reject. What matters is the test underneath it, because that test is where most reviewers go wrong.

Subscribe now

The three options

Every output you review ends in one of three actions.

Use. It’s good enough for its next step, even if some polish is left.
Revise. The foundation is sound, but there are fixable flaws worth correcting.
Reject. The foundation itself is broken, so fixing it means rebuilding from scratch.

It’s three actions, not a score. A score tells you how good something looks, not what to do with it. A draft can look like a nine and still belong in the trash, because the one wrong part is what everything else rests on.

So the question is never “how good is this?” It’s: can the problem be fixed, or does it sink the whole thing?

Small flaws and broken foundations

Here is the rule the rest of this article depends on.

Revise when the foundation is sound but the work is incomplete. Reject when the foundation itself is wrong.

A small flaw is one you can fix without touching the core logic, data, or reasoning. Typos, a missing citation when you know the source, a wrong chart label, a caveat that doesn’t change the conclusion, awkward variable names. None of these threaten what’s underneath. You fix them and move on.

A broken foundation invalidates the whole output. The wrong dataset or metric. A source that doesn’t say what’s claimed, or doesn’t exist. Reasoning you can’t trace. Code that runs the wrong algorithm or hardcodes a credential. An answer to a different question than the one asked. You can’t patch these. Every line below the flaw inherits it.

The trap: how bad a flaw is has little to do with how big the fix looks. A broken foundation can be a one-character change, like swapping gross for net. A small flaw can mean rewriting three paragraphs. People reach for reject when the fix looks expensive and use when it looks cheap. That’s backwards. The question isn’t how much text changes. It’s whether the wrong part holds up everything else.

Two checks before you score anything

You can triage most outputs in under a minute with two questions, before any detailed review.

First, is there an automatic reject condition? Invented citations, confidential or out-of-scope data, a policy violation, a badly wrong headline number. If yes, stop and reject. No point scoring the writing on a draft that already failed.

Second, can you verify it? Can you re-run the code, recompute the figure, open every cited source, trace the retrieved context? If you can’t check a claim that matters, don’t approve it because it reads well. These models produce text that sounds right, which is not the same as being right.

Only drafts that pass both checks are worth a full use-versus-revise read.

The same logic in code:

The same logic in code:
if has_critical_violation(output):
    decision = "Reject"
elif not is_verifiable(output):
    decision = "Reject"
else:
    decision = assess(output)   # now decide Use vs Revise

What “good enough” means

Use doesn’t mean perfect. It means good enough for the next step, and that bar moves with where the output is going. The same draft can be a use as a private note and a revise as a summary you send to a client. Code that’s fine in a scratch notebook needs real review before production. The bar is highest when an output triggers an action on its own, like sending a notification, because no one stands between it and the result.

So name the next step before you decide. An output is a use only when no remaining problem is big enough to block that step.

Leave a comment

Triage by type of output

The line between a small flaw and a broken foundation looks different by output type. Concrete triggers help.

Data analysis. Use it when the dataset and definitions are right, the numbers reconcile, and the logic matches the question. Revise when a causal claim overreaches or a caveat is missing, since that’s fixable with a note while the core data and formulas stay sound. Reject when the metric, grain, or population is wrong, or the question was misread, because every conclusion is then suspect. Wrong data, population, or metric is always a reject.

Code generation. Use it when it runs, passes tests, follows your conventions, and solves the right problem. Revise when it works but needs error handling, better names, or more test coverage. Reject when it uses the wrong algorithm, violates security, hardcodes secrets, or won’t run. “Looks correct” isn’t enough; run it before you trust it.

Retrieved answers. Use it when every factual claim maps to a retrieved passage and the system says “I don’t know” when evidence is thin. Revise when the right documents were pulled but the wording is off or one relevant passage was missed. Reject when it invents facts, cites wrong or fabricated sources, or answers off-context. Keep retrieval and generation separate: if the context is wrong, reject the answer and fix the retrieval.

Dashboards and BI. Use it when the numbers match approved sources, periods and KPIs are consistent, and the interpretation is careful. Revise when the figures are right but the narrative misses a comparison or footnote. Reject when a headline number can’t be traced or filters mix fiscal periods. One wrong headline number spoils the report.

The pattern holds across all four. Style and polish are revise work. Broken substance is reject work. When the wrong part is in the foundation, the data, the retrieval, the algorithm, or the question itself, the answer is reject, no matter how clean the surface looks.

The mistake to watch: editing what you should reject

The expensive mistake isn’t rejecting too much. It’s revising what you should have rejected.

Each round of edits makes a flawed output look better. The writing tightens, the code reads cleaner, and all of it polishes a wrong foundation into something more convincing and harder to catch. A run of reasonable fixes can keep a big error in place and make it look authoritative.

If 90% of a report is good but the core assumption is wrong, reject it. If 10% of otherwise-sound code is missing, revise it. The share that changes is a distraction. What matters is whether you’re polishing something true or dressing up something false.

A simple check: by the third edit on the same output, stop and ask whether you’re fixing flaws or hiding one.

Conclusion

The framework is small on purpose: three outcomes and one test underneath them. When an AI output lands on your desk, don’t grade it. Ask whether the thing that’s wrong sits in the foundation or on the surface.

Surface flaws are revise work. A broken foundation is a reject, however polished it looks, and no amount of editing will fix it. Run the two gates first, name the next step before you decide, and keep a short trail so the same problem doesn’t come back.

Stop asking “how good is this output?” Ask “can the problem be fixed, or does it sink the whole thing?” Use, revise, and reject are the three answers, and getting the question right is most of the skill.

Why Most GenAI Workflows Need a Review Loop

Cornellius Yudha Wijaya — Wed, 24 Jun 2026 15:09:43 GMT

Most teams spend their time improving the prompt. They reword the instructions, add examples, or switch to a stronger model, and the output usually improves. But there is a limit to this. A prompt only produces one attempt. The model writes a draft and stops. It does not tell you whether the draft is correct, and it does not improve on its own the next time you run it.

In an earlier article, I argued that reliability depends on the entire workflow, not just the final answer, and that you should check the workflow at each step rather than only the end result. Evaluation is useful, but it only measures how good an output is. It does not improve the output. To improve the output, you need a review loop: a repeatable process that checks the result, identifies what went wrong, fixes it, and records what you learned so the next run improves. The rest of this article explains the idea and then builds one in Python that you can run today.

In short, evaluation measures how good an output is. A review loop is the process that improves it.

Subscribe now

A prompt cannot check its own work

A single prompt is one pass through the model. It cannot check its own work, it does not remember the previous run, and it has no rule for deciding when the task is complete. In practice, the model is finished when it stops producing text, which is not the same as being correct.

Anthropic describes a more reliable setup called the evaluator-optimizer pattern. One model writes the draft, and a second model grades it against clear criteria. The draft is then revised until it meets those criteria. Anthropic recommends this when you have clear criteria and when revising the output actually improves it, which covers most real work. Its Claude Code guidance makes a related point: if the system cannot check its work against something concrete, such as a test, a schema, or a second reviewer, then deciding whether the work is done becomes a matter of opinion, and a person has to check everything by hand.

OpenAI gives similar advice. Reliable quality comes from clear goals, test datasets, automated graders, and ongoing evaluation, rather than from a single well-written prompt. Reliability is something you build a process around, not something you reach through wording alone.

Common ways the first draft goes wrong

A draft can read well and still be wrong. This happens because of how the models are trained. Below are three common problems. The model cannot detect any of them on its own, which is why an outside check is needed.

Models tend to guess. They are trained to produce text that sounds plausible, not text that has been verified. TruthfulQA found that models often repeat common misconceptions. OpenAI’s 2025 research on hallucinations offers one reason: most training rewards confident answers more than admitting uncertainty, so the model learns to guess rather than say it does not know.
Models can overlook the middle of a long input. Even with large context windows, they do not treat every part of the input equally. The Lost in the Middle research found that information in the middle of a long input is often underused, while information at the start and end gets more weight. For summaries and retrieval-based systems, an important detail can be present in the input but missing from the output.
Models tend to agree with the user. Research on sycophancy found that several assistants often match the user’s stated opinion, and that people sometimes prefer an agreeable answer to a correct one. This matters at work, where prompts often suggest the answer the writer expects. If you ask whether the data supports your plan, the model is more likely to agree than to disagree.

Takeaway: These are normal limits of how the models work. A review loop is there to catch the mistakes the model cannot catch on its own.

The four-stage review loop

The review loop has four stages. It applies to ordinary documents as well as to code.

Figure 1. The review loop. Output that passes goes straight to capture; output that needs work is revised and run again, and each recorded failure raises the baseline for the next run.

Generate. Produce the first draft, however your workflow does it: a prompt chain, retrieval, or tool calls.
Review. Check the draft with code first, then with a second model, and send the hard cases to a person. The claims, the structure, the process that produced it, and safety are all fair game.
Revise. Fix the problems found in the review. You can fix the output itself, or fix the workflow that produced it. Re-running with feedback fixes the current draft. Fixing the instruction or the retrieval step also improves future drafts, which has a larger effect over time.
Capture. This stage is often skipped, but it is what makes the loop worth running. Record each failure in a reusable form, such as a test case, an evaluation example, a rule, or a guardrail. OpenAI recommends expanding your evaluation set using real production data. Anthropic describes the shift from repeatedly fixing the same problem to a system in which each failure becomes a test that prevents it from recurring.

Takeaway: If you find a failure but do not record it, the same problem can return on a later run.

Building it: clean data from a document

Here is the loop applied to a common task: pulling fields out of a messy invoice into validated JSON. The output looks fine at a glance, which is exactly when a review loop earns its place. The code is provider-agnostic, so replace call_model with your own client and install jsonschema.

Start with the input. This is the kind of text a model reads easily but gets subtly wrong:

INVOICE    Acme Industrial Supply
Inv #: AC-10827      Date: 03/14/2025
Bill to: Northwind Trading
 
  2 x Steel bracket (SB-12) ............ 24.00
 10 x Hex bolt M8 @ 3.50 ea ........... 35.00
  1 x Safety gloves, pair .............. 12.75
 
Subtotal: 71.75
Tax (8%): 5.74
TOTAL DUE: 77.49

The setup is your model call and a few imports:

"""Structured extraction with a review loop (provider-agnostic).
Needs the standard library plus jsonschema (pip install jsonschema).
Replace call_model with your own client (OpenAI, Anthropic, ...).
"""
import json
import re
from datetime import date
from jsonschema import validate, ValidationError
 
 
def call_model(prompt: str) -> str:
    """Send the prompt to your model and return its text reply."""
    raise NotImplementedError

A schema fixes the shape, so structure and types are checked for free:

# The shape we require. The schema checks structure and types for free.
SCHEMA = {
    "type": "object",
    "required": ["invoice_number", "date", "vendor", "customer",
                 "line_items", "subtotal", "tax", "total"],
    "properties": {
        "invoice_number": {"type": "string"},
        "date": {"type": "string"},
        "vendor": {"type": "string"},
        "customer": {"type": "string"},
        "line_items": {"type": "array", "items": {
            "type": "object",
            "required": ["description", "quantity",
                         "unit_price", "amount"],
            "properties": {
                "description": {"type": "string"},
                "quantity": {"type": "number"},
                "unit_price": {"type": "number"},
                "amount": {"type": "number"},
            },
        }},
        "subtotal": {"type": "number"},
        "tax": {"type": "number"},
        "total": {"type": "number"},
    },
}

Deterministic checks run first because they are cheap and exact. They catch the arithmetic and the invented numbers that a fluent answer hides:

# Deterministic checks: cheap, exact, and run first. These catch the
# arithmetic and made-up numbers that a fluent draft hides.
def near(a, b, tol=0.01):
    return abs(a - b) <= tol
 
 
def check_extraction(data, source):
    issues = []
    for item in data["line_items"]:
        if not near(item["quantity"] * item["unit_price"], item["amount"]):
            issues.append("line '%s': qty x price != amount"
                          % item["description"])
    if not near(sum(i["amount"] for i in data["line_items"]),
                data["subtotal"]):
        issues.append("line amounts do not sum to subtotal")
    if not near(data["subtotal"] + data["tax"], data["total"]):
        issues.append("subtotal + tax != total")
    for field in ("invoice_number", "total"):
        if str(data[field]) not in source:
            issues.append("%s '%s' is not in the document"
                          % (field, data[field]))
    try:
        date.fromisoformat(data["date"])
    except ValueError:
        issues.append("date '%s' is not ISO format" % data["date"])
    return issues

Some mistakes are not arithmetic. A second model called reviews the extraction against the source and catches things like swapped names:

# Independent reviewer: a second model call catches what code cannot,
# such as a vendor and customer that have been swapped.
JUDGE = (
    "You verify an invoice extraction against the SOURCE.\n"
    "Reply ONLY with JSON: {\"passed\": bool, \"issues\": [string]}.\n"
    "Check that names, descriptions, and roles match the document."
)
 
 
def judge_extraction(source, data):
    prompt = (JUDGE + "\n\nSOURCE:\n" + source
              + "\n\nEXTRACTION:\n" + json.dumps(data, indent=2))
    return json.loads(call_model(prompt))

Capture writes every failure to a file you can replay later:

# Capture: every failure becomes a row you can replay later as a test.
def capture(source, data, issues, path="eval_set.jsonl"):
    row = {"source": source, "output": data, "issues": issues}
    with open(path, "a") as handle:
        handle.write(json.dumps(row) + "\n")

The loop puts it together: generate, run the cheap checks, then the judge, capture failures, revise, and stop after a few tries:

# The loop: generate, review (code, then judge), capture, revise, repeat.
EXTRACT = (
    "Extract this invoice as JSON matching the agreed schema. "
    "Use ISO dates (YYYY-MM-DD) and copy every number exactly.\n\n"
    "INVOICE:\n"
)
 
 
def extract_invoice(source, max_tries=3):
    prompt = EXTRACT + source
    for attempt in range(1, max_tries + 1):
        print("Attempt %d" % attempt)
        try:
            data = json.loads(call_model(prompt))       # generate
            validate(instance=data, schema=SCHEMA)
            issues = check_extraction(data, source)      # review: code
        except (json.JSONDecodeError, ValidationError) as error:
            data, issues = {}, ["invalid output: " + str(error)[:50]]
        print("  code checks:", "FAIL" if issues else "PASS")
 
        if not issues:                                   # review: judge
            verdict = judge_extraction(source, data)
            issues = [] if verdict["passed"] else verdict["issues"]
            print("  judge:", "FAIL" if issues else "PASS")
 
        if not issues:
            print("  accepted")
            return data
 
        for problem in issues:
            print("    - " + problem)
        capture(source, data, issues)                    # capture
        prompt = (EXTRACT + source + "\n\nYour last answer had:\n"
                  + "\n".join("- " + p for p in issues)
                  + "\n\nReturn corrected JSON.")         # revise
 
    raise RuntimeError("still failing; escalate to a human")

Running it on the invoice above produces the trace below. The first draft is fluent but wrong in two different ways:

Attempt 1
  code checks: FAIL
    - subtotal + tax != total
    - total '77.94' is not in the document
Attempt 2
  code checks: PASS
  judge: FAIL
    - vendor/customer swapped: seller is Acme Industrial Supply
Attempt 3
  code checks: PASS
  judge: PASS
  accepted

Attempt 1 fails the cheap checks: the total does not match subtotal plus tax, and 77.94 never appears in the document. Attempt 2 fixes the math, but the judge notices the vendor and customer are swapped. Attempt 3 passes both and is accepted. These are the failure modes from earlier in action: a confident wrong number, then a confident wrong label, both caught before release. Two failures were written to eval_set.jsonl on the way.

Those captured failures are not just logs. Replay them as tests, and a bug you have already fixed cannot quietly come back:

# regression.py: replay captured failures; they should pass now.
import json
from rloop import extract_invoice, check_extraction
 
 
def test_known_failures_now_pass():
    for line in open("eval_set.jsonl"):
        case = json.loads(line)
        fixed = extract_invoice(case["source"])
        assert check_extraction(fixed, case["source"]) == []

Three kinds of review

That example used all three kinds of review, and most real systems need them together.

Figure 2. The three kinds of review. Code catches structural errors cheaply, an independent reviewer catches reasoning the author missed, and a person has the final say on high-stakes decisions.

Code handled everything with a clear answer: the schema, the arithmetic, the date format, and the check that numbers came from the source. It is fast and exact, and it caught the bad total. Anthropic, OpenAI, and Microsoft all treat these checks as the base of a good pipeline.

An independent reviewer, a separate model call with its own rubric, caught the swapped names that code cannot judge. Self-Refine, CRITIC, and Reflexion all report real gains from this kind of separate critique. One caution from my earlier article: a model reviewer has their own biases, so check it against human judgment from time to time rather than trusting it on its own.

A person with authority is the last step, for high-stakes or subjective cases. In the code that is raised at the end, when the loop cannot pass on its own, it stops and hands off instead of shipping. NIST’s Generative AI Profile recommends that pre-deployment testing reach the people who decide what gets released.

Takeaway: One layer of review is rarely enough. Using code, an independent reviewer, and a person together covers more types of error.

What repeatability actually means

It is worth being clear about what repeatability means here. It does not mean the model produces the same text every time, because it is not deterministic. It means the quality stays within a predictable range even as the wording changes.

The loop does not remove that variation, but it keeps the results within known limits. In practice, this lets you describe the worst case in advance. The schema check will not allow malformed output, the arithmetic check will not allow a total that does not add up, and the grounding check will not allow a number that is not in the source. The wording can vary; those guarantees do not.

Review is becoming a compliance requirement

There is also a practical reason beyond quality. Review and oversight are increasingly expected. NIST’s Generative AI Profile asks teams to verify sources before and after launch, monitor for issues that require human attention, and keep records. Microsoft’s code of conduct for its AI services requires ongoing testing, feedback channels, and human oversight, and does not permit significant decisions without it. The EU AI Act requires that people be told when they are interacting with AI or reading AI-generated content.

Several well-known cases show what can happen without review. Each one failed in a different way:

CNET. A review of its AI-written finance articles found that 41 of 77 needed corrections. The articles were published because no one checked them first.
Air Canada. A tribunal held the airline responsible after its chatbot described a refund policy that did not exist, and rejected the argument that the chatbot was a separate entity.
Avianca. Lawyers were sanctioned for submitting a court filing that cited cases that did not exist.

In each case, the writing looked fine. What was missing was a review step before the work was released.

Leave a comment

Conclusion

The main point is straightforward. Evaluation tells you how good an output is. A review loop is how you improve it. The four stages, the three kinds of review, and the governance requirements are all parts of that process.

As the example shows, the loop is mostly ordinary code: a schema, a few exact checks, a second opinion, and a record of past failures. The useful change is to spend less time refining the prompt and more time on the surrounding process: what gets checked, who can reject the output, what counts as a failure, when you rerun it, and how each failure becomes a permanent check. With those pieces in place, generative AI can move from producing one-off drafts to supporting a process you can rely on.

Manager Memo: Reviewing GenAI Output Before It Reaches Stakeholders

Cornellius Yudha Wijaya — Sat, 20 Jun 2026 03:21:50 GMT

The output looks polished. That is the problem. Polished prose hides weak evidence, and the approving manager, not the model, owns what happens next.

You receive a briefing note, a slide deck, or a client-facing summary. Your team tells you AI helped write it. It reads cleanly: the structure is logical, the tone is confident, the citations look authoritative. The temptation is to skim it, nod, and forward it on.

Resist that. Fluent prose is not evidence of accurate content. Generative models are optimized to produce text that sounds right, not text that is right, and the gap between the two is exactly where stakeholder-facing work goes wrong.

This playbook gives you a structured way to close that gap: a three-layer review gate, a use / revise/reject decision rule, the failure modes worth memorizing, and templates you can adopt this week.

The core principle, drawn from where every major governance framework converges: treat stakeholder-facing GenAI output as a high-variance draft, not as finished work.

Subscribe now

What Most GenAI Evaluation Workflows Get Wrong

Cornellius Yudha Wijaya — Sun, 14 Jun 2026 05:51:53 GMT

I keep seeing the same mistake. A team hooks up a model, eyeballs a handful of outputs, decides it looks fine, and ships. Evaluation only enters the picture after something breaks in production, and by then the team is firefighting, bolting on test cases one incident at a time, always a step behind.

The issue is not that they forgot to evaluate. It is that they put the eval in the wrong place. They scored the final answer without looking at what produced it. But in any system that retrieves documents, routes intents, calls tools, or chains reasoning steps, the answer is the last place a bug shows up—and almost never where it starts. A bad retrieval, a misrouted query, or a wrong tool argument can all produce something that reads beautifully and is completely wrong.

Evaluation belongs inside the workflow, not after it. GenAI evals should look less like one big pass/fail test and more like layered quality gates.

This is not just my opinion. Official guidance from OpenAI, Anthropic, Google Cloud, Microsoft Foundry, LangSmith, and NIST all point the same way: evaluate routing, retrieval, tool use, grounding, safety, and drift—not just the final paragraph.

Subscribe now

Why the Final Answer Is Usually Not the Root Cause

When something goes wrong, the reflex is to blame the model. But in most real-world setups, the model is just one piece of a longer pipeline. There are at least five places where things can quietly go sideways before the answer even gets generated:

• Prompt construction and intent routing

• Document retrieval and context assembly

• Tool selection and parameter extraction

• Agent handoff and orchestration logic

• Post-processing and output formatting

Every major platform now says the same thing. OpenAI recommends scoped tests at every stage. Microsoft says you need to evaluate each step, not just the final output. Google separates response evaluation from trajectory evaluation. Ragas breaks RAG scoring into retrieval metrics and generation metrics. The consensus is clear.

Outcome-only evaluation is a scoreboard. Workflow-native evaluation is an instrument panel.

End-to-end success rates still matter as the top-level KPI—but they are terrible debugging tools. Anthropic puts it well: agent mistakes compound, so you need to inspect every intermediate step to figure out whether the failure came from the model, the tooling, the harness, or the eval itself.

Figure 1: How output-only and workflow-native evaluation compare across key dimensions.

Where Evaluation Actually Belongs

The most useful eval setups start by breaking the system into stages and writing checks that match what each stage is supposed to do. Here are the five checkpoints that matter:

• Input & Routing — Did we understand the user’s intent? Did we catch injection attempts? Did we route to the right handler?

• Retrieval & Context — Did we pull the right documents? Are they relevant? Is the context complete enough to answer the question?

• Tool Use & Planning — Did the agent pick the right tool? Are the arguments valid? Did it follow a reasonable path to get there?

• Generation — Is the answer correct, complete, and grounded in the evidence? Is it safe? Does it know when to say “I don’t know”

• Production Monitoring — How fast is it? How much does it cost? Is quality drifting? Are users flagging problems?

Figure 2: Each stage has its own eval checkpoint, with a feedback loop from production back into the test data.

If this sounds like a lot of work, it does not have to be, at least not on day one. The practical move is risk-weighted decomposition. Pick the two or three user journeys that matter most, write clear success criteria for each, and start with a small curated dataset. Anthropic, OpenAI, and LangSmith all recommend beginning with 10-20 high-quality examples and growing from real failures. That keeps things lightweight while still making problems visible.

Picking Metrics That Match How Things Actually Break

The golden rule: use the cheapest reliable check first. Start with deterministic graders: exact match, schema validation, regex, policy checks, citation lookups. Only bring in LLM judges when you need to score something fuzzy like helpfulness or completeness.

What to track depends on your system type:

RAG systems: context precision, context recall, grounding, faithfulness. Track retrieval and generation as separate scores.
Agent systems: tool selection accuracy, argument validity, task completion rate, trajectory precision, and recall.
Any system: correctness, completeness, safety, abstention quality. Keep these as separate dimensions—one overall number hides too much.

A warning about LLM-as-judge: yes, LLM judges can scale evaluation and hit human-level agreement on some tasks. But research (MT-Bench, the “Large Language Models are not Fair Evaluators” paper) shows they are prone to position bias, verbosity bias, and self-enhancement bias. Just swapping the order of two responses can flip the ranking. The fix is not to avoid LLM judges; it is to calibrate them. Use pairwise comparisons, pass/fail rubrics with clear criteria, and periodically check agreement against human labels. Targeted human calibration beats full human scoring.

A Six-Step Playbook for CI and Production

Evals that live only in notebooks are evals that rot. Here is how to make them part of your development and deployment loop:

Instrument traces early. From the first real prototype, capture everything: input, system prompt, routing decision, retrieved docs, tool calls, results, final answer, latency, cost, feedback. Without traces, you can evaluate only the output, not the workflow.
Start with a small, sharp dataset. 10-20 curated examples per critical path. Include happy paths, edge cases, known failure modes, and adversarial inputs. Quality beats quantity here.
Layer your graders. Deterministic checks first (schema validation, regex, policy filters). LLM judges second (for nuanced criteria like completeness). Sampled human review third (for calibration and high-stakes cases).
Wire evals into CI/CD. When prompts, retrievers, tool schemas, or orchestration logic change, run component-level and end-to-end evals automatically. Fail the build if pass rates drop.
Add online evaluation after deploy. Score a sample of production traces for quality, safety, and cost. Route low scores and negative feedback into annotation queues. Feed confirmed failures back into your offline dataset.
Schedule security scans. Red-team prompts, poisoned-context tests, and injection probes should run on a regular cadence, not just at deploy time.

This is the pattern you will find across LangSmith, MLflow, Microsoft Foundry, and Promptfoo. It is where evaluation stops being a one-off check and becomes a continuous part of how the system runs.

Figure 3: A framework-agnostic Python function showing layered evaluation across all pipeline stages.

What Happens When Evals Only Check the Surface

Air Canada’s chatbot (Moffatt v. Air Canada). A customer asked about bereavement fares and got incorrect guidance from the airline’s chatbot. Air Canada tried to distance itself from the bot’s answer, but the tribunal was not having it—the chatbot was part of the company’s website, and the airline was responsible for what it said.

This was not a hallucination problem. It was a missing-controls problem. A workflow-level eval would have checked chatbot responses against the actual policy documents, flagged contradictions, required evidence for refund-related answers, and routed anything ambiguous to a human. A style-level review of the output would have missed all of this—the answer probably read just fine.

Fabricated case law (Mata v. Avianca). Lawyers submitted court filings with fake case citations generated by ChatGPT. The court sanctioned them, ordered letters of explanation to the judges who were falsely cited, and imposed a $5,000 fine.

The output sounded professional. The problem is that nobody checked whether the cited cases actually existed. For any system that conducts research or cites sources, evaluation needs to cover source resolution, citation validity, and handling missing evidence. A confident answer with fabricated support should count as a failure, even if it scores well on fluency.

Leave a comment

Conclusion

Better prompts and better models help. But what actually makes GenAI systems reliable is treating evaluation as part of the workflow; not something you bolt on at the end. Checkpoints before generation, around tool use, before delivery, and after deployment through tracing and feedback.

The playbook: pick 3-5 user journeys. Define what success looks like. Build 10-20 test cases. Use cheap graders first. Wire evals into CI. Sample production traces. The tools already exist. The question is whether your team uses them.

The remaining question is simple: does your team treat evaluation as something the system does all the time, or something someone does to the system after the fact?

From Prompt to Reliable Output: A Practical GenAI Evaluation Workflow

Cornellius Yudha Wijaya — Sun, 31 May 2026 07:47:25 GMT

Prompt engineering is rarely enough to prepare a GenAI system for production.

While a single prompt can generate a good output in initial testing, deploying it across hundreds of real-world inputs exposes failures like missing details, incorrect facts, and formatting errors.

To build a production-ready system, you must transition from prompt optimization to systematic evaluation.

A prompt defines your request; an evaluation workflow defines your standard and proves whether the system meets it.

Here is a practical workflow to evaluate and improve GenAI applications.

Subscribe now

1. The Limits of Initial Testing

It is easy to mistake a fluent LLM output for a correct one.

If a response uses a professional tone and has no grammatical errors, we assume it is accurate.

However, LLMs are non-deterministic, and a prompt that works for one test case can fail on another.

In production, incorrect outputs carry significant risks—whether it is a fabricated metric in an executive report or an incorrect policy in a customer support tool.

Rather than testing under ideal conditions, developers must identify where a system fails under real-world inputs.

2. Why Prompting Alone is Insufficient

A well-crafted prompt is not a testing framework.

Relying entirely on prompts to ensure quality is insufficient for three reasons:

Input Variability: Real-world queries are often messy, incomplete, or poorly formatted.
Model Variability: Even at temperature 0.0, models can generate slightly different outputs for the same input.
System Dependencies: In complex architectures like Retrieval-Augmented Generation (RAG), the LLM is only one component. A prompt cannot correct a failure in the retrieval step.

A polished output does not guarantee a reliable system.

Instead of relying solely on prompt adjustments, you need a structured evaluation workflow.

3. The 7-Step Evaluation Workflow

To ensure predictable GenAI behavior, teams should implement a repeatable seven-step evaluation workflow.

Step 1: Define the task in operational terms.
Step 2: Build a representative evaluation set.
Step 3: Break evaluation into specific dimensions.
Step 4: Choose the right grading method.
Step 5: Log failures and classify the errors.
Step 6: Modify one variable at a time.
Step 7: Define production thresholds.

Step 1: Define the task in operational terms

You must translate vague requirements into objective criteria. For example, instead of asking for “a good summary,” define the exact parameters:

KPI Commentary: Accurate metrics, no causal claims unless explicitly backed by the data source, a concise tone, and a maximum of 150 words.
SQL Explainer: Explains joins and filters correctly in plain language, linking them to a specific KPI.
Customer RAG: Answers using only the provided context, cites sources, and states “I do not know” if context is missing.

Identify the specific tasks, target audience, constraints, and failures that would render an output unusable.

Step 2: Build a representative evaluation set

Start with a small evaluation set of 10 to 30 examples. A massive evaluation set is difficult to manage early in development. The set must reflect real-world inputs rather than just ideal cases. It should contain:

Standard inputs: Common queries to test baseline functionality.
Complex/Ambiguous inputs: Requests with mixed sentiments or multi-step instructions.
Edge cases: Inputs with missing context or specific formatting constraints.
High-risk inputs: Scenarios where errors have significant business or legal impacts.

For example, the initial evaluation set might include for customer evaluation could be like below:

Step 3: Break evaluation into specific dimensions

Avoid grading outputs with a single overall score, as it blends separate failure modes. Instead, assess performance across specific dimensions:

Select the dimensions relevant to your application. A RAG system focuses on Grounding and Completeness, while a code generation tool focuses on Task Success and Format Fidelity.

Step 4: Choose the right grading method

Use the simplest method that provides accurate results. You can grade outputs using four primary approaches:

Rule-Based Checks: Programmatic, deterministic, and highly reliable. Ideal for formatting and constraints (e.g., verifying JSON schema or character counts).
Reference-Based Checks: Used when there is a ground-truth answer (e.g., comparing classification labels or verifying generated SQL output against a reference database query).
LLM-as-a-Judge: Used for semantic or stylistic dimensions like tone and factual consistency at scale. These require a strict grading rubric and few-shot examples to maintain consistency.
Human Review: Recommended for highly sensitive or high-impact tasks. Spot-checks by domain experts are also used to calibrate and validate automated LLM judges.

Step 5: Log failures and classify the errors

When a test case fails, identify the root cause before changing variables. Failure in a GenAI system does not always stem from the prompt.

Classify errors into specific categories:

Prompt Issue: Instructions are vague or contain conflicting constraints.
Retrieval Issue: The context provided to the model is incomplete, irrelevant, or outdated.
Data Issue: The underlying data source contains incorrect or corrupted information.
Model Issue: The model ignores instructions or generates incorrect claims despite correct prompts and context.
Requirements Issue: The operational criteria for the task were poorly defined.

For example, a retrieval failure cannot be fixed by editing the prompt, and a data quality issue cannot be resolved by upgrading the model. Fix the issue at its source.

Step 6: Modify one variable at a time

When optimizing the system, change only one variable at a time to isolate what improves or degrades performance.

Follow this process:

Establish a baseline: Run your current evaluation set.
Identify failures: Audit failed cases to determine the primary error type.
Isolate a single change: Modify exactly one parameter (e.g., update a prompt rule, adjust chunk size, or change the model temperature).
Rerun and compare: Run the evaluation set again and compare results against the baseline to verify improvement.

Step 7: Define production thresholds

Establish clear metric thresholds to determine if the system is ready to deploy. For subjective dimensions, a standard 1-to-5 rubric is useful:

5 — Accurate, fully grounded in context, and correctly formatted.
4 — High quality; minor stylistic issues but safe to deploy without review.
3 — Generally correct; minor phrasing or formatting issues requiring human oversight.
2 — Significant gaps, unsupported claims, or ignored constraints.
1 — Incorrect, contains major errors, or is structurally broken.

Set your deployment thresholds based on risk. A low-risk internal tool might require a minimum average score of 3.5, whereas a high-risk or external application may require a minimum of 4.5 on all core dimensions and a strict 5.0 for factual grounding.

Share Non-Brand Data

4. Case Study: Evaluation in Practice

Consider a GenAI assistant designed to summarize raw analyst notes for stakeholders.

Input Data (Analyst Notes):

* Metrics: Active customers at 12,400 (up 8% QoQ, down 3% YoY due to a seasonal promotion). Revenue at $4.2M (met target of $4.1M, driven by enterprise renewals).
* Churn: Rose to 4.2% in March (up from 3.5% in January). CS team suspects a competitor release.
* Next Actions: CS team to contact high-risk renewals; Product team to release an update in June.

Generated Output (With Errors):

An LLM generates the following summary:

“Q1 was a stellar quarter for the business. Active customers reached 12,400, showing strong growth. Total revenue reached $4.2M, beating our target of $4.1M due to an incredibly successful seasonal marketing campaign. Although churn rose to 4.2%, our proactive CS team has already contacted all high-risk accounts to guarantee renewal.”

A structured assessment reveals key discrepancies:

Evaluating these distinct dimensions allows the team to identify exactly where the model failed. The prompt can then be updated with constraints requiring objective reporting and preventing the model from inferring causality.

Leave a comment

5. Why Technical Teams Must Own Evaluation

Evaluating software and data models is a fundamental engineering discipline. Teams already possess the core mental models required:

Understanding that single test cases are not representative of overall performance.
Distinguishing between individual qualitative examples and statistical evidence.
Applying structured metrics like precision, recall, and error distributions to assess behavior.

Deploying GenAI requires applying these same disciplines to unstructured outputs.

Reliable production systems are built by designing, testing, and verifying performance systematically rather than focusing solely on creative prompt writing.

Conclusion

Prompting is a starting point, but systematic evaluation is what makes a system production-ready. By defining tasks operationally, building representative evaluation sets, assessing performance across distinct dimensions, and optimizing variables individually, developers can build dependable GenAI applications.

Best MCP Servers for Stock Market Data

Cornellius Yudha Wijaya — Thu, 21 May 2026 15:58:20 GMT

Photo by Yorgos Ntrahas on Unsplash

AI agents are becoming more useful for financial research, but they are only as reliable as the data they can access.

A model can summarize earnings, compare stocks, or assist in market monitoring. But without easy access to structured financial data, tasks become stale. Developers may write custom API wrappers or rely on models to interpret data services, which isn’t ideal given the need for accurate market info.

This is where Model Context Protocol (MCP) becomes useful.

An MCP server provides an AI agent with a consistent framework for finding tools, calling data services, and interacting reliably with external systems. In stock market scenarios, this enables an agent to access data uniformly, for example, fundamentals, through a structured interface.

However, not every market data provider is equally suited to MCP-based workflows. Some offer broad asset coverage and agent-native integrations, while others excel at historical data, institutional-grade fundamentals, or enterprise financial analytics. The right choice depends on what you want to build.

In this article, we compare the best MCP servers for stock market data, focusing on five providers:

Alpha Vantage
Nasdaq Data Link
Tiingo
Intrinio
FactSet

Curious about it? Let’s get into it.

Subscribe now

1. Alpha Vantage (Best Overall)

Alpha Vantage is the strongest overall choice for developers who want to connect LLMs and AI agents to stock market data through MCP. Its official MCP server gives agents a structured way to discover financial tools, retrieve the right data, and work with market information more reliably than a custom set of hardcoded API calls.

This is crucial because financial agents need more than simple price lookups. They may require access to historical OHLCV data, performance comparisons among multiple tickers, or the integration of market information with macroeconomic data. Alpha Vantage distinguishes itself as one of the few providers capable of supporting this wide range of workflows from a single data source. Its extensive coverage — including stocks, ETFs, funds, indices, options, forex, commodities, fundamentals, technicals, and economic indicators — ensures it remains a leading choice.

Alpha Vantage stands out for its market data, which is well-suited to serious financial applications and seamless integration with structured information for agents to reason over. The MCP layer enhances usability by exposing tools that an agent can inspect and invoke directly.

The Alpha Vantage MCP server is easy to implement and integrates with popular agent environments like Claude, Cursor, VS Code, ChatGPT, and others. Developers can connect remotely via a single MCP URL or run locally with uvx, making it perfect for research agents, coding tools, financial dashboards, and prototypes with minimal setup.

Alpha Vantage is ideal for MCP workflows that go beyond raw data retrieval to analysis. Users can ask an agent to retrieve NVDA’s recent prices, calculate RSI, summarize trends, or compare quarterly movements across AI stocks, or combine fundamentals and technicals for a market brief. This versatility makes Alpha Vantage the best default for stock market MCP integrations.

Quickstart Example

For a remote MCP connection:

https://mcp.alphavantage.co/mcp?apikey=YOUR_API_KEY

For a local MCP connection:

uvx marketdata-mcp-server YOUR_API_KEY

Example Cursor configuration:

{
  “mcpServers”: {
    “alphavantage”: {
      “url”: “https://mcp.alphavantage.co/mcp?apikey=YOUR_API_KEY”
    }
  }
}

Then, you can ask your agent something like:

Use Alpha Vantage MCP to get NVDA daily price data for the past month,
calculate RSI, and summarize the trend.

Best for

Developers and teams that want the most balanced MCP server for financial agents: broad market coverage, strong analytical flexibility, straightforward setup, and enough data depth to support both simple market questions and more serious research workflows.

2. Nasdaq Data Link (Best for Research-Grade Financial Datasets)

Nasdaq Data Link is a strong choice for teams that want an MCP-based financial agent to work with deeper, more research-oriented datasets, rather than focusing only on everyday price lookups. It is more compelling when the workflow depends on structured historical datasets, specialized financial information, and a richer research context.

This makes it especially useful for financial agents designed to support investment research, economic analysis, strategy exploration, or data-heavy market investigations. Instead of limiting the agent to asking, “What happened to this ticker today?”, Nasdaq Data Link is better suited to questions that require more structured context, such as comparing market behavior across time, analyzing broader economic relationships, or working with datasets that go beyond standard price and indicator endpoints.

The trade-off is that Nasdaq Data Link is not as immediately straightforward for general-purpose MCP stock workflows as Alpha Vantage. Its value is highest when the team already knows the type of dataset it wants to expose to the agent and is building around more deliberate research use cases. For lighter stock analysis, Alpha Vantage will often be faster to adopt. For dataset-centered financial intelligence, Nasdaq Data Link becomes much more attractive.

Quickstart Example

A fitting example prompt would be:

Use the Nasdaq Data Link MCP integration to retrieve a relevant historical
financial dataset, summarize the major trend changes over time, and highlight
periods that deserve closer investigation.

Best for

Developers, analysts, and research teams that want MCP-based financial agents to work with deeper, dataset-oriented market analysis rather than only quotes, indicators, or lightweight stock lookups.

3. Tiingo (Best for Market Data and News-Aware Analysis)

Tiingo is a strong option for developers seeking an MCP-based financial workflow that feels more like a research assistant than a simple market quote tool, where it fits well for agents that need to connect stock market data with richer context on what is happening in the market.

This is especially useful because many financial questions can’t be answered by price data alone. A stock can move sharply due to earnings, guidance, sector sentiment, or broader news. In those cases, an AI agent is more helpful when working with structured market data to interpret what may drive the movement, rather than just reporting price changes. Research highlights the importance of combining numerical data with textual information, such as news, to create better market intelligence systems.

Tiingo ranks well for market analysis workflows that require both data retrieval and narrative context. An MCP-connected agent using Tiingo can focus on questions like recent stock movements, company summaries, historical trend comparisons, or creating comprehensive ticker briefings, rather than viewing the market only as a numerical dataset.

Tiingo’s main strength isn’t covering all financial use cases but providing a useful middle ground: more contextual than basic market-data APIs and more approachable than institutional platforms. It’s ideal for teams building MCP workflows, helping agents move from understanding “what happened?” to analyzing “what happened and what context to consider.”

Quickstart Example

A representative prompt can still show the use case clearly:

Use the Tiingo MCP integration to review TSLA’s recent price movement,
surface any relevant market or company context, and summarize the key
points in a short investor-style briefing.

Best for

Developers and small teams building financial agents that need to combine stock market data retrieval with richer interpretive context, especially for market briefings, watchlist assistants, and news-aware stock analysis workflows.

4. Intrinio (Best for Fundamentals-Driven Financial Agents)

Intrinio is best suited for developers building MCP-based financial agents that delve deeper into company fundamentals, financial statements, valuation context, and business performance analysis. It is better positioned for agents who need to explain a company rather than merely describe its stock movement.

This matters because many useful financial workflows start with questions not centered on price. Investors, analysts, or business users want to know revenue growth, margin changes, debt levels, or company comparison on profitability and valuation. An MCP-connected agent is more valuable when it can retrieve structured company data and craft a clear financial narrative.

Intrinio’s MCP article offers a professional, fundamentals-based ranking, not just market prices. It’s valuable for AI workflows that support deliberate analysis, such as comparing companies, reviewing business quality, identifying financial strengths, and preparing structured research similar to early analyst work.

This makes Intrinio especially appealing for teams that want to build agents for equity research, corporate analysis, and data-backed investment workflows. The trade-off is that it may be more than necessary for a lightweight ticker-monitoring agent. But when the job is to reason about companies with more financial depth, Intrinio deserves its place high in the ranking.

Quickstart Example

A representative prompt could be:

Use the Intrinio MCP integration to analyze MSFT’s latest financial performance,
summarize revenue growth, operating margin, cash flow trends, and key balance
sheet observations, then explain what stands out for an equity research brief.

Best for

Developers and teams building financial agents for company fundamentals, equity research, earnings analysis, valuation support, and more structured business-performance summaries.

5. FactSet (Best for Enterprise Financial Intelligence)

FactSet is the top choice for organizations wanting enterprise-grade financial agents based on institutional data, research workflows, and decision-support tools. Unlike other providers, it focuses on powering AI in serious financial environments like investment banking, wealth management, and institutional research.

FactSet also stands out because of how naturally it fits the broader movement toward AI agents embedded in financial workflows. In February 2026, Anthropic announced new enterprise AI plug-ins developed with partners including FactSet, aimed at work in investment banking, wealth management, and other professional domains. That does not make FactSet a lightweight public MCP server in the same way Alpha Vantage is positioned, but it does strengthen its place in this article as the premium option for teams thinking about agentic finance at an institutional level.

For an MCP-based stock market article, FactSet works best as the enterprise intelligence pick: the provider for teams that care about combining financial data access with mature workflow context, internal research processes, and high-stakes analysis. An agent connected to FactSet-style capabilities could support more advanced tasks such as summarizing company developments, reviewing investment opportunities, generating analyst-style briefs, or helping build pitch materials that rely on trusted financial information rather than generic web retrieval.

The trade-off is that FactSet is not the most approachable choice for individual developers or smaller prototypes. Its value becomes clearest when the application is part of a broader professional research stack, and the user needs institutional depth, workflow integration, and enterprise credibility. For those use cases, FactSet deserves the final place in this ranking — not because it is the easiest option, but because it is one of the most powerful when the objective is serious financial intelligence.

Quickstart Example

Because FactSet’s public AI positioning is more enterprise- and partner-led than self-serve, I would use an example prompt rather than a generic public MCP connection snippet.

Use the FactSet-connected financial agent to review a target company,
summarize recent business developments, highlight major financial risks,
and prepare an executive-style research brief for an investment discussion.

Best for

Enterprise teams, investment professionals, and financial institutions that want AI agents built around institutional-grade research workflows, professional market intelligence, and high-value financial analysis.

Conclusion

The best MCP server for stock market data depends on the kind of financial agent you want to build:

Alpha Vantage: Best overall for broad market coverage and flexible agent workflows
Nasdaq Data Link: Best for research-grade financial datasets
Tiingo: Best for market data with stronger context for briefings and analysis
Intrinio: Best for fundamentals-driven company research
FactSet: Best for enterprise financial intelligence

That’s all for now. I hope it helps!

The GenAI Skill Data Professionals Need Most: Evaluation

Cornellius Yudha Wijaya — Thu, 14 May 2026 17:01:21 GMT

GenAI has made output generation cheap.

A decent prompt can already produce summaries, classifications, SQL explanations, insight drafts, documentation, and stakeholder-facing text.

The harder part is deciding:

Is the output correct?
Is it grounded in the input?
Is it consistent across realistic cases?
Would I trust it inside an actual workflow?

That is the opening for data professionals. Evaluation turns GenAI from a demo into something that can be used responsibly in real work.

OpenAI’s current eval guidance makes the same distinction: the useful kind of eval is not a public benchmark. It is a task-specific test for the application you are building.

By the end of this article, you should have a clearer view of what evaluation means in practice, what to test for, and why data professionals are well-positioned to own this skill.

Curious about it? Let’s get into it.

Subscribe now

GenAI output is easy. Trusting it is not.

Prompting gets attention because it is visible.

Evaluation matters more because it decides whether the work holds up.

Most professionals first experience GenAI as a productivity tool: ask, receive, refine. That makes it tempting to treat a fluent output as a good output.

But GenAI systems are variable.

The same system can behave differently across prompts, data slices, and edge cases.

In professional settings, the cost of a wrong answer is higher than the cost of a slightly worse prompt.

That is why evaluation matters.

What does evaluation actually mean in real work?

Evaluation here does not mean comparing GPT-5.5 against Claude on a benchmark.

It means testing a GenAI workflow against the task’s standards.

The standard is to separate broad model benchmarks from specific evaluations you design for your own application. Google’s evaluation documentation also frames this as a test-driven process: define the task, prepare evaluation data, choose the quality criteria, and inspect results.

Consider workflows that data professionals might evaluate:

A GenAI assistant that summarizes KPI movements
A RAG system answering questions from internal policy documents
A classifier that tags customer feedback or support tickets
A model that explains SQL output to business users

These are not abstract research problems. They are real tasks that need real testing.

The four questions every GenAI workflow should answer

The heart of evaluation is not a long taxonomy of metrics, as it comes down to these four questions:

A. Did it complete the task?

This means that if the GenAI application can perform the task it was asked to do, such as classifying the ticket into the allowed categories or summarizing the table instead of restating it.

In technical terms, this is often measured through deterministic checks: regex matching for expected formats, JSON schema validation, or exact-match accuracy against a predefined label set.

Exact Match = 1 if output matches ground truth, else 0

This is the simplest evaluation dimension, but also the most often skipped.

B. Is it correct and grounded?

Did it stay faithful to the source data, retrieved context, or provided evidence?

This is especially important for RAG systems, analytical summaries, and policy assistants.

Basic text overlap metrics such as ROUGE or BLEU are not sufficient here. They check surface-level word similarity but miss meaning.

A stronger approach is to compare the output and source in the embedding space. The standard metric for that is cosine similarity:

Cosine Similarity(A, B) = (A · B) / (||A|| × ||B||)

A high cosine similarity means the generated text is semantically close to the source. A low score may indicate the model drifted or fabricated information.

For RAG systems specifically, you also need to check whether the retrieval step actually found the right documents before the model started generating. That is where Recall@K matters:

Recall@K = |Relevant Documents ∩ Top-K Retrieved| / |Relevant Documents|

If the retriever misses the relevant source, even a perfect generator will produce the wrong answer.

These are examples of some metrics from the Generative AI metrics.

C. Is the quality consistent across realistic cases?

A single good example proves very little.

The system should be tested against various cases, such as:

Easy cases
Ambiguous cases
Incomplete inputs
Edge cases
Cases where the correct response is “not enough information.”

You already know not to trust a model based on a single clean validation sample.

GenAI deserves the same discipline. That means building automated test suites that run every time the prompt, model version, or pipeline changes.

D. How does it fail?

This is the most important professional instinct.

A workflow that fails clearly is easier to manage than one that sounds confident while being wrong.

A good evaluation should reveal things such as:

Unsupported claims
Fabricated numbers
Hidden assumptions
Formatting failures
Misleading simplifications

This is where logging and tracing tools become important. Capturing inputs and outputs systematically lets you identify failure patterns instead of guessing.

The job is not to prove that the system works. The job is to learn where it fails before someone depends on it.

An example: evaluating a KPI commentary assistant

To make this concrete, consider a workflow that many data teams will recognize.

A GenAI assistant receives a monthly KPI table and writes a short business commentary:

The expected output should:

Identify that revenue declined
Mention lower acquisition and higher churn
Avoid claiming causality unless supported
Keep the commentary concise and business-appropriate

Then you evaluate:

This gives you something concrete to recognize in your own work. It also separates your evaluation from a generic evaluation.

A simple evaluation loop for data professionals

You do not need a full evaluation framework to start.

A five-step loop is enough:

1. Define the job clearly. What should the GenAI system actually do?

2. Collect representative test cases. Not just the clean ones. Include strange, messy, and borderline cases.

3. Write a simple scoring rubric. For structured outputs, use exact-match checks. For subjective quality, use a binary (0/1) rubric or a short LLM-as-a-judge prompt.

4. Compare outputs across prompts, models, or versions against the same cases. Eval guidance emphasizes using evals to test and iterate rather than relying on ad hoc inspection.

5. Record recurring failures. The failure pattern matters more than one overall score.

The evaluation stack reflects this general pattern as well: evaluation datasets, rubric-based measures, deterministic metrics where suitable, and custom checks for task-specific requirements.

Why data professionals are well-positioned to own this

Evaluation is not strange to data professionals.

It draws on habits we already use:

Defining success criteria
Building representative samples
Distinguishing anecdotes from evidence
Checking edge cases
Analyzing errors instead of celebrating one good result
Deciding whether an output is fit for use

Data professionals already have much of the mindset evaluation requires.

The shift is applying that discipline to generative systems, not only predictive models.

That is why evaluation may become one of the most valuable GenAI-adjacent skills for analysts, data scientists, analytics engineers, and ML practitioners.

Conclusion

Evaluation is what turns GenAI from something impressive into something dependable. For data professionals, that matters more than learning a clever prompt pattern or chasing the newest model release.

The real advantage comes from knowing how to define good output, test it against realistic cases, trace where it fails, and decide whether it is ready to support actual work.

That habit is already close to how strong data professionals think: establish the standard, examine the evidence, and avoid trusting a result just because it looks convincing. GenAI simply gives that discipline a new place to matter. As these systems move deeper into analysis, reporting, search, and decision support, the professionals who can evaluate them well will be the ones who make them genuinely useful.

The ML Skills That Still Matter in 2026

Cornellius Yudha Wijaya — Thu, 30 Apr 2026 02:01:17 GMT

A lot of data professionals are trying to decide what to learn next.

That is understandable.

The field feels crowded now. There is classical machine learning. Deep learning. Generative AI. RAG. Agents. Evaluation. MLOps. Analytics engineering. Data products. Every few months, a new tool or workflow becomes the thing everyone talks about.

So the question becomes narrow:

Which machine learning skills are still worth learning in 2026?

The answer is not “learn every model.”
It is also not “ML is dead because GenAI can do everything.”

A better answer is this:

The ML skills that still matter are the skills that help you turn data into reliable decisions.

That includes framing the problem, checking the data, building baselines, choosing metrics, validating properly, analyzing errors, and monitoring what happens after deployment.

These skills matter because AI adoption is no longer experimental. Research shows that generative AI adoption has moved quickly into the mainstream, with reported population adoption reaching 53% within three years. But adoption does not automatically mean reliability. The more AI systems enter daily workflows, the more important it becomes to know whether the output can be trusted.

That is where machine learning fundamentals still matter.

By the end of this article, you should have a clearer answer to three questions:

Which ML skills are still worth learning?
Why do they still matter in a GenAI-heavy world?
What should you actually practice if you want to stay useful as a data professional?

Curious about it? Let’s get into it.

Subscribe now

1. Turn vague business problems into ML problems

The first ML skill that still matters is problem framing.

Most weak ML projects do not fail because the model was not advanced enough. They fail because the problem was unclear from the beginning.

Someone says:
“Can we predict churn?”

That sounds like a machine learning problem. But it is not specific enough yet. Before modeling, you need to define, for example:

What is the prediction target? Churn within the next 30 days

What is the prediction unit? One customer account

When is the prediction made? Every Monday morning

What data is available at that time? Activity, billing, support, and product usage up to Sunday night

What action will follow the prediction? Retention team contacts high-risk accounts

What does success mean? Lower churn rate, higher retained revenue, better prioritization

Without these definitions, the model may still produce a score. But the score may not be useful.

This is the first practical lesson: A machine learning problem is not defined by the model you use. It is defined by the decision you want to improve.

A good ML practitioner knows how to translate a messy business request into something testable. That means asking whether we are predicting, ranking, classifying, detecting, recommending, or forecasting. You need to identify the exact target variable, the relevant time window, the information available at prediction time, the end-user of the result, and exactly what they will do differently because of the model.

This skill still matters in 2026 because GenAI tools can help you write code faster, but they do not automatically define the right problem for you. If the target is wrong, the rest of the pipeline is just a waste.

2. Audit the data before trusting the model

The second skill is data auditing. This is more than “clean the data.”

Cleaning data usually means handling missing values, fixing formats, removing duplicates, and standardizing columns. Data auditing goes deeper. It asks whether the data is actually valid for the problem. For an ML project, you need to inspect at least five things:

Label quality: Is the target variable correct and consistently defined?

Data availability: Would these features be available when the prediction is made?

Leakage: Does the dataset contain future information?

Sampling bias: Does the training data represent the population where the model will be used?

Stability: Are the patterns likely to hold over time?

Data leakage is especially important. Scikit-learn’s documentation warns that leakage can produce overly optimistic Performance because information from the test set or future data accidentally enters the training process.

A simple example:

You are building a churn model. Your dataset includes a column called cancellation_reason. The model performs extremely well.

But there is a problem. That column only exists after the customer has already churned. The model is not learning early churn signals. It is reading the answer key.

This happens more often than people admit. It can appear in many forms:

For example, in Credit risk, it uses a manual review result that happens after application submission or in Fraud detection, it uses an investigation outcome as an input feature.

This is why data auditing remains a core ML skill. A data professional should be able to look at a dataset and ask: “Would I know this information at the moment I need to make the prediction?” If the answer is no, the feature probably should not be there.

3. Build simple baselines before complex models

The third skill is baseline building. A baseline is the simplest model or rule that gives you a reference point. It could be a majority-class classifier, a simple business rule, logistic regression, a decision tree, a moving average, a keyword-based classifier, or a simple ranking score.

Baselines are not boring. They are protection against unnecessary complexity.

Google’s Rules of Machine Learning still emphasize starting with simple models and robust infrastructure before moving into more complex ML systems. That advice has aged well because the easiest mistake in modern AI work is to overbuild too early.

Suppose you are building a lead scoring model. A simple baseline might be: Score leads higher if they visited the pricing page, opened two emails, and work at a company above a certain size.

Then you compare that with a logistic regression model. Then maybe a gradient boosting model. Only after that should you consider something more complex.

The practical question is not: “Can I use a more advanced model?”

The better question is: “Does the more advanced model produce enough improvement to justify the added complexity?”

That complexity includes harder debugging, higher maintenance costs, more difficult explanations, more monitoring requirements, more fragile deployments, and more stakeholder confusion.

In many real business settings, the best model is not the most advanced model. It is the simplest model that performs well enough and can be trusted by the people who need to use it.

4. Choose metrics based on the cost of mistakes

The fourth skill is metric selection. This is where many ML projects become misleading. People often ask: “What is the model accuracy?”

But accuracy is not always the right metric. For example, imagine a fraud detection dataset where only 1% of transactions are fraudulent. A model that predicts “not fraud” for everything can be 99% accurate.

That sounds excellent. It is also useless.

Metric choice depends on the type of mistake you care about.

For instance, in Fraud detection, you care about recall, precision, and false positive rate. In Medical screening, recall and false negatives are critical. In Spam detection, it is precision and false positives. In Credit risk, you look at calibration, precision, recall, and expected loss. In Recommendation systems, ranking metrics, conversion, and retention matter most.

This is one of the most useful ML skills for real work: You need to connect the metric to the decision.

If the business can only contact 500 customers per week, then overall accuracy may not matter much. What matters is whether the top 500 predicted customers are actually worth contacting.

In that case, you may care about precision at K, lift in the top decile, expected retained revenue, conversion from intervention, or the cost per saved customer.

The right metric depends on the operating reality. This is why ML is not only a technical exercise. A model is evaluated inside a business process. If the metric ignores that process, the evaluation is incomplete.

5. Validate without fooling yourself

The fifth skill is validation. This is where you test whether the model can generalize.

A model can perform well on the data it has seen. That does not mean it will perform well on new data. This is why train/test splits, cross-validation, time-based validation, and holdout sets still matter.

Scikit-learn’s cross-validation documentation highlights the importance of evaluating estimator performance properly and avoiding pitfalls that make performance estimates unreliable.

The important idea is simple: Your validation setup should imitate the real situation where the model will be used.

For random customer classification, a random train/test split might be fine. For time-dependent problems, it may not be. If you are forecasting demand, predicting churn, detecting fraud, or scoring leads over time, you often need time-based validation. That means training on the past and testing on the future.

For example, you might train on data from January to March to validate in April. Then train from January to April to validate in May, and train from January to May to validate in June.

This is closer to the real world. You do not get to train in June to predict April. Validation should respect time.

A strong data professional knows how to ask:

Was the test set kept separate?
Was the preprocessing fitted only on the training data?
Was the split random when it should have been time-based?
Were duplicate users or records split across train and test?
Was the test set reused too many times during model selection?
Does Performance hold across segments?

This skill matters because bad validation creates false confidence. And false confidence is dangerous.

6. Analyze errors, not just scores

The sixth skill is error analysis. A single score is not enough.

A model with 88% accuracy may still fail badly for the most important cases. A forecasting model with a good average error may still miss peak demand. A churn model may perform well overall but fail for enterprise customers. A document classifier may work for clean English documents but fail for short, messy, multilingual text.

This is why error analysis matters. After you evaluate the model, you should inspect where it fails.

A practical error analysis table might look like this:

New users: Does the model fail when history is limited?

High-value customers: Does it work for the accounts that matter most?

Geography: Does Performance differ by country or region?

Product category: Does it fail on long-tail categories?

Time period: Does it degrade during holidays or campaigns?

Input quality: Does messy or incomplete data hurt Performance?

Minority class: Does the model ignore rare but important cases?

This is where ML work becomes diagnostic. You stop asking only: “Is the model good?” You start asking: “Where is the model useful, where is it weak, and where should we not trust it?”

That is a much better standard. This skill also connects directly to GenAI evaluation. When evaluating an LLM workflow, the same habit applies. You do not only ask whether the average output looks good. You ask which prompts fail, which user intents fail, which document types fail, which languages fail, which edge cases create hallucination, and which outputs need human review.

The models changed, but the evaluation habit still transfers.

7. Understand Monitoring after deployment

The seventh skill is model monitoring. A model is not finished when it is deployed. It enters a changing environment.

Customer behavior changes. Product features change. Fraud patterns change. Marketing channels change. Economic conditions change. Data pipelines change. Even column definitions can change.

Google Cloud’s model monitoring documentation discusses feature skew and drift detection for deployed models, including changes in categorical and numerical input features.

For most data professionals, the key idea is not complicated: A model can become worse even if the code does not change.

That means you need to monitor:

Data freshness: Did the latest batch arrive?

Input distribution: Are feature values changing?

Prediction distribution: Are model scores suddenly higher or lower?

Segment performance: Is one customer group degrading faster?

Business outcome: Is the model still improving the intended metric?

Pipeline health: Are transformations still working correctly?

Human feedback: Are users overriding or ignoring the model?

This is one reason ML skills remain useful in 2026. Many AI systems fail quietly. They do not always break in obvious ways. They drift. They degrade. They become misaligned with the current workflow. Monitoring is how you notice before the damage becomes expensive.

8. Explain the model in business terms

The eighth skill is communication. But not generic communication. The useful skill is being able to explain the model in terms of decisions, trade-offs, and risk.

A stakeholder does not only need to know that the model has 0.82 AUC. They need to know what that means.

For example: “The model is useful for ranking customers by churn risk, but it should not be used as an automatic cancellation prediction. It works better for customers with at least three months of activity history. For new customers, the signal is weaker.”

That explanation is much more useful than a metric alone. A good ML explanation should include:

A good ML explanation should include the Purpose (what decision the model supports), the Scope (where it should and should not be used), the Performance (how well it works and compared to what), Failure modes (where it performs poorly), Trade-offs (what happens if we optimize for precision vs recall), Action (what users should do with the output), and Monitoring (what needs to be checked after deployment).

This is especially important as AI systems become more embedded in business workflows. NIST’s AI Risk Management Framework emphasizes test, evaluation, verification, and validation across the AI lifecycle. That kind of thinking is not only for regulators or governance teams. It is also practical for data professionals who need to explain when a model is reliable enough to use.

The best ML people are not only model builders. They are translators between model behavior and business action.

Leave a comment

So what should you learn in 2026?

If you are learning ML now, do not structure your learning around a long list of algorithms. Structure it around the workflow. Here is a better learning path.

Problem framing | Prediction target, decision point, action, success metric | Practice: A one-page ML problem definition

Data audit | Leakage, label quality, missingness, sampling, availability | Practice: A data quality checklist

Baseline modeling | Rules, logistic regression, trees, simple benchmarks | Practice: A baseline comparison table

Metric selection | Precision, recall, F1, AUC, calibration, ranking metrics | Practice: A metric justification note

Validation | Train/test split, cross-validation, time-based split | Practice: A validation design

Error analysis | Segment performance, false positives, false negatives | Practice: An error analysis report

Monitoring | Drift, skew, data freshness, prediction distribution | Practice: A monitoring checklist

Communication | Trade-offs, limitations, recommended use | Practice: A stakeholder summary

This is the part many courses skip. They teach the model before the workflow.

But in real work, the workflow is what makes the model useful.

The skills that still matter

So, what ML skills still matter in 2026? These ones:

Framing a vague business problem into a clear ML task
Auditing data before trusting it
Detecting leakage and bad labels
Building simple baselines
Choosing metrics based on business cost
Validating models correctly
Analyzing errors by segment
Monitoring model behavior after deployment
Explaining limitations clearly

These skills are not trendy. But they are durable. They matter for classical ML. They matter for GenAI evaluation. They matter for RAG systems. They matter for AI products.

They matter whenever someone asks: “Can we trust this output enough to use it?”

That is the real reason ML still matters. Not because every data professional needs to train models from scratch. Not because classical ML is competing with GenAI. But because ML teaches the discipline behind reliable AI work.

In 2026, the most valuable data professionals will not be the ones who chase every new tool. They will be the ones who can build, evaluate, question, and explain AI systems clearly.

That is still machine learning.

And it still matters.

What Real SQL Work Taught Me About Being a Data Scientist

Cornellius Yudha Wijaya — Sat, 28 Mar 2026 15:07:48 GMT

Image by Ideogram.ai

"Real SQL work taught me that trustworthy definitions matter more than flashy queries."

I did not start by taking SQL seriously

Early in my career, I did not see SQL as central to being a data scientist. Most of my learning was built around Python, and the classes and bootcamps I joined reinforced that view. Python felt like the real language of data science. SQL felt useful, but distant.

So I did not reject it. I simply did not get enough exposure to it.

That distinction matters. When your early learning path is dominated by notebooks, models, and Python libraries, it is easy to assume that the real work starts once the data is already in front of you. In that worldview, SQL looks like preparation work. Helpful, yes. Foundational, no.

Real work changed that view gradually.

The more I worked in corporate settings, the clearer it became that many projects do not begin with modeling, dashboards, or machine learning. They begin with a more basic set of questions: Is the data available? Is the definition correct? Can the result be trusted enough for someone to act on it?

Subscribe now

Work forced the lesson

What changed my view of SQL was not one dramatic moment. It was the accumulation of projects. Again and again, the work pulled me toward the same reality: before anything becomes an analysis, a model, or a recommendation, someone has to make sure the data is available, correctly defined, and usable.

That is where SQL kept appearing.

Sometimes the request looked simple. A business team needed a report. Sometimes the request sounded more strategic. A project needed insight to inform a decision. Sometimes the work moved beyond a single analysis into the project's production life. In each case, SQL mattered not only for retrieving data, but for deciding whether the project itself rested on a solid foundation.

The difficult conversations were often not about syntax at all. They were about meaning. What exactly should count as a sale? Which time window should be used? Which source should be treated as the source of truth? If two tables produce different answers, which one reflects the real business process?

That was the point where SQL stopped feeling like a supporting skill and became infrastructure.

What real SQL work actually looked like

The lesson became clearer through a few recurring types of work. These were not glamorous, as they were simply the places where SQL kept proving its value.

Ad-hoc reporting and insight requests that looked simple but hid messy logic and scattered data.
Metric definition work, where the challenge was deciding what should count before writing the query.
Combining multiple data sources without destroying the business meaning of the result.
Preparing the right data for downstream analysis and modeling in Python.

1. Ad-hoc reporting taught me that simple requests are rarely simple

A lot of real SQL work starts with a seemingly harmless request. The business needs a report. Someone wants a quick performance update. A team asks for insight before a meeting. On paper, it sounds like a straightforward query.

In practice, it rarely is.

Sometimes the data is not available in one place. Sometimes it lives across several sources that were never designed to fit together neatly. Sometimes the logic needed to answer the question is more complicated than the request suggests. And often the timeline is short, so you do not have the luxury of slowly wading through the data.

That changed how I think about SQL skills. In real reporting work, the challenge is not just writing something that runs. The challenge is moving from a vague business question to a reliable answer under real constraints. That takes judgment, prioritization, and a clear sense of what the output needs to mean.

Useful SQL work is often less glamorous than people expect. It is not always about elegant tricks. Very often, it is about getting the right answer quickly enough to matter, without breaking the logic behind it.

2. Metric definition matters more than query complexity

If there is one area where real SQL work changed me the most, it is the definition of metrics.

In theory, a metric looks clean. In practice, even something as familiar as a sales number can go wrong depending on the time scope, exclusions, business rules, and source tables. A number can look precise and still be misleading if two teams are working from different assumptions or if one table captures the event differently from another.

That is why some SQL problems cannot be solved by clever syntax alone. You can write a technically correct query and still produce the wrong business answer.

The real work is often more basic and more demanding at the same time:

deciding what should count
deciding what should be excluded
choosing which table reflects the operational truth
making sure the result matches the way the business actually works

This is where collaboration becomes essential. There are many situations where the data exists, but understanding it requires discussion with business users who know the process behind the records. Without that alignment, a query may return rows but not the truth.

Over time, I started to see that some of the most dangerous problems in data work are not computational. They are definitional. A wrong definition can quietly damage a project, mislead stakeholders, or erode trust in the team long before anyone notices the issue.

3. Combining data sources is harder than it looks

Another lesson real SQL work taught me is that combining information from multiple sources without losing meaning is much harder than it first appears.

From the outside, joins can look like a purely technical step. In practice, they can become one of the most delicate parts of a project. Sometimes a clean primary key does not exist. Sometimes the relationship is not direct. Sometimes aggregation is needed before two datasets can even be compared. And sometimes each source reflects a slightly different view of the same business concept.

That creates several risks at once: duplicate rows, dropped records, timing mismatches, and numbers that appear structurally valid but are conceptually incorrect.

This is why SQL work often requires more collaboration than people expect. To combine sources responsibly, you frequently need validation from multiple stakeholders. The challenge is not merely to make the query run. The challenge is to preserve validity.

For me, this was one of the clearest moments where SQL became inseparable from business understanding. Good SQL was not just about retrieval. It was about preserving meaning as it moved across systems.

4. Even Python-heavy data science often begins with SQL

Because my early learning path emphasized Python, I initially imagined that most serious data-science work would begin there. In reality, SQL was often necessary before I could even start proper work in Python.

If the data lived in a SQL database, then SQL was the gatekeeper. It was how I extracted the relevant population, selected the appropriate time window, assembled the required columns, and checked whether the data were suitable for the task ahead. Whether the next step was exploratory analysis, feature preparation, modeling, or evaluation, SQL was often the first step.

That changed how I think about the relationship between SQL and data science. SQL is not simply what happens before the interesting work. Very often, it is part of the interesting work.

If the population is wrong, the feature set is incomplete, or the definition is unstable, the downstream Python work inherits that weakness. In that sense, SQL does not sit beneath data science. It sits inside it.

What I value in SQL work now

Real work also changed how I evaluate SQL skills in others and in myself.

I still care about writing cleaner, more efficient queries, especially as data grows larger and execution speed matters. But that is no longer the first thing I look for.

What I value first is this:

1. Correctness. The wrong data can quietly damage an entire project.
2. Stakeholder trust. Data work only becomes valuable when other people believe the result is dependable.
3. Maintainability. Many projects do not end after a single request, so someone has to live with the logic later.

A strong SQL practitioner, in my view, is not simply someone who knows a large amount of syntax. It is someone who understands the data definition, knows how to acquire the data in the most reliable way, and can produce logic that remains useful beyond the moment it was written.

What I would tell aspiring data scientists now

If your learning path has focused mostly on Python, I would say this clearly: do not treat SQL as optional.

You do not need to memorize every feature of the language before doing meaningful work. Documentation exists, and syntax can be learned as needed. But you do need to understand why SQL matters. It matters because data projects depend on access to the right data, under the right definitions, with logic that can withstand real business use.

That is the part I wish I had understood earlier. SQL is not important because it looks technical. It is important because it sits close to the truth conditions of data work. It is where data availability is tested. It is where definitions get challenged. It is where numbers either become trustworthy or fall apart.

For me, that has become one of the clearest professional lessons of real data work. SQL is not the opposite of data science, nor is it a lower-level skill beneath it. In many organizations, SQL is one of the foundations that allows data science to be useful at all.

And if there is one line I would leave readers with, it is this: real SQL work taught me that trustworthy definitions matter more than flashy queries.

If you are learning SQL now, learn it through real use cases. Learn it through reporting, metric definition, source validation, and the kind of business questions that force you to care about correctness.

Leave a comment

Best Stock Market data API in the AI Agent era

Cornellius Yudha Wijaya — Sat, 14 Mar 2026 07:49:27 GMT

Photo by Nicholas Cappello on Unsplash

The stock market data API landscape is changing. In the past, developers mostly evaluated providers on familiar dimensions: coverage, latency, pricing, documentation, and reliability. Those criteria still matter, but the rise of LLM-powered copilots, autonomous research workflows, and multi-agent financial systems has introduced a new requirement: how easily can a data provider plug into agentic software?

In that environment, the strongest providers are not just the ones with broad datasets. They are the ones that expose clean, structured interfaces that AI agents can query, reason over, and combine with downstream tools for analysis, monitoring, and decision support. Some vendors are already leaning into this shift with MCP servers and LLM-oriented resources. Others remain stronger as enterprise data backbones than as explicitly AI-native platforms.

In this article, we will explore the Best Stock Market data API in the AI Agent era. Curious about it?

Let’s get into it.

Subscribe now

Alpha Vantage

Overview

Alpha Vantage is a widely used financial data platform that provides real-time and historical market data APIs for equities, options, forex, cryptocurrencies, and macroeconomic indicators. The platform is designed to support both individual developers and professional trading systems through a simple, developer-friendly interface and a large catalog of market datasets.

A distinguishing feature of Alpha Vantage is the breadth of its data coverage. The platform delivers real-time and historical financial market data through programmatic APIs and spreadsheet integrations, enabling developers to build trading dashboards, quantitative research pipelines, and automated trading tools on top of a unified data interface.

The API also provides a rich library of built-in analytics—including technical indicators and fundamental datasets—allowing users to retrieve both raw market data and higher-level financial signals without implementing complex calculations themselves. In practice, this makes Alpha Vantage a flexible backbone for applications ranging from educational projects and fintech prototypes to production trading systems and investment research platforms.

What makes it valuable in the AI Agent era?

Alpha Vantage has become particularly relevant in the emerging ecosystem of LLM-powered financial tools and autonomous AI agents, largely because it provides structured market data in formats that are easy for agents and models to access, reason over, and integrate into automated workflows.

1. Native integration with AI agent ecosystems via MCP
Alpha Vantage provides an official Model Context Protocol (MCP) server, enabling large language models and agent-based applications to directly access financial data through standardized tools. The MCP server allows AI assistants and development environments to query real-time and historical stock market data programmatically, turning the API into a plug-and-play data source for agentic systems.

2. Compatibility with multi-agent financial research systems
Modern agentic trading frameworks increasingly rely on structured financial APIs like Alpha Vantage as data sources. For example, the open-source TradingAgents framework simulates a professional trading firm using multiple LLM-powered agents—such as fundamental analysts, technical analysts, sentiment analysts, traders, and risk managers—that collaborate to analyze equities and make decisions. This system is powered by Alpha Vantage API as the core data backbone.

3. Documentation and developer assets optimized for machine consumption
Another advantage in the LLM era is the structure and accessibility of Alpha Vantage’s developer resources. The platform provides comprehensive API documentation, examples, and community libraries across many programming languages, making it straightforward for both humans and AI coding agents to integrate financial data pipelines. Because LLM-powered development tools rely heavily on structured documentation, well-defined API endpoints, and example code, this ecosystem of docs, SDKs, and README files makes Alpha Vantage particularly easy for AI systems to learn and use.

In short

Alpha Vantage’s combination of structured financial APIs, an MCP interface for AI agents, and extensive developer documentation positions it as a data infrastructure layer for the emerging generation of AI-powered trading tools, research agents, and autonomous financial analysis systems.

Tradier

Overview

Tradier is a brokerage-focused API platform that combines market data, account access, and trading functionality. Its public API supports real-time, delayed, and historical market data through both request/response endpoints and streaming interfaces, while also exposing brokerage capabilities such as account information, positions, orders, watchlists, and trade execution.

A key differentiator is that Tradier is not just a data API. It is part of a brokerage stack. That means developers can use it not only to retrieve quotes, options chains, time-and-sales data, and historical pricing, but also to connect agentic workflows directly to trading and portfolio actions. Tradier also supports HTTP and WebSocket streaming, which is useful when building systems that need fast updates rather than purely batch-style analysis.

Tradier’s market-data positioning is more U.S.-brokerage-centric than broad all-asset-class research platforms. Real-time data is available to Tradier Brokerage account holders for U.S. stocks and options, and delayed data follows the standard 15-minute model for non-real-time access. That makes Tradier particularly compelling for execution-oriented applications rather than for the widest possible global dataset footprint.

What makes it valuable in the AI Agent era?

1. MCP and LLM-oriented documentation
Tradier is unusually forward-leaning in how it presents its docs to the LLM era. Its documentation includes llms.txt, dedicated LLM resources, and a Tradier MCP section. Tradier’s own MCP documentation says users can access market data, account details, documentation, and even place trades from within connected AI tools. That makes Tradier one of the few providers publicly bridging financial APIs and conversational interfaces in a first-class way.

2. Strong fit for execution-capable agents
Many financial APIs stop at data retrieval. Tradier goes further by combining data access with brokerage actions such as order placement, account history, positions, and balances. In the AI agent era, that matters because the most interesting systems are often not just research agents but action-taking agents. Tradier is therefore especially relevant for developers building guarded execution workflows, trading copilots, or semi-autonomous assistants that need both read and act capabilities.

3. Streaming interfaces for real-time agent loops
Tradier supports both HTTP and WebSocket streaming for market and account data. That is important for agent architectures that continuously monitor events, react to intraday changes, or trigger downstream workflows when market conditions shift. In practical terms, Tradier is better suited than batch-only APIs for event-driven agents that need live context rather than periodic polling alone.

In short

Tradier is one of the strongest options for AI agents that need to move beyond analysis into brokerage-connected workflows. It may not be the broadest general-purpose research API, but for U.S.-market, execution-aware agents, Tradier’s mix of market data, account endpoints, streaming support, and MCP/LLM resources makes it highly relevant.

Xignite

Overview

Xignite is an enterprise financial data platform centered on cloud-delivered APIs and market-data management. Its catalog covers stock quotes, ETFs and mutual funds, foreign exchange, futures and options, indices and benchmarks, fixed income and rates, company fundamentals, reference data, earnings, and news. The company also emphasizes broad upstream sourcing, stating that its data comes from more than 250 providers, alongside curated in-house datasets.

Xignite’s public positioning is less “developer hobbyist API” and more enterprise-grade market data infrastructure. It highlights unlimited-usage pricing, flexible commercial packaging by asset class, call frequency, and region, and delivery models that include real-time, historical, and reference data. Its developer materials also show a broad set of products for delayed quotes, real-time quotes, historical data, streaming, alerts, IPOs, and company information.

That means Xignite is best understood as a data platform for institutions and mature fintech products rather than as a lightweight API-first experimentation layer. For many teams, that is a feature, not a drawback. In an AI stack, the most valuable data provider is often the one that can reliably serve as the normalized source behind internal models, orchestration layers, and production analytics systems. This last point is an inference from Xignite’s product positioning toward scalable enterprise delivery and market-data management.

What makes it valuable in the AI Agent era?

1. Enterprise-grade breadth for multi-source agent pipelines
AI agents become more useful when they can combine quotes, fundamentals, benchmarks, reference data, and news into a single reasoning loop. Xignite’s catalog is strong on this dimension. Because it covers a wide range of asset classes and reference datasets, it can act as the structured data layer beneath enterprise financial copilots and internal analyst tools.

2. Strong fit for organizations building their own orchestration layer
Unlike Alpha Vantage or EODHD, Xignite’s public materials emphasize APIs, coverage, and market-data management rather than agent-specific packaging. In practice, that makes it attractive for organizations that want to build their own AI architecture on top of a robust enterprise data backbone instead of depending on vendor-supplied MCP experiences. That is an inference from Xignite’s public positioning around cloud APIs, data management, and unlimited-usage commercial structure.

3. Flexible delivery for production-scale systems
Xignite supports multiple delivery modes across real-time, delayed, historical, and streaming-style services, and it explicitly markets itself for demanding display applications, backtesting, alerts, and application integration. That flexibility matters in AI systems because not all components need the same data path: one model might need historical fundamentals, another might need event-driven market updates, and a third might need reference data normalization.

In short

Xignite is not the most visibly AI-marketed provider in this group, but it is a serious contender for enterprise AI finance stacks. If your goal is to build a proprietary agent platform on top of large-scale, normalized market-data services, Xignite’s breadth and infrastructure orientation make it more compelling than its relative lack of public AI branding might suggest.

EOD Historical Data

Overview

EOD Historical Data, now commonly presented as EODHD, offers a broad financial data platform spanning fundamentals, historical end-of-day prices, live and real-time feeds, intraday data, U.S. options, financial news, stock screeners, technical indicators, and exchange/reference datasets. On its homepage, the company positions itself as a “one-stop shop” for 30+ years of historical, fundamental, and real-time data across global markets, with coverage figures including 60 stock exchanges and 150,000 tickers.

One of EODHD’s strengths is that it sits between lightweight developer tools and more professional research infrastructure. It offers structured JSON and CSV responses, coding libraries, spreadsheet add-ons, and a broad menu of market datasets without being limited to only one narrow workflow. It also exposes precomputed technical indicators through API endpoints rather than requiring users to calculate everything from raw time series.

This combination makes EODHD particularly attractive for builders who want reasonably broad market-data coverage and analytics features in a format that remains accessible to smaller teams, solo developers, and applied AI prototypes.

What makes it valuable in the AI Agent era?

1. Official MCP support for agent integration
EODHD provides an official MCP server for financial data and explicitly documents how to connect it to ChatGPT, Claude, and custom AI agents. The company describes this as a way for AI agents and LLMs to access real-time and historical financial data directly through MCP, making EODHD one of the clearest AI-era data providers alongside Alpha Vantage and Tradier.

2. An official ChatGPT-oriented financial assistant
Beyond MCP, EODHD also offers an official Financial Assistant for ChatGPT, which it describes as an AI that can generate code for EODHD APIs and provide finance insights grounded in real data and news. That does not just signal marketing interest in AI; it suggests the company is actively shaping its product and developer experience around LLM-driven usage patterns.

3. Strong structured outputs plus higher-level analytics
EODHD’s AI relevance is also practical. It provides structured JSON/CSV outputs, extensive API documentation, libraries, and technical-indicator endpoints that already package financial signals into machine-usable form. For agentic systems, that reduces the burden of transforming raw market data before it can be used in screening, summarization, ranking, or recommendation workflows.

In short

EODHD is one of the strongest all-around options for the AI agent era. It combines broad market coverage with precomputed indicators, developer-friendly structured data, an official MCP server, and a ChatGPT-oriented assistant. For teams that want something more AI-forward than classic enterprise vendors but broader than a narrow single-purpose API, EODHD is a very strong choice.

QuoteMedia

Overview

QuoteMedia is a long-standing market-data provider focused on real-time and historical data, news, analytics, and financial information for brokerages, websites, trading systems, and investor-facing products. Its Request APIs and OnDemand services are built around cloud-based access to market data, while its streaming products emphasize tick-by-tick delivery, low latency, and enterprise-grade reliability. QuoteMedia also highlights broad operational scale, including 110+ global exchanges, 200+ data APIs, 99.99% uptime, and 100+ news providers.

A notable strength of QuoteMedia is delivery flexibility. Its platform spans REST-style OnDemand APIs, WebSocket and other streaming interfaces, and SFTP-based file services for bulk delivery. It also supports JSON, XML, CSV, option-chain data, company profiles, historical time series, filings, and custom calculations. That makes QuoteMedia less of a single API product and more of a market-data delivery platform.

QuoteMedia’s public positioning is similar to Xignite in one important way: it is more infrastructure-oriented than explicitly LLM-oriented. In other words, its clearest strengths are reliability, breadth, delivery options, and integration into financial products, not public MCP or agent-marketing. That is an inference from the official materials reviewed.

What makes it valuable in the AI Agent era?

1. Low-latency data for real-time agent monitoring
QuoteMedia’s streaming stack is designed for real-time or delayed tick-by-tick data, normalized for ease of use and optimized for single-digit millisecond performance. For AI systems that monitor live markets, score signals, or trigger alerts and workflows off intraday movement, that kind of delivery profile is highly relevant.

2. Multiple delivery modes for different agent architectures
Modern AI finance stacks are not monolithic. Some components work best with REST requests, others with streams, and others with bulk files for offline training or evaluation. QuoteMedia supports cloud REST APIs, streaming APIs, and SFTP/file services, which makes it well suited to organizations building layered pipelines that combine real-time agent behavior with batch analytics and historical model development.

3. Strong fit as a production data layer
QuoteMedia offers market data, news, analytics, company profiles, option chains, filings, and historical data in structured formats such as JSON, XML, and CSV. That breadth makes it a useful foundation for internal copilots, research dashboards, summarization systems, and client-facing financial applications where the “AI” layer is built on top of the data platform rather than bundled by the vendor itself.

In short

QuoteMedia is a strong candidate for teams that care more about production-grade delivery and integration flexibility than about whether the vendor has already branded itself around AI agents. In the AI agent era, that still matters a lot: a reliable, low-latency, multi-format market-data backbone can be more valuable than flashy AI positioning if you are building your own orchestration layer.

Conclusion

If the goal is to find the most AI-ready providers, Alpha Vantage, Tradier, and EODHD stand out because they already offer MCP or LLM-oriented support. Alpha Vantage is particularly strong for AI-native research tools, Tradier is strong for brokerage-connected agents, and EODHD is a strong general-purpose choice.

If the goal is enterprise-grade infrastructure for proprietary AI systems, Xignite and QuoteMedia remain highly relevant. They may be less visibly AI-marketed, but they are strong as scalable market data backbones.

So in the AI agent era, the best stock market data API depends on what you are building. For AI-native financial research, Alpha Vantage has a strong edge. For execution-oriented agents, Tradier stands out. For broad AI-enabled workflows, EODHD is highly competitive. For enterprise infrastructure, Xignite and QuoteMedia are still important players.

7 SQL Use Cases Every Data Professional Should Know

Cornellius Yudha Wijaya — Sat, 07 Mar 2026 12:19:16 GMT

A lot of people learn SQL in a frustrating way.

They start with SELECT, FROM, WHERE, GROUP BY, maybe a few joins, and if they stay long enough, a window function or two. They can write queries. They can pass the exercises. But when they face a real business question, they still freeze.

That usually happens because they learned SQL as a list of clauses instead of a way to think.

In real work, SQL is rarely about showing that you remember syntax. It is about knowing what question arises once it hits the data. Questions such as:

Is this a reporting problem?
A funnel problem?
A cohort problem?
A segmentation problem?
A QA problem?

The moment you can recognize that, SQL becomes much less intimidating and much more useful. That is the shift that matters.

The people who get genuinely strong at SQL are usually not the people who memorize the most functions. They are the people who can look at a business question and quickly understand what kind of data transformation it needs.

So instead of thinking about SQL as “a language I should know,” I think it is more useful to think about it as a toolkit for a handful of recurring jobs.

Here are seven of the most important ones. Let’s get into it.

Subscribe now

1. KPI reporting

When teams want to know what is happening in the business, they usually start with some version of a KPI question. Revenue by month. Daily active users. Orders by country. Average order value. Churn rate by plan. Refund rate by product. These are not flashy questions, but they are the foundation of most reporting work.

This is where SQL starts becoming practical. You are not trying to prove how advanced you are. You are trying to turn raw data into something clear enough for another person to act on.

That means defining the metric carefully, filtering the right time window, grouping at the right level, and returning a result that is readable. The technical tools are simple, but the judgment behind them matters a lot.

A lot of people underestimate this kind of SQL because it feels too basic. I think that is a mistake. A team with weak KPI logic usually ends up with weak everything else.

A simple example is monthly revenue by product category:

SELECT
    DATE_TRUNC(’month’, order_date) AS order_month,
    product_category,
    SUM(revenue) AS total_revenue
FROM orders
WHERE order_date >= DATE ‘2026-01-01’
GROUP BY 1, 2
ORDER BY 1, 3 DESC;

This is a basic grouped summary, but that is exactly why it matters. A lot of useful SQL is just good filtering, clean aggregation, and returning a table that another person can use.

2. Funnel analysis

The second major use case is figuring out where people drop off.

This is where SQL starts feeling very close to product and growth work. A funnel question usually sounds like this: how many users started onboarding, how many completed profile setup, how many created their first project, and how many upgraded? In ecommerce, the same question shows up as view product, add to cart, begin checkout, and pay.

What makes funnel analysis valuable is that it shows where interest turns into friction.

A lot of the time, the problem is not “traffic is low.” The problem is that the path breaks at one specific step. SQL helps you see that step clearly. It lets you move from a vague sense that “conversion feels weak” to a more precise question like “why do so many users disappear between signup and first action?”

A simple event-based funnel might look like this:

SELECT
    step_name,
    COUNT(DISTINCT user_id) AS users_at_step
FROM onboarding_events
WHERE event_date >= DATE ‘2026-03-01’
GROUP BY 1
ORDER BY
    CASE step_name
        WHEN ‘signup’ THEN 1
        WHEN ‘verify_email’ THEN 2
        WHEN ‘create_project’ THEN 3
        WHEN ‘first_active_use’ THEN 4
    END;

This is not the most advanced funnel query in the world, but it already gives you a clearer conversation. Instead of saying “activation is weak,” you can ask, “Why do so many users disappear between verification and first project creation?”

Once you can answer that, the conversation gets much more useful.

3. Cohort retention analysis

This is one of the most important SQL use cases because it forces better thinking.

A cohort retention analysis groups users by a shared starting point, then checks whether they come back in later periods. That sounds simple, but it is one of those areas where small definition choices change the whole story. What puts a user into a cohort? What counts as a return? What does a week mean? Should a user count once per week or every time they generate an event?

That is why good retention work is not mainly about writing SQL. It is about locking the logic before the SQL ever begins.

This is also where SQL becomes more than a reporting language. It becomes a way of expressing lifecycle behavior. Once you can build a trustworthy retention table, you can stop asking “are users coming back?” in a vague way and start asking “which users are sticking, when do they drop, and what changed across cohorts?”

That is one of the reasons I like this use case so much. It pushes people past syntax into actual analytical design.

A very small example of the logic looks like this:

WITH user_cohort AS (
    SELECT
        user_id,
        DATE_TRUNC(’week’, MIN(login_date)) AS cohort_week
    FROM logins
    GROUP BY 1
),
user_activity AS (
    SELECT
        l.user_id,
        DATE_TRUNC(’week’, l.login_date) AS activity_week
    FROM logins l
    GROUP BY 1, 2
)
SELECT
    c.cohort_week,
    a.activity_week,
    COUNT(DISTINCT a.user_id) AS active_users
FROM user_cohort c
JOIN user_activity a
  ON c.user_id = a.user_id
GROUP BY 1, 2
ORDER BY 1, 2;

This is only the skeleton, not the full retention table. But even here, you can already see the shape: assign the cohort, map later activity, then aggregate by period.

You can check the deep dive of this use case here:

4. Segmentation

Once you know the overall number, the next question is almost always: who exactly is driving it?

That is segmentation.

Averages are useful, but they hide a lot. SQL becomes much more powerful once you stop treating all users as one group and start cutting the data into meaningful slices. That might mean country, plan, acquisition channel, device type, power users versus casual users, or first purchase month.

And in practice, this is where a lot of strong SQL users separate themselves. They stop producing one big average and start showing where the business behaves differently across groups.

A simple segmentation example might be conversion rate by acquisition channel:

SELECT
    acquisition_channel,
    COUNT(DISTINCT user_id) AS users,
    SUM(CASE WHEN converted = 1 THEN 1 ELSE 0 END) AS converted_users,
    ROUND(
        1.0 * SUM(CASE WHEN converted = 1 THEN 1 ELSE 0 END)
        / COUNT(DISTINCT user_id),
        3
    ) AS conversion_rate
FROM user_conversion_summary
GROUP BY 1
ORDER BY conversion_rate DESC;

This is where SQL starts feeling strategic. You stop asking, “Is conversion improving?” and start asking, “Is conversion improving for the users we actually care about?”

5. Experiment analysis

If you work near product or growth teams, SQL becomes very important the moment experiments show up.

Before anyone talks about significance, lift, or confidence intervals, someone still has to build the dataset properly. Who was in the control group? Who was in the treatment group? Who converted? Over what window? Were there logging issues? Did the assignment logic work as expected?

A lot of that early work is SQL.

And this matters more than people think, because if the experiment table is wrong, everything that comes after it is already compromised. If the assignment table is joined incorrectly, if the outcome window is inconsistent, or if duplicated rows quietly inflate conversions, the eventual statistical discussion becomes much less meaningful.

So even though experiment analysis sounds advanced, a lot of it still comes down to careful SQL habits and clean dataset construction.

A simple experiment summary might look like this:

SELECT
    variant,
    COUNT(DISTINCT user_id) AS users,
    SUM(CASE WHEN purchased = 1 THEN 1 ELSE 0 END) AS purchasers,
    ROUND(
        1.0 * SUM(CASE WHEN purchased = 1 THEN 1 ELSE 0 END)
        / COUNT(DISTINCT user_id),
        3
    ) AS purchase_rate
FROM experiment_user_summary
WHERE experiment_name = ‘checkout_redesign_v1’
GROUP BY 1
ORDER BY 1;

That is not the full experiment analysis, but it is the foundation.

6. Data quality and QA checks

This is one of the least glamorous SQL use cases, and one of the most valuable.

A huge amount of trust in data work comes from catching bad structure early. Duplicate rows. Missing keys. Broken joins. Sudden changes in counts. Tables that stopped updating. Records that should be impossible but somehow exist anyway.

SQL is excellent for this kind of work because it is good at isolating patterns, comparing counts, checking coverage, and surfacing anomalies before they become reporting problems.

This is also one of the places where data professionals become more mature in practice. They stop using SQL only to answer the question they were asked, and they start using SQL to challenge whether the dataset itself deserves trust.

That is a very different mindset.

Once you develop it, your work usually becomes much more reliable.

For example, if you want to check for duplicate order IDs:

SELECT
    order_id,
    COUNT(*) AS row_count
FROM orders
GROUP BY 1
HAVING COUNT(*) > 1
ORDER BY row_count DESC;

This is basic, but incredibly useful.

7. Operational monitoring

The last use case is the one that makes SQL feel closest to the day-to-day operating layer of a business.

Sometimes the question is not “what happened this quarter?” Sometimes the question is “did the pipeline run?”, “are transactions missing?”, “did yesterday’s volume collapse?”, or “did a critical table stop refreshing?”

At that point, SQL is not just helping with analysis. It is helping keep the system honest.

This kind of work often lives somewhere between analytics, operations, and data engineering. You are comparing expected versus actual counts, checking daily or weekly movement, and trying to spot problems before somebody else finds them in a broken dashboard or an angry meeting.

If you only think of SQL as a tool for reports, you miss how often it becomes part of the business’s operational nervous system.

A simple monitoring query might compare day-over-day order counts:

SELECT
    order_date,
    COUNT(*) AS orders_today,
    LAG(COUNT(*)) OVER (ORDER BY order_date) AS orders_yesterday
FROM orders
GROUP BY 1
ORDER BY 1;

This is where window functions become especially useful. They let you compare each row to related rows while keeping the row-level result visible, which is exactly the kind of thing you want for trend and monitoring work.

Leave a comment

The bigger point

If you look across all seven use cases, the pattern is pretty clear.

SQL is rarely valuable because of its isolated syntax.

It is valuable because the same small set of ideas keeps getting reused across real work.

That is why strong SQL users usually do not sound like they are reciting functions. They sound like they understand data shape.

That is a much better goal than “learn more SQL syntax.”

Where to go next

If you are still early, I would not try to learn every advanced clause in one sitting.

I would focus on connecting SQL to actual problems.

That is exactly why I built the SQL track into the NBD Focus Map. The point is not to learn SQL randomly. The point is to see how the pieces fit together and start shipping small, useful work with them.

Start here

If you want the broader path, start with the Focus Map:

Focus Map

If you want the full paid system, use:

Vault: https://www.nb-data.com/p/nbd-reading-vault-paid-guided-paths
Template Index: https://www.nb-data.com/p/template-pack-index-paid
Subscriber Benefits: https://www.nb-data.com/p/subscriber-benefits

Cohort Retention in SQL

Cornellius Yudha Wijaya — Fri, 06 Mar 2026 18:39:19 GMT

Most retention tables are not wrong because the SQL is complicated.

They are wrong because the definitions are loose.

Someone says, “Let’s look at retention,” a query gets written, a heatmap shows up in a dashboard, and suddenly everyone is talking about Week 1 and Month 1 as if those numbers are objective facts. They usually are not. They are the result of choices. What counts as the start of a user’s journey? What counts as a return? What exactly is a week? What timezone are we using? Are we measuring one user once per period, or accidentally counting heavy users multiple times?

That is the real work in cohort retention. Not the division. Not the pivot table. The real work is deciding what story the table is allowed to tell.

At its core, cohort analysis is simple. You group users by a shared starting point, then measure what those users do in later periods. That is the common backbone behind most cohort SQL tutorials and warehouse implementations.

What makes it tricky is that small choices can change the story enough to change the decision.

So in this piece, I want to show you how I think about cohort retention in SQL when I want something that is not just presentable, but actually trustworthy. We will walk through a small sample dataset, turn it into a retention table step by step, and discuss the parts that often go wrong: cohort definition, return-event design, week boundaries, duplicate activity, partial cohorts, and interpretation.

Subscribe now

Start with the question, not the query

Before touching SQL, I like to ask one uncomfortable question:

What exactly do I want this retention table to help me decide?

That question matters because different cohort definitions answer different business questions.

If I group users by the week they signed up, I am usually asking something about onboarding, activation, or acquisition quality. I want to know whether new users are sticking around after entering the funnel.

If I group users by the week they first did something meaningful, I am asking something slightly different. I am saying that signup is not the real beginning of value. Maybe the real beginning is the first login, the first purchase, the first report built, or the first document uploaded. In that case, I am less interested in the funnel entry and more interested in what happens once a user actually starts using the product.

Both are valid. But they are not interchangeable.

The same holds for the return event. If I define retention as “any page view,” my table might look reassuring while hiding the fact that users are not doing anything meaningful. If I define retention as “purchase,” the metric might be more valuable but also much sparser. There is no universally correct event. There is only one event that is more or less aligned with the value loop you care about.

Then there is the time bucket. This is the part people often treat as neutral, even though it really isn’t. A daily retention table tells a different story than a weekly one. A weekly table tells a different story than a monthly one. And even the idea of a “week” is less fixed than people think. BigQuery, for example, distinguishes between WEEK, WEEK(), and ISOWEEK, and those choices affect how dates are grouped and how period differences are calculated.

That is why I think of cohort retention as a design problem before I think of it as a SQL problem.

The version we’re building here

To make this concrete, let’s keep the example small and explicit.

In this walkthrough:

A user’s cohort is the week of their first login
Retention means they performed a login in a later week
The table uses calendar weeks
Each user should count at most once per week

That last condition matters a lot. If a user logs in ten times in the same week, they are still one retained user for that week. Retention is about whether someone came back in the period, not how noisy their event stream was.

Sample data

Here is a tiny events table we can use end-to-end.

Template Pack Index (Paid)

Cornellius Yudha Wijaya — Sun, 01 Mar 2026 15:43:57 GMT

This is the paid asset library for Non-Brand Data.

Each template pack is designed to help you ship faster, not just read more. Pick one pack based on your current goal, use it for 7–14 days, and finish with a visible artifact: a notebook, SQL analysis, write-up, repo, or project page.

NBD Reading Vault (Paid): Guided Paths + Mini-Projects

Cornellius Yudha Wijaya — Tue, 24 Feb 2026 12:45:05 GMT

Start here if you feel overwhelmed

If you are a paid member and do not know what to read next, do not browse the archive.

Choose one track below, follow the reading order, and ship one mini-project in 2–4 weeks.

This page is designed to help you move from reading to output.

Do this before browsing the archive:

Pick one track only.
Read the first post in that …

✨Subscriber Benefits

Cornellius Yudha Wijaya — Sun, 22 Feb 2026 13:26:51 GMT

Non-Brand Data Subscriber Benefits

This is the up-to-date summary of what you get as a free reader, paid member, or founding member.

Non-Brand Data is built around one simple idea:

Learn with structure. Practice with intention. Ship something useful.

Full version

Free subscribers

Start here: NBD Focus Map (Free PDF)

What you get:

NBD Focus Map
Public posts and public archive
The best free posts in order
Subscriber chat and comment threads
Free practical resources, including the AI Evaluation Checklist for Data Professionals
Occasional free guides, templates, and learning notes

Best for:

Readers who want a structured starting point before deciding whether to go deeper.

Use this tier if you want to pick one track, follow a simple cadence, and start shipping small artifacts from what you learn.

Paid members

Paid members get the structured side of Non-Brand Data.

What you get:

Member-only deep-dive posts
Full archive access
NBD Reading Vault
Template Pack Index
Monthly or periodic template updates
Guided learning paths across SQL, Python + ML, and GenAI / RAG
Practical assets such as checklists, worksheets, rubrics, and project templates

Best for:

Readers who do not want to browse randomly and want a clearer path from learning to practice.

Use this tier if you want to follow guided paths, reuse templates, and turn the archive into a practical learning system.

What paid members should use first

If you feel lost in the archive, start with the NBD Reading Vault.

The Vault helps you:

pick one track,
follow the reading order,
commit to a 2–4 week sprint,
and ship one mini-project.

If you already know what you want to build, go to the Template Pack Index.

The Template Pack Index gives you reusable assets such as:

Sprint Pack
SQL Work Pack
ML Experiment Pack
AI Evaluation Checklist

Founding members

Founding members get everything in Paid, plus priority feedback and one annual review call.

The annual review call is for one project, portfolio entry, write-up, or learning artifact.

We focus on what to change so the work becomes clearer, stronger, and closer to hiring-manager-ready or stakeholder-ready quality.

What you get:

Everything in Paid
Priority feedback
One annual 30-minute review call
Project, portfolio, or artifact review

One-time purchases (optional)

These are separate from subscriptions. Buy once and reuse anytime.

Portfolio Rubric Toolkit
Data Science Resume Template (FREE)
Python Packages to Learn Data Science (e-book) (FREE)

How to redeem your benefits

All subscribers

Focus Map
Start Here

Paid members

Access member-only posts by logging in with the email you used to subscribe
Template packs are delivered by email and will also be collected in one place as the vault grows
The member vault reading list will be added here once it is published

Founding members

Reply to any email with the subject: Founding review
Include:

a link to your project/repo/write-up
What do you want feedback on. I will send the booking link, and we will schedule the 30 minutes.