· 15 min read

Building the AI-Ready Data Platform

series: The Agentic Data Stack

  1. 1. Data Is the Bottleneck in the Agentic Era
  2. 2. Building the AI-Ready Data Platform
A human data engineer and a holographic AI agent collaborating over a layered data platform architecture

This is Part 2 of a two-part series. In Part 1, I argued that the primary bottleneck in the agentic era isn’t model quality — it’s data execution. Here, I’ll lay out what to actually build.


Part 1 identified the bottleneck. The harder question is what to build.

The feedback loop is clear: agents consume data, make decisions, generate more data, and that data feeds the next round. If the data layer can’t keep up, the loop stalls. But “keep up” is vague. What does the data platform actually need to become?

Think of deploying an agent like onboarding a high-capability new hire. In a well-run org, they get documentation, ownership maps, process runbooks, and colleagues who know the caveats — so they make good decisions quickly. In a chaotic org, they make avoidable mistakes for months despite being talented.

Agents are the same. The model is the talent. The data platform is the onboarding system. Without metadata, lineage, freshness guarantees, and clear data contracts, the agent is capable but disoriented — it has the intelligence, not the context.

The answer isn’t a new tool or a bigger cluster. It’s an architectural commitment — one that most organizations already have the building blocks for but haven’t assembled with agents in mind.

What agent builders hit first

If you’re building agents, you’ve probably hit these walls — and you may not have recognized them as data problems.

Your agent works in the notebook but breaks in production. You assume it’s the prompt, so you spend days tuning it. The real issue: the production data is stale, incomplete, or structured differently than the sample data you developed against.

You can’t figure out why the agent made a bad decision. Was it the model? The retrieval? The chain-of-thought? You dig through traces for hours. The root cause turns out to be a table that was silently updated upstream — a renamed column, a changed filter, a broken pipeline — and nobody told you.

You find five tables that look like they contain what you need. No descriptions, no lineage, no way to tell which one is canonical. You pick one. It’s wrong. The agent produces plausible outputs that are silently incorrect.

Your eval suite passes on Tuesday and fails on Thursday. The agent didn’t change. The data did.

These aren’t intelligence-layer problems. They’re data-layer problems that surface through the intelligence layer. And this is the dynamic most teams miss: the data layer and the intelligence layer aren’t independent stacks — they’re symbiotic. The intelligence layer depends on the data layer for context, freshness, and correctness. The data layer now depends on the intelligence layer too, because agents generate decision data that flows back into the platform. Each layer’s quality constrains the other.

When the conversation about agents focuses only on models, prompts, and orchestration, it treats the intelligence layer as self-contained. It isn’t. You can build agents without talking to your data team. You just can’t build agents that work reliably in production without them. Every failure mode I just described — the broken eval, the silent schema change, the wrong table — is a data platform failure that no amount of prompt engineering will fix.

What “AI-ready” actually means

“AI-ready” has become a marketing term. Vendors slap it on everything from CSV uploads to data catalogs. So let me be specific about what I mean.

An AI-ready data platform is one where:

  • Every dataset is discoverable — an agent (or a human) can find the right table without tribal knowledge.
  • Every table is described — not just column names and types, but semantics, business meaning, freshness guarantees, and known caveats.
  • Every record has lineage — you can trace a value back through the transformations that produced it.
  • Every access is governed — permissions, audit trails, and retention policies are enforced at the platform level, not bolted on after the fact.
  • Every pipeline has SLAs — freshness, completeness, and cost are measured and enforced, not hoped for.

None of this is new. Data engineers have been advocating for these properties for years. What’s new is that agents make the consequences of not having them immediate and operational — the machine-speed failure mode Part 1 described. The margin for “good enough” just got a lot thinner.

Why Lakehouse

If you’re building a data platform for the agentic era, the lakehouse architecture is the most natural foundation for most modern stacks. Not because it’s trendy — because the design constraints align.

A lakehouse unifies the storage layer (open file formats like Parquet and Iceberg on object storage) with the compute layer (SQL engines, Spark, Python, ML frameworks) — all operating on the same data, with a shared catalog and governance model.

This matters for agent workloads because:

One copy of truth. Agents can’t afford ambiguity about which table is canonical. In a traditional setup — data lake for raw storage, warehouse for curated analytics, feature store for ML — the same business entity exists in three places with three different freshness guarantees and three different schemas. An agent asking “what’s the current cancellation rate?” shouldn’t have to know which system to query. A lakehouse with a unified catalog eliminates this fragmentation.

Both batch and real-time. Agent workloads are inherently hybrid. They need historical context (batch) and current state (streaming) in the same query. Consider the Shiba Resorts example from Part 1: a pricing agent needs last quarter’s seasonal trends and today’s booking velocity. A fraud agent needs historical chargeback patterns and the transaction happening right now. Lakehouse architectures with streaming ingestion and incremental materialization handle this natively. Traditional architectures force you to stitch together batch and streaming pipelines with different SLAs, different schemas, and different failure modes.

Open formats, no lock-in. Agents interact with data through multiple channels — SQL queries, Python scripts, API calls, vector embeddings. Open table formats (Delta Lake, Apache Iceberg) mean any compute engine can read the data without export, conversion, or proprietary connectors. This is critical when your agent orchestration framework, your eval pipeline, and your business logic all need access to the same underlying tables.

Governance at the storage layer. In a lakehouse, access control, audit logging, and lineage tracking are properties of the platform — not of the individual tool sitting on top. A pricing agent querying a table inherits the same governance as a human analyst running SQL. This is vastly simpler than enforcing governance across a sprawl of warehouses, lakes, and feature stores with separate permission models.

Cost-efficient at agent scale. Agent workloads generate high data volumes (decision traces, validation logs, telemetry) that need to be retained for auditability but aren’t queried constantly. Lakehouse storage on object stores is significantly cheaper than warehouse storage for this long-tail data. You can keep everything without choosing between auditability and budget.

The lakehouse isn’t the only viable architecture — a well-governed warehouse with a strong semantic layer can approximate many of these properties. But for organizations building greenfield or modernizing existing platforms, the lakehouse most naturally absorbs the new demands agents create: unified access, hybrid latency, open interop, and platform-level governance.

The three pillars of an AI-ready platform

Architecture is necessary but not sufficient. A lakehouse with no metadata is just a data lake with better marketing. What makes the platform AI-ready is what you build on top of the architecture.

1. Governance as infrastructure

In the dashboard era, governance was often treated as a compliance exercise — something you bolted on to satisfy auditors. In the agentic era, governance is the immune system.

Every agent action — every query, every decision, every tool call — needs to be traceable. Not because a regulator might ask, but because you need to debug the system when something goes wrong. When Shiba Resorts’ pricing agent slashes a luxury suite rate to $12 because it consumed a test row from an unvalidated staging table, the postmortem needs to answer: What data did it see? What was the freshness of that data? What permissions did it have? What other agents consumed the output?

This requires:

  • Fine-grained access control enforced at the platform level. Agents should inherit the permissions of the role they represent, not operate with superuser access.
  • Audit trails that are automatic and immutable. Every read and write by an agent should be logged without the agent (or its developer) having to opt in.
  • Data lineage that spans the full chain — from raw ingestion through transformation to agent consumption and back to agent-generated output.
  • Retention and compliance policies that account for agent-generated data, which may have different regulatory treatment than human-generated data.

If you’re an agent developer wondering why debugging a bad decision takes hours — this is why. Without platform-level governance, you’re reconstructing the data state manually after the fact. Governance isn’t the tax you pay for running agents. It’s the trust layer that makes running agents possible — and the debugging layer that makes them fixable.

2. Metadata-rich tables

This is where most platforms fall short — and where the payoff for agents is highest.

A table with good column names and correct data types is a start. But an agent needs more than a schema to use a table correctly. It needs to know:

  • What the table represents — not “user_events” but “all first-party ChatGPT web traffic events, excluding logged-out users, deduplicated by session.”
  • How it was derived — what pipeline produces it, what upstream sources feed it, what transformations are applied, what assumptions are baked in.
  • When it’s fresh — not “it gets updated” but “the daily refresh completes by 06:00 UTC with a 99.5% SLA, with a streaming supplement that’s at most 5 minutes behind.”
  • What the caveats are — “this table undercounts mobile sessions before 2025-Q3 due to a tracking bug that was fixed in v2.4.”
  • How others use it — historical query patterns, common joins, known gotchas.

Without this context, agents guess. They pick the wrong table. They join on the wrong key. They produce results that look plausible but are silently wrong — exactly the failure mode Part 1 described. And when a schema changes upstream, there’s no contract to break and no alert to fire. The agent builder finds out when the outputs stop making sense.

A common response here is “RAG solves this.” It doesn’t — not alone. RAG is a retrieval mechanism. It can surface relevant context at query time, but it can’t retrieve metadata that doesn’t exist. If your tables have no descriptions, no lineage, and no freshness guarantees, there’s nothing for RAG to retrieve. The context layer has to be built before retrieval can work.

Can agent builders work around this? Partially. You can build schema validation, freshness checks, and table discovery into each agent individually. But that’s every agent team reinventing the same guardrails — redundant work that the platform should provide once, for all consumers.

Rich metadata isn’t documentation for humans who might read it someday. It’s the context layer that agents consume at query time to make correct decisions. Every hour spent enriching table metadata is an hour saved debugging agent failures in production.

3. Platform as product

The final shift is organizational. An AI-ready data platform must be treated as a product — with SLAs, observability, cost governance, and consumer feedback loops.

When the consumers were humans running dashboards, a pipeline breaking at 3 AM could wait until morning. When the consumers are agents making decisions around the clock, a pipeline break at 3 AM is an operational incident.

There’s also a technical reality that most teams underestimate: traditional data pipelines weren’t designed to handle agentic telemetry. A batch ETL pipeline built for hourly dashboard refreshes can’t absorb the output of a fleet of agents making thousands of decisions per hour. The streaming infrastructure to handle this throughput exists — Kafka, Flink, and their equivalents have been production-ready for years. But most organizations’ deployed architectures haven’t been extended to treat agent outputs as first-class data products.

Part 1 made the distinction: agent telemetry is decision data, not event data. Here’s why that distinction matters architecturally. Every trace includes the input context, the tool calls, the retrieval results, the output, and the validation decision. It’s structurally richer than a log line, higher stakes than a clickstream event, and it feeds back into the system that produced it.

The pipeline that ingests it needs to support near-real-time writes, schema flexibility for heterogeneous agent outputs, and retention policies that satisfy auditability without blowing up storage costs. If your eval suite breaks on Thursday because the underlying data shifted — that’s a pipeline freshness problem, not a model problem.

This means:

  • Freshness SLAs that are defined, measured, and enforced — not aspirational.
  • Cost governance that attributes spend to agent workloads, so you can optimize the fleet, not just the infrastructure.
  • Observability that covers the full loop — data ingestion, transformation, agent consumption, decision output, and feedback.
  • Consumer contracts — schemas, freshness guarantees, and breaking-change policies — that the platform team commits to and agent developers can rely on. When a schema changes, the agent developer should know before their agent does.

The data platform is no longer a back-office utility. It’s the runtime your agents depend on.

Case study: OpenAI’s in-house data agent

This isn’t just theory. In January 2026, OpenAI published a detailed account of building an internal data agent — one that lets employees across Engineering, Finance, and Research go from question to insight in minutes using natural language.

As of that writing, their data platform serves 3,500+ internal users across 70,000+ datasets and 600+ petabytes. The agent is powered by their most capable model. But the most revealing part of the post isn’t the model — it’s the six layers of context they had to build before the agent could work.

Layer 1: Table usage. Schema metadata, column types, table lineage, and historical query patterns — so the agent knows which tables exist, how they relate, and how they’re typically queried.

Layer 2: Human annotations. Curated descriptions written by domain experts capturing intent, semantics, business meaning, and caveats that aren’t inferrable from schemas alone.

Layer 3: Code enrichment. By crawling the codebase that produces each table, the agent understands how data is derived — uniqueness constraints, update frequency, scope exclusions, granularity. As they put it: “Meaning lives in code. Schemas and query history describe a table’s shape and usage, but its true meaning lives in the code that produces it.”

Layer 4: Institutional knowledge. Integration with Slack, Google Docs, and Notion to access launch context, incident history, internal codenames, and canonical metric definitions — with access control, PII redaction, and retention policies applied at ingestion, not as an afterthought.

Layer 5: Memory. Corrections and learnings from previous interactions — saved and reused so the agent doesn’t repeat the same mistakes.

Layer 6: Runtime context. Live queries to the data warehouse, Airflow, and Spark when static context is insufficient or stale.

These six layers map directly to the pillars I described above. Layers 1-3 are metadata-rich tables — and notably, Layers 2 and 3 required real headcount and process investment (domain experts writing annotations, code analysis tooling), not just turning on a feature flag. Layer 4 is institutional governance and documentation. Layers 5-6 are platform-as-product — continuous improvement and real-time observability.

The punchline from their own team: without these layers, the agent produced wrong results — “vastly misestimating user counts or misinterpreting internal terminology.” With them, it handles complex, multi-step analyses that would otherwise take days.

OpenAI didn’t build a better model to fix their data agent. They built a better data platform.

You might think this is a scale problem — that 70,000 datasets and 600 petabytes demand layers that smaller organizations don’t need. But the pattern holds at any size. Even 500 tables with no descriptions, no lineage, and no freshness guarantees will trip up an agent. The number of layers you need scales with your data estate. The architectural pattern doesn’t change.

The real moat

Zoom out. In the next few years, most enterprises will have access to capable foundation models — whether commercial APIs, open-source alternatives, or fine-tuned variants. The cost will drop. The orchestration frameworks will mature. The intelligence layer, for the majority of enterprise use cases, will commoditize.

What won’t commoditize is your data platform. Your schemas, your lineage, your metadata, your governance model, your freshness guarantees — these are specific to your business, your domain, and your operational requirements. No vendor will build them for you. No model upgrade will bypass them. OpenAI — a company that builds the best models in the world — proved this when their own data agent only worked after investing in six layers of platform context.

This is what the symbiotic relationship between the intelligence layer and the data layer actually demands. Better models don’t compensate for a weak data platform. And a strong data platform without capable agents is just an expensive warehouse. The two layers succeed together or stall together.

The moat isn’t the model. It’s the platform underneath it.


If you’re thinking about what your data platform needs to look like for agent workloads — or you’re already building it — I’d like to hear about it. Find me @gcdaii.

5 views

continue reading

Data Is the Bottleneck in the Agentic Era

Good data yields good agents — and agents can produce far more decision data than humans ever did. The data layer is about to become the most critical piece of the stack.

Your Code Survives the Session. Your Reasoning Doesn't.

A practical guide to building a personal journaling system for AI-assisted development — using Claude Code skills and hooks to capture the reasoning that code alone can't preserve.

Starting Over, on Purpose

Why I rebuilt this site from scratch — and what I plan to explore next.