Work in Progress

Inspector Agents

A working prototype of automated AI compute verification. We trained a small GPT on a laptop, ran real training workloads on H100s, and asked Claude to read the logs and tell us what was running inside.

summary
  • Jason and Jasmine are running the Verifier Challenge: an open competition to build the best AI verifiers.
  • AI verification is the problem of confirming what has been computed inside an AI datacenter, such as through reading its hardware records ("logs").
  • It's the prerequisite for almost any kind of AI rule we'd want: an AI lab proving it's keeping its own safety commitments, a regulator enforcing the law, the US and China checking each other on any kind of future deal.
  • Like some problems before it — self-driving, image recognition, reading sealed scrolls — building good verifiers is orphaned: too applied for academia, too unprofitable for VCs, too technical for most philanthropy.
  • Before running the competition, we built the smallest possible working version ourselves, on a laptop and some rented GPUs — and, to our knowledge, the first.
  • Jason trained a small, 3.2M-parameter GPT end-to-end on his MacBook, capturing each training step, and then built a small automated AI verifier (a lightly scaffolded Claude agent) to read them.
  • Jasmine ran an analogous testbed on two H100 SXM hosts on RunPod, in different US regions: an eight-phase schedule of training, inference, fine-tuning, and idle workloads. Three of the phases lie about what they're actually doing — training labeled as inference, the larger model labeled as the smaller one, fine-tuning labeled as a fresh pretraining run — with ground-truth labels released alongside the prover's claims.
  • We've open-sourced all of the above: github.com/jasonhausenloy/inspector-agents and huggingface.co/datasets/jasminexli/verifier-challenge-traces, CC-BY-4.0.

Introduction

In what seemed like a lifetime ago, Sam Altman testified to Congress that he wanted a regime to oversee the development of AI modelled after the IAEA (the International Atomic Energy Agency, the body that tries to verify nuclear-non-proliferation deals). Enforced, perhaps, by on-the-ground inspectors.

What, indeed, would these inspectors be doing? Like everyone else in AI governance at the time, we were fuzzy on the details — there wasn't exactly an enriched uranium stockpile to test for — and fumbling in the dark.

Mar-a-Lago, 2017Trump and Xi 1-on-1 meeting at Mar-a-Lago, 2017Trump and Xi at Mar-a-Lago, 2017. White House photo, public domain.These days, the vibes have shifted away from multilateral fora (UN? UN who?). But a bilateral treaty between the US and China may still be a legacy-defining project for President Trump (a "Deal of The Century", if you will, for the dealmaker-in-chief).

And how would such a deal be enforced? As the old Reagan saying goes, "trust but verify."

This faces a few problems: your inspector would need to be accurate and thorough enough to catch violations (datacenter logs are huge), they'd need to be trusted by all parties (would the Chinese really trust the Swiss?), and preserve AI-relevant intellectual property.

Like everyone, we were not AI-pilled enough. One way, perhaps the best way, to verify a deal on AI, is to use the AIs themselves.


Challenges

StanleyStanley, the Stanford VW Touareg that won the 2005 DARPA Grand ChallengeStanley in the Mojave, 2005. Wikipedia, CC BY-SA.In 2004, the military's innovation arm, DARPA, launched its first Grand Challenge: build a self-driving car capable of crossing 142 miles of the Mojave Desert, with a $1m prize. No team finished, so DARPA ran it again, with a $2m prize. Stanford's "Stanley", a modified Volkswagen Touareg, finished the course in just under seven hours.

Waymo, 2026A Waymo parked on a residential street in San FranciscoA Waymo parked outside Jason's SF block!Sebastian Thrun, who led the team, later went on to start Google's self-driving car division. Now we ride Waymos across SF.

In 2009, then-Princeton professor Fei-Fei Li launched ImageNet: a hand-labelled dataset of 14m images, alongside a competition to ask entrants to classify images in 1,000 categories. In 2012, a graduate student named Alex Krizhevsky entered AlexNet, a deep CNN trained on two consumer GPUs, and won by >10%. The deep learning revolution produced the modern AI industry we know today.AlexNet, 2012AlexNet architecture diagram, Krizhevsky et al. 2012AlexNet's architecture (Krizhevsky, Sutskever, Hinton, 2012). Wikipedia, CC BY-SA.

In 2023, Nat Friedman, Daniel Gross, and a small team launched the Vesuvius Challenge: a $1 million prize for anyone who could read the inside of a Herculaneum scroll — papyrus carbonized by Mount Vesuvius in 79 AD, sealed for 1,944 years, too fragile to unroll. Within 10 months, a 21-year-old named Luke Farritor recovered the first word: πορφύραc, "purple." Similar techniques have since recovered hundreds of previously unreadable scrolls.Herculaneum scrollCarbonized Herculaneum papyrus scrollA carbonized Herculaneum scroll, the kind the Vesuvius Challenge is unrolling. Wikimedia Commons, public domain.

Like the problems before it — self-driving, image recognition, reading sealed scrolls — building good verifiers is orphaned: too applied for academia, too unprofitable for VCs, too technical for most philanthropy. Funding pre-paradigmatic fields is hard. Also, the field of AI verification doesn't exist yet: spanning agent security, cryptography, hardware, standards and policy. To succeed, verification requires buy-in across rivals; a verifier built, closed-source, by one US lab (or, worse, the NSA) is not a verifier any Chinese ministry will accept, and vice versa.


The Verifier Challenge

Leaderboard (mockup)#teamscoreapproach123Public scoring on held-out adversarial logs. Approach column open: cryptographic, agent-based, hybrid.We (Jason and Jasmine) are building the "Verifier Challenge", an open competition to build the best system for telling, just from a datacenter's logs, what AI is being trained inside.

A prerequisite to ~every form of AI governance will require solving the problem of AI verification: for AI labs to prove to their employees and to regulators they're respecting their safety standards, for the US and China to verify compliance to any kind of AI deal, even — as a basic check — for security teams to catch a rogue AI.

Before we run a competition, we wanted to work on basic prototyping ourselves of an automated inspector agent: . This post will briefly run over some of what we did, what we learned, and the next steps.


Our prototype

What we wanted to verify

We picked two properties of compute usage to monitor:

  1. Training versus inference — is the chip training a new model, or serving an existing one?
  2. Model size — is the model being trained the size that was declared?

We picked these because both are foundational to any serious agreement on compute; the former is needed to institute FLOP caps on training compute, and the latter for model-size caps. Eval-gating — the proposal that a lab can only proceed to its next training run conditional on passing a set of evaluations — also rests on a verifier.

The inspector agent

We use Claude Sonnet 4.6 and Opus-4.7 with custom system and user prompts, accessed through Anthropic API. The system prompt is frozen and cached for an hour, the log payload is cached for five minutes, and the agent's output is structured JSON. Each call sends the agent a captured trace and the protocol schema, and the agent emits a verdict together with a stated confidence and a list of cited signals. We describe the agent's actual verdicts on real data further down.

Laptop-scale testbed (Jason)

The point of this half of the prototype was to verify the basic loop end-to-end at small scale — train a model, capture a structured log of every step and every inference token, hand the log to the agent, and have the agent classify what was running — before scaling to a real datacenter. On a laptop every memory counter is readable and every kernel is mine, so this is the comfortable half of the experiment.

I trained a 3.2-million-parameter character-level GPT — four transformer blocks, four heads, 256 hidden dimensions — on Tiny Shakespeare (a 1 MB corpus of the plays — yes, "something is rotten in the state of Denmark" is in there), on Apple Silicon's MPS backend (Apple's GPU compute backend, the M-series equivalent of CUDA).Spec4 layers · 4 heads · 256 hidden · seq 128 · vocab 65 · MPS · ~5 min wall clock for 9k steps. It took ~5 minutes of wall clock training time for the model to be able to tell ROMEO is followed by a colon! Then, I ran inference on the trained checkpoint and captured another few minutes of token-by-token output.

The log from this training run recorded the following parameters:

I did a full capture of this data, which was about 320 MB on disk: roughly 9,000 hash-chained training records, 240 inference records, 100 raw-tensor snapshots if you want the actual matrices, four full model checkpoints, and one Chrome profiler trace. The log format is at logs/schema.md; everything is also written to jason.ml/everything in more depth, with actual numbers.One training-step record{ "step": 200, "loss": 2.31, "grad_norm": 1.17, "step_time_seconds": 0.06, "tokens_per_second": 68260.3, "learning_rate": 0.0003, "forward_ms": 15.749, "backward_ms": 38.542, "optimizer_ms": 4.744, "telemetry": { "process_rss_mb": 414.28, "mps_allocated_mb": 51.9, "system_load_1m": 7.0 }, "grad_norms_per_group": { "embed": 0.80, "blocks": 0.86, "head": 0.0, "ln": 0.02 }, "weight_norms_per_group": { "embed": 5.12, "blocks": 57.81, "head": 0.0, "ln": 16.58 }, "optimizer_state_norms": { "adam_m_norm": 0.17, "adam_v_norm": 0.0089, "tracked_params": 52 }, "flops": 78863400960, "op_type": "training", "model_hash_prefix": "sha256:546a946947ee7117", "prev_record_hash": "6cc01ed1fb49062f671b04acd121bdb8a0bf8150fd9f5f7f3ec29ebcecac48ae", "record_id": "trec_000200", "window_start": "2026-04-29T05:20:29Z", "window_end": "2026-04-29T05:20:29Z" }One step out of 9,000. Scroll to read.Training loss4.411.17step 09kLoss over the 9,000-step laptop run, ~11 min wall clock on MPS.

H100 testbed (Jasmine)

The H100 testbed is a sanity check on whether the laptop approach extends to a real datacenter setting. The laptop run is comfortable, since every counter is readable and every kernel is ours, but a real datacenter is the opposite: the chip is shared, the access is mediated, and the counters you most want to read are usually the ones the cloud provider does not expose. The point of this half of the experiment is to test whether the same agent-on-logs approach holds up on the kind of telemetry an actual inspector would have to work with.Why this is hardMost of the work was getting telemetry that real labs already collect but do not share: DCGM at 10 Hz, per-rank NCCL logs, NIC counters. The combination of streams at this resolution — nvsmi, DCGM, per-rank NCCL, and NIC counters, with ground-truth adversarial labels alongside the prover's claims — is uncommon in public GPU-workload datasets.

I rented two H100 SXM hosts on RunPod, in different US regions, and ran a fixed schedule on each. I ran workload phases, one idle baseline, and three adversarial phases:

The three adversarial phases each run a real workload and lie about what it is. Each lie is picked to attack a different governance property.8-phase scheduletinyPsmallPinfertinyFTidleP1P2P34 workload phases (green) + 1 idle (neutral) + 3 adversarial (red), 145 min per host.

The first adversarial phase runs a training workload and labels it as inference. This is the lie that defeats a training-FLOP cap, because if a training run is reported as inference, the declared training compute stays under whatever cap is in place. The signal that gives this lie away is NCCL: training requires the two GPUs to synchronize gradients on every step, and the per-rank NCCL log records each of those calls, while inference never issues them.

The second adversarial phase runs the larger model but labels the workload as the smaller one. This is the lie that defeats a model-size cap. In our captures, the larger model occupies about 18.8 GB of HBM in steady state and the smaller one about 2 GB, so there is no plausible way to fit the larger model into the smaller's memory footprint without changing the workload itself.

The third adversarial phase fine-tunes an existing checkpoint but labels the run as a fresh pretraining job. This is the lie that hides an undeclared fine-tune inside a declared training window — a harder property than the first two, included here as a stretch case. Fine-tuning in this setup freezes the embedding layer and uses a learning rate two orders of magnitude below pretraining, and both choices leave visible signatures in the loss and gradient traces.

Ground-truth labels live in workload_labels.jsonl alongside the prover's claimed labels. The whole capture is a label-flip benchmark: any verifier can be scored against it.By the numbers34,072 rows · 1.33 GB on disk · 2 hosts × 8 phases · captured under CC-BY-4.0 at huggingface.co/datasets/jasminexli/verifier-challenge-traces.

For each phase, on each host, we captured six streams. nvsmi.csv is nvidia-smi polled once a second, recording the basic dashboard counters every cloud provider exposes: power, utilization, memory used, clock speed, temperature. dcgm.csv is NVIDIA's data-center GPU manager sampled at ten samples per second, a deeper layer that most cloud providers do not enable by default and which we installed the daemon for ourselves. A couple of DCGM's most useful counters — NVLink traffic and tensor-core activity — require a privileged container, which RunPod does not grant on community pods, so we ran without those and noted the gap in the dataset card.

The four other streams are NCCL logs per rank (every collective communication call between the two GPUs, gzipped), /proc/net/dev polled at 1 Hz as a NIC-level cross-check, model checkpoints saved at the end of training phases, and a provenance.json file with host metadata, GPU UUIDs, driver and NCCL version, and a wall-clock skew estimate so the two hosts can be temporally aligned.

The combined resolution is the part that is unusual in public datasets. People who run real workloads tend not to share this much, and people who write about verification tend not to run real workloads. None of the streams is individually conclusive, but together they leave very little room for a prover to lie.

Two excerpts from the actual capture, both gpt-small on the same checkpoint:

phases/honest_pretrain_small/nvsmi.csv · steady-state training · both GPUstimestamp                       gpu  util%  mem (MiB)  power (W)  temp (°C)
2026/04/29 07:20:20.991          0    100      18787      578.66       54
2026/04/29 07:20:20.992          1    100      18787      568.88       55
2026/04/29 07:20:21.992          0    100      18787      575.94       54
2026/04/29 07:20:21.993          1    100      18787      567.14       55
2026/04/29 07:20:22.993          0    100      18787      576.77       53
2026/04/29 07:20:22.994          1    100      18787      566.75       55

phases/honest_inference_small/nvsmi.csv · steady-state inference · both GPUstimestamp                       gpu  util%  mem (MiB)  power (W)  temp (°C)
2026/04/29 07:48:52.395          0     55       2079      174.42       34
2026/04/29 07:48:52.395          1     54       2079      171.82       35
2026/04/29 07:48:53.396          0     55       2079      174.87       34
2026/04/29 07:48:53.396          1     53       2079      171.75       36
2026/04/29 07:48:54.397          0     55       2079      174.37       34
2026/04/29 07:48:54.397          1     54       2079      173.09       36

Both excerpts are running the same gpt-small model on the same H100 pair, so the difference between the two blocks is entirely the workload signature, not the hardware. Training pins both GPUs at 100% utilization and around 575 W of power draw, while inference holds steady at about 55% utilization and 174 W. The memory footprint differs by roughly an order of magnitude (18.8 GB versus 2.1 GB), and the temperature gap directly reflects the power gap. None of these differences are surprising once they are visible — the point of the testbed is that they are recoverable from telemetry the prover cannot fully control without changing the workload itself.

Across the six streams, the signatures we relied on were the following:

Results

We started by running the agent on the laptop traces. Across the seven canonical test cases — one clean training trace, one clean inference trace, and five adversarial mutations designed to fool a naive classifier — the agent returned the correct verdict on all seven, at roughly $0.15 per call.Most-cited signalsFLOP-per-token ratio (training is ~3× higher than inference) · presence/absence of a _real_training block · memory profile shape (training pre-allocates and stays flat; inference grows with the KV cache) · weight drift across the run.

The red-team loop we ran alongside the verifier is more interesting than the verdicts themselves. Claude plays both sides: in each round, a red-team agent looks at the schema, the existing evasions, and the last ten rounds, and proposes a new attack — a function that mutates a clean log into a lying one that should still pass the verifier. The mutate function runs in a sandbox, the verifier classifies the result, and the round is appended to a public JSONL.

We ran this loop for sixty rounds. Fifteen attacks succeeded, and each one was closed by a protocol patch the next round. The patch list turns out to be the more interesting artifact, since every dimension of the schema needed an explicit uniqueness or stability constraint before the agent could be relied on (truncation, model-hash drift, op-type relabelling, hash-and-config rotation, token under-reporting, cross-job fingerprint pollution), and by the end the protocol had grown from four implicit rules to nineteen.Patch trajectory194round 160Rules in protocol over the 60-round red-team loop. 4 → 19.

We're currently running the inspector agent end-to-end on the H100 capture and will update results soon. The dataset ships with ground-truth labels — workload_labels.jsonl carries both the truth and the prover's claim for every phase, and our verifier can be easily scored against it.

Current limitations


Our next steps

First, we want to focus on harder compute-usage properties to verify.Properties × signalspropertyload-bearing signalFLOP cappower × time vs Kaplan floormodel sizetensor-core util, mem footprintundeclared finetuneNCCL bursts during "inference"served identityweight-checkpoint hash at loadWhat we'd want each verifier to read off first.

Other things on the list, grouped by area:

Improving the verifier:

Improving the data:

Building the challenge:

Protocols:


Support us

AI is the most consequential technology of our time, and we still cannot verify the most basic facts about how it's built. 

We want to build the adversarially robust, privacy-preserving verifier that different parties will trust.

AI verification has the same shape as DARPA's Grand Challenge, ImageNet, and the Vesuvius Challenge. The problem is real and the solution path is unclear, but not technically impossible. We believe that setting a public challenge will move the progress of the field materially. 

Simultaneously, we want to forge forward with novel research and building ourselves, through standards-setting and our automated inspection infrastructure.

Show our phased plan
  • Phase 1: Foundations (May–June; $50k for compute, travel).
    • Collaborations. Speak with RAND, MIRI, the State Department, DARPA, and Chinese researchers (e.g. at ISE 2026 in Singapore) on useful verification properties. Approach cloud providers, NVIDIA, and the Frontier Model Forum about sponsoring compute for participants and our logging standard.
    • Build a small datacenter-scale MVP. Rent many nodes, run mixed training and inference workloads, build the infrastructure to capture full logs.
  • Phase 2: Build (June–September; $70k for compute, talent, salaries). Determine the competition structure: lock in testbed design for three-to-five challenge types (e.g. training-run attestation, training-vs-inference classification, undeclared-workload detection), specify the scoring function and adversarial budget, and send to frontier AI security researchers to review.
  • Phase 3: Launch (October; $50k–$500k for prize, marketing, judging, review). Public Verifier Challenge launch.

We are currently seeking funding, mentorship, and sponsorship from compute providers and labs.