Work in Progress
Inspector Agents
A working prototype of automated AI compute verification. We trained a small GPT on a laptop, ran real training workloads on H100s, and asked Claude to read the logs and tell us what was running inside.
summary
- Jason and Jasmine are running the Verifier Challenge: an open competition to build the best AI verifiers.
- AI verification is the problem of confirming what has been computed inside an AI datacenter, such as through reading its hardware records ("logs").
- It's the prerequisite for almost any kind of AI rule we'd want: an AI lab proving it's keeping its own safety commitments, a regulator enforcing the law, the US and China checking each other on any kind of future deal.
- Like some problems before it — self-driving, image recognition, reading sealed scrolls — building good verifiers is orphaned: too applied for academia, too unprofitable for VCs, too technical for most philanthropy.
- Before running the competition, we built the smallest possible working version ourselves, on a laptop and some rented GPUs — and, to our knowledge, the first.
- Jason trained a small, 3.2M-parameter GPT end-to-end on his MacBook, capturing each training step, and then built a small automated AI verifier (a lightly scaffolded Claude agent) to read them.
- Jasmine ran an analogous testbed on two H100 SXM hosts on RunPod, in different US regions: an eight-phase schedule of training, inference, fine-tuning, and idle workloads. Three of the phases lie about what they're actually doing — training labeled as inference, the larger model labeled as the smaller one, fine-tuning labeled as a fresh pretraining run — with ground-truth labels released alongside the prover's claims.
- We've open-sourced all of the above: github.com/jasonhausenloy/inspector-agents and huggingface.co/datasets/jasminexli/verifier-challenge-traces, CC-BY-4.0.
Introduction
In what seemed like a lifetime ago, Sam Altman testified to Congress that he wanted a regime to oversee the development of AI modelled after the IAEA (the International Atomic Energy Agency, the body that tries to verify nuclear-non-proliferation deals). Enforced, perhaps, by on-the-ground inspectors.
What, indeed, would these inspectors be doing? Like everyone else in AI governance at the time, we were fuzzy on the details — there wasn't exactly an enriched uranium stockpile to test for — and fumbling in the dark.
Mar-a-Lago, 2017
Trump and Xi at Mar-a-Lago, 2017. White House photo, public domain.These days, the vibes have shifted away from multilateral fora (UN? UN who?). But a bilateral treaty between the US and China may still be a legacy-defining project for President Trump (a "Deal of The Century", if you will, for the dealmaker-in-chief).
And how would such a deal be enforced? As the old Reagan saying goes, "trust but verify."
This faces a few problems: your inspector would need to be accurate and thorough enough to catch violations (datacenter logs are huge), they'd need to be trusted by all parties (would the Chinese really trust the Swiss?), and preserve AI-relevant intellectual property.
Like everyone, we were not AI-pilled enough. One way, perhaps the best way, to verify a deal on AI, is to use the AIs themselves.
Challenges
Stanley
Stanley in the Mojave, 2005. Wikipedia, CC BY-SA.In 2004, the military's innovation arm, DARPA, launched its first Grand Challenge: build a self-driving car capable of crossing 142 miles of the Mojave Desert, with a $1m prize. No team finished, so DARPA ran it again, with a $2m prize. Stanford's "Stanley", a modified Volkswagen Touareg, finished the course in just under seven hours.
Waymo, 2026
A Waymo parked outside Jason's SF block!Sebastian Thrun, who led the team, later went on to start Google's self-driving car division. Now we ride Waymos across SF.
In 2009, then-Princeton professor Fei-Fei Li launched ImageNet: a hand-labelled dataset of 14m images, alongside a competition to ask entrants to classify images in 1,000 categories. In 2012, a graduate student named Alex Krizhevsky entered AlexNet, a deep CNN trained on two consumer GPUs, and won by >10%. The deep learning revolution produced the modern AI industry we know today.AlexNet, 2012
AlexNet's architecture (Krizhevsky, Sutskever, Hinton, 2012). Wikipedia, CC BY-SA.
In 2023, Nat Friedman, Daniel Gross, and a small team launched the Vesuvius Challenge: a $1 million prize for anyone who could read the inside of a Herculaneum scroll — papyrus carbonized by Mount Vesuvius in 79 AD, sealed for 1,944 years, too fragile to unroll. Within 10 months, a 21-year-old named Luke Farritor recovered the first word: πορφύραc, "purple." Similar techniques have since recovered hundreds of previously unreadable scrolls.Herculaneum scroll
A carbonized Herculaneum scroll, the kind the Vesuvius Challenge is unrolling. Wikimedia Commons, public domain.
Like the problems before it — self-driving, image recognition, reading sealed scrolls — building good verifiers is orphaned: too applied for academia, too unprofitable for VCs, too technical for most philanthropy. Funding pre-paradigmatic fields is hard. Also, the field of AI verification doesn't exist yet: spanning agent security, cryptography, hardware, standards and policy. To succeed, verification requires buy-in across rivals; a verifier built, closed-source, by one US lab (or, worse, the NSA) is not a verifier any Chinese ministry will accept, and vice versa.
The Verifier Challenge
Leaderboard (mockup)#teamscoreapproach1———2———3———…………Public scoring on held-out adversarial logs. Approach column open: cryptographic, agent-based, hybrid.We (Jason and Jasmine) are building the "Verifier Challenge", an open competition to build the best system for telling, just from a datacenter's logs, what AI is being trained inside.
A prerequisite to ~every form of AI governance will require solving the problem of AI verification: for AI labs to prove to their employees and to regulators they're respecting their safety standards, for the US and China to verify compliance to any kind of AI deal, even — as a basic check — for security teams to catch a rogue AI.
Before we run a competition, we wanted to work on basic prototyping ourselves of an automated inspector agent: . This post will briefly run over some of what we did, what we learned, and the next steps.
Our prototype
What we wanted to verify
We picked two properties of compute usage to monitor:
- Training versus inference — is the chip training a new model, or serving an existing one?
- Model size — is the model being trained the size that was declared?
We picked these because both are foundational to any serious agreement on compute; the former is needed to institute FLOP caps on training compute, and the latter for model-size caps. Eval-gating — the proposal that a lab can only proceed to its next training run conditional on passing a set of evaluations — also rests on a verifier.
The inspector agent
We use Claude Sonnet 4.6 and Opus-4.7 with custom system and user prompts, accessed through Anthropic API. The system prompt is frozen and cached for an hour, the log payload is cached for five minutes, and the agent's output is structured JSON. Each call sends the agent a captured trace and the protocol schema, and the agent emits a verdict together with a stated confidence and a list of cited signals. We describe the agent's actual verdicts on real data further down.
Laptop-scale testbed (Jason)
The point of this half of the prototype was to verify the basic loop end-to-end at small scale — train a model, capture a structured log of every step and every inference token, hand the log to the agent, and have the agent classify what was running — before scaling to a real datacenter. On a laptop every memory counter is readable and every kernel is mine, so this is the comfortable half of the experiment.
I trained a 3.2-million-parameter character-level GPT — four transformer blocks, four heads, 256 hidden dimensions — on Tiny Shakespeare (a 1 MB corpus of the plays — yes, "something is rotten in the state of Denmark" is in there), on Apple Silicon's MPS backend (Apple's GPU compute backend, the M-series equivalent of CUDA).Spec4 layers · 4 heads · 256 hidden · seq 128 · vocab 65 · MPS · ~5 min wall clock for 9k steps. It took ~5 minutes of wall clock training time for the model to be able to tell ROMEO is followed by a colon! Then, I ran inference on the trained checkpoint and captured another few minutes of token-by-token output.
The log from this training run recorded the following parameters:
- Per training step: loss; gradient norms (global and per parameter group); Adam optimiser state norms.
- Phase timings (data load, forward, backward, optimiser); tokens per second; learning rate.
- System telemetry: process RSS, MPS allocated and driver memory, CPU%, system load, network I/O.
- Every fifty steps: full model fingerprint, top-five-largest-tensor min/max/mean/std summaries, and a per-parameter-group weight-norm snapshot.
- A SHA-256 hash chained to the previous record. Reorder a record and the chain breaks. Edit one and the chain breaks.
- Per inference token: the full softmax distribution at the moment of sampling, the attention pattern at every layer and every head, the KL divergence from the previous step's distribution at the same position, the latency, the KV-cache bytes.
I did a full capture of this data, which was about 320 MB on disk: roughly 9,000 hash-chained training records, 240 inference records, 100 raw-tensor snapshots if you want the actual matrices, four full model checkpoints, and one Chrome profiler trace. The log format is at logs/schema.md; everything is also written to jason.ml/everything in more depth, with actual numbers.One training-step record{
"step": 200,
"loss": 2.31,
"grad_norm": 1.17,
"step_time_seconds": 0.06,
"tokens_per_second": 68260.3,
"learning_rate": 0.0003,
"forward_ms": 15.749,
"backward_ms": 38.542,
"optimizer_ms": 4.744,
"telemetry": {
"process_rss_mb": 414.28,
"mps_allocated_mb": 51.9,
"system_load_1m": 7.0
},
"grad_norms_per_group": {
"embed": 0.80,
"blocks": 0.86,
"head": 0.0,
"ln": 0.02
},
"weight_norms_per_group": {
"embed": 5.12,
"blocks": 57.81,
"head": 0.0,
"ln": 16.58
},
"optimizer_state_norms": {
"adam_m_norm": 0.17,
"adam_v_norm": 0.0089,
"tracked_params": 52
},
"flops": 78863400960,
"op_type": "training",
"model_hash_prefix": "sha256:546a946947ee7117",
"prev_record_hash": "6cc01ed1fb49062f671b04acd121bdb8a0bf8150fd9f5f7f3ec29ebcecac48ae",
"record_id": "trec_000200",
"window_start": "2026-04-29T05:20:29Z",
"window_end": "2026-04-29T05:20:29Z"
}One step out of 9,000. Scroll to read.Training lossLoss over the 9,000-step laptop run, ~11 min wall clock on MPS.
H100 testbed (Jasmine)
The H100 testbed is a sanity check on whether the laptop approach extends to a real datacenter setting. The laptop run is comfortable, since every counter is readable and every kernel is ours, but a real datacenter is the opposite: the chip is shared, the access is mediated, and the counters you most want to read are usually the ones the cloud provider does not expose. The point of this half of the experiment is to test whether the same agent-on-logs approach holds up on the kind of telemetry an actual inspector would have to work with.Why this is hardMost of the work was getting telemetry that real labs already collect but do not share: DCGM at 10 Hz, per-rank NCCL logs, NIC counters. The combination of streams at this resolution — nvsmi, DCGM, per-rank NCCL, and NIC counters, with ground-truth adversarial labels alongside the prover's claims — is uncommon in public GPU-workload datasets.
I rented two H100 SXM hosts on RunPod, in different US regions, and ran a fixed schedule on each. I ran workload phases, one idle baseline, and three adversarial phases:
- 30-minute pretraining run on TinyStories with a small GPT (
gpt-tiny, ~10M parameters). - 30-minute pretraining run on TinyStories with the larger model (
gpt-small, the GPT-2 architecture, ~124M parameters). - 15-minute inference run loading the larger model's checkpoint.
- 20-minute fine-tune of
gpt-tinyon WikiText. - 5-minute idle phase, as a baseline.
The three adversarial phases each run a real workload and lie about what it is. Each lie is picked to attack a different governance property.8-phase schedule4 workload phases (green) + 1 idle (neutral) + 3 adversarial (red), 145 min per host.
The first adversarial phase runs a training workload and labels it as inference. This is the lie that defeats a training-FLOP cap, because if a training run is reported as inference, the declared training compute stays under whatever cap is in place. The signal that gives this lie away is NCCL: training requires the two GPUs to synchronize gradients on every step, and the per-rank NCCL log records each of those calls, while inference never issues them.
The second adversarial phase runs the larger model but labels the workload as the smaller one. This is the lie that defeats a model-size cap. In our captures, the larger model occupies about 18.8 GB of HBM in steady state and the smaller one about 2 GB, so there is no plausible way to fit the larger model into the smaller's memory footprint without changing the workload itself.
The third adversarial phase fine-tunes an existing checkpoint but labels the run as a fresh pretraining job. This is the lie that hides an undeclared fine-tune inside a declared training window — a harder property than the first two, included here as a stretch case. Fine-tuning in this setup freezes the embedding layer and uses a learning rate two orders of magnitude below pretraining, and both choices leave visible signatures in the loss and gradient traces.
Ground-truth labels live in workload_labels.jsonl alongside the prover's claimed labels. The whole capture is a label-flip benchmark: any verifier can be scored against it.By the numbers34,072 rows · 1.33 GB on disk · 2 hosts × 8 phases · captured under CC-BY-4.0 at huggingface.co/datasets/jasminexli/verifier-challenge-traces.
For each phase, on each host, we captured six streams. nvsmi.csv is nvidia-smi polled once a second, recording the basic dashboard counters every cloud provider exposes: power, utilization, memory used, clock speed, temperature. dcgm.csv is NVIDIA's data-center GPU manager sampled at ten samples per second, a deeper layer that most cloud providers do not enable by default and which we installed the daemon for ourselves. A couple of DCGM's most useful counters — NVLink traffic and tensor-core activity — require a privileged container, which RunPod does not grant on community pods, so we ran without those and noted the gap in the dataset card.
The four other streams are NCCL logs per rank (every collective communication call between the two GPUs, gzipped), /proc/net/dev polled at 1 Hz as a NIC-level cross-check, model checkpoints saved at the end of training phases, and a provenance.json file with host metadata, GPU UUIDs, driver and NCCL version, and a wall-clock skew estimate so the two hosts can be temporally aligned.
The combined resolution is the part that is unusual in public datasets. People who run real workloads tend not to share this much, and people who write about verification tend not to run real workloads. None of the streams is individually conclusive, but together they leave very little room for a prover to lie.
Two excerpts from the actual capture, both gpt-small on the same checkpoint:
phases/honest_pretrain_small/nvsmi.csv · steady-state training · both GPUstimestamp gpu util% mem (MiB) power (W) temp (°C) 2026/04/29 07:20:20.991 0 100 18787 578.66 54 2026/04/29 07:20:20.992 1 100 18787 568.88 55 2026/04/29 07:20:21.992 0 100 18787 575.94 54 2026/04/29 07:20:21.993 1 100 18787 567.14 55 2026/04/29 07:20:22.993 0 100 18787 576.77 53 2026/04/29 07:20:22.994 1 100 18787 566.75 55 phases/honest_inference_small/nvsmi.csv · steady-state inference · both GPUstimestamp gpu util% mem (MiB) power (W) temp (°C) 2026/04/29 07:48:52.395 0 55 2079 174.42 34 2026/04/29 07:48:52.395 1 54 2079 171.82 35 2026/04/29 07:48:53.396 0 55 2079 174.87 34 2026/04/29 07:48:53.396 1 53 2079 171.75 36 2026/04/29 07:48:54.397 0 55 2079 174.37 34 2026/04/29 07:48:54.397 1 54 2079 173.09 36
Both excerpts are running the same gpt-small model on the same H100 pair, so the difference between the two blocks is entirely the workload signature, not the hardware. Training pins both GPUs at 100% utilization and around 575 W of power draw, while inference holds steady at about 55% utilization and 174 W. The memory footprint differs by roughly an order of magnitude (18.8 GB versus 2.1 GB), and the temperature gap directly reflects the power gap. None of these differences are surprising once they are visible — the point of the testbed is that they are recoverable from telemetry the prover cannot fully control without changing the workload itself.
Across the six streams, the signatures we relied on were the following:
- Steady-state power and utilization plateaus differ by roughly 3× between training and inference on the same model.
- HBM footprint scales with model size, so a 124M-parameter model cannot hide in the footprint of a 10M-parameter one.
- NCCL collectives fire on every training step, while inference does not issue them at all.
- Idle phases sit at a power floor around 75 W per GPU, well below any active workload.
- For cross-checking, PCIe TX/RX from DCGM and NIC counters from
/proc/net/devagree with the NCCL log's claims about traffic, so a tampered NCCL log is detectable from the orthogonal streams.
Results
We started by running the agent on the laptop traces. Across the seven canonical test cases — one clean training trace, one clean inference trace, and five adversarial mutations designed to fool a naive classifier — the agent returned the correct verdict on all seven, at roughly $0.15 per call.Most-cited signalsFLOP-per-token ratio (training is ~3× higher than inference) · presence/absence of a _real_training block · memory profile shape (training pre-allocates and stays flat; inference grows with the KV cache) · weight drift across the run.
The red-team loop we ran alongside the verifier is more interesting than the verdicts themselves. Claude plays both sides: in each round, a red-team agent looks at the schema, the existing evasions, and the last ten rounds, and proposes a new attack — a function that mutates a clean log into a lying one that should still pass the verifier. The mutate function runs in a sandbox, the verifier classifies the result, and the round is appended to a public JSONL.
We ran this loop for sixty rounds. Fifteen attacks succeeded, and each one was closed by a protocol patch the next round. The patch list turns out to be the more interesting artifact, since every dimension of the schema needed an explicit uniqueness or stability constraint before the agent could be relied on (truncation, model-hash drift, op-type relabelling, hash-and-config rotation, token under-reporting, cross-job fingerprint pollution), and by the end the protocol had grown from four implicit rules to nineteen.Patch trajectoryRules in protocol over the 60-round red-team loop. 4 → 19.
We're currently running the inspector agent end-to-end on the H100 capture and will update results soon. The dataset ships with ground-truth labels — workload_labels.jsonl carries both the truth and the prover's claim for every phase, and our verifier can be easily scored against it.
Current limitations
- Multi-node fabric. Single-host on the laptop, two hosts on the testbed. Real frontier training is hundreds of nodes with distinct topology. The hard part of verification is the fabric, not the chip.NoteCross-node verification needs per-rank NCCL + per-link bandwidth. NVLink is intra-node only; the inter-node fabric (InfiniBand, RoCE) is the harder verification target.
- DCGM counters that need privileged access. We ran on RunPod community pods, which don't grant the privileged container (or
CAP_SYS_ADMIN) that DCGM needs to read NVLink TX/RX, TENSOR_ACTIVE, and DRAM_ACTIVE. Those three counters carry signals — gradient-sync bandwidth, mixed-precision usage, memory throughput — that several of the harder verification properties depend on. We worked around the gap and noted it in the dataset card, but the H100 results above use only counters readable from an unprivileged container. - Hardware attestation. Nothing the prover writes into the log is cryptographically tied to the silicon that produced it. Without TEEs (Trusted Execution Environments — chips that can sign attestations of what they ran) or signed hardware attestations, a sufficiently sophisticated adversary can forge any log we read.NoteTEE-signed attestations exist on Intel SGX, AMD SEV-SNP, and (since 2024) H100 Confidential Computing. None used in this prototype.
- Sovereign trust: We did initial tests using Opus-4.7 as inspector agent. However, a truly deployable inspector will need to run through an open-source narrow model that all parties can audit. We plan next to finetune our own open-weights model as automated inspector agent.
- Compound attacks. Each red-team round mutates one dimension; we did not stress two attack classes at once.NoteThe realistic threat model is op_type relabel + token under-reporting + fingerprint rotation simultaneously. None of the 60 red-team rounds tested combinations.
- n=2 hosts on the testbed. Real cross-host variance is unknown.
- Train-vs-inference is suspiciously easy. The five-stream H100 telemetry separates training and inference roughly trivially. The harder properties (FLOP cap, undeclared fine-tune, identity) are the test of the framework.
Our next steps
First, we want to focus on harder compute-usage properties to verify.Properties × signalspropertyload-bearing signalFLOP cappower × time vs Kaplan floormodel sizetensor-core util, mem footprintundeclared finetuneNCCL bursts during "inference"served identityweight-checkpoint hash at loadWhat we'd want each verifier to read off first.
- [medium] FLOP-budget compliance. Did this run consume more compute than the declared cap? FLOPs (floating-point operations, the standard unit of compute spent) are the primitive any training-compute cap is denominated in.NoteTwo independent floors on declared compute: the Kaplan estimate (6 × params × tokens) and integrated power × time × FLOPs/W from hardware counters. Honest transformer runs agree within ~10%.
- [medium] Model identity compliance. Is the trained model the one you said you were going to train? This is stronger than the served-model identity check below, because the trained checkpoint has to match a declared training run — its data, procedure, and resulting weights — not just a declared architecture.
- [hard] Undeclared fine-tuning or post-training. Labs could load existing checkpoints and run undeclared fine-tuning or post-training. The hard part is distinguishing this from benign inference, because RL sampling — which is post-training — has the same telemetry shape as serving an existing model. The canonical defence is "we were not training, we were sampling from the existing model", and it is hard to refute from datacenter-access telemetry alone. This is probably the most realistic abuse pattern.
- [easy] Is the served model the declared model? Inference-side identity is the easy version: you can fingerprint the weights at load time, then verify subsequent inference uses those weights.
Other things on the list, grouped by area:
Improving the verifier:
- [hard] A mechanistic write-up of which signals the agent is actually using to make its determination, head-by-head. Right now we know the cited signals, but we don't yet have a reliable map from signal to weight.
- [hard] Realistic-subset verifier. We want to run on only the realistic subset of logs an inspector would see — input/output to the rack, with proprietary internal telemetry redacted. This is the version that matters for actual deployment, and the version we have not yet built.
- [research] An open-source verifier model fine-tuned on the corpus and published on Hugging Face. Right now the verifier is a Sonnet wrapper, but we want to train an open-source model that would actually be trusted by a sovereign auditor for deployment on-prem.
Improving the data:
- [medium] Large-scale data collection of real production-shaped traces: we want to publish a clean baseline of production traces gained from many inference and training runs, or convince labs to share their logs with us.
- [medium] Full multi-node implementation with privileged GPU access. We are currently bottlenecked on funding for compute and access to privileged GPUs. On the latter: cloud GPU providers (e.g. RunPod's community GPUs) do not expose DCGM counters, including NVLink TX/RX, tensor-core activity, and DRAM activity. Several of the harder verification properties (FLOP-budget compliance via integrated power × FLOPs/W, gradient-sync verification, mixed-precision attestation) depend on these counters being readable. We will look into requesting RunPod Secure Cloud, Lambda, Crusoe, an on-prem cluster, or a sponsored slice of a frontier provider's hardware.NoteAWS does not expose DCGM. GCP partially does. RunPod Secure Cloud, Lambda, and Crusoe grant privileged container access on request; community-tier RunPod, which we used, does not. Verification testbeds need raw-access providers.
Building the challenge:
- [easy] Identifying a list of compute-usage properties we cannot yet detect but want to verify. This has not been comprehensively done before, and would be useful both as a theoretical contribution and a guiding lode for verifier implementation progress.
- [easy] A Fermi estimate of the hardware requirements to scale this approach to a production-scale datacenter.
- [medium] A stronger adversary model (red team). We'd like to develop stronger models for what real attempted evasions in compute-usage logs would look like from adversaries.NoteStandard log-based intrusion-detection evasion taxonomy applies: padding, substitution, fragmentation, encoding. The 60-round loop ported the first two.
- [medium] Challenge prompts, leaderboard, evaluation harness — the mechanical surface that turns "an idea for a competition" into "a competition you can actually enter."
Protocols:
- [research] A first-cut networking protocol for between-chip communication that an inspector could reasonably observe without reading inside any one chip.
Support us
AI is the most consequential technology of our time, and we still cannot verify the most basic facts about how it's built.
We want to build the adversarially robust, privacy-preserving verifier that different parties will trust.
AI verification has the same shape as DARPA's Grand Challenge, ImageNet, and the Vesuvius Challenge. The problem is real and the solution path is unclear, but not technically impossible. We believe that setting a public challenge will move the progress of the field materially.
Simultaneously, we want to forge forward with novel research and building ourselves, through standards-setting and our automated inspection infrastructure.
Show our phased plan
- Phase 1: Foundations (May–June; $50k for compute, travel).
- Collaborations. Speak with RAND, MIRI, the State Department, DARPA, and Chinese researchers (e.g. at ISE 2026 in Singapore) on useful verification properties. Approach cloud providers, NVIDIA, and the Frontier Model Forum about sponsoring compute for participants and our logging standard.
- Build a small datacenter-scale MVP. Rent many nodes, run mixed training and inference workloads, build the infrastructure to capture full logs.
- Phase 2: Build (June–September; $70k for compute, talent, salaries). Determine the competition structure: lock in testbed design for three-to-five challenge types (e.g. training-run attestation, training-vs-inference classification, undeclared-workload detection), specify the scoring function and adversarial budget, and send to frontier AI security researchers to review.
- Phase 3: Launch (October; $50k–$500k for prize, marketing, judging, review). Public Verifier Challenge launch.
We are currently seeking funding, mentorship, and sponsorship from compute providers and labs.