Can frontier LLM coding agents reproduce the published numerical results of psychological-science papers — from the data alone, or with the authors' code?
Verifying computational reproducibility — ensuring reported results can be regenerated from the original data and code — is a critical but labor-intensive part of the scientific process. This benchmark systematically evaluates how far state-of-the-art LLM coding agents can automate that check. We take articles published in Psychological Science in 2025 that passed the journal's human-led STAR reproducibility check, and compare agent performance under two conditions — manuscript + data only (Code-Free) and manuscript + data + the authors' code (Code-Assisted) — scoring each agent's outputs against the published results and typing every discrepancy into predefined error categories.
Research question. To what extent can LLM agents correctly reproduce the numerical results of published papers, and how is this influenced by the availability of the original analysis code?
Primary indicator: target-level reproduction rate — the fraction of targets whose every required value matches. Bars show the target rate; the right column shows value-level rate and the Wilson 95% CI. Scored under the precision-aligned rule (a value counts if it is within tolerance or rounds to the figure the paper reports).
| Agent | Condition | Task acc | Value acc | Tool calls | Time | Output tok | Cost / run | Failures |
|---|---|---|---|---|---|---|---|---|
| gpt-5.5 | code-assisted | 0.733 | 0.855 | 54 | 12.2 min | 29.1k | n/a | 0% |
| claude-opus-4.6 | code-assisted | 0.699 | 0.837 | 30 | 7.2 min | 27.1k | $1.92 | 0% |
| claude-opus-4.6 | code-free | 0.413 | 0.597 | 42 | 17.7 min | 45.8k | $2.97 | 0% |
| gpt-5.5 | code-free | 0.395 | 0.560 | 56 | 11.8 min | 30.2k | n/a | 0% |
Time = median wall-clock per run; tool calls and output tokens are medians. Output tokens are comparable across agents; total tokens and USD cost are not — the Claude wrapper bills cached context as near-zero input (≈ $1.9–3.0 / run; ~$425 for all 70 Claude runs), while the Codex CLI reports neither a USD cost nor a comparable token total. All 140 runs completed (0 operational failures).
Target-level reproduction rate per frozen paper bundle, pooled across both agents and both conditions (precision-aligned rule). Reproducibility varies widely with the paper's analytical complexity.
An artifact-only handoff pipeline. Stage 2 (Reproduce) runs in a sandbox that never contains the gold values; Stage 3 (Evaluate) reads only the files Stage 2 wrote. The unit of analysis is the target value and the target (a group of values reported together).
The author code/ is removed. The agent must derive the entire analysis from the manuscript, supplement, data and codebook, writing original analysis code.
The author code/ is mounted. Only non-substantive edits (paths, output capture, environment) are allowed before the first complete numeric output; an AST diff audit flags any substantive change.
Once a complete numeric output exists, the agent must not tune filters, contrasts, models or rounding to fit expected values. A tool-call trace audit enforces it and feeds a sensitivity subset.
Agents evaluated: claude-opus-4.6 (highest thinking) and gpt-5.5 (xhigh reasoning), each driven through an identical wrapper and standardized prompt. Conditions are run in the fixed order Code-Free → Code-Assisted so no code-assisted state can leak into a code-free workspace.
Reproduction accuracy is a dichotomous measure per target: a target reproduces iff every required value matches its registered comparison rule (round-to-reported-precision, tolerance, integer, exact-p, or one-sided inequality) and carries provenance linking it to the command that produced it. Figures get a tolerated match when their underlying data are not numerically available.
When a value does not reproduce, we ask why. Each failed target is compared method-against-method — the author's original analysis versus the agent's reproduction, cited line-by-line — and typed into the four preregistered categories below. A model drafts the diagnosis; two independent human reviewers adjudicate and resolve disagreements, with Cohen's κ reported.
Errors in data cleaning, filtering, transformation or aggregation — the inputs to the model were wrong.
Errors in how a variable is constructed/operationalized from the raw data.
Errors in selecting or specifying the statistical model (e.g. fixed- vs random-effects, contrasts, estimator).
Numerical precision or algorithmic-implementation errors in a correctly specified model.
An emergent category beyond the preregistered four: the original engine was unavailable and reimplemented locally — the model matches but the implementation (e.g. a macro replaced by a bootstrap) differs.
Output missing/unscorable, or required source material absent from the packet — kept separate from the causal types.
⏳ Error-type and severity distributions are withheld until the two-reviewer adjudication and inter-rater reliability (κ ≥ 0.70) are complete.
16 computationally reproducible Psychological Science 2025 articles (STAR-confirmed), decomposed into 35 self-contained analysis tasks. Conclusions are descriptive of this benchmark; cross-domain generalization is out of scope.
@misc{llm_comprepro_2026,
title = {Evaluating Large Language Models for Automated
Computational Reproducibility Checks},
author = {TODO — author list},
year = {2026},
note = {Preprint & data release forthcoming.
Preregistration: AsPredicted}
}