Evaluating Large Language Models for Automated Computational Reproducibility Checks

Can frontier LLM coding agents reproduce the published numerical results of psychological-science papers — from the data alone, or with the authors' code?

Preregistered · AsPredicted Psychological Science 2025 STAR-confirmed papers 16 papers · 35 tasks · 140 runs

Verifying computational reproducibility — ensuring reported results can be regenerated from the original data and code — is a critical but labor-intensive part of the scientific process. This benchmark systematically evaluates how far state-of-the-art LLM coding agents can automate that check. We take articles published in Psychological Science in 2025 that passed the journal's human-led STAR reproducibility check, and compare agent performance under two conditions — manuscript + data only (Code-Free) and manuscript + data + the authors' code (Code-Assisted) — scoring each agent's outputs against the published results and typing every discrepancy into predefined error categories.

Research question. To what extent can LLM agents correctly reproduce the numerical results of published papers, and how is this influenced by the availability of the original analysis code?

Authors — to be added · affiliations · correspondence TODO

Reproduction accuracy is final & deterministic. Failure-mode diagnosis is under independent human adjudication.
16
2025 papers (frozen)
35
task packets
140
agent runs
2
agents · 2 conditions
2,537
gold values · 762 targets
01

Headline findings

0.560
targets reproduced overall
value-level 0.712
+30pp
Code-Assisted over Code-Free
the largest single factor
0.733
best cell — GPT-5.5, Code-Assisted
just ahead of Claude (0.699)
5
failure-mode types
4 preregistered + ESE (new)
02

Leaderboard

Primary indicator: target-level reproduction rate — the fraction of targets whose every required value matches. Bars show the target rate; the right column shows value-level rate and the Wilson 95% CI. Scored under the precision-aligned rule (a value counts if it is within tolerance or rounds to the figure the paper reports).

gpt-5.5code-assisted
0.733
0.855
CI .699–.764
claude-opus-4.6code-assisted
0.699
0.837
CI .665–.732
claude-opus-4.6code-free
0.413
0.597
CI .377–.449
gpt-5.5code-free
0.395
0.560
CI .359–.431
all cellspooled
0.560
0.712
CI .542–.578

Full metrics · per agent × condition, medians per run

AgentCondition Task accValue accTool callsTimeOutput tokCost / runFailures
gpt-5.5code-assisted0.7330.8555412.2 min29.1kn/a0%
claude-opus-4.6code-assisted0.6990.837307.2 min27.1k$1.920%
claude-opus-4.6code-free0.4130.5974217.7 min45.8k$2.970%
gpt-5.5code-free0.3950.5605611.8 min30.2kn/a0%

Time = median wall-clock per run; tool calls and output tokens are medians. Output tokens are comparable across agents; total tokens and USD cost are not — the Claude wrapper bills cached context as near-zero input (≈ $1.9–3.0 / run; ~$425 for all 70 Claude runs), while the Codex CLI reports neither a USD cost nor a comparable token total. All 140 runs completed (0 operational failures).

Scoring sensitivity. Under the strict canonical rule (exact tolerance band, no precision rounding) overall target rate is 0.506 and GPT-5.5 Code-Assisted falls to 0.574 — below Claude (0.699), which is essentially unchanged. The relative tolerance is, for ~55% of values, tighter than the precision the paper itself reports, which disproportionately penalizes the more precise agent; we therefore feature the precision-aligned rule and report the strict rule as a documented sensitivity. The Code-Assisted ordering is rule-sensitive.

By evidentiary tier · target rate

Confirmatory (1,896)0.562
Secondary (440)0.636
Exploratory (512)0.486

Robustness subsets · target rate

Strict-Docker (containerized)0.667
Single-pass-clean (90 runs)0.578
Code-assisted-clean (64 runs)0.717
03

By paper

Target-level reproduction rate per frozen paper bundle, pooled across both agents and both conditions (precision-aligned rule). Reproducibility varies widely with the paper's analytical complexity.

Aday 20250.868
Yousif 20250.787
Otten 20250.771
Cho 20250.750
Navon 20250.727
Ditlmann 20250.720
Zhang 20250.712
Woolley 20250.693
Charlesworth 20250.558
Geiger 20250.542
Lopukhina 20250.480
Alister 20250.412
Willroth 20250.362
VanDeCalseyde 20250.319
Garagnani 20250.303
Liang 20250.302
04

Three-phase workflow

An artifact-only handoff pipeline. Stage 2 (Reproduce) runs in a sandbox that never contains the gold values; Stage 3 (Evaluate) reads only the files Stage 2 wrote. The unit of analysis is the target value and the target (a group of values reported together).

materials → packets → scoring P1 Prepare Freeze 16 papers (SHA-256), extract gold targets, build 35 blinded task packets. strips gold · audit · answers P2 Reproduce 2 agents × 2 conditions run one packet per run in a sandbox → results + provenance. no gold · single-pass rule P3 Evaluate Score vs gold, audit leakage, draft discrepancy diagnosis, route to human adjudication. reads only Stage-2 artifacts task packet (blinded) results · provenance · figures
artifact handoff (one-way) sandboxed phase re-running P1 on the same materials reproduces the same hashes
05

Task setting

Code-Free

The author code/ is removed. The agent must derive the entire analysis from the manuscript, supplement, data and codebook, writing original analysis code.

Code-Assisted

The author code/ is mounted. Only non-substantive edits (paths, output capture, environment) are allowed before the first complete numeric output; an AST diff audit flags any substantive change.

Single-pass rule

Once a complete numeric output exists, the agent must not tune filters, contrasts, models or rounding to fit expected values. A tool-call trace audit enforces it and feeds a sensitivity subset.

Run matrix

papers / task packets16 / 35
agents × conditions2 × 2
runs (all completed)140
forbidden-file leakage0 / 140

Agents evaluated: claude-opus-4.6 (highest thinking) and gpt-5.5 (xhigh reasoning), each driven through an identical wrapper and standardized prompt. Conditions are run in the fixed order Code-Free → Code-Assisted so no code-assisted state can leak into a code-free workspace.

06

Scoring

Primary rule

Reproduction accuracy is a dichotomous measure per target: a target reproduces iff every required value matches its registered comparison rule (round-to-reported-precision, tolerance, integer, exact-p, or one-sided inequality) and carries provenance linking it to the command that produced it. Figures get a tolerated match when their underlying data are not numerically available.

Secondary & sensitivity

  • value-match, sign agreement, significance crossing, log relative error
  • subsets: strict-Docker, by compute backend, single-pass-clean, code-assisted-clean
  • strict vs precision-aligned tolerance (see leaderboard note)
07

Discrepancy diagnosis

When a value does not reproduce, we ask why. Each failed target is compared method-against-method — the author's original analysis versus the agent's reproduction, cited line-by-line — and typed into the four preregistered categories below. A model drafts the diagnosis; two independent human reviewers adjudicate and resolve disagreements, with Cohen's κ reported.

DPE

Data-processing error

Errors in data cleaning, filtering, transformation or aggregation — the inputs to the model were wrong.

VDE

Variable-definition error

Errors in how a variable is constructed/operationalized from the raw data.

MSE

Model-specification error

Errors in selecting or specifying the statistical model (e.g. fixed- vs random-effects, contrasts, estimator).

CE

Computational error

Numerical precision or algorithmic-implementation errors in a correctly specified model.

ESE

Engine-substitution error discovered

An emergent category beyond the preregistered four: the original engine was unavailable and reimplemented locally — the model matches but the implementation (e.g. a macro replaced by a bootstrap) differs.

+

Process & materials

Output missing/unscorable, or required source material absent from the packet — kept separate from the causal types.

⏳ Error-type and severity distributions are withheld until the two-reviewer adjudication and inter-rater reliability (κ ≥ 0.70) are complete.

08

About

Scope

16 computationally reproducible Psychological Science 2025 articles (STAR-confirmed), decomposed into 35 self-contained analysis tasks. Conclusions are descriptive of this benchmark; cross-domain generalization is out of scope.

Known limitations

  • Failure-mode results pending human κ adjudication.
  • One attempt per cell (retry on operational failure only); variance not characterized.
  • Original values are visible in the manuscript; provenance + single-pass + diff audits mitigate but cannot fully exclude copying.
  • Code-Assisted agent ordering is sensitive to the tolerance rule (see leaderboard).

Citation

@misc{llm_comprepro_2026,
  title  = {Evaluating Large Language Models for Automated
            Computational Reproducibility Checks},
  author = {TODO — author list},
  year   = {2026},
  note   = {Preprint & data release forthcoming.
            Preregistration: AsPredicted}
}