Evaluating LLMs for Automated Computational Reproducibility Checks

01

Headline findings

0.560

targets reproduced overall
value-level 0.712

+30pp

Code-Assisted over Code-Free
the largest single factor

0.733

best cell — GPT-5.5, Code-Assisted
just ahead of Claude (0.699)

5

failure-mode types
4 preregistered + ESE (new)

02

Leaderboard

Primary indicator: target-level reproduction rate — the fraction of targets whose every required value matches. Bars show the target rate; the right column shows value-level rate and the Wilson 95% CI. Scored under the precision-aligned rule (a value counts if it is within tolerance or rounds to the figure the paper reports).

gpt-5.5code-assisted

0.733

0.855

CI .699–.764

claude-opus-4.6code-assisted

0.699

0.837

CI .665–.732

claude-opus-4.6code-free

0.413

0.597

CI .377–.449

gpt-5.5code-free

0.395

0.560

CI .359–.431

all cellspooled

0.560

0.712

CI .542–.578

Full metrics · per agent × condition, medians per run

Agent	Condition	Task acc	Value acc	Tool calls	Time	Output tok	Cost / run	Failures
gpt-5.5	code-assisted	0.733	0.855	54	12.2 min	29.1k	n/a	0%
claude-opus-4.6	code-assisted	0.699	0.837	30	7.2 min	27.1k	$1.92	0%
claude-opus-4.6	code-free	0.413	0.597	42	17.7 min	45.8k	$2.97	0%
gpt-5.5	code-free	0.395	0.560	56	11.8 min	30.2k	n/a	0%

Time = median wall-clock per run; tool calls and output tokens are medians. Output tokens are comparable across agents; total tokens and USD cost are not — the Claude wrapper bills cached context as near-zero input (≈ $1.9–3.0 / run; ~$425 for all 70 Claude runs), while the Codex CLI reports neither a USD cost nor a comparable token total. All 140 runs completed (0 operational failures).

Scoring sensitivity. Under the strict canonical rule (exact tolerance band, no precision rounding) overall target rate is 0.506 and GPT-5.5 Code-Assisted falls to 0.574 — below Claude (0.699), which is essentially unchanged. The relative tolerance is, for ~55% of values, tighter than the precision the paper itself reports, which disproportionately penalizes the more precise agent; we therefore feature the precision-aligned rule and report the strict rule as a documented sensitivity. The Code-Assisted ordering is rule-sensitive.

By evidentiary tier · target rate

Confirmatory (1,896)0.562

Secondary (440)0.636

Exploratory (512)0.486

Robustness subsets · target rate

Strict-Docker (containerized)0.667

Single-pass-clean (90 runs)0.578

Code-assisted-clean (64 runs)0.717

03

By paper

Target-level reproduction rate per frozen paper bundle, pooled across both agents and both conditions (precision-aligned rule). Reproducibility varies widely with the paper's analytical complexity.

Aday 20250.868

Yousif 20250.787

Otten 20250.771

Cho 20250.750

Navon 20250.727

Ditlmann 20250.720

Zhang 20250.712

Woolley 20250.693

Charlesworth 20250.558

Geiger 20250.542

Lopukhina 20250.480

Alister 20250.412

Willroth 20250.362

VanDeCalseyde 20250.319

Garagnani 20250.303

Liang 20250.302

04

Three-phase workflow

An artifact-only handoff pipeline. Stage 2 (Reproduce) runs in a sandbox that never contains the gold values; Stage 3 (Evaluate) reads only the files Stage 2 wrote. The unit of analysis is the target value and the target (a group of values reported together).

artifact handoff (one-way) sandboxed phase re-running P1 on the same materials reproduces the same hashes

05

Task setting

Code-Free

The author code/ is removed. The agent must derive the entire analysis from the manuscript, supplement, data and codebook, writing original analysis code.

Code-Assisted

The author code/ is mounted. Only non-substantive edits (paths, output capture, environment) are allowed before the first complete numeric output; an AST diff audit flags any substantive change.

Single-pass rule

Once a complete numeric output exists, the agent must not tune filters, contrasts, models or rounding to fit expected values. A tool-call trace audit enforces it and feeds a sensitivity subset.

Run matrix

papers / task packets16 / 35

agents × conditions2 × 2

runs (all completed)140

forbidden-file leakage0 / 140

Agents evaluated: claude-opus-4.6 (highest thinking) and gpt-5.5 (xhigh reasoning), each driven through an identical wrapper and standardized prompt. Conditions are run in the fixed order Code-Free → Code-Assisted so no code-assisted state can leak into a code-free workspace.

06

Scoring

Primary rule

Reproduction accuracy is a dichotomous measure per target: a target reproduces iff every required value matches its registered comparison rule (round-to-reported-precision, tolerance, integer, exact-p, or one-sided inequality) and carries provenance linking it to the command that produced it. Figures get a tolerated match when their underlying data are not numerically available.

Secondary & sensitivity

value-match, sign agreement, significance crossing, log relative error
subsets: strict-Docker, by compute backend, single-pass-clean, code-assisted-clean
strict vs precision-aligned tolerance (see leaderboard note)

07

Discrepancy diagnosis

When a value does not reproduce, we ask why. Each failed target is compared method-against-method — the author's original analysis versus the agent's reproduction, cited line-by-line — and typed into the four preregistered categories below. A model drafts the diagnosis; two independent human reviewers adjudicate and resolve disagreements, with Cohen's κ reported.

DPE

Data-processing error

Errors in data cleaning, filtering, transformation or aggregation — the inputs to the model were wrong.

VDE

Variable-definition error

Errors in how a variable is constructed/operationalized from the raw data.

MSE

Model-specification error

Errors in selecting or specifying the statistical model (e.g. fixed- vs random-effects, contrasts, estimator).

CE

Computational error

Numerical precision or algorithmic-implementation errors in a correctly specified model.

ESE

Engine-substitution error discovered

An emergent category beyond the preregistered four: the original engine was unavailable and reimplemented locally — the model matches but the implementation (e.g. a macro replaced by a bootstrap) differs.

+

Process & materials

Output missing/unscorable, or required source material absent from the packet — kept separate from the causal types.

⏳ Error-type and severity distributions are withheld until the two-reviewer adjudication and inter-rater reliability (κ ≥ 0.70) are complete.

08

About

Scope

16 computationally reproducible Psychological Science 2025 articles (STAR-confirmed), decomposed into 35 self-contained analysis tasks. Conclusions are descriptive of this benchmark; cross-domain generalization is out of scope.

Known limitations

Failure-mode results pending human κ adjudication.
One attempt per cell (retry on operational failure only); variance not characterized.
Original values are visible in the manuscript; provenance + single-pass + diff audits mitigate but cannot fully exclude copying.
Code-Assisted agent ordering is sensitive to the tolerance rule (see leaderboard).

Citation

@misc{llm_comprepro_2026,
  title  = {Evaluating Large Language Models for Automated
            Computational Reproducibility Checks},
  author = {TODO — author list},
  year   = {2026},
  note   = {Preprint & data release forthcoming.
            Preregistration: AsPredicted}
}