← Devansh Shukla — all projects

perturbbench

Memorization probe for LLM benchmarks via meaning-preserving perturbation plus controls.

a Summary

Asks whether a model's headline benchmark accuracy is partly memorization: if items are rewritten meaning-preservingly, does accuracy drop more on the public benchmark than on a fresh, difficulty-matched control set? Three perturbation operators are implemented for GSM8K-style word problems — an identity null, numeric resampling that recomputes the gold answer by re-evaluating the problem's own calculator chain, and entity renaming.

The decision rule flags memorization only when the bootstrap 95% CI of the gap (benchmark drop − control drop) excludes zero. The harness is validated end-to-end on three simulated models with known ground truth (memorized, clean, paraphrase-brittle) over 300 GSM8K items; all three are classified correctly. A real model plugs in via a one-method is_correct(item) interface.

b Results

Key results of perturbbench
measurementvaluenote
gap — memorized model0.260, CI [0.189, 0.331]numeric_resample; correctly flagged
gap — paraphrase-brittle model0.023, CI [−0.064, 0.115]correctly NOT flagged — the case a control-free probe would misclassify
identity-null drop0.000, McNemar p = 1.0no harness artifact (n = 150)
perturbation coverage127/150 itemsnumeric_resample; McNemar p = 2.5e-08 on the memorized model
Benchmark vs control accuracy drop for the three simulated scenarios; only the memorized model shows a CI-separated gap.
perturbbench_drop_gap.pngBenchmark vs control accuracy drop for the three simulated scenarios; only the memorized model shows a CI-separated gap.

caveatNo real LLM is evaluated — results validate the harness on simulated models with known memorization; the real-model runner is an interface stub.

c Code

the gap estimator — paired bootstrap over both sets — perturbbench/experiment.py. Full source and tests are on GitHub; the walkthrough notebook reproduces the results table above in Colab.

def drop_gap(model, benchmark_items, control_items, perturb_fn, seed=0, n_boot=2000):
    bo, bp = _eval_pairs(model, benchmark_items, perturb_fn, seed)
    co, cp = _eval_pairs(model, control_items, perturb_fn, seed)
    bench_drop = float(bo.mean() - bp.mean())
    ctrl_drop = float(co.mean() - cp.mean())
    rng = np.random.default_rng(seed + 1)
    gaps = np.empty(n_boot)
    for i in range(n_boot):
        bi = rng.integers(0, len(bo), size=len(bo))
        ci = rng.integers(0, len(co), size=len(co))
        gaps[i] = (bo[bi].mean() - bp[bi].mean()) - (co[ci].mean() - cp[ci].mean())
    return {"bench_drop": bench_drop, "ctrl_drop": ctrl_drop,
            "gap": bench_drop - ctrl_drop,
            "gap_lo": float(np.percentile(gaps, 2.5)),
            "gap_hi": float(np.percentile(gaps, 97.5))}
stack
Python, numpy, pandas, scipy, matplotlib
tests
21 pytest tests - GitHub Actions CI (ruff + pytest)
notebook
notebooks/01_walkthrough.ipynb on Colab