← Devansh Shukla — all projects

contamination-scanner

From-scratch MinHash/LSH auditor measuring benchmark inflation from train-eval leakage.

a Summary

Quantifies how much of a text benchmark's accuracy is an artifact of train/eval contamination rather than generalization. MinHash signatures and banded LSH are implemented from scratch (plus a TF-IDF cosine matcher for cases MinHash misses), and the matcher is validated on 200 seeded duplicate pairs across five controlled edit types before any claim is made.

Known leakage is then injected into an AG News classification task (6,000 documents; 4 contamination levels × 3 seeds) and naive versus leakage-corrected accuracy is reported. A third experiment (36 retrains) shows dedup-first pruning dominates random pruning. Ships a data-audit CLI with dedup and leakage subcommands.

b Results

Key results of contamination-scanner
measurementvaluenote
naive vs corrected accuracy0.900 vs 0.870at 30% injected contamination — a 3-point illusory gain; gap exactly 0.000 at zero leakage
detector on injected leakagerecall 1.00, precision 0.985–0.997the accuracy gap exceeds the false-positive rate
matcher validationprecision 1.0 at all 9 thresholdshonest failure reported: both matchers 0.0 recall on heavy paraphrase
pruning at 30% data removal−0.6 vs −1.0 pointsdedup-first vs random ordering (means over 3 seeds)
Naive vs leakage-corrected accuracy as injected contamination grows 0 to 30%, with the shaded illusory-accuracy wedge and seed error bars.
contamination_naive_vs_corrected.pngNaive vs leakage-corrected accuracy as injected contamination grows 0 to 30%, with the shaded illusory-accuracy wedge and seed error bars.
Matcher validation: precision/recall/F1 across Jaccard thresholds, and per-edit-type recall for MinHash vs TF-IDF including the paraphrase failure.
contamination_matcher_validation.pngMatcher validation: precision/recall/F1 across Jaccard thresholds, and per-edit-type recall for MinHash vs TF-IDF including the paraphrase failure.

caveatInjected near-exact lexical duplicates on one dataset with a linear classifier; paraphrased contamination would need an embedding-based matcher.

c Code

the MinHash core — vectorized universal hashing over a Mersenne prime — data_audit/minhash.py. Full source and tests are on GitHub; the walkthrough notebook reproduces the results table above in Colab.

class MinHasher:
    def __init__(self, num_perm=128, seed=0):
        rng = np.random.default_rng(seed)
        self.num_perm = num_perm
        self.a = rng.integers(1, _PRIME, size=num_perm, dtype=np.uint64)
        self.b = rng.integers(0, _PRIME, size=num_perm, dtype=np.uint64)

    def signature(self, shingles):
        if not shingles:
            return np.full(self.num_perm, _PRIME, dtype=np.uint64)
        h = np.array([_base_hash(s) for s in shingles], dtype=np.uint64)
        hashed = (self.a[:, None] * h[None, :] + self.b[:, None]) % _PRIME
        return hashed.min(axis=1)

def jaccard_estimate(sig1, sig2):
    return float(np.mean(sig1 == sig2))
stack
Python, NumPy, pandas, scikit-learn (classifier/TF-IDF only), CLI entry point
tests
17 pytest tests - GitHub Actions CI (ruff + pytest)
notebook
notebooks/01_walkthrough.ipynb on Colab