Devansh Shukla

AI Engineer & ML researcher — measurement and evaluation of ML systems, from benchmarks to model internals.

b. 2005 · BSc CS '26 · Gravity · GitHub

01 About

I work on what a single reported number hides. The thread through my work is measurement: whether a benchmark score survives contamination and memorization checks, whether a learning algorithm's gradients match an independent oracle, and what the internal geometry of a model says that its accuracy cannot. My current manuscript studies the last of these in small open language models. By day I build LLM-backed systems in production; the engineering keeps the research honest about what such measurements can do once deployed.

born
2005, India
education
BSc (Hons) Computer Science, K.R. Mangalam University — graduating 2026
certificates
IBM — Big Data Engineer (Career Education Program, Oct 2025) · Microsoft Certified: Azure Data Fundamentals (Aug 2025 · code wNNpp-2FMV at verify.certiport.com)

02 Research

D. Shukla. Clustered, Overlapped, or Fractured? The Local-Neighbourhood Geometry of Truth Across Small Multilingual Language Models. Manuscript, 2026. arXiv — soon full results & verification

  • 7 models · 4 families0.5B–14B
  • 8 languages3 truth concepts
  • 2000-sample bootstrap CIspre-registered 14B extension
  • controlslabel-permutation · length · random-weight · quantization

Linear probes read whether a statement is true from a language model's hidden states at ≈0.97–1.00 accuracy. This paper asks what that number hides. It takes the k* distribution (Kotyan, Ueda & Vargas, IEEE TNNLS 2025) — a distortion-free local-neighbourhood statistic — and applies it to the hidden states of autoregressive LLMs for the first time. At matched probe accuracy, the local shape of truth differs by model family (Fractured → Clustered); it changes with scale in a layer- and concept-dependent way; and it is fragile across languages, measured by a new Geometric Reliability Coefficient (GRC). Two controlled negative results show this local geometry does not reliably predict probe transfer: the global truth direction travels across models and languages; the local neighbourhood does not follow.

Headline numbers of the manuscript
measurementvaluenote
linear probe accuracy0.97–1.00near-perfect decodability, all 7 models — the number the paper looks beneath
k* skew across families−0.51 … +1.29full Fractured → Clustered range at matched ≈0.99 probe accuracy
GRC across languages0.267 → 0.096geometric reliability falls with scale while transfer rises
cross-model probe transfer0.996 vs 0.970learned linear vs rigid orthogonal-Procrustes alignment
Probe transfer accuracy across model pairs under a learned linear alignment versus a rigid orthogonal-Procrustes alignment.
fig5_transfer_vs_gap.pngCross-model probe transfer: flat near 1.0 under a learned linear alignment; degrades under rigid orthogonal-Procrustes alignment.
Cross-lingual transfer rising with model scale while the Geometric Reliability Coefficient falls, across language pairs.
fig8_dissociation.pngCross-lingual dissociation: transfer rises with scale while GRC falls; script and token length, not local geometry, predict which pairs transfer.
instrument — kstar.py (68 lines)

The cleaned, comment-free core instrument, numerically verified against the paper pipeline. Four segments, explained between the code instead of inside it.

1 · Nearest-different-class rank kstar_rank

For each sample, rank all others by distance and count how many same-class neighbours appear before the first different-class one. Supports euclidean / cosine / cityblock / chebyshev; the result depends only on neighbour ranks, not absolute distances.

import numpy as np

def kstar_rank(X, y, metric='euclidean', standardize=False):
    X = np.asarray(X, dtype=np.float64)
    y = np.asarray(y)
    if standardize:
        X = (X - X.mean(0, keepdims=True)) / (X.std(0, keepdims=True) + 1e-08)
    if metric == 'cosine':
        Xn = X / (np.linalg.norm(X, axis=1, keepdims=True) + 1e-12)
        D = 1.0 - Xn @ Xn.T
    elif metric in ('cityblock', 'l1'):
        D = np.abs(X[:, None, :] - X[None, :, :]).sum(-1)
    elif metric in ('chebyshev', 'linf'):
        D = np.abs(X[:, None, :] - X[None, :, :]).max(-1)
    else:
        sq = (X * X).sum(1)
        D = np.sqrt(np.maximum(sq[:, None] + sq[None, :] - 2 * X @ X.T, 0.0))
    np.fill_diagonal(D, np.inf)
    order = np.argsort(D, axis=1)
    diff = y[order] != y[:, None]
    kstar = diff.argmax(axis=1)
    has_diff = diff.any(axis=1)
    same_counts = (y[:, None] == y[None, :]).sum(1) - 1
    return np.where(has_diff, kstar, same_counts).astype(int)

2 · Class-size normalisation and typology kstar_normalized, typology_by_class

Divide by the sample's own class cardinality, then label each class by the Fisher–Pearson skewness of its normalised distribution — Fractured (γ > 0.5), Overlapped (|γ| ≤ 0.5), Clustered (γ < −0.5), cut-offs unchanged from the original method.

def _skew_pop(v):
    v = np.asarray(v, dtype=np.float64)
    sd = v.std()
    if sd == 0:
        return 0.0
    return float(((v - v.mean()) ** 3).mean() / sd ** 3)

def kstar_normalized(raw, y):
    y = np.asarray(y)
    out = np.empty(len(raw), dtype=np.float64)
    for c in np.unique(y):
        m = y == c
        out[m] = raw[m] / max(int(m.sum()), 1)
    return out

def typology_by_class(norm_kstar, y):
    y = np.asarray(y)
    res = {}
    for c in np.unique(y):
        v = norm_kstar[y == c]
        g = _skew_pop(v)
        label = 'Fractured' if g > 0.5 else 'Clustered' if g < -0.5 else 'Overlapped'
        res[int(c)] = {'label': label, 'skewness': g, 'mean': float(v.mean()), 'std': float(v.std()), 'n': int(len(v))}
    return res

3 · Uncertainty gamma_bootstrap_ci

Third moments are noisy at small n, so every reported skewness carries a 2000-resample bootstrap 95% CI.

def gamma_bootstrap_ci(norm_kstar, y, cls, n_boot=2000, seed=0):
    v = np.asarray(norm_kstar)[np.asarray(y) == cls]
    rng = np.random.default_rng(seed)
    n = len(v)
    boots = [_skew_pop(v[rng.integers(0, n, n)]) for _ in range(n_boot)]
    return (float(np.percentile(boots, 2.5)), float(np.percentile(boots, 97.5)))

4 · Cross-lingual stability grc_spearman

GRC = mean pairwise Spearman correlation of per-item normalised k* across languages — rank correlation because k* is itself a rank.

def grc_spearman(norm_kstar_by_lang):
    from scipy.stats import spearmanr
    langs = list(norm_kstar_by_lang)
    vals, pairs = ([], [])
    for a in range(len(langs)):
        for b in range(a + 1, len(langs)):
            r, _ = spearmanr(norm_kstar_by_lang[langs[a]], norm_kstar_by_lang[langs[b]])
            r = 0.0 if np.isnan(r) else float(r)
            vals.append(r)
            pairs.append((langs[a], langs[b], round(r, 3)))
    return {'grc': float(np.mean(vals)) if vals else 0.0, 'pairs': pairs}

03 Projects

Five self-contained studies sharing one question: when can you trust a reported number? Each entry opens a full page with the results, figures, and code.

  • Memorization probe for LLM benchmarks via meaning-preserving perturbation plus controls.

    producedflags a 60%-memorized model (gap 0.260, CI [0.189, 0.331]) and correctly clears a paraphrase-brittle clean one

  • From-scratch MinHash/LSH auditor measuring benchmark inflation from train-eval leakage.

    producednaive 0.900 vs corrected 0.870 accuracy at 30% injected leakage; detector recall 1.00, precision ≥ 0.985

  • Leakage-tested link-prediction bake-off: zero-parameter heuristics versus a trained GCN.

    producedclustering coefficient predicts the winner (ρ = 0.955); a heuristic matches the GCN on 16/20 graphs

  • NumPy bake-off of TD return estimators under delayed reward.

    producedoptimal λ shifts 0.8 → 1.0 as reward delay grows; delay-aware λ beats TD(0) with disjoint CIs (1,600 runs)

  • Reverse-mode autodiff engine from scratch, gradient-verified against PyTorch.

    producedgradients match PyTorch to ~3.5e-15 over 500 expressions; trains two-moons to 0.985 with its own Adam

04 Experience

Work experience, most recent first
datescompanyrole
Jun 2026 – presentGravityAI Engineer (Junior)
Feb 2026 – May 2026GravityAI Engineer Intern
Nov 2025 – Jan 2026Vibe AIFull-Stack Intern

05 Contact

email
devansh.shukla.official@gmail.com
github
github.com/Devansh-hello