Research — Devansh Shukla

a What the paper does

Linear probes read whether a statement is true from a language model's hidden states at ≈0.97–1.00 accuracy. The paper asks what that number hides. It takes the k* distribution (Kotyan, Ueda & Vargas, IEEE TNNLS 2025) — a distortion-free local-neighbourhood statistic that labels a class Clustered, Overlapped, or Fractured from the skewness of its nearest-different-class rank — and applies it to the hidden states of autoregressive LLMs for the first time.

Findings: at matched probe accuracy the local shape of truth differs by model family (Fractured → Clustered on identical inputs); it changes with scale in a layer- and concept-dependent way; and it is fragile across languages, measured by a new Geometric Reliability Coefficient (GRC). Two controlled negative results show this local geometry does not reliably predict probe transfer — the global truth direction travels across models and languages; the local neighbourhood does not follow. Both negatives are reported as such.

7 models · 4 familiesQwen2.5 0.5B–14B · Mistral · Phi-3 · OLMo-2
8 languagesen es de fr ru zh ar hi · 3 truth concepts
2000-sample bootstrap CIspre-registered 14B extension
controlslabel-permutation · length · random-weight · quantization

b Results

Main result (geo_num concept, read at each model's best-probe layer). Truth is decodable everywhere; the local typology is not the same anywhere.

Per-model probe accuracy and k* typology
model	probe	k* skew γ [95% CI]	typology	GRC
Qwen2.5-0.5B	0.972	+1.04 [+0.78, +1.32]	Fractured	0.267
Qwen2.5-1.5B	0.997	+0.38 [+0.14, +0.59]	Overlapped	0.216
Qwen2.5-3B	0.993	+0.17 [−0.02, +0.37]	Overlapped	0.204
Qwen2.5-7B	1.000	−0.51 [−0.81, −0.22]	Clustered	0.096
Qwen2.5-14B	0.998	−0.08 [−0.26, +0.10]	Overlapped	—
Mistral-7B	0.995	+0.46	Overlapped	0.256
Phi-3-mini	0.997	+0.29 [+0.10, +0.50]	Overlapped	0.160
OLMo-2-7B	0.982	+1.29 [+0.97, +1.63]	Fractured	0.218

notesCIs are 2000-resample bootstrap. Several CIs straddle a ±0.5 boundary, so the robust claim is the family-and-scale ordering, not any single categorical label. Mistral's CI and 14B's GRC were not recomputed in the scale-extension run.

Scale × concept: no single curve describes truth's local geometry. At the probe-peak layer geo_num reverses at 14B; at one matched depth (0.6) the same ladder is monotone (+0.83 → −0.84) — the reversal is a layer-selection effect the paper flags rather than hides.

k* skew across the Qwen ladder per concept
concept	0.5B	1.5B	3B	7B	14B
geo_num (cities + larger_than)	+1.04	+0.38	+0.17	−0.51	−0.08
entity (companies + sp_en_trans)	+2.62	+0.99	+1.00	+1.27	+0.69
neg_cities (negated facts)	+1.19	+0.17	−0.48	+0.08	−3.31

Transfer — the two honest negatives. Cross-model: a learned linear remap recovers the truth direction regardless of local geometry; under a rigid alignment the degradation co-occurs with the k* gap but is outlier-driven and concentrated on the smallest target model. Cross-lingual: token length and script predict transfer; the local geometry does not.

Transfer results
test	value	note
cross-model, learned-linear (ridge)	mean 0.996	k* gap does not predict it (r = +0.19, two-sided Mantel p = 0.51)
cross-model, rigid (Procrustes, 64-d PCA)	mean 0.970, worst 0.82	Pearson r = −0.34 (two-sided Mantel p = 0.029) — but Spearman ρ ≈ 0; six of eight worst pairs share the 0.5B target
cross-lingual, 8×8 per model	off-diagonal 0.58 → 0.89	rises with scale while GRC falls 0.267 → 0.096
per-pair predictor, pooled	k* gap r = +0.05	null; token-length gap r = −0.55…−0.68 in every model is the consistent predictor

Controls (the result survives them; the instrument is validated before any LLM number is trusted).

Controls and validation
control	value	conclusion
label permutation (200 shuffles, 0.5B)	probe 0.975 → 0.499 ± 0.028	decodability needs true labels (p < 0.001)
k* skew vs random-label null	+1.04 vs +1.89	real labels ≈3 SD less fractured than chance
distance metric	euclidean / cosine / cityblock / chebyshev	typology unchanged under all four
layer consistency (0.5B)	24/25 layers	not a cherry-picked layer
quantization (Qwen-7B)	bf16 −0.516 vs 4-bit −0.506	4-bit extraction does not flip the typology
raw neighbour overlap M10 (21 pairs)	0.35–0.48 ≫ 0.025 floor	family divergence is not unrelated representation spaces (r vs k* gap = −0.67, p = 0.005: a partial, not clean, dissociation)

Scale ladder de-fracturing on geo_num while an entity concept stays Fractured; four families at matched probe accuracy span Fractured to Clustered. — fig1_scale_and_family.pngScale de-fractures truth on geo_num (0.5B→7B) while an entity concept stays Fractured; at ~7B, four families at matched ≈0.99 probe accuracy span the full Fractured→Clustered range.

Full 0.5B to 14B ladder with bootstrap confidence intervals, and label-agnostic neighbour overlap M10 against the k* skew gap. — fig9_scale_wolfram.pngFull 0.5B–14B ladder with 95% bootstrap CIs (the probe-peak curve reverses at 14B; at matched depth it is monotone) and label-agnostic neighbour overlap M10 against the k* skew gap.

Per-layer probe accuracy and k* skewness for all seven models. — fig4_per_layer.pngPer-layer probe accuracy and k* skewness, all seven models: decodability is broad across depth while the typology varies with it — why the paper reports depth curves and a competent-layer ordering, not one label.

Per-pair k* skew gap is uncorrelated with cross-lingual transfer; model-level GRC runs opposite to transfer. — fig6_crosslingual.png(a) Per-pair |k* skew gap| vs cross-lingual transfer: uncorrelated (r = +0.05). (b) Model-level GRC vs mean transfer (Spearman −0.54): the least language-stable geometry belongs to the best-transferring model.

Per-model 8 by 8 cross-lingual transfer matrices, raw versus affine-calibrated. — fig7_crosslingual_heatmaps.pngPer-model 8×8 transfer matrices (rows train, columns test), raw vs affine-calibrated: the within-Latin block stays warm, non-Latin columns cool, the matrix warms with scale, and calibration barely changes it.

c The instrument — kstar.py

The complete measurement core (68 lines, comment-free; the notes between segments stand in for inline comments). It is numerically verified against the paper pipeline and is the file this page lets you download.

The cleaned, comment-free core instrument, numerically verified against the paper pipeline. Four segments, explained between the code instead of inside it.

1 · Nearest-different-class rank kstar_rank

For each sample, rank all others by distance and count how many same-class neighbours appear before the first different-class one. Supports euclidean / cosine / cityblock / chebyshev; the result depends only on neighbour ranks, not absolute distances.

import numpy as np

def kstar_rank(X, y, metric='euclidean', standardize=False):
    X = np.asarray(X, dtype=np.float64)
    y = np.asarray(y)
    if standardize:
        X = (X - X.mean(0, keepdims=True)) / (X.std(0, keepdims=True) + 1e-08)
    if metric == 'cosine':
        Xn = X / (np.linalg.norm(X, axis=1, keepdims=True) + 1e-12)
        D = 1.0 - Xn @ Xn.T
    elif metric in ('cityblock', 'l1'):
        D = np.abs(X[:, None, :] - X[None, :, :]).sum(-1)
    elif metric in ('chebyshev', 'linf'):
        D = np.abs(X[:, None, :] - X[None, :, :]).max(-1)
    else:
        sq = (X * X).sum(1)
        D = np.sqrt(np.maximum(sq[:, None] + sq[None, :] - 2 * X @ X.T, 0.0))
    np.fill_diagonal(D, np.inf)
    order = np.argsort(D, axis=1)
    diff = y[order] != y[:, None]
    kstar = diff.argmax(axis=1)
    has_diff = diff.any(axis=1)
    same_counts = (y[:, None] == y[None, :]).sum(1) - 1
    return np.where(has_diff, kstar, same_counts).astype(int)

2 · Class-size normalisation and typology kstar_normalized, typology_by_class

Divide by the sample's own class cardinality, then label each class by the Fisher–Pearson skewness of its normalised distribution — Fractured (γ > 0.5), Overlapped (|γ| ≤ 0.5), Clustered (γ < −0.5), cut-offs unchanged from the original method.

def _skew_pop(v):
    v = np.asarray(v, dtype=np.float64)
    sd = v.std()
    if sd == 0:
        return 0.0
    return float(((v - v.mean()) ** 3).mean() / sd ** 3)

def kstar_normalized(raw, y):
    y = np.asarray(y)
    out = np.empty(len(raw), dtype=np.float64)
    for c in np.unique(y):
        m = y == c
        out[m] = raw[m] / max(int(m.sum()), 1)
    return out

def typology_by_class(norm_kstar, y):
    y = np.asarray(y)
    res = {}
    for c in np.unique(y):
        v = norm_kstar[y == c]
        g = _skew_pop(v)
        label = 'Fractured' if g > 0.5 else 'Clustered' if g < -0.5 else 'Overlapped'
        res[int(c)] = {'label': label, 'skewness': g, 'mean': float(v.mean()), 'std': float(v.std()), 'n': int(len(v))}
    return res

3 · Uncertainty gamma_bootstrap_ci

Third moments are noisy at small n, so every reported skewness carries a 2000-resample bootstrap 95% CI.

def gamma_bootstrap_ci(norm_kstar, y, cls, n_boot=2000, seed=0):
    v = np.asarray(norm_kstar)[np.asarray(y) == cls]
    rng = np.random.default_rng(seed)
    n = len(v)
    boots = [_skew_pop(v[rng.integers(0, n, n)]) for _ in range(n_boot)]
    return (float(np.percentile(boots, 2.5)), float(np.percentile(boots, 97.5)))

4 · Cross-lingual stability grc_spearman

GRC = mean pairwise Spearman correlation of per-item normalised k* across languages — rank correlation because k* is itself a rank.

def grc_spearman(norm_kstar_by_lang):
    from scipy.stats import spearmanr
    langs = list(norm_kstar_by_lang)
    vals, pairs = ([], [])
    for a in range(len(langs)):
        for b in range(a + 1, len(langs)):
            r, _ = spearmanr(norm_kstar_by_lang[langs[a]], norm_kstar_by_lang[langs[b]])
            r = 0.0 if np.isnan(r) else float(r)
            vals.append(r)
            pairs.append((langs[a], langs[b], round(r, 3)))
    return {'grc': float(np.mean(vals)) if vals else 0.0, 'pairs': pairs}

d Verify it yourself

The instrument is validated on synthetic data with known geometry before any LLM number is trusted. The check below builds three 2-D fixtures — interleaved tight sub-blobs (locally pure → Clustered), two moderately separated Gaussians (Overlapped, the boundary case by construction), and fully interpenetrating Gaussians (Fractured) — and labels them with the exact code above. Needs only numpy and scipy:

pip install numpy scipy
python synthetic_check.py   # with kstar.py in the same folder

import numpy as np
from kstar import kstar_rank, kstar_normalized, typology_by_class

N = 148

def make(kind, rng):
    if kind == "clustered":
        c = np.array([[i * 2.0, 0] for i in range(8)])
        a = np.concatenate([rng.normal(c[i], 0.35, (N // 4, 2)) for i in [0, 2, 4, 6]])
        b = np.concatenate([rng.normal(c[i], 0.35, (N // 4, 2)) for i in [1, 3, 5, 7]])
    elif kind == "overlapped":
        a = rng.normal([0, 0], 1.0, (N, 2))
        b = rng.normal([3, 0], 1.0, (N, 2))
    else:
        a = rng.normal([0, 0], 1.0, (N, 2))
        b = rng.normal([0, 0], 1.0, (N, 2))
    return np.vstack([a, b]), np.array([1] * len(a) + [0] * len(b))

def band(g):
    return "Fractured" if g > 0.5 else "Clustered" if g < -0.5 else "Overlapped"

for kind in ["clustered", "overlapped", "fractured"]:
    skews = []
    for seed in range(10):
        X, y = make(kind, np.random.default_rng(seed))
        t = typology_by_class(kstar_normalized(kstar_rank(X, y), y), y)
        skews.append(np.mean([t[c]["skewness"] for c in t]))
    med = float(np.median(skews))
    print(f"{kind:<11} median skew over 10 draws {med:+.2f}  ->  {band(med)}")

Output (deterministic):

clustered   median skew over 10 draws -2.02  ->  Clustered
overlapped  median skew over 10 draws +0.45  ->  Overlapped
fractured   median skew over 10 draws +1.97  ->  Fractured

e Reproducibility & status

status: Manuscript, 2026 — under submission preparation; arXiv link will replace the badge above.
available now: kstar.py (the instrument) and synthetic_check.py (its validation), both runnable as-is.
with the paper: The full extraction-and-analysis pipeline (activation caching, probes, transfer matrices, bootstrap and Mantel tests), per-run result JSONs, and the timestamped pre-registration of the 14B extension.
honesty: The two transfer results are negatives and are reported as negatives; the rigid-alignment association is additionally flagged as outlier-driven (Spearman ρ ≈ 0) and entangled with target capacity.