← Devansh Shukla — all projects

credit-assignment-bakeoff

NumPy bake-off of TD return estimators under delayed reward.

a Summary

Asks whether the best return estimator for value prediction changes measurably as reward delivery is delayed. TD(0), n-step TD, TD(λ) with eligibility traces, and GAE are implemented from scratch in NumPy, with the unifying identities (λ=0 ≡ TD(0), λ=1 ≡ Monte Carlo, λ-return ≡ n-step mixture, GAE ≡ λ-return − V) encoded as pytest tests.

Correctness is validated by reproducing the Sutton & Barto 19-state random-walk figure and checking the gridworld learner against closed-form values from a linear solve. The main experiment sweeps 5 λ × 4 delays × 4 step sizes × 20 seeds = 1,600 runs with α tuned per cell.

b Results

Key results of credit-assignment-bakeoff
measurementvaluenote
optimal λ vs delay0.8 → 1.0shifts toward Monte Carlo as delay grows (delay 0 → 8)
delay-aware λ vs TD(0), RMS0.118 vs 0.196 · 0.152 vs 0.234 · 0.132 vs 0.231delays 2/4/8 — 95% bootstrap CIs disjoint at all three
TD(0) under delay0.116 → 0.231error roughly doubles from delay 0 to 8
validationn = 4 min RMS 0.0854reproduces the classic random-walk shape (intermediate n wins)
Lambda-by-delay heatmap of RMS at best step size; stars mark the per-delay winner migrating from 0.8 toward 1.0.
cab_phase_diagram.pngLambda-by-delay heatmap of RMS at best step size; stars mark the per-delay winner migrating from 0.8 toward 1.0.
Delay-aware lambda vs TD(0) with bootstrap CI bands; the bands separate from delay 2 onward.
cab_crossover.pngDelay-aware lambda vs TD(0) with bootstrap CI bands; the bands separate from delay 2 onward.

caveatPrediction-only on tiny tabular environments; a delay-blind lambda = 1 baseline already captures most of the benefit.

c Code

the single backward recursion everything reduces to — cab/returns.py. Full source and tests are on GitHub; the walkthrough notebook reproduces the results table above in Colab.

def lambda_return(rewards, values, gamma, lam):
    rewards = np.asarray(rewards, dtype=float)
    values = np.asarray(values, dtype=float)
    T = len(rewards)
    out = np.zeros(T)
    g = values[T]
    for t in reversed(range(T)):
        g = rewards[t] + gamma * ((1 - lam) * values[t + 1] + lam * g)
        out[t] = g
    return out
stack
Python, NumPy, pandas, matplotlib, joblib, SciPy
tests
15 pytest tests + result-level regression tests - GitHub Actions CI
notebook
notebooks/01_walkthrough.ipynb on Colab