// components/act3b.jsx // Act 3b — Misconception: we're not imitating the winner. // // Big idea: in self-play, MCTS search is what produces improved policies. // Boards from LOST games are just as valuable as boards from won games — // both have π* targets that are better than what the net predicted. // Contrast with NFSP (AlphaStar), where no cheap lookahead exists and the // training signal reduces to "imitate the winner." function Act3b_NotImitation() { const { localTime: lt } = useSprite(); // Phased timing (9s total): // 0.0 - 0.6 title fades in // 0.6 - 1.5 subtitle // 1.5 - 3.5 WRONG column reveals (winner circled, loser discarded) // 3.5 - 4.5 strike-through the wrong column // 4.5 - 7.0 RIGHT column reveals (both games feed π* targets) // 7.0 - 9.0 NFSP footnote const wrongIn = clamp((lt - 1.3) / 0.8, 0, 1); const strike = clamp((lt - 3.5) / 0.7, 0, 1); const rightIn = clamp((lt - 4.5) / 0.9, 0, 1); const footIn = clamp((lt - 6.8) / 0.7, 0, 1); // 4 mini boards representing a self-play game. const gameA = [ // Agent wins [{x:1,y:1,c:'B'}], [{x:1,y:1,c:'B'},{x:0,y:2,c:'W'}], [{x:1,y:1,c:'B'},{x:0,y:2,c:'W'},{x:2,y:0,c:'B'}], [{x:1,y:1,c:'B'},{x:0,y:2,c:'W'},{x:2,y:0,c:'B'},{x:0,y:0,c:'W'}], ]; const gameB = [ // Agent loses [{x:1,y:1,c:'B'}], [{x:1,y:1,c:'B'},{x:2,y:2,c:'W'}], [{x:1,y:1,c:'B'},{x:2,y:2,c:'W'},{x:0,y:1,c:'B'}], [{x:1,y:1,c:'B'},{x:2,y:2,c:'W'},{x:0,y:1,c:'B'},{x:1,y:0,c:'W'}], ]; // Mini board stack (3 boards overlapping to suggest a sequence) const BoardStack = ({ x, y, games, outcome, outcomeColor, treatment /* 'use' | 'discard' | 'use-all' */, opacity = 1 }) => { const boardSize = 52; const offset = 14; return ( {/* Game label */} {outcome} {/* Stack of boards */}

{games.slice(0, 4).map((stones, i) => (

))}

{/* Outcome badge */} {outcome.includes('WON') ? 'W' : 'L'} ); }; return ( <> {/* Header */}

The goal of training isn't to copy winning trajectories. It's to distill the{' '} MCTS-improved policy π* into the network — and MCTS improves the policy regardless of who wins the game.

{/* Two columns */}

{/* ─── WRONG column ─── */}

✗ Neural Fictitious Self-Play

"The winner played good moves. Keep those, discard the loser's."

{/* Strike-through overlay — sized to cover just the column's content (header + subtitle + svg), so it doesn't leak into the NFSP footnote below. */} {strike > 0 && ( )}

{/* ─── RIGHT column ─── */}

✓ What AlphaGo actually does

Every board (won and lost) has a π* from MCTS that's{' '} better than the network's own prior. Train on all of them.

The game outcome z trains the value head — but the policy head trains on π* from every position, win or lose.

{/* NFSP footnote */} {footIn > 0 && (

aside — NFSP

The "reinforce the winner" approach is called{' '} neural fictitious self-play — it's what you fall back on when lookahead search isn't tractable. It powers agents like{' '} AlphaStar in StarCraft II, where the branching factor and real-time constraints rule out MCTS.

)} ); } Object.assign(window, { Act3b_NotImitation });