Jan 14 2026

Alex Shapiro

Jan 14 2026

Things I Tried Today

Worked

Using ground truth optimal actions to measure the relative advantage of optimal vs suboptimal actions
Supervised learning from RL rollouts (only works if the env has a ground truth optimal action, but it is highly effective at learning far fewer epochs)
Tracking loss components to ensure that no single component dominates the others

Did Not Work

Using fancy reward structures to improve advantage diffs between optimal and suboptimal actions
RL from a structurally imbalanced environment

Meh

Modifying environments to create even odds of winning