Alex Shapiro
Jan 14 2026
Things I Tried Today
Worked
- Using ground truth optimal actions to measure the relative advantage of optimal vs suboptimal actions
- Supervised learning from RL rollouts (only works if the env has a ground truth optimal action, but it is highly effective at learning far fewer epochs)
- Tracking loss components to ensure that no single component dominates the others
Did Not Work
- Using fancy reward structures to improve advantage diffs between optimal and suboptimal actions
- RL from a structurally imbalanced environment
Meh
- Modifying environments to create even odds of winning