Predicting Transformer Training Instability
Researchers introduce RKSP, a method to predict transformer training divergence from a single forward pass, and KSS, a technique to actively prevent it, saving compute and enabling higher learning rates.
Feb 28 at 1:10 PM3 min read



