↓Skip to main content

🏢 University of Minnesota, Twin Cities

Unraveling the Gradient Descent Dynamics of Transformers

26 September 2024·1273 words·6 mins· loading · loading

AI Theory Optimization 🏢 University of Minnesota, Twin Cities

This paper reveals how large embedding dimensions and appropriate initialization guarantee convergence in Transformer training, highlighting Gaussian attention’s superior landscape over Softmax.