Skip to main content

🏢 University of Minnesota, Twin Cities

Unraveling the Gradient Descent Dynamics of Transformers
·1273 words·6 mins· loading · loading
AI Theory Optimization 🏢 University of Minnesota, Twin Cities
This paper reveals how large embedding dimensions and appropriate initialization guarantee convergence in Transformer training, highlighting Gaussian attention’s superior landscape over Softmax.