🏢 University of Minnesota, Twin Cities
Unraveling the Gradient Descent Dynamics of Transformers
·1273 words·6 mins·
loading
·
loading
AI Theory
Optimization
🏢 University of Minnesota, Twin Cities
This paper reveals how large embedding dimensions and appropriate initialization guarantee convergence in Transformer training, highlighting Gaussian attention’s superior landscape over Softmax.