π’ Cerebras Systems
Sparse maximal update parameterization: A holistic approach to sparse training dynamics
·3095 words·15 mins·
loading
·
loading
AI Generated
Machine Learning
Deep Learning
π’ Cerebras Systems
SΒ΅Par stabilizes sparse neural network training, slashing tuning costs and boosting performance, especially at high sparsity levels, via a novel parameterization technique.
Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers
·2502 words·12 mins·
loading
·
loading
Natural Language Processing
Large Language Models
π’ Cerebras Systems
By cleverly integrating per-example gradient norm calculations during the backward pass of LayerNorm layers, this research enables efficient and accurate gradient noise scale estimation in Transformer…