Skip to main content

🏒 Cerebras Systems

Sparse maximal update parameterization: A holistic approach to sparse training dynamics
·3095 words·15 mins· loading · loading
AI Generated Machine Learning Deep Learning 🏒 Cerebras Systems
SΒ΅Par stabilizes sparse neural network training, slashing tuning costs and boosting performance, especially at high sparsity levels, via a novel parameterization technique.
Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers
·2502 words·12 mins· loading · loading
Natural Language Processing Large Language Models 🏒 Cerebras Systems
By cleverly integrating per-example gradient norm calculations during the backward pass of LayerNorm layers, this research enables efficient and accurate gradient noise scale estimation in Transformer…