↓Skip to main content

🏢 Cerebras Systems

Sparse maximal update parameterization: A holistic approach to sparse training dynamics

26 September 2024·3095 words·15 mins· loading · loading

AI Generated Machine Learning Deep Learning 🏢 Cerebras Systems

SµPar stabilizes sparse neural network training, slashing tuning costs and boosting performance, especially at high sparsity levels, via a novel parameterization technique.

Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers

26 September 2024·2502 words·12 mins· loading · loading

Natural Language Processing Large Language Models 🏢 Cerebras Systems

By cleverly integrating per-example gradient norm calculations during the backward pass of LayerNorm layers, this research enables efficient and accurate gradient noise scale estimation in Transformer…