Skip to main content

Optimization

Ο€P^2: Effective Sharpness Aware Minimization Requires Layerwise Perturbation Scaling
·9260 words·44 mins· loading · loading
AI Generated AI Theory Optimization 🏒 University of Tübingen
Β΅PΒ²: Layerwise perturbation scaling in SAM enables hyperparameter transfer and improved generalization in large models.
Why Warmup the Learning Rate? Underlying Mechanisms and Improvements
·7149 words·34 mins· loading · loading
AI Generated AI Theory Optimization 🏒 University of Maryland
Deep learning’s learning rate warmup improves performance by allowing larger learning rates, pushing networks to better-conditioned loss landscape areas.
Why Transformers Need Adam: A Hessian Perspective
·2407 words·12 mins· loading · loading
AI Theory Optimization 🏒 Chinese University of Hong Kong, Shenzhen, China
Adam’s superiority over SGD in Transformer training is explained by the ‘block heterogeneity’ of the Hessian matrix, highlighting the need for adaptive learning rates.
Why the Metric Backbone Preserves Community Structure
·2073 words·10 mins· loading · loading
AI Theory Optimization 🏒 EPFL
Metric backbone graph sparsification surprisingly preserves community structure, offering an efficient and robust method for analyzing large networks.
Why Do We Need Weight Decay in Modern Deep Learning?
·3285 words·16 mins· loading · loading
AI Theory Optimization 🏒 EPFL
Weight decay’s role in modern deep learning is surprisingly multifaceted, impacting optimization dynamics rather than solely regularization, improving generalization and training stability.
Where Do Large Learning Rates Lead Us?
·5231 words·25 mins· loading · loading
AI Generated AI Theory Optimization 🏒 Constructor University
Unlocking optimal neural network training: A narrow range of initially high learning rates, slightly above the convergence threshold, consistently yields superior generalization after fine-tuning.
When Is Inductive Inference Possible?
·1470 words·7 mins· loading · loading
AI Theory Optimization 🏒 Princeton University
This paper provides a tight characterization of inductive inference, proving it’s possible if and only if the hypothesis class is a countable union of online learnable classes, resolving a long-standi…
What type of inference is planning?
·1424 words·7 mins· loading · loading
AI Theory Optimization 🏒 Google DeepMind
Planning is redefined as a distinct inference type within a variational framework, enabling efficient approximate planning in complex environments.
What does guidance do? A fine-grained analysis in a simple setting
·3498 words·17 mins· loading · loading
AI Theory Optimization 🏒 Duke University
Diffusion guidance, a common generative modeling technique, is shown to not sample from its intended distribution; instead, it heavily biases samples towards the boundary of the conditional distributi…
Warm-starting Push-Relabel
·1936 words·10 mins· loading · loading
AI Theory Optimization 🏒 UC Berkeley
This research introduces the first theoretical guarantees for warm-starting the celebrated Push-Relabel network flow algorithm, improving its speed using a predicted flow, while maintaining worst-case…
Variance estimation in compound decision theory under boundedness
·323 words·2 mins· loading · loading
AI Theory Optimization 🏒 University of Chicago
Unlocking the optimal variance estimation rate in compound decision theory under bounded means, this paper reveals a surprising (log log n/log n)Β² rate and introduces a rate-optimal cumulant-based est…
Validating Climate Models with Spherical Convolutional Wasserstein Distance
·2133 words·11 mins· loading · loading
AI Theory Optimization 🏒 University of Illinois Urbana-Champaign
Researchers developed Spherical Convolutional Wasserstein Distance (SCWD) to more accurately validate climate models by considering spatial variability and local distributional differences.
User-Creator Feature Polarization in Recommender Systems with Dual Influence
·2172 words·11 mins· loading · loading
AI Theory Optimization 🏒 Harvard University
Recommender systems, when influenced by both users and creators, inevitably polarize; however, prioritizing efficiency through methods like top-k truncation can surprisingly enhance diversity.
Unveiling User Satisfaction and Creator Productivity Trade-Offs in Recommendation Platforms
·1440 words·7 mins· loading · loading
AI Theory Optimization 🏒 University of Virginia
Recommendation algorithms on UGC platforms face a critical trade-off: prioritizing user satisfaction reduces creator engagement, jeopardizing long-term content diversity. This research introduces a ga…
Unrolled denoising networks provably learn to perform optimal Bayesian inference
·2411 words·12 mins· loading · loading
AI Generated AI Theory Optimization 🏒 Harvard University
Unrolled neural networks, trained via gradient descent, provably achieve optimal Bayesian inference for compressed sensing, surpassing prior-aware counterparts.
Unraveling the Gradient Descent Dynamics of Transformers
·1273 words·6 mins· loading · loading
AI Theory Optimization 🏒 University of Minnesota, Twin Cities
This paper reveals how large embedding dimensions and appropriate initialization guarantee convergence in Transformer training, highlighting Gaussian attention’s superior landscape over Softmax.
Universality of AdaGrad Stepsizes for Stochastic Optimization: Inexact Oracle, Acceleration and Variance Reduction
·1717 words·9 mins· loading · loading
AI Theory Optimization 🏒 CISPA
Adaptive gradient methods using AdaGrad stepsizes achieve optimal convergence rates for convex composite optimization problems, handling inexact oracles, acceleration, and variance reduction without n…
Universal Online Convex Optimization with $1$ Projection per Round
·373 words·2 mins· loading · loading
Machine Learning Optimization 🏒 Nanjing University
This paper introduces a novel universal online convex optimization algorithm needing only one projection per round, achieving optimal regret bounds for various function types, including general convex…
Understanding the Gains from Repeated Self-Distillation
·2009 words·10 mins· loading · loading
Machine Learning Optimization 🏒 University of Washington
Repeated self-distillation significantly reduces excess risk in linear regression, achieving up to a ’d’ factor improvement over single-step methods.
Ultrafast classical phylogenetic method beats large protein language models on variant effect prediction
·2536 words·12 mins· loading · loading
AI Generated AI Theory Optimization 🏒 UC Berkeley
A revolutionary ultrafast phylogenetic method outperforms protein language models in variant effect prediction by efficiently estimating amino acid substitution rates from massive datasets.