Optimizers & Schedulers¶
Optimus-DL leverages PyTorch's standard optimizers and provides additional customized implementations and configurations in the optimus_dl.modules.optim and optimus_dl.modules.lr_scheduler packages. The framework is designed to easily integrate with different optimization strategies through the Hydra configuration system.
For a complete list of available optimizers and their configurations, see the Optimizer API Reference.
Key Optimizers¶
While any PyTorch optimizer can be used, Optimus-DL provides convenient configs and wrappers for the following:
adamw: The AdamW optimizer, a standard choice for training transformer models, with options for weight decay and other hyperparameters. It is the default optimizer for most training recipes.muon: The Muon optimizer, a momentum-based optimizer with Newton-Schulz iterations for improved convergence.soap: The SOAP (Shampoo with Adam in the Preconditioner subspace) optimizer, a second-order optimizer that uses Shampoo preconditioning combined with Adam updates for improved training efficiency.
Learning Rate Schedulers¶
Optimus-DL provides a set of learning rate schedulers tailored for large-scale language model training.
wsd(Warmup-Stable-Decay): A popular scheduler for pre-training that consists of three phases:- Warmup: Linear increase from
base_lr / init_div_factortobase_lr. - Stable: Constant learning rate at
base_lr. - Decay (Cooldown): Final decay to a target LR. Supports multiple shapes: linear, cosine, exponential, and piecewise linear.
- Warmup: Linear increase from
cosine_annealing: Standard cosine annealing with optional warmup.linear_warmup: Simple linear warmup followed by a constant or decaying learning rate.
WSD Configuration Example¶
lr_scheduler:
_name: wsd
warmup_steps: 2000
fract_decay: 0.1
decay_type: cosine
final_lr_factor: 0.1
Optimizer Configuration¶
Optimizers are configured through the Hydra configuration system. Each optimizer has its own set of hyperparameters that can be tuned for your specific use case.