Skip to content

Configuration Guide

Optimus-DL uses Hydra for its configuration system, enabling flexible, hierarchical, and composable setups. All training configurations are located in the configs/ directory.

Core Concepts

1. Hierarchical Structure

Configurations are built in layers. A main training config (e.g., train_llama.yaml) specifies defaults for different components like the model, optimizer, and data.

Example from train_llama.yaml:

defaults:
  - defaults.yaml
  - lr_scheduler: wsd
  - model: llama2
  - criterion: cross_entropy
  - loggers: basic
  - optimization/amp: bfloat16
  - _self_
Each item in the defaults list points to another YAML file, allowing you to mix and match components easily.

2. The args Section

We use a special args section as a "scratch space" for high-level variables that are reused throughout the configuration. This is the single source of truth for important parameters like batch size, sequence length, and vocabulary size.

args:
  name: llama-finetune
  batch_size: 64
  seq_len: 1024
  vocab_size: 32000

These values are then referenced in other parts of the config using OmegaConf's interpolation syntax (${...}).

3. Interpolation

Interpolation is key to keeping the configuration DRY (Don't Repeat Yourself).

model:
  vocab_size: ${args.vocab_size} # From args

data:
  train_datasets:
    transform:
      _name: flat_batcher
      batch_size: ${args.batch_size} # From args
      seq_len: ${args.seq_len}       # From args

You can also evaluate simple expressions:

args:
  global_batch_size: 128
  num_devices: 8

# Calculate per-device batch size
per_device_batch_size: ${eval:"int(${args.global_batch_size} / ${args.num_devices})"}

Key Configuration Sections

model

This section defines the model architecture. The _name key determines which model to build (e.g., llama2, gpt2). Other parameters are specific to the model, such as the number of layers, hidden size, and number of attention heads.

model:
  _name: llama2
  vocab_size: ${args.vocab_size}
  n_layer: 12
  n_head: 12
  hidden_dim: 768

data

This section defines the entire data pipeline, including training and evaluation datasets. It typically contains: - train_datasets and eval_datasets: Define the data sources and transforms. - scratch: A reusable space to define complex transform chains that can be referenced via interpolation.

data:
  scratch:
    # Define a reusable transform chain
    my_transform:
      _name: compose
      transforms:
        - _name: tokenize
          tokenizer_config:
            _name: tiktoken
            name: gpt2
        - _name: chunk_tokens
          max_seq_len: ${args.seq_len}
        - _name: shuffle
          buffer_size: 8096
        - _name: flat_batcher
          batch_size: ${args.batch_size}
          seq_len: ${args.seq_len}
        - _name: prefetch
        - _name: to_device

  train_datasets:
    source:
      _name: loop
      inner:
        _name: preset_slimpajama6b
        split: train
    transform: ${data.scratch.my_transform} # Reference the chain

optimization

This section controls the optimization process, including the optimizer, learning rate scheduler, and gradient clipping.

# Optimization settings
optimization:
  iterations: ${args.iterations}
  acc_steps: 1          # Gradient accumulation steps
  clip_grad_norm: 5.0   # Gradient clipping norm

  optimizer:
    _name: adamw
    lr: 5e-4            # Base learning rate
    weight_decay: 1e-1
    betas: [0.9, 0.99]  # Adam beta parameters
    eps: 1e-8           # Adam epsilon

Command-Line Overrides

One of Hydra's most powerful features is the ability to override any configuration value from the command line.

# Override the learning rate and batch size
python scripts/train.py \
  optimization.optimizer.lr=0.01 \
  args.batch_size=32

# Swap out the entire model for GPT-2
python scripts/train.py model=gpt2

This makes experimentation fast and easy without needing to modify the underlying configuration files.