Data Pipelines¶

Data handling in Optimus-DL is designed to be highly flexible and modular, allowing for complex data processing pipelines to be constructed from reusable components. The core components are located in optimus_dl.modules.data.

The key components are: - Sources: Yield raw data items, like lines from a text file or examples from a Hugging Face dataset. - Transforms: A chain of operations applied to the data, such as tokenization, chunking, shuffling, and batching.

For detailed information, see the Data API Reference.

Core Components¶

datasets: Contains various dataset implementations, including tokenized datasets, and utilities for handling different data formats.
presets: Provides some predefined datasets for common use cases.
transforms: Includes a wide range of data transformations, from tokenization to batching and device placement.

Pre-tokenized Datasets & Strategies¶

The TokenizedDataset is a high-performance dataset that streams tokens from memory-mapped numpy shards. It supports pluggable Sampling Strategies to control how documents are traversed:

document: Yields full documents as they appear in the dataset.
concat_random: Treats the entire dataset as a single concatenated stream of tokens, splits it into fixed-size chunks, and yields them in a globally random order. This is highly efficient for training as it ensures constant sequence lengths and full data utilization.

Efficient Batching¶

Optimus-DL provides specialized transforms for efficient token batching:

flat_batcher: Concatenates multiple variable-length sequences into a single flat tensor, accompanied by sequence length metadata. This avoids padding overhead and is compatible with FlashAttention and other kernel-level optimizations.
basic_batcher: A standard batcher that pads sequences to a fixed length. Also supports flat batching to pack variable length sequences into a single tensor.