data_builder
optimus_dl.recipe.train.builders.data_builder
¶
Data builder mixin for building data pipelines.
DataBuilder
¶
Builder class for constructing training and evaluation data pipelines.
Manages the creation of DataPipeline objects, ensuring correct distributed
sharding and iterator behavior (e.g., infinite loop for training, resettable
for evaluation).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
DataBuilderConfig
|
Builder configuration. |
required |
data_config
|
DataConfig
|
Configuration for datasets and transforms. |
required |
Source code in optimus_dl/recipe/train/builders/data_builder.py
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | |
build_eval_data(collective, **kwargs)
¶
Build evaluation data pipelines.
Constructs a dictionary of pipelines for multiple evaluation datasets.
Uses LoaderIterResettable to allow repeated iteration over the same
validation sets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
collective
|
Collective
|
Distributed collective. |
required |
**kwargs
|
Any
|
Additional arguments. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, EvalDataPipeline | None]
|
Dictionary mapping dataset names to DataPipelines. |
Source code in optimus_dl/recipe/train/builders/data_builder.py
build_train_data(collective, **kwargs)
¶
Build the training data pipeline.
Automatically injects rank and world_size for sharding. The resulting loader is configured to restart automatically on StopIteration, creating an infinite stream.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
collective
|
Collective
|
Distributed collective for sharding info. |
required |
**kwargs
|
Additional arguments passed to dataset builders. |
{}
|
Returns:
| Type | Description |
|---|---|
DataPipeline | None
|
A DataPipeline containing the dataset and loader. |
Source code in optimus_dl/recipe/train/builders/data_builder.py
DataBuilderConfig
dataclass
¶
LoaderIterResettable
¶
Bases: Loader
A Loader that automatically resets its iterator on __iter__.
This is essential for evaluation loops where the dataloader is re-used multiple times.