huggingface
optimus_dl.modules.data.datasets.huggingface
¶
HuggingFaceDataset
¶
Bases: BaseDataset
Dataset wrapper for Hugging Face Hub datasets.
This class integrates with the Hugging Face datasets library, supporting:
- Streaming: Automatically enables streaming for efficient loading of large datasets without downloading everything.
- Distributed Sharding: Uses
split_dataset_by_nodeto ensure each rank sees a unique portion of the data. - Checkpointing: Tracks current position to allow resuming from the middle of a stream.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
Hugging Face dataset configuration. |
required | |
rank
|
int
|
Distributed rank. |
required |
world_size
|
int
|
Total number of ranks. |
required |
Source code in optimus_dl/modules/data/datasets/huggingface.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 | |
get_state()
¶
Return the current position and configuration for checkpointing.
Source code in optimus_dl/modules/data/datasets/huggingface.py
next()
¶
reset(initial_state=None)
¶
Initialize or restore the dataset stream.
Configures streaming, performs distributed sharding, and skips to the saved position if restoring from a checkpoint.
Source code in optimus_dl/modules/data/datasets/huggingface.py
HuggingFaceDatasetConfig
dataclass
¶
Bases: RegistryConfigStrict
Configuration for Hugging Face datasets.
Attributes:
| Name | Type | Description |
|---|
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_load_kwargs
|
dict
|
|
'???'
|