sharder
optimus_dl.recipe.pretokenize.sharder
¶
Handles writing tokenized documents into sized-shards on disk and creating the final index file.
Sharder
¶
Manages the creation of sharded dataset files.
Accumulates tokenized documents in memory until a size threshold is reached,
then flushes them to disk as numpy arrays (.npy). Also tracks metadata
for each shard to generate a global index file.
Features:
- Buffering: Minimizes disk I/O by batching writes.
- Size-based Splitting: Creates shards of approx. equal size (e.g., 512MB).
- Metadata Tracking: Records token counts and file paths for the index.
- Checkpoint Support: Can serialize internal state to resume processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_config
|
OutputConfig
|
Output directory and naming configuration. |
required |
proc_config
|
ProcessingConfig
|
Processing settings (shard size, dtype). |
required |
Source code in optimus_dl/recipe/pretokenize/sharder.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 | |
add(doc_tokens)
¶
Add a tokenized document to the current shard.
If adding the document exceeds the maximum shard size, the current shard is flushed to disk first.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doc_tokens
|
list[int]
|
A list of integers representing the tokenized document. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if a shard was flushed, False otherwise. |
Source code in optimus_dl/recipe/pretokenize/sharder.py
finalize(final_config)
¶
Flush remaining data and write the global index file.
The index file (index.json) contains metadata for all shards and the
processing configuration, enabling the dataset to be loaded later.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
final_config
|
dict[str, Any]
|
Configuration dictionary to embed in the index. |
required |
Source code in optimus_dl/recipe/pretokenize/sharder.py
flush()
¶
Write the current accumulated tokens to a new shard file.
Saves two files:
name_XXXXX.npy: The flat token array.name_XXXXX_lens.npy: Array of document lengths for reconstruction.
Source code in optimus_dl/recipe/pretokenize/sharder.py
get_state()
¶
Returns the sharder's current state for checkpointing.
Source code in optimus_dl/recipe/pretokenize/sharder.py
load_state(state)
¶
Restores the sharder's state from a checkpoint.