chunk_tokens
optimus_dl.modules.data.transforms.chunk_tokens
¶
ChunkTransform
¶
Bases: BaseTransform
Transform that splits variable-length documents into fixed-size chunks.
Useful when datasets yield full documents that are longer than the desired training sequence length.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
ChunkTransformConfig
|
Chunking configuration. |
required |
Source code in optimus_dl/modules/data/transforms/chunk_tokens.py
ChunkTransformConfig
dataclass
¶
Bases: RegistryConfigStrict
Configuration for chunking token sequences.
Attributes:
| Name | Type | Description |
|---|
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
max_seq_len
|
int
|
|
'???'
|
add_one_for_shift
|
bool
|
|
True
|
Source code in optimus_dl/modules/data/transforms/chunk_tokens.py
ChunkTransformNode
¶
Bases: BaseNode
Internal node for performing sequence chunking.
Maintains a buffer of tokens from the source node and yields segments of
length max_seq_len.
Source code in optimus_dl/modules/data/transforms/chunk_tokens.py
get_state()
¶
Collect current buffer and source state for checkpointing.
next()
¶
Yield the next chunk of tokens, refilling the buffer if empty.
Source code in optimus_dl/modules/data/transforms/chunk_tokens.py
reset(initial_state=None)
¶
Restore the buffer and source node state.