tokenize
optimus_dl.modules.data.transforms.tokenize
¶
TokenizeTransform
¶
Bases: BaseTransform
Transform that converts raw text strings into sequences of token IDs.
Uses the registry system to instantiate a tokenizer and applies it to the
input data. Supports parallel mapping via torchdata.nodes.ParallelMapper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
TokenizeTransformConfig
|
Tokenization configuration. |
required |
Source code in optimus_dl/modules/data/transforms/tokenize.py
build(source)
¶
Wrap the source node with a parallel mapper using the tokenizer function.
Source code in optimus_dl/modules/data/transforms/tokenize.py
TokenizeTransformConfig
dataclass
¶
Bases: RegistryConfigStrict
Configuration for text tokenization.
Attributes:
| Name | Type | Description |
|---|
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokenizer_config
|
Any
|
|
'???'
|
debug_samples
|
int
|
|
0
|
worker_cfg
|
MapperConfig
|
Config with process-based parallelism by default. |
<dynamic>
|