config
optimus_dl.recipe.pretokenize.config
¶
Configuration for data preparation recipe.
DataPrepConfig
dataclass
¶
DataPrepConfig(dataset: optimus_dl.recipe.pretokenize.config.DatasetConfig =
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset
|
DatasetConfig
|
DatasetConfig(repo_id: str = '???', split: str = 'train', config_name: str | None = None, cache_dir: str | None = None, file_pattern: str | None = None) |
<dynamic>
|
processing
|
ProcessingConfig
|
ProcessingConfig(shard_size_mb: int = 512, shuffle_buffer_size: int = 10000, text_column: str = 'text', seed: int = 42, dtype: str = 'uint16', num_proc: int = 1) |
<dynamic>
|
output
|
OutputConfig
|
OutputConfig(dir: str = '???', name: str = 'dataset') |
<dynamic>
|
tokenizer
|
Any
|
|
'???'
|
Source code in optimus_dl/recipe/pretokenize/config.py
DatasetConfig
dataclass
¶
DatasetConfig(repo_id: str = '???', split: str = 'train', config_name: str | None = None, cache_dir: str | None = None, file_pattern: str | None = None)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
repo_id
|
str
|
|
'???'
|
split
|
str
|
|
'train'
|
config_name
|
str | None
|
|
None
|
cache_dir
|
str | None
|
|
None
|
file_pattern
|
str | None
|
|
None
|
Source code in optimus_dl/recipe/pretokenize/config.py
OutputConfig
dataclass
¶
ProcessingConfig
dataclass
¶
ProcessingConfig(shard_size_mb: int = 512, shuffle_buffer_size: int = 10000, text_column: str = 'text', seed: int = 42, dtype: str = 'uint16', num_proc: int = 1)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
shard_size_mb
|
int
|
|
512
|
shuffle_buffer_size
|
int
|
|
10000
|
text_column
|
str
|
|
'text'
|
seed
|
int
|
|
42
|
dtype
|
str
|
|
'uint16'
|
num_proc
|
int
|
|
1
|