Skip to content

config

optimus_dl.recipe.pretokenize.config

Configuration for data preparation recipe.

DataPrepConfig dataclass

DataPrepConfig(dataset: optimus_dl.recipe.pretokenize.config.DatasetConfig = , processing: optimus_dl.recipe.pretokenize.config.ProcessingConfig = , output: optimus_dl.recipe.pretokenize.config.OutputConfig = , tokenizer: Any = '???')

Parameters:

Name Type Description Default
dataset DatasetConfig

DatasetConfig(repo_id: str = '???', split: str = 'train', config_name: str | None = None, cache_dir: str | None = None, file_pattern: str | None = None)

<dynamic>
processing ProcessingConfig

ProcessingConfig(shard_size_mb: int = 512, shuffle_buffer_size: int = 10000, text_column: str = 'text', seed: int = 42, dtype: str = 'uint16', num_proc: int = 1)

<dynamic>
output OutputConfig

OutputConfig(dir: str = '???', name: str = 'dataset')

<dynamic>
tokenizer Any
'???'
Source code in optimus_dl/recipe/pretokenize/config.py
@dataclass
class DataPrepConfig:
    dataset: DatasetConfig = field(default_factory=DatasetConfig)
    processing: ProcessingConfig = field(default_factory=ProcessingConfig)
    output: OutputConfig = field(default_factory=OutputConfig)
    tokenizer: Any = MISSING

DatasetConfig dataclass

DatasetConfig(repo_id: str = '???', split: str = 'train', config_name: str | None = None, cache_dir: str | None = None, file_pattern: str | None = None)

Parameters:

Name Type Description Default
repo_id str
'???'
split str
'train'
config_name str | None
None
cache_dir str | None
None
file_pattern str | None
None
Source code in optimus_dl/recipe/pretokenize/config.py
@dataclass
class DatasetConfig:
    repo_id: str = MISSING
    split: str = "train"
    config_name: str | None = None
    cache_dir: str | None = None
    file_pattern: str | None = None  # To filter files if needed

OutputConfig dataclass

OutputConfig(dir: str = '???', name: str = 'dataset')

Parameters:

Name Type Description Default
dir str
'???'
name str
'dataset'
Source code in optimus_dl/recipe/pretokenize/config.py
@dataclass
class OutputConfig:
    dir: str = MISSING
    name: str = "dataset"  # Prefix for shards?

ProcessingConfig dataclass

ProcessingConfig(shard_size_mb: int = 512, shuffle_buffer_size: int = 10000, text_column: str = 'text', seed: int = 42, dtype: str = 'uint16', num_proc: int = 1)

Parameters:

Name Type Description Default
shard_size_mb int
512
shuffle_buffer_size int
10000
text_column str
'text'
seed int
42
dtype str
'uint16'
num_proc int
1
Source code in optimus_dl/recipe/pretokenize/config.py
@dataclass
class ProcessingConfig:
    shard_size_mb: int = 512
    shuffle_buffer_size: int = 10000
    text_column: str = "text"
    seed: int = 42
    dtype: str = "uint16"  # uint16 or uint32
    num_proc: int = 1