base
optimus_dl.modules.data.transforms.base
¶
Base transform classes for data pipeline.
This module defines the base classes for data transforms, which are components that process data as it flows through the pipeline. Transforms can be chained together to create complex data processing pipelines.
BaseTransform
¶
Base class for all data transforms.
All data transforms in Optimus-DL should inherit from this class. Transforms take a data source (BaseNode) and return a new BaseNode that applies the transformation. Transforms can be chained together using CompositeTransform.
Subclasses should implement:
build(): Apply the transform to a data source and return a new node
Example
@register_transform("tokenize", TokenizeConfig)
class TokenizeTransform(BaseTransform):
def __init__(self, cfg: TokenizeConfig, **kwargs):
super().__init__(**kwargs)
self.tokenizer = build_tokenizer(cfg.tokenizer_config)
def build(self, source: BaseNode) -> BaseNode:
def tokenize_fn(item):
return {"input_ids": self.tokenizer.encode(item["text"])}
return source.map(tokenize_fn)
Source code in optimus_dl/modules/data/transforms/base.py
__init__(*args, **kwargs)
¶
Initialize the transform.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*args
|
Positional arguments (typically unused, for compatibility). |
()
|
|
**kwargs
|
Keyword arguments passed from the data builder. |
{}
|
build(source)
¶
Apply the transform to a data source.
This method takes a data source node and returns a new node that applies the transformation. The transformation is applied lazily as data flows through the pipeline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
BaseNode
|
The data source node to transform. |
required |
Returns:
| Type | Description |
|---|---|
BaseNode
|
A new BaseNode that applies the transformation. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
Must be implemented by subclasses. |
Example
Source code in optimus_dl/modules/data/transforms/base.py
MapperConfig
dataclass
¶
Configuration for map operations in data transforms.
This configuration is used by transforms that apply map operations to data. It controls parallelism, ordering, and batching behavior.
Attributes:
| Name | Type | Description |
|---|
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_workers
|
int
|
|
4
|
in_order
|
bool
|
|
True
|
method
|
str
|
|
'thread'
|
snapshot_frequency
|
int
|
|
128
|
prebatch
|
int
|
|
32
|
Source code in optimus_dl/modules/data/transforms/base.py
ProcessMapperConfig
dataclass
¶
Bases: MapperConfig
Config with process-based parallelism by default.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
method
|
str
|
|
'process'
|
Source code in optimus_dl/modules/data/transforms/base.py
ThreadedMapperConfig
dataclass
¶
Bases: MapperConfig
Config with thread-based parallelism by default.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
method
|
str
|
|
'thread'
|