char
optimus_dl.modules.tokenizer.implementations.char
¶
CharTokenizer
¶
Bases: BaseTokenizer
Simple byte-level UTF-8 tokenizer.
Converts text to raw UTF-8 bytes and adds optional BOS/EOS tokens. Detokenization skips the special token IDs and decodes the remainder as UTF-8.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
CharTokenizerConfig
|
Character tokenizer configuration. |
required |
Source code in optimus_dl/modules/tokenizer/implementations/char.py
bos_token_id
property
¶
BOS token ID from config.
eos_token_id
property
¶
EOS token ID from config.
vocab_size
property
¶
Vocabulary size including BOS/EOS tokens.
decode(ids)
¶
Filter out special IDs and decode bytes to UTF-8.
Source code in optimus_dl/modules/tokenizer/implementations/char.py
encode(text)
¶
Convert text to UTF-8 bytes and add special tokens.
Source code in optimus_dl/modules/tokenizer/implementations/char.py
CharTokenizerConfig
dataclass
¶
Bases: BaseTokenizerConfig
Configuration for character/byte-level tokenizer.
Attributes:
| Name | Type | Description |
|---|
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_size
|
int
|
|
256
|
bos_token_id
|
int
|
|
256
|
eos_token_id
|
int
|
|
257
|