huggingface
optimus_dl.modules.tokenizer.implementations.huggingface
¶
HFTokenizer
¶
Bases: BaseTokenizer
Wrapper for Hugging Face AutoTokenizer.
Integrates standard Hub tokenizers into the framework. It handles:
- Pretrained Loading: Automatically downloads and caches tokenizers.
- Special Tokens: Manages BOS/EOS injection based on config.
- Chat Templates: Supports generating formatted conversation strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
HFTokenizerConfig
|
Hugging Face tokenizer configuration. |
required |
Source code in optimus_dl/modules/tokenizer/implementations/huggingface.py
bos_token_id
property
¶
BOS ID from Hub tokenizer.
eos_token_id
property
¶
EOS ID from Hub tokenizer.
vocab_size
property
¶
Vocabulary size from Hub tokenizer.
apply_chat_template(conversation, tokenize=True, add_generation_prompt=True)
¶
Apply the Hub tokenizer's chat template to a conversation.
Source code in optimus_dl/modules/tokenizer/implementations/huggingface.py
decode(ids)
¶
encode(text)
¶
Convert text to IDs using the Hub tokenizer.
Source code in optimus_dl/modules/tokenizer/implementations/huggingface.py
save_pretrained(save_directory)
¶
HFTokenizerConfig
dataclass
¶
Bases: BaseTokenizerConfig
Configuration for Hugging Face tokenizers.
Attributes:
| Name | Type | Description |
|---|
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
|
'gpt2'
|
trust_remote_code
|
bool
|
|
False
|