base
optimus_dl.modules.tokenizer.base
¶
BaseTokenizer
¶
Bases: ABC
Abstract base class for all tokenizers.
Defines the standard interface for encoding strings to token IDs and decoding IDs back to text.
Attributes:
| Name | Type | Description |
|---|---|---|
config |
Configuration object for the tokenizer. |
Source code in optimus_dl/modules/tokenizer/base.py
bos_token_id
property
¶
ID of the Beginning-of-Sequence token, if any.
eos_token_id
property
¶
ID of the End-of-Sequence token, if any.
vocab_size
abstractmethod
property
¶
Total size of the tokenizer's vocabulary.
apply_chat_template(conversation, tokenize=True, add_generation_prompt=True)
¶
Apply a chat template (e.g., Llama-2-chat) to a conversation history.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
conversation
|
list[dict[str, str]]
|
List of messages (e.g., [{"role": "user", "content": "..."}]). |
required |
tokenize
|
bool
|
Whether to return token IDs (True) or the raw string (False). |
True
|
add_generation_prompt
|
bool
|
Whether to append the assistant's response prefix. |
True
|
Returns:
| Type | Description |
|---|---|
str | list[int]
|
Formatted string or list of token IDs. |
Source code in optimus_dl/modules/tokenizer/base.py
decode(ids)
abstractmethod
¶
encode(text)
abstractmethod
¶
save_pretrained(save_directory)
¶
Save tokenizer configuration to a directory.