tiktoken
optimus_dl.modules.tokenizer.implementations.tiktoken
¶
TiktokenConfig
dataclass
¶
Bases: BaseTokenizerConfig
Configuration for Tiktoken tokenizers.
Attributes:
| Name | Type | Description |
|---|
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
|
'gpt2'
|
Source code in optimus_dl/modules/tokenizer/implementations/tiktoken.py
TiktokenTokenizer
¶
Bases: BaseTokenizer
Wrapper for OpenAI's tiktoken library.
Provides extremely fast Byte-Pair Encoding (BPE) for GPT-style models.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
TiktokenConfig
|
Tiktoken tokenizer configuration. |
required |
Source code in optimus_dl/modules/tokenizer/implementations/tiktoken.py
bos_token_id
property
¶
EOT token ID used as BOS (tiktoken default).
eos_token_id
property
¶
EOT token ID used as EOS.
vocab_size
property
¶
Total number of tokens in the encoding.
decode(ids)
¶
encode(text)
¶
Convert text to IDs, allowing all special tokens.