inline_tokens
optimus_dl.modules.tokenizer.implementations.inline_tokens
¶
InlineTokensTokenizer
¶
Bases: BaseTokenizer
Inline sequence tokenizer based on an explicitly provided list of tokens.
Uses regex-based greedy longest-match parsing to tokenize arbitrary strings without whitespace assumptions, handling unknown text chunks based on the configured strategy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
InlineTokensTokenizerConfig
|
Tokenizer configuration containing the vocabulary and UNK strategy. |
required |
Source code in optimus_dl/modules/tokenizer/implementations/inline_tokens.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | |
bos_token_id
property
¶
BOS token ID from config.
eos_token_id
property
¶
EOS token ID from config.
vocab_size
property
¶
Vocabulary size including BOS/EOS/UNK tokens.
decode(ids)
¶
Filter out special IDs and decode back to a string.
Source code in optimus_dl/modules/tokenizer/implementations/inline_tokens.py
encode(text)
¶
Convert text into token IDs and add special tokens using regex.
Source code in optimus_dl/modules/tokenizer/implementations/inline_tokens.py
InlineTokensTokenizerConfig
dataclass
¶
Bases: BaseTokenizerConfig
Configuration for explicitly specified tokens tokenizer.
Attributes:
| Name | Type | Description |
|---|---|---|
bos_token |
list[str]
|
Beginning-of-Sequence token. |
eos_token |
list[str]
|
End-of-Sequence token. |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokens
|
list[str]
|
|
'???'
|
unk_strategy
|
UnkStrategy
|
|
<UnkStrategy.RAISE: 'raise'>
|