Lexer and vocabulary¶
The lexer and vocabulary are configured via lexer.yaml; the file format and top-level keys are described in Configuration. Below are the API references for the classes that consume that configuration.
UnifiedLexer ¶
UnifiedLexer(
vocab_config: VocabConfig,
number_policy: Optional[NumberPolicy] = None,
strict: bool = True,
include_base_vocab: bool = True,
)
Bases: AbstractPreProcessor
Unified regex-based lexer for tokenizing input strings.
This lexer converts raw input strings into token sequences based on vocabulary configuration. It follows longest-match principle for reserved tokens and supports configurable number tokenization.
Examples:
>>> from calt.io.vocabulary import VocabConfig
>>> from calt.io.preprocessor import UnifiedLexer, NumberPolicy
>>>
>>> vocab_config = VocabConfig([], {}).from_config("config/vocab.yaml")
>>> number_policy = NumberPolicy(sign=False, digit_group=1)
>>> lexer = UnifiedLexer(vocab_config, number_policy=number_policy)
>>>
>>> tokens = lexer.tokenize("C-50*x1^2 + 3.14")
>>> # Returns: ["C-50", "*", "x1", "^", "2", "+", "3", ".", "1", "4"]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_config
|
VocabConfig
|
Vocabulary configuration. |
required |
number_policy
|
Optional[NumberPolicy]
|
Policy for tokenizing numbers. If None, uses default. |
None
|
strict
|
bool
|
If True, raise error on unknown characters. If False, emit |
True
|
include_base_vocab
|
bool
|
Whether to include base vocabulary tokens. |
True
|
Source code in src/calt/io/preprocessor/lexer.py
161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 | |
NumberPolicy
dataclass
¶
NumberPolicy(
sign: bool = False, digit_group: int = 0, allow_float: bool = True
)
Policy for tokenizing numbers.
Attributes:
| Name | Type | Description |
|---|---|---|
sign |
bool
|
How to handle sign. True (attach) means sign is part of the number, False (separate) means sign is a separate token. |
digit_group |
int
|
Group digits. 0 = no split, d>=1 = split every d digits. |
allow_float |
bool
|
Whether to allow floating point numbers. |
VocabConfig ¶
VocabConfig(
vocab: list[str],
special_tokens: dict[str, str],
include_base_vocab=True,
include_base_special_tokens=True,
)
Source code in src/calt/io/vocabulary/config.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 | |