Skip to content

Lexer and vocabulary

The lexer and vocabulary are configured via lexer.yaml; the file format and top-level keys are described in Configuration. Below are the API references for the classes that consume that configuration.

UnifiedLexer

UnifiedLexer(
    vocab_config: VocabConfig,
    number_policy: Optional[NumberPolicy] = None,
    strict: bool = True,
    include_base_vocab: bool = True,
)

Bases: AbstractPreProcessor

Unified regex-based lexer for tokenizing input strings.

This lexer converts raw input strings into token sequences based on vocabulary configuration. It follows longest-match principle for reserved tokens and supports configurable number tokenization.

Examples:

>>> from calt.io.vocabulary import VocabConfig
>>> from calt.io.preprocessor import UnifiedLexer, NumberPolicy
>>>
>>> vocab_config = VocabConfig([], {}).from_config("config/vocab.yaml")
>>> number_policy = NumberPolicy(sign=False, digit_group=1)
>>> lexer = UnifiedLexer(vocab_config, number_policy=number_policy)
>>>
>>> tokens = lexer.tokenize("C-50*x1^2 + 3.14")
>>> # Returns: ["C-50", "*", "x1", "^", "2", "+", "3", ".", "1", "4"]

Parameters:

Name Type Description Default
vocab_config VocabConfig

Vocabulary configuration.

required
number_policy Optional[NumberPolicy]

Policy for tokenizing numbers. If None, uses default.

None
strict bool

If True, raise error on unknown characters. If False, emit .

True
include_base_vocab bool

Whether to include base vocabulary tokens.

True
Source code in src/calt/io/preprocessor/lexer.py
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
def __init__(
    self,
    vocab_config: VocabConfig,
    number_policy: Optional[NumberPolicy] = None,
    strict: bool = True,
    include_base_vocab: bool = True,
):
    """Initialize the unified lexer.

    Args:
        vocab_config: Vocabulary configuration.
        number_policy: Policy for tokenizing numbers. If None, uses default.
        strict: If True, raise error on unknown characters. If False, emit <unk>.
        include_base_vocab: Whether to include base vocabulary tokens.
    """
    # Initialize post-processor base class
    super().__init__()

    self.number_policy = number_policy or NumberPolicy()
    self.strict = strict
    self.include_base_vocab = include_base_vocab

    # Extend vocab_config with required tokens based on number_policy
    self.vocab_config = self._extend_vocab_for_number_policy(vocab_config)

    # Build reserved tokens
    self._build_reserved_tokens()

    # Build regex patterns
    self._build_patterns()

NumberPolicy dataclass

NumberPolicy(
    sign: bool = False, digit_group: int = 0, allow_float: bool = True
)

Policy for tokenizing numbers.

Attributes:

Name Type Description
sign bool

How to handle sign. True (attach) means sign is part of the number, False (separate) means sign is a separate token.

digit_group int

Group digits. 0 = no split, d>=1 = split every d digits.

allow_float bool

Whether to allow floating point numbers.

VocabConfig

VocabConfig(
    vocab: list[str],
    special_tokens: dict[str, str],
    include_base_vocab=True,
    include_base_special_tokens=True,
)
Source code in src/calt/io/vocabulary/config.py
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def __init__(
    self,
    vocab: list[str],
    special_tokens: dict[str, str],
    include_base_vocab=True,
    include_base_special_tokens=True,
):
    self.vocab = vocab
    self.special_tokens = special_tokens

    if include_base_vocab:
        self.vocab = BASE_VOCAB + self.vocab
    if include_base_special_tokens:
        self.special_tokens = BASE_SPECIAL_TOKENS | self.special_tokens