Data Loader
Data Loader¶
Utilities to prepare training/evaluation datasets, tokenizers, and data collators. They convert symbolic expressions (polynomials/integers) into internal token sequences and build batches suitable for training.
Entry point¶
Create dataset, tokenizer and data-collator objects.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
train_dataset_path
|
str
|
Path to the file that stores the "training" samples. |
required |
test_dataset_path
|
str
|
Path to the file that stores the "evaluation" samples. |
required |
field
|
str
|
Finite-field identifier (e.g. |
required |
num_variables
|
int
|
Maximum number of symbolic variables ((x_1, \dots, x_n)) that can appear in a polynomial. |
required |
max_degree
|
int
|
Maximum total degree allowed for any monomial term. |
required |
max_coeff
|
int
|
Maximum absolute value of the coefficients appearing in the data. |
required |
max_length
|
int
|
Hard upper bound on the token sequence length. Longer sequences will be right-truncated. Defaults to 512. |
512
|
processor_name
|
str
|
Name of the processor to use for converting symbolic expressions into
internal token IDs. The default processor is |
'polynomial'
|
vocab_path
|
str | None
|
Path to the vocabulary configuration file. If None, a default vocabulary will be generated based on the field, max_degree, and max_coeff parameters. Defaults to None. |
None
|
num_train_samples
|
int | None
|
Maximum number of training samples to load. If None or -1, all available training samples will be loaded. Defaults to None. |
None
|
num_test_samples
|
int | None
|
Maximum number of test samples to load. If None or -1, all available test samples will be loaded. Defaults to None. |
None
|
Returns:
Type | Description |
---|---|
tuple[dict[str, StandardDataset], PreTrainedTokenizerFast, StandardDataCollator]
|
tuple[dict[str, StandardDataset], StandardTokenizer, StandardDataCollator]:
1. |
Source code in src/calt/data_loader/data_loader.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
|
Dataset and collator¶
Bases: Dataset
Source code in src/calt/data_loader/utils/data_collator.py
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 |
|
load_file
classmethod
¶
load_file(
data_path: str,
preprocessor: AbstractPreprocessor,
max_samples: int | None = None,
) -> StandardDataset
Load data from a file and create a StandardDataset
instance.
This method maintains backward compatibility with the previous file-based initialization.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_path
|
str
|
Path to the data file. |
required |
preprocessor
|
AbstractPreprocessor
|
Preprocessor instance. |
required |
max_samples
|
int | None
|
Maximum number of samples to load. Use -1 or None to load all samples. Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
StandardDataset |
StandardDataset
|
Loaded dataset instance. |
Source code in src/calt/data_loader/utils/data_collator.py
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
|
__getitem__ ¶
__getitem__(idx: int) -> dict[str, str]
Get dataset item and convert to internal representation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
idx
|
int
|
Index of the item to retrieve. |
required |
Returns:
Type | Description |
---|---|
dict[str, str]
|
dict[str, str]: A mapping with keys |
Source code in src/calt/data_loader/utils/data_collator.py
128 129 130 131 132 133 134 135 136 137 138 139 |
|
Source code in src/calt/data_loader/utils/data_collator.py
146 147 |
|
__call__ ¶
__call__(batch)
Collate a batch of data samples.
If a tokenizer is provided, it tokenizes input
and target
attributes.
Other attributes starting with target_
are prefixed with decoder_
and padded.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batch
|
list[dict[str, Any]]
|
Mini-batch samples. |
required |
Returns:
Type | Description |
---|---|
dict[str, torch.Tensor | list[str]]: Batched tensors and/or lists. |
Source code in src/calt/data_loader/utils/data_collator.py
178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 |
|
Preprocessing (expression → internal tokens)¶
Bases: AbstractPreprocessor
Convert SageMath-style expressions to/from internal token representation.
Example (to_internal):
"2x1^2x0 + 5*x0 - 3" -> "C2 E1 E2 C5 E1 E0 C-3 E0 E0" (for num_vars=2
)
Example (to_original): "C2 E2 E1 C5 E1 E0 C-3 E0 E0" -> "2x0^2x1 + 5*x0 - 3"
The internal representation uses
C{n}
tokens for coefficients (e.g.,C2
,C-3
)E{n}
tokens for exponents (e.g.,E1
,E2
,E0
)
Each term is represented as a coefficient token followed by exponent tokens for each variable.
Source code in src/calt/data_loader/utils/preprocessor.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
|
encode ¶
encode(text: str) -> str
Process a symbolic text into internal token representation.
If the text contains the '|' separator character, each part is processed separately and joined with '[SEP]' token.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
Input symbolic text to process. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
String in the internal token representation. |
Source code in src/calt/data_loader/utils/preprocessor.py
243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 |
|
decode ¶
decode(tokens: str) -> str
Convert an internal token string back to a symbolic polynomial expression.
Source code in src/calt/data_loader/utils/preprocessor.py
327 328 329 330 331 332 333 334 |
|
Bases: AbstractPreprocessor
Convert an integer string to/from internal token representation.
Input format examples (to_internal): - "12345" - "123|45|678" Output format examples (from_internal): - "C1 C2 C3 C4 C5" - "C1 C2 C3 [SEP] C4 C5 [SEP] C6 C7 C8"
The internal representation uses C{n}
tokens for digits.
Parts separated by '|' are converted individually and joined by [SEP]
.
Note: num_variables
, max_degree
, max_coeff
are inherited but not directly used.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
max_coeff
|
int
|
The maximum digit value (typically 9). Passed to superclass but primarily used for validation context. |
9
|
Source code in src/calt/data_loader/utils/preprocessor.py
352 353 354 355 356 357 358 359 |
|
encode ¶
encode(text: str) -> str
Process an integer string into internal token representation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
Input string representing one or more integers separated by '|'. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Internal token representation (e.g., "C1 C2 [SEP] C3 C4"), or "[ERROR_FORMAT]" if any part is not a valid integer string. |
Source code in src/calt/data_loader/utils/preprocessor.py
369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 |
|
decode ¶
decode(tokens: str) -> str
Convert an internal token representation back to an integer string.
Source code in src/calt/data_loader/utils/preprocessor.py
403 404 405 406 407 408 409 410 411 412 |
|
Tokenizer¶
Build or load a tokenizer for polynomial expressions and configure the vocabulary.
Create or load a tokenizer for polynomial expressions.
If a vocab_config
is provided, it builds a tokenizer from the config.
Otherwise, it creates a new tokenizer based on the provided parameters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
field
|
str
|
Field specification ("QQ"/"ZZ" for rational/integer, or "GF "
for finite field). Used if |
'GF'
|
max_coeff
|
int
|
Maximum absolute value for coefficients. Used if
|
100
|
max_degree
|
int
|
Maximum degree for any variable. Used if |
10
|
max_length
|
int
|
Maximum sequence length the tokenizer will process. |
512
|
vocab_config
|
Optional[VocabConfig]
|
Optional dictionary with "vocab" and "special_vocab". |
None
|
Returns:
Name | Type | Description |
---|---|---|
PreTrainedTokenizerFast |
PreTrainedTokenizerFast
|
A pre-configured HuggingFace tokenizer for polynomial expressions. |
Source code in src/calt/data_loader/utils/tokenizer.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
|
Visualization utilities (optional)¶
Quickly render visual diffs between predictions and references.
Render "gold" vs. "pred" with strikethrough on mistakes in "pred".
Parameters:
Name | Type | Description | Default |
---|---|---|---|
gold
|
Expr | str
|
Ground-truth expression. If a string, it will be parsed as a token
sequence (e.g., "C1 E1 E1 C-3 E0 E7") via |
required |
pred
|
Expr | str
|
Model-predicted expression. If a string, it will be parsed as a token
sequence via |
required |
var_order
|
Sequence[Symbol] | None
|
Variable ordering (important for >2 variables). Inferred if None. Also
passed to |
None
|
Source code in src/calt/data_loader/utils/comparison_vis.py
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 |
|
Load evaluation results from a JSON file.
The JSON file should contain a list of objects with "generated" and "reference" keys.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
Path to the JSON file. |
required |
Returns:
Type | Description |
---|---|
tuple[list[str], list[str]]
|
tuple[list[str], list[str]]: A tuple containing two lists: - List of generated texts. - List of reference texts. |
Source code in src/calt/data_loader/utils/comparison_vis.py
184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 |
|
Convert a math expression string or token sequence to a SymPy polynomial.
This function handles: 1. Standard mathematical notation (e.g., "4x0 + 4x1"). 2. SageMath-style power notation (e.g., "3x0^2 + 3x0"). 3. Internal token format (e.g., "C4 E1 E0 C4 E0 E1").
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
The mathematical expression or token sequence to parse. |
required |
var_names
|
Sequence[str | Symbol] | None
|
Variable names. Primarily used for the token sequence format to ensure the correct number of variables. For expression strings, variables are inferred, but providing them can ensure they are treated as symbols. |
None
|
Returns:
Type | Description |
---|---|
Expr
|
sympy.Expr: A SymPy expression for the polynomial. |
Source code in src/calt/data_loader/utils/comparison_vis.py
296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 |
|