Configuration¶
Example tasks in calt/examples/* share a common configuration pattern based on
three YAML files:
- configs/data.yaml – controls dataset generation via
DatasetPipeline. - configs/train.yaml – main control file for the model, training loop, and WandB logging via
ModelPipelineandTrainerPipeline(referenceslexer.yamlin itsdatablock). - configs/lexer.yaml – controls tokenisation and vocabulary via
IOPipeline; path set intrain.yaml’sdata.lexer_config.
All three are loaded with OmegaConf.load and passed around as omegaconf.DictConfig
objects, so they support dot-style access (e.g. cfg.data, cfg.model, cfg.train).
WandB configuration can be included in cfg.train.wandb or passed separately as cfg.wandb.
data.yaml – dataset generation (DatasetPipeline)¶
Example tasks under calt/examples/* use configs/data.yaml to drive dataset generation
through DatasetPipeline. Typical usage:
from omegaconf import OmegaConf
from calt.dataset import DatasetPipeline
cfg = OmegaConf.load("configs/data.yaml")
pipeline = DatasetPipeline.from_config(
cfg.dataset,
instance_generator=my_instance_generator,
statistics_calculator=None,
)
pipeline.run()
The dataset block in data.yaml controls all dataset-generation behaviour. Example:
dataset:
save_dir: "./data"
num_train_samples: 100000
num_test_samples: 1000
batch_size: 10000
n_jobs: 4
root_seed: 42
verbose: true
backend: "sagemath"
save_text: true
save_json: false
dataset — Passed to DatasetPipeline.from_config
| Name | Description |
|---|---|
save_dir |
Base directory where all splits (train/test/…) are written. |
num_train_samples |
Number of training samples to generate (size of the "train" split). |
num_test_samples |
Number of test samples to generate (size of the "test" split). |
batch_size |
Batch size passed to DatasetGenerator.run for efficient multiprocessing. |
n_jobs |
Number of worker processes used by the generator (backend="multiprocessing"). |
root_seed |
Global seed used to derive per-sample seeds in the backend. |
verbose |
Whether the generator prints progress information. |
backend |
Which implementation to use: "sagemath" → SageMath backend, "sympy" → SymPy backend. |
save_text |
Whether to write human-readable .txt files (text_raw.txt, etc.). |
save_json |
Whether to write .jsonl files preserving the nested Python structure. |
Under the hood, DatasetPipeline resolves the appropriate DatasetGenerator for the chosen backend and uses the common DatasetWriter; both are configured from the same options.
train.yaml – model and training (ModelPipeline, TrainerPipeline)¶
Example tasks under calt/examples/* use configs/train.yaml as the main control file for experiments. It is loaded with OmegaConf.load and passed to IOPipeline, ModelPipeline, and TrainerPipeline. Typical usage:
from omegaconf import OmegaConf
from calt.io import IOPipeline
from calt.models import ModelPipeline
from calt.trainer import TrainerPipeline
cfg = OmegaConf.load("configs/train.yaml")
io_dict = IOPipeline.from_config(cfg.data).build()
model = ModelPipeline.from_io_dict(cfg.model, io_dict).build()
trainer_pipeline = TrainerPipeline.from_io_dict(cfg.train, model, io_dict).build()
trainer_pipeline.train()
success_rate = trainer_pipeline.evaluate_and_save_generation()
print(f"Success rate: {100 * success_rate:.1f}%")
The top-level blocks in train.yaml are data, model, train, and optionally wandb (under train or separate). Example:
data:
train_dataset_path: ./data/train_raw.txt
test_dataset_path: ./data/test_raw.txt
lexer_config: ./configs/lexer.yaml
num_train_samples: -1
num_test_samples: -1
validate_train_tokens: true
validate_test_tokens: true
display_samples: 5
model:
model_type: generic
num_encoder_layers: 6
num_encoder_heads: 8
num_decoder_layers: 6
num_decoder_heads: 8
d_model: 512
encoder_ffn_dim: 2048
decoder_ffn_dim: 2048
max_sequence_length: 512
train:
save_dir: ./results
num_train_epochs: 100
learning_rate: 0.0001
weight_decay: 0.01
warmup_ratio: 0.1
batch_size: 16
test_batch_size: 16
seed: 42
wandb:
project: calt
group: gf17_addition
name: gf17_addition
data — Passed to IOPipeline.from_config
| Name | Description |
|---|---|
lexer_config |
Path to lexer.yaml (required). |
train_dataset_path |
Path to training raw text file. |
test_dataset_path |
Path to test raw text file. |
num_train_samples |
Number of samples to load for training (-1 = all). |
num_test_samples |
Number of samples to load for evaluation (-1 = all). |
validate_train_tokens |
Whether to validate that all training tokens are in vocab (default false). |
validate_test_tokens |
Whether to validate test tokens (default true). |
display_samples |
Number of sample lines to print when loading (0 to disable). |
use_jsonl |
If true, load from JSONL instead of raw text. |
use_pickle |
If true, load from pickle. |
train_dataset_jsonl, test_dataset_jsonl |
Paths when using JSONL. |
train_dataset_pickle, test_dataset_pickle |
Paths when using pickle. |
dataset_load_preprocessor |
Optional preprocessor for custom loading. |
model — Passed to ModelPipeline.from_io_dict
| Name | Description |
|---|---|
model_type |
Model architecture (e.g. generic, bart). |
num_encoder_layers, num_encoder_heads |
Encoder depth and attention heads. |
num_decoder_layers, num_decoder_heads |
Decoder depth and attention heads. |
d_model |
Hidden size (embedding dimension). |
encoder_ffn_dim, decoder_ffn_dim |
Feed-forward dimension in encoder/decoder. |
max_sequence_length |
Maximum sequence length. |
train — Converted to TrainingArguments by TrainerPipeline
| Name | Description |
|---|---|
save_dir |
Output directory for checkpoints and logs. Passed to HuggingFace TrainingArguments.output_dir. If omitted, output_dir is used; if both are missing, defaults to "./tmp". |
output_dir |
Alias for save_dir. Used when save_dir is not set. |
num_train_epochs |
Number of training epochs. |
learning_rate |
Learning rate for the optimizer. |
weight_decay |
Weight decay (L2 penalty) coefficient. |
warmup_ratio |
Fraction of training steps used for a linear warmup from 0 to learning_rate (0 to 1). |
batch_size |
Per-device training batch size. With multiple GPUs, this is divided by the number of devices and passed as per_device_train_batch_size. |
test_batch_size |
Per-device evaluation batch size. Similarly passed as per_device_eval_batch_size. |
lr_scheduler_type |
Learning rate schedule. Use "linear" or "constant". Defaults to "linear" if not set. |
max_grad_norm |
Maximum gradient norm for clipping. |
optimizer |
Optimizer name (passed to HuggingFace optim). |
num_workers |
Number of DataLoader worker processes (dataloader_num_workers). |
dataloader_pin_memory |
DataLoader pin_memory option. Defaults to true. |
eval_strategy |
When to run evaluation (e.g. "steps", "epoch"). Defaults to "steps". |
eval_steps |
Run evaluation every this many steps when eval_strategy is "steps". Defaults to 1000. |
save_strategy |
When to save checkpoints (e.g. "steps", "epoch"). Defaults to "steps". |
save_steps |
Save a checkpoint every this many steps when save_strategy is "steps". Defaults to 1000. |
save_total_limit |
Maximum number of checkpoints to keep; older ones are removed. Defaults to 1. |
save_safetensors |
If true, save model weights in safetensors format. Defaults to false. |
label_names |
List of label keys used by the Trainer. Defaults to ["labels"]. |
logging_strategy |
When to log (e.g. "steps", "epoch"). Defaults to "steps". |
logging_steps |
Log every this many steps when logging_strategy is "steps". Defaults to 50. |
seed |
Random seed for reproducibility. |
remove_unused_columns |
Whether to drop dataset columns not used by the model. Defaults to false. |
disable_tqdm |
Whether to disable the progress bar. Defaults to true. |
train.wandb (or top-level wandb) — Used by TrainerPipeline for Weights & Biases
| Name | Description |
|---|---|
project |
Project name on WandB. |
group |
Logical experiment group (e.g. task name). |
name |
Run identifier shown in the UI. |
tags |
Optional list of tags. |
no_wandb |
If true, disable WandB logging. |
The trainer pipeline sets the corresponding environment variables and ensures TrainingArguments.report_to includes "wandb" when WandB is configured.
lexer.yaml – IO and vocabulary (IOPipeline)¶
The data block in train.yaml points to lexer.yaml via data.lexer_config. That file controls tokenisation and vocabulary and is loaded by IOPipeline.from_config. Typical usage:
from omegaconf import OmegaConf
from calt.io import IOPipeline
cfg = OmegaConf.load("configs/train.yaml")
io_pipeline = IOPipeline.from_config(cfg.data) # loads lexer_config from cfg.data.lexer_config
The top-level keys in lexer.yaml control vocabulary and number tokenisation. Example:
vocab:
range:
numbers: ["", 0, 16]
misc: ["+", "*", "^", "(", ")","|", "-"]
special_tokens: {}
flags:
include_base_vocab: true
include_base_special_tokens: true
number_policy:
attach_sign: true
digit_group: 0
allow_float: false
strict: true
include_base_vocab: true
vocab — Passed to VocabConfig.from_config
| Name | Description |
|---|---|
range |
Dict of arbitrary key → [prefix, min, max] (inclusive). Each entry expands to tokens prefix+str(i) for i in min..max. For example, numbers: ["", 0, 16], coefficients: ["C", -50, 50], exponents: ["E", 0, 20], variables: ["x", 0, 2]. |
misc |
List of extra tokens (e.g. ["+", "=", ","]). |
special_tokens |
Dict of special token names. Base special tokens are defined in code. |
flags |
Optional. include_base_vocab, include_base_special_tokens (both default true). |
number_policy — Builds NumberPolicy for UnifiedLexer
| Name | Description |
|---|---|
attach_sign |
bool (default true). true = sign is part of the number token; false = sign is a separate token. |
digit_group |
int (default 0). 0 = no digit grouping; d ≥ 1 = split number into tokens of d digits. |
allow_float |
bool (default true). Whether to allow decimal numbers (adds "." to vocab if needed). |
strict, include_base_vocab — Passed to UnifiedLexer
| Name | Description |
|---|---|
strict |
bool (default true). If true, raise an error on unknown characters; if false, emit the unknown token (e.g. <unk>) and continue. |
include_base_vocab |
bool (default true). If true, add built-in tokens (separators, operators +, -, *, brackets, etc.) to the lexer’s reserved set; if false, only tokens from vocab are used. |
Under the hood, IOPipeline instantiates UnifiedLexer and VocabConfig from this configuration, then builds a HuggingFace-compatible tokenizer, tokenised datasets, and a StandardDataCollator. See Lexer and vocabulary for the API of UnifiedLexer, NumberPolicy, and VocabConfig.