Skip to content

Configuration

Example tasks in calt/examples/* share a common configuration pattern based on three YAML files:

  • configs/data.yaml – controls dataset generation via DatasetPipeline.
  • configs/train.yaml – main control file for the model, training loop, and WandB logging via ModelPipeline and TrainerPipeline (references lexer.yaml in its data block).
  • configs/lexer.yaml – controls tokenisation and vocabulary via IOPipeline; path set in train.yaml’s data.lexer_config.

All three are loaded with OmegaConf.load and passed around as omegaconf.DictConfig objects, so they support dot-style access (e.g. cfg.data, cfg.model, cfg.train). WandB configuration can be included in cfg.train.wandb or passed separately as cfg.wandb.

data.yaml – dataset generation (DatasetPipeline)

Example tasks under calt/examples/* use configs/data.yaml to drive dataset generation through DatasetPipeline. Typical usage:

from omegaconf import OmegaConf
from calt.dataset import DatasetPipeline

cfg = OmegaConf.load("configs/data.yaml")
pipeline = DatasetPipeline.from_config(
    cfg.dataset,
    instance_generator=my_instance_generator,
    statistics_calculator=None,
)
pipeline.run()

The dataset block in data.yaml controls all dataset-generation behaviour. Example:

dataset:
  save_dir: "./data"
  num_train_samples: 100000
  num_test_samples: 1000
  batch_size: 10000
  n_jobs: 4
  root_seed: 42
  verbose: true
  backend: "sagemath"
  save_text: true
  save_json: false
dataset — Passed to DatasetPipeline.from_config
Name Description
save_dir Base directory where all splits (train/test/…) are written.
num_train_samples Number of training samples to generate (size of the "train" split).
num_test_samples Number of test samples to generate (size of the "test" split).
batch_size Batch size passed to DatasetGenerator.run for efficient multiprocessing.
n_jobs Number of worker processes used by the generator (backend="multiprocessing").
root_seed Global seed used to derive per-sample seeds in the backend.
verbose Whether the generator prints progress information.
backend Which implementation to use: "sagemath"SageMath backend, "sympy"SymPy backend.
save_text Whether to write human-readable .txt files (text_raw.txt, etc.).
save_json Whether to write .jsonl files preserving the nested Python structure.

Under the hood, DatasetPipeline resolves the appropriate DatasetGenerator for the chosen backend and uses the common DatasetWriter; both are configured from the same options.

train.yaml – model and training (ModelPipeline, TrainerPipeline)

Example tasks under calt/examples/* use configs/train.yaml as the main control file for experiments. It is loaded with OmegaConf.load and passed to IOPipeline, ModelPipeline, and TrainerPipeline. Typical usage:

from omegaconf import OmegaConf
from calt.io import IOPipeline
from calt.models import ModelPipeline
from calt.trainer import TrainerPipeline

cfg = OmegaConf.load("configs/train.yaml")

io_dict = IOPipeline.from_config(cfg.data).build()
model = ModelPipeline.from_io_dict(cfg.model, io_dict).build()
trainer_pipeline = TrainerPipeline.from_io_dict(cfg.train, model, io_dict).build()

trainer_pipeline.train()
success_rate = trainer_pipeline.evaluate_and_save_generation()
print(f"Success rate: {100 * success_rate:.1f}%")

The top-level blocks in train.yaml are data, model, train, and optionally wandb (under train or separate). Example:

data:
  train_dataset_path: ./data/train_raw.txt
  test_dataset_path: ./data/test_raw.txt
  lexer_config: ./configs/lexer.yaml
  num_train_samples: -1
  num_test_samples: -1
  validate_train_tokens: true
  validate_test_tokens: true
  display_samples: 5

model:
  model_type: generic
  num_encoder_layers: 6
  num_encoder_heads: 8
  num_decoder_layers: 6
  num_decoder_heads: 8
  d_model: 512
  encoder_ffn_dim: 2048
  decoder_ffn_dim: 2048
  max_sequence_length: 512

train:
  save_dir: ./results
  num_train_epochs: 100
  learning_rate: 0.0001
  weight_decay: 0.01
  warmup_ratio: 0.1
  batch_size: 16
  test_batch_size: 16
  seed: 42
  wandb:
    project: calt
    group: gf17_addition
    name: gf17_addition
data — Passed to IOPipeline.from_config
Name Description
lexer_config Path to lexer.yaml (required).
train_dataset_path Path to training raw text file.
test_dataset_path Path to test raw text file.
num_train_samples Number of samples to load for training (-1 = all).
num_test_samples Number of samples to load for evaluation (-1 = all).
validate_train_tokens Whether to validate that all training tokens are in vocab (default false).
validate_test_tokens Whether to validate test tokens (default true).
display_samples Number of sample lines to print when loading (0 to disable).
use_jsonl If true, load from JSONL instead of raw text.
use_pickle If true, load from pickle.
train_dataset_jsonl, test_dataset_jsonl Paths when using JSONL.
train_dataset_pickle, test_dataset_pickle Paths when using pickle.
dataset_load_preprocessor Optional preprocessor for custom loading.
model — Passed to ModelPipeline.from_io_dict
Name Description
model_type Model architecture (e.g. generic, bart).
num_encoder_layers, num_encoder_heads Encoder depth and attention heads.
num_decoder_layers, num_decoder_heads Decoder depth and attention heads.
d_model Hidden size (embedding dimension).
encoder_ffn_dim, decoder_ffn_dim Feed-forward dimension in encoder/decoder.
max_sequence_length Maximum sequence length.
train — Converted to TrainingArguments by TrainerPipeline
Name Description
save_dir Output directory for checkpoints and logs. Passed to HuggingFace TrainingArguments.output_dir. If omitted, output_dir is used; if both are missing, defaults to "./tmp".
output_dir Alias for save_dir. Used when save_dir is not set.
num_train_epochs Number of training epochs.
learning_rate Learning rate for the optimizer.
weight_decay Weight decay (L2 penalty) coefficient.
warmup_ratio Fraction of training steps used for a linear warmup from 0 to learning_rate (0 to 1).
batch_size Per-device training batch size. With multiple GPUs, this is divided by the number of devices and passed as per_device_train_batch_size.
test_batch_size Per-device evaluation batch size. Similarly passed as per_device_eval_batch_size.
lr_scheduler_type Learning rate schedule. Use "linear" or "constant". Defaults to "linear" if not set.
max_grad_norm Maximum gradient norm for clipping.
optimizer Optimizer name (passed to HuggingFace optim).
num_workers Number of DataLoader worker processes (dataloader_num_workers).
dataloader_pin_memory DataLoader pin_memory option. Defaults to true.
eval_strategy When to run evaluation (e.g. "steps", "epoch"). Defaults to "steps".
eval_steps Run evaluation every this many steps when eval_strategy is "steps". Defaults to 1000.
save_strategy When to save checkpoints (e.g. "steps", "epoch"). Defaults to "steps".
save_steps Save a checkpoint every this many steps when save_strategy is "steps". Defaults to 1000.
save_total_limit Maximum number of checkpoints to keep; older ones are removed. Defaults to 1.
save_safetensors If true, save model weights in safetensors format. Defaults to false.
label_names List of label keys used by the Trainer. Defaults to ["labels"].
logging_strategy When to log (e.g. "steps", "epoch"). Defaults to "steps".
logging_steps Log every this many steps when logging_strategy is "steps". Defaults to 50.
seed Random seed for reproducibility.
remove_unused_columns Whether to drop dataset columns not used by the model. Defaults to false.
disable_tqdm Whether to disable the progress bar. Defaults to true.
train.wandb (or top-level wandb) — Used by TrainerPipeline for Weights & Biases
Name Description
project Project name on WandB.
group Logical experiment group (e.g. task name).
name Run identifier shown in the UI.
tags Optional list of tags.
no_wandb If true, disable WandB logging.

The trainer pipeline sets the corresponding environment variables and ensures TrainingArguments.report_to includes "wandb" when WandB is configured.

lexer.yaml – IO and vocabulary (IOPipeline)

The data block in train.yaml points to lexer.yaml via data.lexer_config. That file controls tokenisation and vocabulary and is loaded by IOPipeline.from_config. Typical usage:

from omegaconf import OmegaConf
from calt.io import IOPipeline

cfg = OmegaConf.load("configs/train.yaml")
io_pipeline = IOPipeline.from_config(cfg.data)  # loads lexer_config from cfg.data.lexer_config

The top-level keys in lexer.yaml control vocabulary and number tokenisation. Example:

vocab:
  range:
    numbers: ["", 0, 16]
  misc: ["+", "*", "^", "(", ")","|", "-"]
  special_tokens: {}
  flags:
    include_base_vocab: true
    include_base_special_tokens: true

number_policy:
  attach_sign: true
  digit_group: 0
  allow_float: false

strict: true
include_base_vocab: true
vocab — Passed to VocabConfig.from_config
Name Description
range Dict of arbitrary key → [prefix, min, max] (inclusive). Each entry expands to tokens prefix+str(i) for i in min..max. For example, numbers: ["", 0, 16], coefficients: ["C", -50, 50], exponents: ["E", 0, 20], variables: ["x", 0, 2].
misc List of extra tokens (e.g. ["+", "=", ","]).
special_tokens Dict of special token names. Base special tokens are defined in code.
flags Optional. include_base_vocab, include_base_special_tokens (both default true).
number_policy — Builds NumberPolicy for UnifiedLexer
Name Description
attach_sign bool (default true). true = sign is part of the number token; false = sign is a separate token.
digit_group int (default 0). 0 = no digit grouping; d ≥ 1 = split number into tokens of d digits.
allow_float bool (default true). Whether to allow decimal numbers (adds "." to vocab if needed).
strict, include_base_vocab — Passed to UnifiedLexer
Name Description
strict bool (default true). If true, raise an error on unknown characters; if false, emit the unknown token (e.g. <unk>) and continue.
include_base_vocab bool (default true). If true, add built-in tokens (separators, operators +, -, *, brackets, etc.) to the lexer’s reserved set; if false, only tokens from vocab are used.

Under the hood, IOPipeline instantiates UnifiedLexer and VocabConfig from this configuration, then builds a HuggingFace-compatible tokenizer, tokenised datasets, and a StandardDataCollator. See Lexer and vocabulary for the API of UnifiedLexer, NumberPolicy, and VocabConfig.