Overview¶
A convenient extension of the HuggingFace Trainer and utility helpers for training and
evaluation. It streamlines device placement, metrics computation, and generation result
saving.
- Model pipeline — builds the model from configuration;
cfg.modeland supported model types. - Configuration —
data.yaml,lexer.yaml, andtrain.yaml.
Trainer ¶
Trainer(*args, **kwargs)
Bases: Trainer
Extension of HuggingFace :class:~transformers.Trainer.
The trainer adds task-specific helpers that simplify training generative
Transformer models. It accepts all the usual HTrainer keyword arguments
and does not introduce new parameters - the default constructor is therefore forwarded verbatim.
Source code in src/calt/trainer/trainer.py
31 32 33 34 35 36 37 38 39 | |
evaluate ¶
evaluate(eval_dataset=None, ignore_keys=None, metric_key_prefix='eval')
Override evaluate to also save generation results during training.
This method is called during training evaluation steps and after training. It runs the standard evaluation and then saves generation results.
Source code in src/calt/trainer/trainer.py
132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 | |
evaluate_and_save_generation ¶
evaluate_and_save_generation(max_length: int = 512, step: int | None = None)
Run greedy/beam-search generation on the evaluation set.
The helper decodes the model outputs into strings, stores the results in
eval_results.json inside the trainer's output directory and finally computes
exact-match accuracy between the generated and reference sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
max_length
|
int
|
Maximum generation length. Defaults to 512. |
512
|
step
|
int
|
Current training step number. If None, tries to get from self.state. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
float |
Exact-match accuracy in the [0, 1] interval. |
Source code in src/calt/trainer/trainer.py
177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 | |
TrainerPipeline¶
The main entry point for building a trainer from config is TrainerPipeline. Use TrainerPipeline.from_io_dict(cfg.train, model, io_dict) then .build() to obtain a pipeline; call .train() to run training and .evaluate_and_save_generation() for evaluation.
Pipeline for creating trainers from configuration.
Similar to IOPipeline, this class provides a simple interface for creating trainer instances from config files. It automatically selects the appropriate TrainerLoader based on the config.
Examples:
>>> from omegaconf import OmegaConf
>>> from calt.trainer import TrainerPipeline
>>>
>>> cfg = OmegaConf.load("config/train.yaml")
>>> model = ... # Get model from ModelPipeline
>>> tokenizer = ... # Get tokenizer from IOPipeline
>>> train_dataset = ... # Get from IOPipeline
>>> eval_dataset = ... # Get from IOPipeline
>>> data_collator = ... # Get from IOPipeline
>>>
>>> trainer_pipeline = TrainerPipeline(
... cfg.train,
... model=model,
... tokenizer=tokenizer,
... train_dataset=train_dataset,
... eval_dataset=eval_dataset,
... data_collator=data_collator,
... )
>>> trainer = trainer_pipeline.build()
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
DictConfig
|
Training configuration from cfg.train (OmegaConf). |
required |
model
|
PreTrainedModel | None
|
Model instance. |
None
|
tokenizer
|
PreTrainedTokenizerFast | None
|
Tokenizer instance. |
None
|
train_dataset
|
Dataset | None
|
Training dataset. |
None
|
eval_dataset
|
Dataset | None
|
Evaluation dataset. |
None
|
data_collator
|
StandardDataCollator | None
|
Data collator. |
None
|
Source code in src/calt/trainer/pipeline.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | |
build ¶
build() -> TrainerPipeline
Build the trainer from configuration.
Returns:
| Name | Type | Description |
|---|---|---|
TrainerPipeline |
TrainerPipeline
|
Returns self for method chaining. |
Source code in src/calt/trainer/pipeline.py
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | |
train ¶
train(resume_from_checkpoint: str | bool | None = None) -> None
Train the model.
This method calls trainer.train(). If resume_from_checkpoint is provided, training will resume from the specified checkpoint.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
resume_from_checkpoint
|
str | bool | None
|
If True, resume from the latest checkpoint in output_dir. If str, resume from the specified checkpoint path. If None, start training from scratch. |
None
|
Source code in src/calt/trainer/pipeline.py
143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 | |
save_model ¶
save_model(output_dir: str | None = None) -> None
Save the model and tokenizer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
str | None
|
Directory to save the model and tokenizer. If None, uses trainer's output_dir. |
None
|
Source code in src/calt/trainer/pipeline.py
160 161 162 163 164 165 166 167 168 169 170 171 172 173 | |
evaluate_and_save_generation ¶
evaluate_and_save_generation(max_length: int = 512) -> float
Evaluate and save generation results.
Source code in src/calt/trainer/pipeline.py
175 176 177 178 179 | |
from_io_dict
classmethod
¶
from_io_dict(
config: DictConfig,
model: PreTrainedModel,
io_dict: dict,
wandb_config: Optional[DictConfig] = None,
) -> TrainerPipeline
Create a TrainerPipeline from a dict returned by IOPipeline.build().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
DictConfig
|
Training configuration (cfg.train). May contain wandb config as config.wandb. |
required |
model
|
PreTrainedModel
|
Model instance from ModelPipeline. |
required |
io_dict
|
dict
|
IOPipeline.build() result. |
required |
wandb_config
|
Optional[DictConfig]
|
Optional wandb configuration block. If None, tries to get from config.wandb. |
None
|
Source code in src/calt/trainer/pipeline.py
196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 | |
resume_from_checkpoint
classmethod
¶
resume_from_checkpoint(
save_dir: str, resume_from_checkpoint: bool = True
) -> TrainerPipeline
Resume training from a saved checkpoint directory.
This method loads train.yaml from save_dir, reconstructs IOPipeline, ModelPipeline, and TrainerPipeline, and optionally loads the saved model and tokenizer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
save_dir
|
str
|
Directory containing train.yaml, model/, and tokenizer/. |
required |
resume_from_checkpoint
|
bool
|
If True, load saved model and tokenizer. If False, create new model. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
TrainerPipeline |
TrainerPipeline
|
TrainerPipeline instance ready for training continuation. |
Examples:
>>> from calt.trainer import TrainerPipeline
>>>
>>> # Load from checkpoint and continue training
>>> trainer_pipeline = TrainerPipeline.resume_from_checkpoint("./results")
>>> trainer_pipeline.build()
>>> trainer_pipeline.train() # Continue training
Source code in src/calt/trainer/pipeline.py
228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 | |
Pipelines and configuration¶
High-level example scripts (under calt/examples/*) use class-based pipelines to keep
configuration and wiring simple:
- :class:
calt.io.IOPipeline– builds datasets, tokenizer, and collator. - :class:
calt.models.ModelPipeline– builds the model from configuration. - :class:
calt.trainer.TrainerPipeline– builds the HuggingFaceTrainer.
A typical training script looks like:
from omegaconf import OmegaConf
from calt.io import IOPipeline
from calt.models import ModelPipeline
from calt.trainer import TrainerPipeline
cfg = OmegaConf.load("configs/train.yaml")
io_dict = IOPipeline.from_config(cfg.data).build()
model = ModelPipeline.from_io_dict(cfg.model, io_dict).build()
# wandb_config is optional - if not provided, TrainerPipeline will try to get it from cfg.train.wandb
trainer_pipeline = TrainerPipeline.from_io_dict(cfg.train, model, io_dict).build()
trainer_pipeline.train()
success_rate = trainer_pipeline.evaluate_and_save_generation()
print(f"Success rate: {100 * success_rate:.1f}%")
Resuming training from checkpoint¶
You can resume training from a saved checkpoint using TrainerPipeline.resume_from_checkpoint:
from calt.trainer import TrainerPipeline
# Resume training from a saved directory
trainer_pipeline = TrainerPipeline.resume_from_checkpoint(
save_dir="./results/my_experiment",
resume_from_checkpoint=True # Load saved model weights
)
trainer_pipeline.build()
trainer_pipeline.train() # Continue training
For details on the three configuration files used in these examples
(data.yaml, lexer.yaml, train.yaml), see Configuration.