Skip to content

Model pipeline

ModelPipeline builds a sequence-to-sequence model from the model block of your config and the tokenizer produced by IOPipeline. It is used after IOPipeline.build() and before TrainerPipeline.

  • Overview — how the three pipelines (IO, Model, Trainer) fit together.
  • Configuration — the model block in train.yaml and its keys.

ModelPipeline

Use ModelPipeline.from_io_dict(cfg.model, io_dict) to create a pipeline from the result of IOPipeline.from_config(cfg.data).build(). The tokenizer is taken from io_dict["tokenizer"]. Call .build() to obtain the PreTrainedModel instance.

Pipeline for creating models from configuration using ModelRegistry.

Similar to IOPipeline, this class provides a simple interface for creating model instances from config files. It uses ModelRegistry internally to handle model creation.

Examples:

>>> from omegaconf import OmegaConf
>>> from calt.models import ModelPipeline
>>>
>>> cfg = OmegaConf.load("config/train.yaml")
>>> tokenizer = ...  # Get tokenizer from IOPipeline
>>>
>>> model_pipeline = ModelPipeline(cfg.model, tokenizer)
>>> model = model_pipeline.build()

Parameters:

Name Type Description Default
calt_config DictConfig

Model configuration from cfg.model (OmegaConf).

required
tokenizer PreTrainedTokenizerFast | None

Tokenizer instance (required for some models).

None
Source code in src/calt/models/pipeline.py
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def __init__(
    self,
    calt_config: DictConfig,
    tokenizer: Optional[PreTrainedTokenizerFast] = None,
):
    """Initialize the model pipeline.

    Args:
        calt_config (DictConfig): Model configuration from cfg.model (OmegaConf).
        tokenizer (PreTrainedTokenizerFast | None): Tokenizer instance (required for some models).
    """
    self.calt_config = calt_config
    self.tokenizer = tokenizer
    self.model: Optional[PreTrainedModel] = None
    self._registry = ModelRegistry()

from_io_dict classmethod

from_io_dict(calt_config: DictConfig, io_dict: dict) -> ModelPipeline

Create a ModelPipeline using the result dict from IOPipeline.build().

Parameters:

Name Type Description Default
calt_config DictConfig

Model configuration (cfg.model).

required
io_dict dict

Result dict from IOPipeline.build(), expected to contain at least the "tokenizer" entry.

required
Source code in src/calt/models/pipeline.py
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
@classmethod
def from_io_dict(
    cls,
    calt_config: DictConfig,
    io_dict: dict,
) -> "ModelPipeline":
    """Create a ModelPipeline using the result dict from IOPipeline.build().

    Args:
        calt_config: Model configuration (cfg.model).
        io_dict: Result dict from ``IOPipeline.build()``, expected to contain
            at least the ``\"tokenizer\"`` entry.
    """
    return cls(
        calt_config=calt_config,
        tokenizer=io_dict["tokenizer"],
    )

build

build() -> PreTrainedModel

Build the model from configuration using ModelRegistry.

Returns:

Name Type Description
PreTrainedModel PreTrainedModel

Model instance.

Source code in src/calt/models/pipeline.py
68
69
70
71
72
73
74
75
76
77
78
79
def build(self) -> PreTrainedModel:
    """Build the model from configuration using ModelRegistry.

    Returns:
        PreTrainedModel: Model instance.
    """
    # Use ModelRegistry to create the model
    self.model = self._registry.create_from_config(
        model_config=self.calt_config,
        tokenizer=self.tokenizer,
    )
    return self.model

Supported model types

Models are created via an internal ModelRegistry. The following types are registered by default:

model_type Description
generic, transformer, calt CALT generic Transformer (encoder–decoder).
bart HuggingFace BART for conditional generation.

Set model_type in the model block of train.yaml (e.g. model_type: generic). Other keys in the model block (e.g. num_encoder_layers, d_model, max_sequence_length) are documented under Configuration — model.

ModelRegistry

To create a model without using the pipeline (e.g. with a custom config), you can use the registry or helpers from calt.models: ModelRegistry, get_model_from_config. See the API reference below.

Registry for creating model instances based on model type.

This class provides a unified interface for creating different types of models. Models can be registered and retrieved by name or inferred from config.

Examples:

>>> # Create model with explicit name and config
>>> from calt.models.generic import TransformerConfig
>>> registry = ModelRegistry()
>>> config = TransformerConfig(vocab_size=1000, d_model=128)
>>> model = registry.create("transformer", config)
>>>
>>> # Create model from config only (model_type inferred from config)
>>> model = registry.create(model_config=config)
>>>
>>> # Register custom model
>>> registry.register("custom_model", CustomModel, CustomModelConfig)
Source code in src/calt/models/base.py
33
34
35
36
37
38
39
def __init__(self):
    """Initialize the registry with default model types."""
    self._registry: dict[
        str, tuple[Type[PreTrainedModel], Type[PretrainedConfig]]
    ] = {}
    self._config_mappings: dict[str, Callable] = {}
    self._register_defaults()

create_from_config

create_from_config(
    model_config: DictConfig,
    tokenizer: Optional[PreTrainedTokenizerFast] = None,
    model_name: Optional[str] = None,
) -> PreTrainedModel

Create a model instance from OmegaConf config (cfg.model).

This method automatically converts the unified config format to model-specific configs using registered config mapping functions.

Parameters:

Name Type Description Default
model_config DictConfig

Model configuration from cfg.model (OmegaConf).

required
tokenizer PreTrainedTokenizerFast | None

Tokenizer instance (required for some models like BART).

None
model_name str | None

Name of the model type. If None, will be inferred from model_config.model_type.

None

Returns:

Name Type Description
PreTrainedModel PreTrainedModel

Model instance.

Raises:

Type Description
ValueError

If model_name is not supported or cannot be inferred from config.

Examples:

>>> # Create model from OmegaConf config
>>> from omegaconf import OmegaConf
>>> cfg = OmegaConf.load("config/train.yaml")
>>> registry = ModelRegistry()
>>> model = registry.create_from_config(cfg.model, tokenizer)
Source code in src/calt/models/base.py
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
def create_from_config(
    self,
    model_config: DictConfig,
    tokenizer: Optional[PreTrainedTokenizerFast] = None,
    model_name: Optional[str] = None,
) -> PreTrainedModel:
    """Create a model instance from OmegaConf config (cfg.model).

    This method automatically converts the unified config format to model-specific configs
    using registered config mapping functions.

    Args:
        model_config (DictConfig): Model configuration from cfg.model (OmegaConf).
        tokenizer (PreTrainedTokenizerFast | None): Tokenizer instance (required for some models like BART).
        model_name (str | None): Name of the model type. If None, will be inferred from model_config.model_type.

    Returns:
        PreTrainedModel: Model instance.

    Raises:
        ValueError: If model_name is not supported or cannot be inferred from config.

    Examples:
        >>> # Create model from OmegaConf config
        >>> from omegaconf import OmegaConf
        >>> cfg = OmegaConf.load("config/train.yaml")
        >>> registry = ModelRegistry()
        >>> model = registry.create_from_config(cfg.model, tokenizer)
    """
    # Determine model name
    if model_name is None:
        if hasattr(model_config, "model_type"):
            model_name = model_config.model_type
        else:
            raise ValueError(
                "Cannot infer model_name from model_config. "
                "Please provide model_name explicitly or set model_config.model_type."
            )

    model_name = model_name.lower() if isinstance(model_name, str) else model_name

    # Get model class from registry
    if model_name not in self._registry:
        supported_types = list(self._registry.keys())
        raise ValueError(
            f"Unsupported model type: {model_name}. "
            f"Supported types: {supported_types}"
        )

    model_class, config_class = self._registry[model_name]

    # Get config mapping function
    if model_name not in self._config_mappings:
        raise ValueError(
            f"No config mapping registered for model type: {model_name}. "
            f"Please register a config mapping using register_config_mapping()."
        )

    mapping_func = self._config_mappings[model_name]

    # Convert OmegaConf config to model-specific config
    converted_config = mapping_func(model_config, tokenizer)

    # Create and return model
    return model_class(config=converted_config)

list_models

list_models() -> list[str]

List all registered model types.

Returns:

Type Description
list[str]

list[str]: List of registered model type names.

Source code in src/calt/models/base.py
225
226
227
228
229
230
231
def list_models(self) -> list[str]:
    """List all registered model types.

    Returns:
        list[str]: List of registered model type names.
    """
    return list(self._registry.keys())

register

register(
    model_name: str,
    model_class: Type[PreTrainedModel],
    config_class: Type[PretrainedConfig],
)

Register a model class with the registry.

Parameters:

Name Type Description Default
model_name str

Name to register the model under.

required
model_class Type[PreTrainedModel]

Model class to register.

required
config_class Type[PretrainedConfig]

Config class for the model.

required
Source code in src/calt/models/base.py
62
63
64
65
66
67
68
69
70
71
72
73
74
75
def register(
    self,
    model_name: str,
    model_class: Type[PreTrainedModel],
    config_class: Type[PretrainedConfig],
):
    """Register a model class with the registry.

    Args:
        model_name (str): Name to register the model under.
        model_class (Type[PreTrainedModel]): Model class to register.
        config_class (Type[PretrainedConfig]): Config class for the model.
    """
    self._registry[model_name.lower()] = (model_class, config_class)

register_config_mapping

register_config_mapping(
    model_name: str,
    mapping_func: Callable[
        [DictConfig, Optional[PreTrainedTokenizerFast]], PretrainedConfig
    ],
)

Register a config mapping function for converting OmegaConf to model config.

Parameters:

Name Type Description Default
model_name str

Name of the model type.

required
mapping_func Callable

Function that takes (model_config: DictConfig, tokenizer: Optional) and returns PretrainedConfig.

required
Source code in src/calt/models/base.py
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
def register_config_mapping(
    self,
    model_name: str,
    mapping_func: Callable[
        [DictConfig, Optional[PreTrainedTokenizerFast]], PretrainedConfig
    ],
):
    """Register a config mapping function for converting OmegaConf to model config.

    Args:
        model_name (str): Name of the model type.
        mapping_func (Callable): Function that takes (model_config: DictConfig, tokenizer: Optional)
            and returns PretrainedConfig.
    """
    self._config_mappings[model_name.lower()] = mapping_func