Model pipeline¶

ModelPipeline builds a sequence-to-sequence model from the model block of your config and the tokenizer produced by IOPipeline. It is used after IOPipeline.build() and before TrainerPipeline.

Overview — how the three pipelines (IO, Model, Trainer) fit together.
Configuration — the model block in train.yaml and its keys.

ModelPipeline¶

Use ModelPipeline.from_io_dict(cfg.model, io_dict) to create a pipeline from the result of IOPipeline.from_config(cfg.data).build(). The tokenizer is taken from io_dict["tokenizer"]. Call .build() to obtain the PreTrainedModel instance.

Pipeline for creating models from configuration using ModelRegistry.

Similar to IOPipeline, this class provides a simple interface for creating model instances from config files. It uses ModelRegistry internally to handle model creation.

Examples:

>>> from omegaconf import OmegaConf
>>> from calt.models import ModelPipeline
>>>
>>> cfg = OmegaConf.load("config/train.yaml")
>>> tokenizer = ...  # Get tokenizer from IOPipeline
>>>
>>> model_pipeline = ModelPipeline(cfg.model, tokenizer)
>>> model = model_pipeline.build()

Parameters:

Name	Type	Description	Default
`calt_config`	`DictConfig`	Model configuration from cfg.model (OmegaConf).	required
`tokenizer`	`PreTrainedTokenizerFast \| None`	Tokenizer instance (required for some models).	`None`

Source code in src/calt/models/pipeline.py

def __init__(
    self,
    calt_config: DictConfig,
    tokenizer: Optional[PreTrainedTokenizerFast] = None,
):
    """Initialize the model pipeline.

    Args:
        calt_config (DictConfig): Model configuration from cfg.model (OmegaConf).
        tokenizer (PreTrainedTokenizerFast | None): Tokenizer instance (required for some models).
    """
    self.calt_config = calt_config
    self.tokenizer = tokenizer
    self.model: Optional[PreTrainedModel] = None
    self._registry = ModelRegistry()

from_io_dict `classmethod` ¶

from_io_dict(calt_config: DictConfig, io_dict: dict) -> ModelPipeline

Create a ModelPipeline using the result dict from IOPipeline.build().

Parameters:

Name	Type	Description	Default
`calt_config`	`DictConfig`	Model configuration (cfg.model).	required
`io_dict`	`dict`	Result dict from `IOPipeline.build()`, expected to contain at least the `"tokenizer"` entry.	required

Source code in src/calt/models/pipeline.py

@classmethod
def from_io_dict(
    cls,
    calt_config: DictConfig,
    io_dict: dict,
) -> "ModelPipeline":
    """Create a ModelPipeline using the result dict from IOPipeline.build().

    Args:
        calt_config: Model configuration (cfg.model).
        io_dict: Result dict from ``IOPipeline.build()``, expected to contain
            at least the ``\"tokenizer\"`` entry.
    """
    return cls(
        calt_config=calt_config,
        tokenizer=io_dict["tokenizer"],
    )

build ¶

build() -> PreTrainedModel

Build the model from configuration using ModelRegistry.

Returns:

Name	Type	Description
`PreTrainedModel`	`PreTrainedModel`	Model instance.

Source code in src/calt/models/pipeline.py

def build(self) -> PreTrainedModel:
    """Build the model from configuration using ModelRegistry.

    Returns:
        PreTrainedModel: Model instance.
    """
    # Use ModelRegistry to create the model
    self.model = self._registry.create_from_config(
        model_config=self.calt_config,
        tokenizer=self.tokenizer,
    )
    return self.model

Supported model types¶

Models are created via an internal ModelRegistry. The following types are registered by default:

`model_type`	Description
`generic`, `transformer`, `calt`	CALT generic Transformer (encoder–decoder).
`bart`	HuggingFace BART for conditional generation.

Set model_type in the model block of train.yaml (e.g. model_type: generic). Other keys in the model block (e.g. num_encoder_layers, d_model, max_sequence_length) are documented under Configuration — model.

ModelRegistry¶

To create a model without using the pipeline (e.g. with a custom config), you can use the registry or helpers from calt.models: ModelRegistry, get_model_from_config. See the API reference below.

Registry for creating model instances based on model type.

This class provides a unified interface for creating different types of models. Models can be registered and retrieved by name or inferred from config.

Examples:

>>> # Create model with explicit name and config
>>> from calt.models.generic import TransformerConfig
>>> registry = ModelRegistry()
>>> config = TransformerConfig(vocab_size=1000, d_model=128)
>>> model = registry.create("transformer", config)
>>>
>>> # Create model from config only (model_type inferred from config)
>>> model = registry.create(model_config=config)
>>>
>>> # Register custom model
>>> registry.register("custom_model", CustomModel, CustomModelConfig)

Source code in src/calt/models/base.py

def __init__(self):
    """Initialize the registry with default model types."""
    self._registry: dict[
        str, tuple[Type[PreTrainedModel], Type[PretrainedConfig]]
    ] = {}
    self._config_mappings: dict[str, Callable] = {}
    self._register_defaults()

create_from_config ¶

create_from_config(
    model_config: DictConfig,
    tokenizer: Optional[PreTrainedTokenizerFast] = None,
    model_name: Optional[str] = None,
) -> PreTrainedModel

Create a model instance from OmegaConf config (cfg.model).

This method automatically converts the unified config format to model-specific configs using registered config mapping functions.

Parameters:

Name	Type	Description	Default
`model_config`	`DictConfig`	Model configuration from cfg.model (OmegaConf).	required
`tokenizer`	`PreTrainedTokenizerFast \| None`	Tokenizer instance (required for some models like BART).	`None`
`model_name`	`str \| None`	Name of the model type. If None, will be inferred from model_config.model_type.	`None`

Returns:

Name	Type	Description
`PreTrainedModel`	`PreTrainedModel`	Model instance.

Raises:

Type	Description
`ValueError`	If model_name is not supported or cannot be inferred from config.

Examples:

>>> # Create model from OmegaConf config
>>> from omegaconf import OmegaConf
>>> cfg = OmegaConf.load("config/train.yaml")
>>> registry = ModelRegistry()
>>> model = registry.create_from_config(cfg.model, tokenizer)

Source code in src/calt/models/base.py

def create_from_config(
    self,
    model_config: DictConfig,
    tokenizer: Optional[PreTrainedTokenizerFast] = None,
    model_name: Optional[str] = None,
) -> PreTrainedModel:
    """Create a model instance from OmegaConf config (cfg.model).

    This method automatically converts the unified config format to model-specific configs
    using registered config mapping functions.

    Args:
        model_config (DictConfig): Model configuration from cfg.model (OmegaConf).
        tokenizer (PreTrainedTokenizerFast | None): Tokenizer instance (required for some models like BART).
        model_name (str | None): Name of the model type. If None, will be inferred from model_config.model_type.

    Returns:
        PreTrainedModel: Model instance.

    Raises:
        ValueError: If model_name is not supported or cannot be inferred from config.

    Examples:
        >>> # Create model from OmegaConf config
        >>> from omegaconf import OmegaConf
        >>> cfg = OmegaConf.load("config/train.yaml")
        >>> registry = ModelRegistry()
        >>> model = registry.create_from_config(cfg.model, tokenizer)
    """
    # Determine model name
    if model_name is None:
        if hasattr(model_config, "model_type"):
            model_name = model_config.model_type
        else:
            raise ValueError(
                "Cannot infer model_name from model_config. "
                "Please provide model_name explicitly or set model_config.model_type."
            )

    model_name = model_name.lower() if isinstance(model_name, str) else model_name

    # Get model class from registry
    if model_name not in self._registry:
        supported_types = list(self._registry.keys())
        raise ValueError(
            f"Unsupported model type: {model_name}. "
            f"Supported types: {supported_types}"
        )

    model_class, config_class = self._registry[model_name]

    # Get config mapping function
    if model_name not in self._config_mappings:
        raise ValueError(
            f"No config mapping registered for model type: {model_name}. "
            f"Please register a config mapping using register_config_mapping()."
        )

    mapping_func = self._config_mappings[model_name]

    # Convert OmegaConf config to model-specific config
    converted_config = mapping_func(model_config, tokenizer)

    # Create and return model
    return model_class(config=converted_config)

list_models ¶

list_models() -> list[str]

List all registered model types.

Returns:

Type	Description
`list[str]`	list[str]: List of registered model type names.

Source code in src/calt/models/base.py

def list_models(self) -> list[str]:
    """List all registered model types.

    Returns:
        list[str]: List of registered model type names.
    """
    return list(self._registry.keys())

register ¶

register(
    model_name: str,
    model_class: Type[PreTrainedModel],
    config_class: Type[PretrainedConfig],
)

Register a model class with the registry.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	Name to register the model under.	required
`model_class`	`Type[PreTrainedModel]`	Model class to register.	required
`config_class`	`Type[PretrainedConfig]`	Config class for the model.	required

Source code in src/calt/models/base.py

def register(
    self,
    model_name: str,
    model_class: Type[PreTrainedModel],
    config_class: Type[PretrainedConfig],
):
    """Register a model class with the registry.

    Args:
        model_name (str): Name to register the model under.
        model_class (Type[PreTrainedModel]): Model class to register.
        config_class (Type[PretrainedConfig]): Config class for the model.
    """
    self._registry[model_name.lower()] = (model_class, config_class)

register_config_mapping ¶

register_config_mapping(
    model_name: str,
    mapping_func: Callable[
        [DictConfig, Optional[PreTrainedTokenizerFast]], PretrainedConfig
    ],
)

Register a config mapping function for converting OmegaConf to model config.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	Name of the model type.	required
`mapping_func`	`Callable`	Function that takes (model_config: DictConfig, tokenizer: Optional) and returns PretrainedConfig.	required

Source code in src/calt/models/base.py

def register_config_mapping(
    self,
    model_name: str,
    mapping_func: Callable[
        [DictConfig, Optional[PreTrainedTokenizerFast]], PretrainedConfig
    ],
):
    """Register a config mapping function for converting OmegaConf to model config.

    Args:
        model_name (str): Name of the model type.
        mapping_func (Callable): Function that takes (model_config: DictConfig, tokenizer: Optional)
            and returns PretrainedConfig.
    """
    self._config_mappings[model_name.lower()] = mapping_func

Model pipeline¶

ModelPipeline¶

from_io_dict classmethod ¶

build ¶

Supported model types¶

ModelRegistry¶

create_from_config ¶

list_models ¶

register ¶

register_config_mapping ¶

from_io_dict `classmethod` ¶