HuggingFace Transformers

Integration with HuggingFace Transformers for loading pre-trained language models, tokenizers, and building transformer-based text encoders. These components are used by neural models such as cross-encoders, SPLADE, and ColBERT.

Models

Wrappers around HuggingFace model classes. These configurations define which pre-trained model to use and how it should be loaded.

XPM Configxpmir.text.huggingface.base.HFMaskedLanguageModel(*, config)[source]

Bases: HFModel

config: ConfigT

HuggingFace model configuration

XPM Configxpmir.text.huggingface.base.HFModel(*, config)[source]

Bases: Module, Generic[ConfigT]

Base transformer class from Huggingface

Model structure is created during __initialize__ from the config when available. Pretrained weights can be loaded via init tasks such as HFModelInitFromID or HFFromCheckpoint.

config: ConfigT

HuggingFace model configuration

Init tasks

Tasks that handle model weight loading at experiment submit time (from a HuggingFace model ID or a local checkpoint).

XPM Configxpmir.text.huggingface.base.HFModelInitFromID(*, model, fabric)[source]

Bases: HFModelInitBase

Load pretrained weights from a HuggingFace Hub model ID.

Uses model.config.hf_id to resolve the model.

model: xpmir.text.huggingface.base.HFModel
fabric: xpm_torch.configuration.FabricConfiguration

The fabric configuration to use for initialization. When set, model creation runs inside fabric.init_module() so that parameters are allocated directly on the target device and dtype. See https://lightning.ai/docs/fabric/stable/advanced/model_init.html

XPM Configxpmir.text.huggingface.base.HFFromCheckpoint(*, model, fabric, checkpoint)[source]

Bases: HFModelInitBase

Load from a local checkpoint.

Uses model.config.hf_id for the architecture config, then loads weights from checkpoint.

model: xpmir.text.huggingface.base.HFModel
fabric: xpm_torch.configuration.FabricConfiguration

The fabric configuration to use for initialization. When set, model creation runs inside fabric.init_module() so that parameters are allocated directly on the target device and dtype. See https://lightning.ai/docs/fabric/stable/advanced/model_init.html

checkpoint: path

The checkpoint path to load weights from

Tokenizers

HuggingFace tokenizer wrappers, with variants for different output formats (token IDs, strings, lists).

XPM Configxpmir.text.huggingface.tokenizers.HFTokenizer(*, model_id, max_length)[source]

Bases: Config, Initializable

This is the main tokenizer class

model_id: str

The tokenizer hugginface ID

max_length: int

Maximum length for the tokenizer (can be overridden by the model) default can be set by default using the hf config

XPM Configxpmir.text.huggingface.tokenizers.HFTokenizerBase(*, tokenizer)[source]

Bases: TokenizerBase[TokenizerInput, TokenizedTexts]

Base class for all Hugging-Face tokenizers

tokenizer: xpmir.text.huggingface.tokenizers.HFTokenizer

The HuggingFace tokenizer

XPM Configxpmir.text.huggingface.tokenizers.HFListTokenizer(*, tokenizer, separate_index)[source]

Bases: HFTokenizerBase[List[List[str]]]

Process list of texts by separating them by a separator token

tokenizer: xpmir.text.huggingface.tokenizers.HFTokenizer

The HuggingFace tokenizer

separate_index: bool = 0

Use a tuple until this index

XPM Configxpmir.text.huggingface.tokenizers.HFStringTokenizer(*, tokenizer)[source]

Bases: HFTokenizerBase[str | List[str] | List[Tuple[str, str]]]

Process list of texts

tokenizer: xpmir.text.huggingface.tokenizers.HFTokenizer

The HuggingFace tokenizer

XPM Configxpmir.text.huggingface.tokenizers.HFTokenizerAdapter(*, tokenizer, converter)[source]

Bases: HFTokenizerBase[TokenizerInput]

Process list of texts

tokenizer: xpmir.text.huggingface.tokenizers.HFTokenizer

The HuggingFace tokenizer

converter: xpmir.utils.convert.Converter[TokenizerInput, Union[str, List[str], List[Tuple[str, str]]]]

Encoders

Encoders that produce text representations from HuggingFace models. These implement the TokensEncoder interface.

XPM Configxpmir.text.huggingface.encoders.HFEncoderBase(*, model)[source]

Bases: Module

Base HuggingFace encoder

model: xpmir.text.huggingface.base.HFModel

A Hugging-Face model

XPM Configxpmir.text.huggingface.encoders.HFTokensEncoder(*, model)[source]

Bases: HFEncoderBase, TokenizedEncoder

HuggingFace-based tokenized

model: xpmir.text.huggingface.base.HFModel

A Hugging-Face model

XPM Configxpmir.text.huggingface.encoders.HFCLSEncoder(*, model)[source]

Bases: HFEncoderBase, TokenizedEncoder

Encodes a text using the [CLS] token

model: xpmir.text.huggingface.base.HFModel

A Hugging-Face model

XPM Configxpmir.text.huggingface.encoders.OneHotHuggingFaceEncoder(*, model_id, maxlen)[source]

Bases: TextEncoder

A tokenizer which encodes the tokens into 0 and 1 vector 1 represents the text contains the token and 0 otherwise

model_id: str = bert-base-uncased

Model ID from huggingface

maxlen: int

Max length for texts

version: int = 2constant
XPM Configxpmir.text.huggingface.encoders.SentenceTransformerTextEncoder(*, model_id)[source]

Bases: TextEncoder

A Sentence Transformers text encoder

model_id: str = sentence-transformers/all-MiniLM-L6-v2

Training hooks

Hooks that modify encoder behaviour during training (e.g. selecting intermediate layers).

XPM Configxpmir.text.huggingface.encoders.LayerSelector(*, re_layer, transformer, pick_layers, select_embeddings, select_feed_forward)[source]

Bases: ParametersIterator

This class can be used to pick some of the transformer layers

re_layer: str = (?:encoder|transformer)\.layer\.(\d+)\.
transformer: xpmir.text.huggingface.base.HFModel

The model for which layers are selected

pick_layers: int = 0

Counting from the first processing layers (can be negative, i.e. -1 meaning until the last layer excluded, etc. / 0 means no layer)

select_embeddings: bool = False

Whether to pick the embeddings layer

select_feed_forward: bool = False

Whether to pick the feed forward of Transformer layers