Huggingface Transformers

Base

Models architectures and default parameters are specificied using a HFModelConfig.

XPM Configxpmir.text.huggingface.base.HFModelConfig[source]

Bases: Config, ABC

Submit type: xpmir.text.huggingface.base.HFModelConfig

Base class for all HuggingFace model configurations

XPM Configxpmir.text.huggingface.base.HFModelConfigFromId(*, model_id)[source]

Bases: HFModelConfig

Submit type: xpmir.text.huggingface.base.HFModelConfigFromId

model_id: str

HuggingFace Model ID

Models

Models follow the HuggingFace hierarchy

XPM Configxpmir.text.huggingface.base.HFMaskedLanguageModel(*, config)[source]

Bases: HFModel

Submit type: xpmir.text.huggingface.base.HFMaskedLanguageModel

config: xpmir.text.huggingface.base.HFModelConfig

Model ID from huggingface

XPM Configxpmir.text.huggingface.base.HFModel(*, config)[source]

Bases: Module

Submit type: xpmir.text.huggingface.base.HFModel

Base transformer class from Huggingface

The config specifies the architecture

config: xpmir.text.huggingface.base.HFModelConfig

Model ID from huggingface

Tokenizers

XPM Configxpmir.text.huggingface.tokenizers.HFTokenizer(*, model_id, max_length)[source]

Bases: Config, Initializable

Submit type: xpmir.text.huggingface.tokenizers.HFTokenizer

This is the main tokenizer class

model_id: str

The tokenizer hugginface ID

max_length: int = 4096

Maximum length for the tokenizer (can be overridden)

XPM Configxpmir.text.huggingface.tokenizers.HFTokenizerBase(*, tokenizer)[source]

Bases: TokenizerBase[TokenizerInput, TokenizedTexts]

Submit type: xpmir.text.huggingface.tokenizers.HFTokenizerBase

Base class for all Hugging-Face tokenizers

tokenizer: xpmir.text.huggingface.tokenizers.HFTokenizer

The HuggingFace tokenizer

XPM Configxpmir.text.huggingface.tokenizers.HFListTokenizer(*, tokenizer)[source]

Bases: HFTokenizerBase[List[List[str]]]

Submit type: xpmir.text.huggingface.tokenizers.HFListTokenizer

Process list of texts by separating them by a separator token

tokenizer: xpmir.text.huggingface.tokenizers.HFTokenizer

The HuggingFace tokenizer

XPM Configxpmir.text.huggingface.tokenizers.HFStringTokenizer(*, tokenizer)[source]

Bases: HFTokenizerBase[Union[List[str], List[Tuple[str, str]]]]

Submit type: xpmir.text.huggingface.tokenizers.HFStringTokenizer

Process list of texts

tokenizer: xpmir.text.huggingface.tokenizers.HFTokenizer

The HuggingFace tokenizer

XPM Configxpmir.text.huggingface.tokenizers.HFTokenizerAdapter(*, tokenizer, converter)[source]

Bases: HFTokenizerBase[TokenizerInput]

Submit type: xpmir.text.huggingface.tokenizers.HFTokenizerAdapter

Process list of texts

tokenizer: xpmir.text.huggingface.tokenizers.HFTokenizer

The HuggingFace tokenizer

converter: xpmir.utils.convert.Converter[TokenizerInput, Union[List[str], List[Tuple[str, str]]]]

Encoders

XPM Configxpmir.text.huggingface.encoders.HFEncoderBase(*, model)[source]

Bases: Module

Submit type: xpmir.text.huggingface.encoders.HFEncoderBase

Base HuggingFace encoder

model: xpmir.text.huggingface.base.HFModel

A Hugging-Face model

classmethod from_pretrained_id(model_id: str)[source]

Returns a new encoder

Parameters:

model_id – The HuggingFace Hub ID

Returns:

A hugging-fasce based encoder

XPM Configxpmir.text.huggingface.encoders.HFTokensEncoder(*, model)[source]

Bases: HFEncoderBase, TokenizedEncoder[TokenizedTexts, TokensRepresentationOutput]

Submit type: xpmir.text.huggingface.encoders.HFTokensEncoder

HuggingFace-based tokenized

model: xpmir.text.huggingface.base.HFModel

A Hugging-Face model

XPM Configxpmir.text.huggingface.encoders.HFCLSEncoder(*, model, maxlen)[source]

Bases: HFEncoderBase, TokenizedEncoder[TokenizedTexts, TextsRepresentationOutput]

Submit type: xpmir.text.huggingface.encoders.HFCLSEncoder

Encodes a text using the [CLS] token

model: xpmir.text.huggingface.base.HFModel

A Hugging-Face model

maxlen: int

Limit the text to be encoded

Legacy

The old huggingface wrappers are listed below for reference, but should not be used for future development.

XPM Configxpmir.text.huggingface.BaseTransformer(*, model_id, trainable, layer, dropout)[source]

Bases: Encoder

Submit type: xpmir.text.huggingface.BaseTransformer

Base transformer class from Huggingface

model_id: str = bert-base-uncased

Model ID from huggingface

trainable: bool

Whether BERT parameters should be trained

layer: int = 0

Layer to use (0 is the last, -1 to use them all)

dropout: float = 0

(deprecated) Define a dropout for all the layers

Encoders

XPM Configxpmir.text.huggingface.TransformerEncoder(*, model_id, trainable, layer, dropout, maxlen)[source]

Bases: BaseTransformer, TextEncoder, DistributableModel

Submit type: xpmir.text.huggingface.TransformerEncoder

Encodes using the [CLS] token

model_id: str = bert-base-uncased

Model ID from huggingface

trainable: bool

Whether BERT parameters should be trained

layer: int = 0

Layer to use (0 is the last, -1 to use them all)

dropout: float = 0

(deprecated) Define a dropout for all the layers

maxlen: int
XPM Configxpmir.text.huggingface.TransformerTokensEncoder(*, model_id, trainable, layer, dropout)[source]

Bases: BaseTransformer, TokensEncoder

Submit type: xpmir.text.huggingface.TransformerTokensEncoder

A tokens encoder based on HuggingFace

model_id: str = bert-base-uncased

Model ID from huggingface

trainable: bool

Whether BERT parameters should be trained

layer: int = 0

Layer to use (0 is the last, -1 to use them all)

dropout: float = 0

(deprecated) Define a dropout for all the layers

XPM Configxpmir.text.huggingface.TransformerTextEncoderAdapter(*, encoder, maxlen)[source]

Bases: TextEncoder, DistributableModel

Submit type: xpmir.text.huggingface.TransformerTextEncoderAdapter

encoder: xpmir.text.huggingface.TransformerEncoder
maxlen: int
XPM Configxpmir.text.huggingface.DualTransformerEncoder(*, model_id, trainable, layer, dropout, maxlen)[source]

Bases: BaseTransformer, DualTextEncoder

Submit type: xpmir.text.huggingface.DualTransformerEncoder

Encodes the (query, document pair) using the [CLS] token

maxlen: Maximum length of the query document pair (in tokens) or None if using the transformer limit

model_id: str = bert-base-uncased

Model ID from huggingface

trainable: bool

Whether BERT parameters should be trained

layer: int = 0

Layer to use (0 is the last, -1 to use them all)

dropout: float = 0

(deprecated) Define a dropout for all the layers

maxlen: int
version: int = 2constant
XPM Configxpmir.text.huggingface.SentenceTransformerTextEncoder(*, model_id)[source]

Bases: TextEncoder

Submit type: xpmir.text.huggingface.SentenceTransformerTextEncoder

A Sentence Transformers text encoder

model_id: str = sentence-transformers/all-MiniLM-L6-v2
XPM Configxpmir.text.huggingface.OneHotHuggingFaceEncoder(*, model_id, maxlen)[source]

Bases: TextEncoder

Submit type: xpmir.text.huggingface.OneHotHuggingFaceEncoder

A tokenizer which encodes the tokens into 0 and 1 vector 1 represents the text contains the token and 0 otherwise

model_id: str = bert-base-uncased

Model ID from huggingface

maxlen: int

Max length for texts

version: int = 2constant
XPM Configxpmir.text.huggingface.DualDuoBertTransformerEncoder(*, model_id, trainable, layer, dropout, maxlen_query, maxlen_doc)[source]

Bases: BaseTransformer, TripletTextEncoder

Submit type: xpmir.text.huggingface.DualDuoBertTransformerEncoder

Vector encoding of a (query, document, document) triplet

Be like: [cls] query [sep] doc1 [sep] doc2 [sep]

model_id: str = bert-base-uncased

Model ID from huggingface

trainable: bool

Whether BERT parameters should be trained

layer: int = 0

Layer to use (0 is the last, -1 to use them all)

dropout: float = 0

(deprecated) Define a dropout for all the layers

maxlen_query: int = 64

Maximum length for the query, the first document and the second one

maxlen_doc: int = 224

Maximum length for the query, the first document and the second one

XPM Configxpmir.text.huggingface.TransformerVocab(*, model_id, trainable, layer, dropout)[source]

Bases: TransformerTokensEncoder

Submit type: xpmir.text.huggingface.TransformerVocab

Old tokens encoder

model_id: str = bert-base-uncased

Model ID from huggingface

trainable: bool

Whether BERT parameters should be trained

layer: int = 0

Layer to use (0 is the last, -1 to use them all)

dropout: float = 0

(deprecated) Define a dropout for all the layers

XPM Configxpmir.text.huggingface.TransformerTokensEncoderWithMLMOutput(*, model_id, trainable, layer, dropout)[source]

Bases: TransformerTokensEncoder

Submit type: xpmir.text.huggingface.TransformerTokensEncoderWithMLMOutput

Transformer that output logits over the vocabulary

model_id: str = bert-base-uncased

Model ID from huggingface

trainable: bool

Whether BERT parameters should be trained

layer: int = 0

Layer to use (0 is the last, -1 to use them all)

dropout: float = 0

(deprecated) Define a dropout for all the layers

Tokenizers

XPM Configxpmir.text.huggingface.OneHotHuggingFaceEncoder(*, model_id, maxlen)[source]

Bases: TextEncoder

Submit type: xpmir.text.huggingface.OneHotHuggingFaceEncoder

A tokenizer which encodes the tokens into 0 and 1 vector 1 represents the text contains the token and 0 otherwise

model_id: str = bert-base-uncased

Model ID from huggingface

maxlen: int

Max length for texts

version: int = 2constant
XPM Configxpmir.text.huggingface.HuggingfaceTokenizer(*, model_id, maxlen)[source]

Bases: OneHotHuggingFaceEncoder

Submit type: xpmir.text.huggingface.HuggingfaceTokenizer

The old encoder for one hot

model_id: str = bert-base-uncased

Model ID from huggingface

maxlen: int

Max length for texts

version: int = 2constant

Masked-LM ——–=

XPM Configxpmir.text.huggingface.MLMEncoder(*, model_id, trainable, layer, dropout, maxlen, mlm_probability)[source]

Bases: BaseTransformer, DistributableModel

Submit type: xpmir.text.huggingface.MLMEncoder

Implementation of the encoder for the Masked Language Modeling task

model_id: str = bert-base-uncased

Model ID from huggingface

trainable: bool

Whether BERT parameters should be trained

layer: int = 0

Layer to use (0 is the last, -1 to use them all)

dropout: float = 0

(deprecated) Define a dropout for all the layers

maxlen: int
mlm_probability: float = 0.2

Probability to mask tokens

Hooks

XPM Configxpmir.text.huggingface.LayerSelector(*, re_layer, transformer, pick_layers, select_embeddings, select_feed_forward)[source]

Bases: ParametersIterator

Submit type: xpmir.text.huggingface.LayerSelector

This class can be used to pick some of the transformer layers

re_layer: str = (?:encoder|transformer)\.layer\.(\d+)\.
transformer: xpmir.text.huggingface.BaseTransformer

The model for which layers are selected

pick_layers: int = 0

Counting from the first processing layers (can be negative, i.e. -1 meaning until the last layer excluded, etc. / 0 means no layer)

select_embeddings: bool = False

Whether to pick the embeddings layer

select_feed_forward: bool = False

Whether to pick the feed forward of Transformer layers