Huggingface Transformers
Base
Models architectures and default parameters are specificied using a HFModelConfig.
- XPM Configxpmir.text.huggingface.base.HFModelConfig[source]
Bases:
Config
,ABC
Submit type:
xpmir.text.huggingface.base.HFModelConfig
Base class for all HuggingFace model configurations
- XPM Configxpmir.text.huggingface.base.HFModelConfigFromId(*, model_id)[source]
Bases:
HFModelConfig
Submit type:
xpmir.text.huggingface.base.HFModelConfigFromId
- model_id: str
HuggingFace Model ID
Models
Models follow the HuggingFace hierarchy
- XPM Configxpmir.text.huggingface.base.HFMaskedLanguageModel(*, config)[source]
Bases:
HFModel
Submit type:
xpmir.text.huggingface.base.HFMaskedLanguageModel
- config: xpmir.text.huggingface.base.HFModelConfig
Model ID from huggingface
- XPM Configxpmir.text.huggingface.base.HFModel(*, config)[source]
Bases:
Module
Submit type:
xpmir.text.huggingface.base.HFModel
Base transformer class from Huggingface
The config specifies the architecture
- config: xpmir.text.huggingface.base.HFModelConfig
Model ID from huggingface
Tokenizers
- XPM Configxpmir.text.huggingface.tokenizers.HFTokenizer(*, model_id, max_length)[source]
Bases:
Config
,Initializable
Submit type:
xpmir.text.huggingface.tokenizers.HFTokenizer
This is the main tokenizer class
- model_id: str
The tokenizer hugginface ID
- max_length: int = 4096
Maximum length for the tokenizer (can be overridden by the model)
- XPM Configxpmir.text.huggingface.tokenizers.HFTokenizerBase(*, tokenizer)[source]
Bases:
TokenizerBase
[TokenizerInput
,TokenizedTexts
]Submit type:
xpmir.text.huggingface.tokenizers.HFTokenizerBase
Base class for all Hugging-Face tokenizers
- tokenizer: xpmir.text.huggingface.tokenizers.HFTokenizer
The HuggingFace tokenizer
- XPM Configxpmir.text.huggingface.tokenizers.HFListTokenizer(*, tokenizer, separate_index)[source]
Bases:
HFTokenizerBase
[List
[List
[str
]]]Submit type:
xpmir.text.huggingface.tokenizers.HFListTokenizer
Process list of texts by separating them by a separator token
- tokenizer: xpmir.text.huggingface.tokenizers.HFTokenizer
The HuggingFace tokenizer
- separate_index: bool = 0
Use a tuple until this index
- XPM Configxpmir.text.huggingface.tokenizers.HFStringTokenizer(*, tokenizer)[source]
Bases:
HFTokenizerBase
[Union
[List
[str
],List
[Tuple
[str
,str
]]]]Submit type:
xpmir.text.huggingface.tokenizers.HFStringTokenizer
Process list of texts
- tokenizer: xpmir.text.huggingface.tokenizers.HFTokenizer
The HuggingFace tokenizer
- XPM Configxpmir.text.huggingface.tokenizers.HFTokenizerAdapter(*, tokenizer, converter)[source]
Bases:
HFTokenizerBase
[TokenizerInput
]Submit type:
xpmir.text.huggingface.tokenizers.HFTokenizerAdapter
Process list of texts
- tokenizer: xpmir.text.huggingface.tokenizers.HFTokenizer
The HuggingFace tokenizer
- converter: xpmir.utils.convert.Converter[TokenizerInput, Union[List[str], List[Tuple[str, str]]]]
Encoders
- XPM Configxpmir.text.huggingface.encoders.HFEncoderBase(*, model)[source]
Bases:
Module
Submit type:
xpmir.text.huggingface.encoders.HFEncoderBase
Base HuggingFace encoder
- model: xpmir.text.huggingface.base.HFModel
A Hugging-Face model
- XPM Configxpmir.text.huggingface.encoders.HFTokensEncoder(*, model)[source]
Bases:
HFEncoderBase
,TokenizedEncoder
[TokenizedTexts
,TokensRepresentationOutput
]Submit type:
xpmir.text.huggingface.encoders.HFTokensEncoder
HuggingFace-based tokenized
- model: xpmir.text.huggingface.base.HFModel
A Hugging-Face model
- XPM Configxpmir.text.huggingface.encoders.HFCLSEncoder(*, model)[source]
Bases:
HFEncoderBase
,TokenizedEncoder
[TokenizedTexts
,TextsRepresentationOutput
]Submit type:
xpmir.text.huggingface.encoders.HFCLSEncoder
Encodes a text using the [CLS] token
- model: xpmir.text.huggingface.base.HFModel
A Hugging-Face model
Legacy
The old huggingface wrappers are listed below for reference, but should not be used for future development.
- XPM Configxpmir.text.huggingface.BaseTransformer(*, model_id, trainable, layer, dropout)[source]
Bases:
Encoder
Submit type:
xpmir.text.huggingface.BaseTransformer
Base transformer class from Huggingface
- model_id: str = bert-base-uncased
Model ID from huggingface
- trainable: bool
Whether BERT parameters should be trained
- layer: int = 0
Layer to use (0 is the last, -1 to use them all)
- dropout: float = 0
(deprecated) Define a dropout for all the layers
Encoders
- XPM Configxpmir.text.huggingface.TransformerEncoder(*, model_id, trainable, layer, dropout, maxlen)[source]
Bases:
BaseTransformer
,TextEncoder
,DistributableModel
Submit type:
xpmir.text.huggingface.TransformerEncoder
Encodes using the [CLS] token
- model_id: str = bert-base-uncased
Model ID from huggingface
- trainable: bool
Whether BERT parameters should be trained
- layer: int = 0
Layer to use (0 is the last, -1 to use them all)
- dropout: float = 0
(deprecated) Define a dropout for all the layers
- maxlen: int
- XPM Configxpmir.text.huggingface.TransformerTokensEncoder(*, model_id, trainable, layer, dropout)[source]
Bases:
BaseTransformer
,TokensEncoder
Submit type:
xpmir.text.huggingface.TransformerTokensEncoder
A tokens encoder based on HuggingFace
- model_id: str = bert-base-uncased
Model ID from huggingface
- trainable: bool
Whether BERT parameters should be trained
- layer: int = 0
Layer to use (0 is the last, -1 to use them all)
- dropout: float = 0
(deprecated) Define a dropout for all the layers
- XPM Configxpmir.text.huggingface.TransformerTextEncoderAdapter(*, encoder, maxlen)[source]
Bases:
TextEncoder
,DistributableModel
Submit type:
xpmir.text.huggingface.TransformerTextEncoderAdapter
- maxlen: int
- XPM Configxpmir.text.huggingface.DualTransformerEncoder(*, model_id, trainable, layer, dropout, maxlen)[source]
Bases:
BaseTransformer
,DualTextEncoder
Submit type:
xpmir.text.huggingface.DualTransformerEncoder
Encodes the (query, document pair) using the [CLS] token
maxlen: Maximum length of the query document pair (in tokens) or None if using the transformer limit
- model_id: str = bert-base-uncased
Model ID from huggingface
- trainable: bool
Whether BERT parameters should be trained
- layer: int = 0
Layer to use (0 is the last, -1 to use them all)
- dropout: float = 0
(deprecated) Define a dropout for all the layers
- maxlen: int
- version: int = 2constant
- XPM Configxpmir.text.huggingface.SentenceTransformerTextEncoder(*, model_id)[source]
Bases:
TextEncoder
Submit type:
xpmir.text.huggingface.SentenceTransformerTextEncoder
A Sentence Transformers text encoder
- model_id: str = sentence-transformers/all-MiniLM-L6-v2
- XPM Configxpmir.text.huggingface.OneHotHuggingFaceEncoder(*, model_id, maxlen)[source]
Bases:
TextEncoder
Submit type:
xpmir.text.huggingface.OneHotHuggingFaceEncoder
A tokenizer which encodes the tokens into 0 and 1 vector 1 represents the text contains the token and 0 otherwise
- model_id: str = bert-base-uncased
Model ID from huggingface
- maxlen: int
Max length for texts
- version: int = 2constant
- XPM Configxpmir.text.huggingface.DualDuoBertTransformerEncoder(*, model_id, trainable, layer, dropout, maxlen_query, maxlen_doc)[source]
Bases:
BaseTransformer
,TripletTextEncoder
Submit type:
xpmir.text.huggingface.DualDuoBertTransformerEncoder
Vector encoding of a (query, document, document) triplet
Be like: [cls] query [sep] doc1 [sep] doc2 [sep]
- model_id: str = bert-base-uncased
Model ID from huggingface
- trainable: bool
Whether BERT parameters should be trained
- layer: int = 0
Layer to use (0 is the last, -1 to use them all)
- dropout: float = 0
(deprecated) Define a dropout for all the layers
- maxlen_query: int = 64
Maximum length for the query, the first document and the second one
- maxlen_doc: int = 224
Maximum length for the query, the first document and the second one
- XPM Configxpmir.text.huggingface.TransformerVocab(*, model_id, trainable, layer, dropout)[source]
Bases:
TransformerTokensEncoder
Submit type:
xpmir.text.huggingface.TransformerVocab
Old tokens encoder
- model_id: str = bert-base-uncased
Model ID from huggingface
- trainable: bool
Whether BERT parameters should be trained
- layer: int = 0
Layer to use (0 is the last, -1 to use them all)
- dropout: float = 0
(deprecated) Define a dropout for all the layers
- XPM Configxpmir.text.huggingface.TransformerTokensEncoderWithMLMOutput(*, model_id, trainable, layer, dropout)[source]
Bases:
TransformerTokensEncoder
Submit type:
xpmir.text.huggingface.TransformerTokensEncoderWithMLMOutput
Transformer that output logits over the vocabulary
- model_id: str = bert-base-uncased
Model ID from huggingface
- trainable: bool
Whether BERT parameters should be trained
- layer: int = 0
Layer to use (0 is the last, -1 to use them all)
- dropout: float = 0
(deprecated) Define a dropout for all the layers
Tokenizers
- XPM Configxpmir.text.huggingface.OneHotHuggingFaceEncoder(*, model_id, maxlen)[source]
Bases:
TextEncoder
Submit type:
xpmir.text.huggingface.OneHotHuggingFaceEncoder
A tokenizer which encodes the tokens into 0 and 1 vector 1 represents the text contains the token and 0 otherwise
- model_id: str = bert-base-uncased
Model ID from huggingface
- maxlen: int
Max length for texts
- version: int = 2constant
- XPM Configxpmir.text.huggingface.HuggingfaceTokenizer(*, model_id, maxlen)[source]
Bases:
OneHotHuggingFaceEncoder
Submit type:
xpmir.text.huggingface.HuggingfaceTokenizer
The old encoder for one hot
- model_id: str = bert-base-uncased
Model ID from huggingface
- maxlen: int
Max length for texts
- version: int = 2constant
Masked-LM ——–=
- XPM Configxpmir.text.huggingface.MLMEncoder(*, model_id, trainable, layer, dropout, maxlen, mlm_probability)[source]
Bases:
BaseTransformer
,DistributableModel
Submit type:
xpmir.text.huggingface.MLMEncoder
Implementation of the encoder for the Masked Language Modeling task
- model_id: str = bert-base-uncased
Model ID from huggingface
- trainable: bool
Whether BERT parameters should be trained
- layer: int = 0
Layer to use (0 is the last, -1 to use them all)
- dropout: float = 0
(deprecated) Define a dropout for all the layers
- maxlen: int
- mlm_probability: float = 0.2
Probability to mask tokens
Hooks
- XPM Configxpmir.text.huggingface.LayerSelector(*, re_layer, transformer, pick_layers, select_embeddings, select_feed_forward)[source]
Bases:
ParametersIterator
Submit type:
xpmir.text.huggingface.LayerSelector
This class can be used to pick some of the transformer layers
- re_layer: str = (?:encoder|transformer)\.layer\.(\d+)\.
- transformer: xpmir.text.huggingface.BaseTransformer
The model for which layers are selected
- pick_layers: int = 0
Counting from the first processing layers (can be negative, i.e. -1 meaning until the last layer excluded, etc. / 0 means no layer)
- select_embeddings: bool = False
Whether to pick the embeddings layer
- select_feed_forward: bool = False
Whether to pick the feed forward of Transformer layers