Huggingface Transformers
Base
Models architectures and default parameters are specificied using a HFModelConfig.
- XPM Configxpmir.text.huggingface.base.HFModelConfig[source]
Bases:
Config,ABCSubmit type:
xpmir.text.huggingface.base.HFModelConfigBase class for all HuggingFace model configurations
- XPM Configxpmir.text.huggingface.base.HFModelConfigFromId(*, model_id)[source]
Bases:
HFModelConfigSubmit type:
xpmir.text.huggingface.base.HFModelConfigFromId- model_id: str
HuggingFace Model ID
Models
Models follow the HuggingFace hierarchy
- XPM Configxpmir.text.huggingface.base.HFMaskedLanguageModel(*, config)[source]
Bases:
HFModelSubmit type:
xpmir.text.huggingface.base.HFMaskedLanguageModel- config: xpmir.text.huggingface.base.HFModelConfig
Model ID from huggingface
- XPM Configxpmir.text.huggingface.base.HFModel(*, config)[source]
Bases:
ModuleSubmit type:
xpmir.text.huggingface.base.HFModelBase transformer class from Huggingface
The config specifies the architecture
- config: xpmir.text.huggingface.base.HFModelConfig
Model ID from huggingface
Tokenizers
- XPM Configxpmir.text.huggingface.tokenizers.HFTokenizer(*, model_id, max_length)[source]
Bases:
Config,InitializableSubmit type:
xpmir.text.huggingface.tokenizers.HFTokenizerThis is the main tokenizer class
- model_id: str
The tokenizer hugginface ID
- max_length: int = 4096
Maximum length for the tokenizer (can be overridden by the model)
- XPM Configxpmir.text.huggingface.tokenizers.HFTokenizerBase(*, tokenizer)[source]
Bases:
TokenizerBase[TokenizerInput,TokenizedTexts]Submit type:
xpmir.text.huggingface.tokenizers.HFTokenizerBaseBase class for all Hugging-Face tokenizers
- tokenizer: xpmir.text.huggingface.tokenizers.HFTokenizer
The HuggingFace tokenizer
- XPM Configxpmir.text.huggingface.tokenizers.HFListTokenizer(*, tokenizer, separate_index)[source]
Bases:
HFTokenizerBase[List[List[str]]]Submit type:
xpmir.text.huggingface.tokenizers.HFListTokenizerProcess list of texts by separating them by a separator token
- tokenizer: xpmir.text.huggingface.tokenizers.HFTokenizer
The HuggingFace tokenizer
- separate_index: bool = 0
Use a tuple until this index
- XPM Configxpmir.text.huggingface.tokenizers.HFStringTokenizer(*, tokenizer)[source]
Bases:
HFTokenizerBase[Union[List[str],List[Tuple[str,str]]]]Submit type:
xpmir.text.huggingface.tokenizers.HFStringTokenizerProcess list of texts
- tokenizer: xpmir.text.huggingface.tokenizers.HFTokenizer
The HuggingFace tokenizer
- XPM Configxpmir.text.huggingface.tokenizers.HFTokenizerAdapter(*, tokenizer, converter)[source]
Bases:
HFTokenizerBase[TokenizerInput]Submit type:
xpmir.text.huggingface.tokenizers.HFTokenizerAdapterProcess list of texts
- tokenizer: xpmir.text.huggingface.tokenizers.HFTokenizer
The HuggingFace tokenizer
- converter: xpmir.utils.convert.Converter[TokenizerInput, Union[List[str], List[Tuple[str, str]]]]
Encoders
- XPM Configxpmir.text.huggingface.encoders.HFEncoderBase(*, model)[source]
Bases:
ModuleSubmit type:
xpmir.text.huggingface.encoders.HFEncoderBaseBase HuggingFace encoder
- model: xpmir.text.huggingface.base.HFModel
A Hugging-Face model
- XPM Configxpmir.text.huggingface.encoders.HFTokensEncoder(*, model)[source]
Bases:
HFEncoderBase,TokenizedEncoder[TokenizedTexts,TokensRepresentationOutput]Submit type:
xpmir.text.huggingface.encoders.HFTokensEncoderHuggingFace-based tokenized
- model: xpmir.text.huggingface.base.HFModel
A Hugging-Face model
- XPM Configxpmir.text.huggingface.encoders.HFCLSEncoder(*, model)[source]
Bases:
HFEncoderBase,TokenizedEncoder[TokenizedTexts,TextsRepresentationOutput]Submit type:
xpmir.text.huggingface.encoders.HFCLSEncoderEncodes a text using the [CLS] token
- model: xpmir.text.huggingface.base.HFModel
A Hugging-Face model
Legacy
The old huggingface wrappers are listed below for reference, but should not be used for future development.
- XPM Configxpmir.text.huggingface.BaseTransformer(*, model_id, trainable, layer, dropout)[source]
Bases:
EncoderSubmit type:
xpmir.text.huggingface.BaseTransformerBase transformer class from Huggingface
- model_id: str = bert-base-uncased
Model ID from huggingface
- trainable: bool
Whether BERT parameters should be trained
- layer: int = 0
Layer to use (0 is the last, -1 to use them all)
- dropout: float = 0
(deprecated) Define a dropout for all the layers
Encoders
- XPM Configxpmir.text.huggingface.TransformerEncoder(*, model_id, trainable, layer, dropout, maxlen)[source]
Bases:
BaseTransformer,TextEncoder,DistributableModelSubmit type:
xpmir.text.huggingface.TransformerEncoderEncodes using the [CLS] token
- model_id: str = bert-base-uncased
Model ID from huggingface
- trainable: bool
Whether BERT parameters should be trained
- layer: int = 0
Layer to use (0 is the last, -1 to use them all)
- dropout: float = 0
(deprecated) Define a dropout for all the layers
- maxlen: int
- XPM Configxpmir.text.huggingface.TransformerTokensEncoder(*, model_id, trainable, layer, dropout)[source]
Bases:
BaseTransformer,TokensEncoderSubmit type:
xpmir.text.huggingface.TransformerTokensEncoderA tokens encoder based on HuggingFace
- model_id: str = bert-base-uncased
Model ID from huggingface
- trainable: bool
Whether BERT parameters should be trained
- layer: int = 0
Layer to use (0 is the last, -1 to use them all)
- dropout: float = 0
(deprecated) Define a dropout for all the layers
- XPM Configxpmir.text.huggingface.TransformerTextEncoderAdapter(*, encoder, maxlen)[source]
Bases:
TextEncoder,DistributableModelSubmit type:
xpmir.text.huggingface.TransformerTextEncoderAdapter- maxlen: int
- XPM Configxpmir.text.huggingface.DualTransformerEncoder(*, model_id, trainable, layer, dropout, maxlen)[source]
Bases:
BaseTransformer,DualTextEncoderSubmit type:
xpmir.text.huggingface.DualTransformerEncoderEncodes the (query, document pair) using the [CLS] token
maxlen: Maximum length of the query document pair (in tokens) or None if using the transformer limit
- model_id: str = bert-base-uncased
Model ID from huggingface
- trainable: bool
Whether BERT parameters should be trained
- layer: int = 0
Layer to use (0 is the last, -1 to use them all)
- dropout: float = 0
(deprecated) Define a dropout for all the layers
- maxlen: int
- version: int = 2constant
- XPM Configxpmir.text.huggingface.SentenceTransformerTextEncoder(*, model_id)[source]
Bases:
TextEncoderSubmit type:
xpmir.text.huggingface.SentenceTransformerTextEncoderA Sentence Transformers text encoder
- model_id: str = sentence-transformers/all-MiniLM-L6-v2
- XPM Configxpmir.text.huggingface.OneHotHuggingFaceEncoder(*, model_id, maxlen)[source]
Bases:
TextEncoderSubmit type:
xpmir.text.huggingface.OneHotHuggingFaceEncoderA tokenizer which encodes the tokens into 0 and 1 vector 1 represents the text contains the token and 0 otherwise
- model_id: str = bert-base-uncased
Model ID from huggingface
- maxlen: int
Max length for texts
- version: int = 2constant
- XPM Configxpmir.text.huggingface.DualDuoBertTransformerEncoder(*, model_id, trainable, layer, dropout, maxlen_query, maxlen_doc)[source]
Bases:
BaseTransformer,TripletTextEncoderSubmit type:
xpmir.text.huggingface.DualDuoBertTransformerEncoderVector encoding of a (query, document, document) triplet
Be like: [cls] query [sep] doc1 [sep] doc2 [sep]
- model_id: str = bert-base-uncased
Model ID from huggingface
- trainable: bool
Whether BERT parameters should be trained
- layer: int = 0
Layer to use (0 is the last, -1 to use them all)
- dropout: float = 0
(deprecated) Define a dropout for all the layers
- maxlen_query: int = 64
Maximum length for the query, the first document and the second one
- maxlen_doc: int = 224
Maximum length for the query, the first document and the second one
- XPM Configxpmir.text.huggingface.TransformerVocab(*, model_id, trainable, layer, dropout)[source]
Bases:
TransformerTokensEncoderSubmit type:
xpmir.text.huggingface.TransformerVocabOld tokens encoder
- model_id: str = bert-base-uncased
Model ID from huggingface
- trainable: bool
Whether BERT parameters should be trained
- layer: int = 0
Layer to use (0 is the last, -1 to use them all)
- dropout: float = 0
(deprecated) Define a dropout for all the layers
- XPM Configxpmir.text.huggingface.TransformerTokensEncoderWithMLMOutput(*, model_id, trainable, layer, dropout)[source]
Bases:
TransformerTokensEncoderSubmit type:
xpmir.text.huggingface.TransformerTokensEncoderWithMLMOutputTransformer that output logits over the vocabulary
- model_id: str = bert-base-uncased
Model ID from huggingface
- trainable: bool
Whether BERT parameters should be trained
- layer: int = 0
Layer to use (0 is the last, -1 to use them all)
- dropout: float = 0
(deprecated) Define a dropout for all the layers
Tokenizers
- XPM Configxpmir.text.huggingface.OneHotHuggingFaceEncoder(*, model_id, maxlen)[source]
Bases:
TextEncoderSubmit type:
xpmir.text.huggingface.OneHotHuggingFaceEncoderA tokenizer which encodes the tokens into 0 and 1 vector 1 represents the text contains the token and 0 otherwise
- model_id: str = bert-base-uncased
Model ID from huggingface
- maxlen: int
Max length for texts
- version: int = 2constant
- XPM Configxpmir.text.huggingface.HuggingfaceTokenizer(*, model_id, maxlen)[source]
Bases:
OneHotHuggingFaceEncoderSubmit type:
xpmir.text.huggingface.HuggingfaceTokenizerThe old encoder for one hot
- model_id: str = bert-base-uncased
Model ID from huggingface
- maxlen: int
Max length for texts
- version: int = 2constant
Masked-LM ——–=
- XPM Configxpmir.text.huggingface.MLMEncoder(*, model_id, trainable, layer, dropout, maxlen, mlm_probability)[source]
Bases:
BaseTransformer,DistributableModelSubmit type:
xpmir.text.huggingface.MLMEncoderImplementation of the encoder for the Masked Language Modeling task
- model_id: str = bert-base-uncased
Model ID from huggingface
- trainable: bool
Whether BERT parameters should be trained
- layer: int = 0
Layer to use (0 is the last, -1 to use them all)
- dropout: float = 0
(deprecated) Define a dropout for all the layers
- maxlen: int
- mlm_probability: float = 0.2
Probability to mask tokens
Hooks
- XPM Configxpmir.text.huggingface.LayerSelector(*, re_layer, transformer, pick_layers, select_embeddings, select_feed_forward)[source]
Bases:
ParametersIteratorSubmit type:
xpmir.text.huggingface.LayerSelectorThis class can be used to pick some of the transformer layers
- re_layer: str = (?:encoder|transformer)\.layer\.(\d+)\.
- transformer: xpmir.text.huggingface.BaseTransformer
The model for which layers are selected
- pick_layers: int = 0
Counting from the first processing layers (can be negative, i.e. -1 meaning until the last layer excluded, etc. / 0 means no layer)
- select_embeddings: bool = False
Whether to pick the embeddings layer
- select_feed_forward: bool = False
Whether to pick the feed forward of Transformer layers