Huggingface Transformers
- XPM Configxpmir.text.huggingface.BaseTransformer(*, model_id, trainable, layer, dropout)[source]
Bases:
Encoder
Base transformer class from Huggingface
- model_id: str = bert-base-uncased
Model ID from huggingface
- trainable: bool
Whether BERT parameters should be trained
- layer: int = 0
Layer to use (0 is the last, -1 to use them all)
- dropout: float = 0
(deprecated) Define a dropout for all the layers
Encoders
- XPM Configxpmir.text.huggingface.TransformerEncoder(*, model_id, trainable, layer, dropout, maxlen)[source]
Bases:
BaseTransformer
,TextEncoder
,DistributableModel
Encodes using the [CLS] token
- model_id: str = bert-base-uncased
Model ID from huggingface
- trainable: bool
Whether BERT parameters should be trained
- layer: int = 0
Layer to use (0 is the last, -1 to use them all)
- dropout: float = 0
(deprecated) Define a dropout for all the layers
- maxlen: int
- XPM Configxpmir.text.huggingface.TransformerTokensEncoder(*, model_id, trainable, layer, dropout)[source]
Bases:
BaseTransformer
,TokensEncoder
A tokens encoder based on HuggingFace
- model_id: str = bert-base-uncased
Model ID from huggingface
- trainable: bool
Whether BERT parameters should be trained
- layer: int = 0
Layer to use (0 is the last, -1 to use them all)
- dropout: float = 0
(deprecated) Define a dropout for all the layers
- XPM Configxpmir.text.huggingface.TransformerTextEncoderAdapter(*, encoder, maxlen)[source]
Bases:
TextEncoder
,DistributableModel
- maxlen: int
- XPM Configxpmir.text.huggingface.DualTransformerEncoder(*, model_id, trainable, layer, dropout, maxlen)[source]
Bases:
BaseTransformer
,DualTextEncoder
Encodes the (query, document pair) using the [CLS] token
maxlen: Maximum length of the query document pair (in tokens) or None if using the transformer limit
- model_id: str = bert-base-uncased
Model ID from huggingface
- trainable: bool
Whether BERT parameters should be trained
- layer: int = 0
Layer to use (0 is the last, -1 to use them all)
- dropout: float = 0
(deprecated) Define a dropout for all the layers
- maxlen: int
- version: int = 2constant
- XPM Configxpmir.text.huggingface.SentenceTransformerTextEncoder(*, model_id)[source]
Bases:
TextEncoder
A Sentence Transformers text encoder
- model_id: str = sentence-transformers/all-MiniLM-L6-v2
- XPM Configxpmir.text.huggingface.OneHotHuggingFaceEncoder(*, model_id, maxlen)[source]
Bases:
TextEncoder
A tokenizer which encodes the tokens into 0 and 1 vector 1 represents the text contains the token and 0 otherwise
- model_id: str = bert-base-uncased
Model ID from huggingface
- maxlen: int
Max length for texts
- version: int = 2constant
- XPM Configxpmir.text.huggingface.DualDuoBertTransformerEncoder(*, model_id, trainable, layer, dropout, maxlen_query, maxlen_doc)[source]
Bases:
BaseTransformer
,TripletTextEncoder
Vector encoding of a (query, document, document) triplet
Be like: [cls] query [sep] doc1 [sep] doc2 [sep]
- model_id: str = bert-base-uncased
Model ID from huggingface
- trainable: bool
Whether BERT parameters should be trained
- layer: int = 0
Layer to use (0 is the last, -1 to use them all)
- dropout: float = 0
(deprecated) Define a dropout for all the layers
- maxlen_query: int = 64
Maximum length for the query, the first document and the second one
- maxlen_doc: int = 224
Maximum length for the query, the first document and the second one
- XPM Configxpmir.text.huggingface.TransformerVocab(*, model_id, trainable, layer, dropout)[source]
Bases:
TransformerTokensEncoder
Old tokens encoder
- model_id: str = bert-base-uncased
Model ID from huggingface
- trainable: bool
Whether BERT parameters should be trained
- layer: int = 0
Layer to use (0 is the last, -1 to use them all)
- dropout: float = 0
(deprecated) Define a dropout for all the layers
- XPM Configxpmir.text.huggingface.TransformerTokensEncoderWithMLMOutput(*, model_id, trainable, layer, dropout)[source]
Bases:
TransformerTokensEncoder
Transformer that output logits over the vocabulary
- model_id: str = bert-base-uncased
Model ID from huggingface
- trainable: bool
Whether BERT parameters should be trained
- layer: int = 0
Layer to use (0 is the last, -1 to use them all)
- dropout: float = 0
(deprecated) Define a dropout for all the layers
Tokenizers
- XPM Configxpmir.text.huggingface.OneHotHuggingFaceEncoder(*, model_id, maxlen)[source]
Bases:
TextEncoder
A tokenizer which encodes the tokens into 0 and 1 vector 1 represents the text contains the token and 0 otherwise
- model_id: str = bert-base-uncased
Model ID from huggingface
- maxlen: int
Max length for texts
- version: int = 2constant
Masked-LM
- XPM Configxpmir.text.huggingface.MLMEncoder(*, model_id, trainable, layer, dropout, maxlen, mlm_probability)[source]
Bases:
BaseTransformer
,DistributableModel
Implementation of the encoder for the Masked Language Modeling task
- model_id: str = bert-base-uncased
Model ID from huggingface
- trainable: bool
Whether BERT parameters should be trained
- layer: int = 0
Layer to use (0 is the last, -1 to use them all)
- dropout: float = 0
(deprecated) Define a dropout for all the layers
- maxlen: int
- mlm_probability: float = 0.2
Probability to mask tokens
Hooks
- XPM Configxpmir.text.huggingface.LayerSelector(*, re_layer, transformer, pick_layers, select_embeddings, select_feed_forward)[source]
Bases:
ParametersIterator
This class can be used to pick some of the transformer layers
- re_layer: str = (?:encoder|transformer)\.layer\.(\d+)\.
- transformer: xpmir.text.huggingface.BaseTransformer
The model for which layers are selected
- pick_layers: int = 0
Counting from the first processing layers (can be negative, i.e. -1 meaning until the last layer excluded, etc. / 0 means no layer)
- select_embeddings: bool = False
Whether to pick the embeddings layer
- select_feed_forward: bool = False
Whether to pick the feed forward of Transformer layers