Text representation
The text module provides the building blocks for converting raw text into
numerical representations – from tokenisation to contextual embeddings and
document-level vectors. These encoders are used by the neural models in
Neural models and can be composed freely.
Base encoders
Abstract interfaces for encoders and token-level encoders.
Tokenizers
Tokenizers split text into token sequences and manage the vocabulary mapping.
- XPM Configxpmir.text.tokenizers.Tokenizer[source]
Bases:
ConfigRepresents a vocabulary and a tokenization method
Deprecated: Use TokenizerBase instead
Text encoders
Encoders that map a text string (or a pair of texts) to a dense representation.
- XPM Configxpmir.text.encoders.TextEncoderBase[source]
Bases:
Module,Generic[InputType,EncoderOutput]Base class for all text encoders
- XPM Configxpmir.text.encoders.TextEncoder[source]
Bases:
TextEncoderBaseEncodes a text into a vector
Deprecated since version 1.3: Use TextEncoderBase directly
- XPM Configxpmir.text.encoders.DualTextEncoder[source]
Bases:
TextEncoderBaseEncodes a pair of text into a vector
Deprecated since version 1.3: Use TextEncoderBase directly
- XPM Configxpmir.text.encoders.TripletTextEncoder[source]
Bases:
TextEncoderBaseEncodes a triplet of text into a vector
Deprecated since version 1.3: Use TextEncoderBase directly
This is used in models such as DuoBERT where we encode (query, positive, negative) triplets.
- XPM Configxpmir.text.encoders.TokenizedTextEncoderBase[source]
Bases:
TextEncoderBase
Tokenizer-based encoders
- XPM Configxpmir.text.encoders.TokenizedTextEncoder(*, tokenizer, encoder)[source]
Bases:
TokenizedTextEncoderBase,Generic[InputType,EncoderOutput,TokenizerOutput]Encodes a tokenizer input into a vector
This pipelines two objects:
A tokenizer that segments the text;
An encoder that returns a representation of the tokens in a vector space
- tokenizer: xpmir.text.tokenizers.TokenizerBase[InputType, TokenizerOutput]
Adapters
Adapters transform or aggregate encoder outputs.
- XPM Configxpmir.text.adapters.MeanTextEncoder(*, encoder)[source]
Bases:
TokenizedTextEncoderBaseReturns the mean of the word embeddings
- XPM Configxpmir.text.adapters.TopicTextConverter[source]
Bases:
Converter[TextRecord,str]Extracts the text from a topic