Text Representation

The text module groups classes and configurations that compute a representation of text – this includes word embeddings as well as contextual word embeddings and document embeddings.

XPM Configxpmir.text.encoders.Encoder[source]

Bases: Module, EasyLogger, ABC

Submit type: xpmir.text.encoders.Encoder

Base class for all word and text encoders

XPM Configxpmir.text.encoders.TokensEncoder[source]

Bases: Tokenizer, Encoder

Submit type: xpmir.text.encoders.TokensEncoder

(deprecated) Represent a text as a sequence of token representations

forward(tokenized: TokenizedTexts)[source]

Returns embeddings for the tokenized texts.

tokenized: tokenized texts


XPM Configxpmir.text.tokenizers.Tokenizer[source]

Bases: Config

Submit type: xpmir.text.tokenizers.Tokenizer

Represents a vocabulary and a tokenization method

Deprecated: Use TokenizerBase instead

batch_tokenize(texts: List[str], batch_first=True, maxlen=None, mask=False) TokenizedTexts[source]

Returns tokenized texts


mask – Whether a mask should be computed

id2tok(idx: int) str[source]

Converts an integer id to a token

lexicon_size() int[source]

Returns the number of items in the lexicon

tok2id(tok: str) int[source]

Converts a token to an integer id

XPM Configxpmir.text.tokenizers.TokenizerBase[source]

Bases: Config, Initializable, Generic[TokenizerInput, TokenizerOutput], ABC

Submit type: xpmir.text.tokenizers.TokenizerBase

Base tokenizer

Text Encoders

XPM Configxpmir.text.encoders.TextEncoderBase[source]

Bases: Encoder, Generic[InputType, EncoderOutput]

Submit type: xpmir.text.encoders.TextEncoderBase

Base class for all text encoders

XPM Configxpmir.text.encoders.TextEncoder[source]

Bases: TextEncoderBase[str, torch.Tensor]

Submit type: xpmir.text.encoders.TextEncoder

Encodes a text into a vector

Deprecated since version 1.3: Use TextEncoderBase directly

XPM Configxpmir.text.encoders.DualTextEncoder[source]

Bases: TextEncoderBase[Tuple[str, str], torch.Tensor]

Submit type: xpmir.text.encoders.DualTextEncoder

Encodes a pair of text into a vector

Deprecated since version 1.3: Use TextEncoderBase directly

XPM Configxpmir.text.encoders.TripletTextEncoder[source]

Bases: TextEncoderBase[Tuple[str, str, str], torch.Tensor]

Submit type: xpmir.text.encoders.TripletTextEncoder

Encodes a triplet of text into a vector

Deprecated since version 1.3: Use TextEncoderBase directly

This is used in models such as DuoBERT where we encode (query, positive, negative) triplets.

XPM Configxpmir.text.encoders.TokenizedTextEncoderBase[source]

Bases: TextEncoderBase[InputType, EncoderOutput]

Submit type: xpmir.text.encoders.TokenizedTextEncoderBase

XPM Configxpmir.text.encoders.TokenizedEncoder[source]

Bases: Encoder, Generic[EncoderOutput, TokenizerOutput]

Submit type: xpmir.text.encoders.TokenizedEncoder

Encodes a tokenized text into a vector

Tokenizer-based encoders

XPM Configxpmir.text.encoders.TokenizedTextEncoder(*, tokenizer, encoder)[source]

Bases: TokenizedTextEncoderBase[InputType, EncoderOutput], Generic[InputType, EncoderOutput, TokenizerOutput]

Submit type: xpmir.text.encoders.TokenizedTextEncoder

Encodes a tokenizer input into a vector

This pipelines two objects:

  1. A tokenizer that segments the text;

  2. An encoder that returns a representation of the tokens in a vector space

tokenizer: xpmir.text.tokenizers.TokenizerBase[InputType, TokenizerOutput]
encoder: xpmir.text.encoders.TokenizedEncoder[TokenizerOutput, EncoderOutput]


XPM Configxpmir.text.adapters.MeanTextEncoder(*, encoder)[source]

Bases: TokenizedTextEncoderBase[InputType, RepresentationOutput]

Submit type: xpmir.text.adapters.MeanTextEncoder

Returns the mean of the word embeddings

encoder: xpmir.text.encoders.TokenizedTextEncoderBase[InputType, xpmir.text.encoders.RepresentationOutput]
XPM Configxpmir.text.adapters.TopicTextConverter[source]

Bases: Converter[Record, str]

Submit type: xpmir.text.adapters.TopicTextConverter

Extracts the text from a topic