Text Representation

The text module groups classes and configurations that compute a representation of text – this includes word embeddings as well as contextual word embeddings and document embeddings.

XPM Configxpmir.text.encoders.Tokenizer[source]

Bases: Config

Represents a vocabulary and a tokenization method

batch_tokenize(texts: List[str], batch_first=True, maxlen=None, mask=False) TokenizedTexts[source]

Returns tokenized texts

Parameters:

mask – Whether a mask should be computed

id2tok(idx: int) str[source]

Converts an integer id to a token

lexicon_size() int[source]

Returns the number of items in the lexicon

tok2id(tok: str) int[source]

Converts a token to an integer id

XPM Configxpmir.text.encoders.TokensEncoder[source]

Bases: Tokenizer, Encoder

Represent a text as a sequence of token representations

forward(tok_texts: TokenizedTexts)[source]

Returns embeddings for the tokenized texts.

tok_texts: tokenized texts

XPM Configxpmir.text.encoders.Encoder[source]

Bases: Module, EasyLogger

Base class for all word and text encoders

XPM Configxpmir.text.encoders.MeanTextEncoder(*, encoder)[source]

Bases: TextEncoder

Returns the mean of the word embeddings

encoder: xpmir.text.encoders.TokensEncoder
XPM Configxpmir.text.encoders.TripletTextEncoder[source]

Bases: Encoder

The generic class for triplet encoders: query-document-document

This encoding is used in models such as DuoBERT that compute whether a pair is preferred to another

XPM Configxpmir.text.encoders.TextEncoder[source]

Bases: Encoder

Vectorial representation of a text - can be dense or sparse

forward(texts: List[str]) torch.Tensor[source]

Returns a matrix encoding the provided texts

XPM Configxpmir.text.encoders.DualTextEncoder[source]

Bases: Encoder

Vectorial representation for a pair of texts

This is used for instance in the case of BERT models that represent the (query, document couple)

forward(texts: List[Tuple[str, str]]) torch.Tensor[source]

Computes the representation of a list of pair of texts