vectorisers.huggingface.HuggingFaceVectoriser

vectorisers.huggingface.HuggingFaceVectoriser(
    model_name,
    device=None,
    model_revision='main',
    tokenizer_kwargs=None,
    model_kwargs=None,
)

A general wrapper class for Huggingface Transformers models to generate text embeddings.

Attributes

Name Type Description
model_name str The name of the Huggingface model to use.
tokenizer transformers.PreTrainedTokenizer The tokenizer for the specified model.
model transformers.PreTrainedModel The Huggingface model instance.
device torch.device The device (CPU or GPU) on which the model is loaded.
tokenizer_kwargs dict Additional keyword arguments passed to the tokenizer.
model_kwargs dict Additional keyword arguments passed to the model.

Methods

Name Description
transform Transforms input text(s) into embeddings using the Huggingface model.

transform

vectorisers.huggingface.HuggingFaceVectoriser.transform(texts)

Transforms input text(s) into embeddings using the Huggingface model.

Parameters

Name Type Description Default
texts (str, list[str]) The input text(s) to embed. Can be a single string or a list of strings. required

Returns

Name Type Description
np_ndarray numpy.ndarray: A 2D array of embeddings, where each row corresponds to an input text.

Raises

Name Type Description
VectorisationError If tokenization, model inference, or embedding extraction fails.