meerqat.ir.embedding module#

Script to embed dataset and Knowledge Base prior to search.

Usage: embedding.py <dataset> <config> [–disable_caching –kb=<path> –output=<path>]

Positional arguments:

<dataset> Path to the dataset
<config> Path to the JSON configuration file (passed as kwargs)

Options:

--disable_caching: Disables Dataset caching (useless when using save_to_disk), see datasets.set_caching_enabled()
--kb=<path>: Path to the KB that can be mapped from the passages
--output=<path>: Optionally save the resulting dataset there instead of overwriting the input dataset.

meerqat.ir.embedding.get_face_inputs(batch, n_faces=4, face_dim=512, bbox_dim=7)[source]#

Formats pre-computed face features in nice square tensors similarly to PreComputedImageFeatures.get_face_inputs

Parameters:

batch (dict) –
n_faces (int, optional) –
face_dim (int, optional) –
bbox_dim (int, optional) –

Returns:

face_inputs –

{

face: Tensor(batch_size, 1, n_faces, face_dim)
bbox: Tensor(batch_size, 1, n_faces, bbox_dim)
attention_mask: Tensor(batch_size, 1, n_faces)

}

Return type:

dict[str, Tensor]

meerqat.ir.embedding.get_image_inputs(batch, image_kwargs)[source]#

Formats pre-computed full-image features in nice square tensors similarly to PreComputedImageFeatures.get_image_inputs

Parameters:

batch (dict) –
image_kwargs (dict) – keys are used to index batch to get precomputed features.

Returns:

image_inputs – one key per image feature (the same as image_kwargs) {

input: Tensor(batch_size, 1, ?)

attention_mask: Tensor(batch_size, 1) None of the images are masked

}

Return type:

dict[str, dict[str,Tensor]]

meerqat.ir.embedding.map_passage_to_kb(batch, kb, features)[source]#

Parameters:

batch (dict) – Should be a batch from the passages KB Should be able to map to the KB using the ‘index’ key
kb (Dataset) – Should be a dataset with pre-computed features
features (List[str]) – each feature in features is used to index kb and is then added to the batch

meerqat.ir.embedding.expand_query(batch, key='passage', kb=None, run=None, tokenizer=None, qe_predictions_key=None, doc_name_key='wikidata_label')[source]#

meerqat.ir.embedding.is_multimodal(model)[source]#

meerqat.ir.embedding.get_inputs(batch, model, tokenizer, tokenization_kwargs={}, key='passage', kb=None, run=None, qe_predictions_key=None)[source]#

Tokenizes input text and optionally gathers image features from the kb depending on model.

Parameters:

batch (dict) –
model (nn.Module) – If it’s a ECAEncoder or IntermediateLinearFusion instance, will gather image features to take as input (from the kb if kb is not None)
tokenizer (PreTrainedTokenizer) –
tokenization_kwargs (dict, optional) – To be passed to tokenizer
key (str, optional) – Used to index the batch to get the text
kb (Dataset, optional) – Should hold image features and be mappable from batch[‘index’]

meerqat.ir.embedding.embed(batch, model, tokenizer, tokenization_kwargs={}, key='passage', save_as='text_embedding', output_key=None, forward_kwargs={}, layers=None, kb=None, call=None, run=None, qe_predictions_key=None)[source]#

Parameters:

batch – see get_inputs
model – see get_inputs
tokenizer – see get_inputs
tokenization_kwargs – see get_inputs
key – see get_inputs
kb – see get_inputs
save_as (str, optional) – key to save the resulting embedding in batch
output_key (str or int, optional) – if model outputs a dict, list, or tuple, used to get THE output Tensor you want
forward_kwargs (dict, optional) – passed to model.forward
layers (list[int], optional) – if not None, expects that the output is a List[Tensor] with each Tensor being shaped like (batch_size, sequence_length, hidden_size) In this case, it will save in {save_as}_layer_{layer} the representation of the first token (DPR-like), for each layer
call (str, optional) – Name of the method to call on model. By default, the model should be callable and is called.
run (Run, optional) – used to expand query with results of visual search

meerqat.ir.embedding.dataset_embed(dataset_path, map_kwargs={}, output_path=None, keep_columns=None, run=None, qe_predictions=None, qe_predictions_key=None, **fn_kwargs)[source]#: Loads dataset from path, maps it through embed, and saves it to output_path