meerqat.ir.embedding module#

Script to embed dataset and Knowledge Base prior to search.

Usage: embedding.py <dataset> <config> [–disable_caching –kb=<path> –output=<path>]

Positional arguments:
  1. <dataset> Path to the dataset

  2. <config> Path to the JSON configuration file (passed as kwargs)

Options:
--disable_caching

Disables Dataset caching (useless when using save_to_disk), see datasets.set_caching_enabled()

--kb=<path>

Path to the KB that can be mapped from the passages

--output=<path>

Optionally save the resulting dataset there instead of overwriting the input dataset.

meerqat.ir.embedding.get_face_inputs(batch, n_faces=4, face_dim=512, bbox_dim=7)[source]#

Formats pre-computed face features in nice square tensors similarly to PreComputedImageFeatures.get_face_inputs

Parameters:
  • batch (dict) –

  • n_faces (int, optional) –

  • face_dim (int, optional) –

  • bbox_dim (int, optional) –

Returns:

face_inputs

{
  • face: Tensor(batch_size, 1, n_faces, face_dim)

  • bbox: Tensor(batch_size, 1, n_faces, bbox_dim)

  • attention_mask: Tensor(batch_size, 1, n_faces)

}

Return type:

dict[str, Tensor]

meerqat.ir.embedding.get_image_inputs(batch, image_kwargs)[source]#

Formats pre-computed full-image features in nice square tensors similarly to PreComputedImageFeatures.get_image_inputs

Parameters:
  • batch (dict) –

  • image_kwargs (dict) – keys are used to index batch to get precomputed features.

Returns:

image_inputs – one key per image feature (the same as image_kwargs) {

  • input: Tensor(batch_size, 1, ?)

  • attention_mask: Tensor(batch_size, 1) None of the images are masked

}

Return type:

dict[str, dict[str,Tensor]]

meerqat.ir.embedding.map_passage_to_kb(batch, kb, features)[source]#
Parameters:
  • batch (dict) – Should be a batch from the passages KB Should be able to map to the KB using the ‘index’ key

  • kb (Dataset) – Should be a dataset with pre-computed features

  • features (List[str]) – each feature in features is used to index kb and is then added to the batch

meerqat.ir.embedding.expand_query(batch, key='passage', kb=None, run=None, tokenizer=None, qe_predictions_key=None, doc_name_key='wikidata_label')[source]#
meerqat.ir.embedding.is_multimodal(model)[source]#
meerqat.ir.embedding.get_inputs(batch, model, tokenizer, tokenization_kwargs={}, key='passage', kb=None, run=None, qe_predictions_key=None)[source]#

Tokenizes input text and optionally gathers image features from the kb depending on model.

Parameters:
  • batch (dict) –

  • model (nn.Module) – If it’s a ECAEncoder or IntermediateLinearFusion instance, will gather image features to take as input (from the kb if kb is not None)

  • tokenizer (PreTrainedTokenizer) –

  • tokenization_kwargs (dict, optional) – To be passed to tokenizer

  • key (str, optional) – Used to index the batch to get the text

  • kb (Dataset, optional) – Should hold image features and be mappable from batch[‘index’]

meerqat.ir.embedding.embed(batch, model, tokenizer, tokenization_kwargs={}, key='passage', save_as='text_embedding', output_key=None, forward_kwargs={}, layers=None, kb=None, call=None, run=None, qe_predictions_key=None)[source]#
Parameters:
  • batch – see get_inputs

  • model – see get_inputs

  • tokenizer – see get_inputs

  • tokenization_kwargs – see get_inputs

  • key – see get_inputs

  • kb – see get_inputs

  • save_as (str, optional) – key to save the resulting embedding in batch

  • output_key (str or int, optional) – if model outputs a dict, list, or tuple, used to get THE output Tensor you want

  • forward_kwargs (dict, optional) – passed to model.forward

  • layers (list[int], optional) – if not None, expects that the output is a List[Tensor] with each Tensor being shaped like (batch_size, sequence_length, hidden_size) In this case, it will save in {save_as}_layer_{layer} the representation of the first token (DPR-like), for each layer

  • call (str, optional) – Name of the method to call on model. By default, the model should be callable and is called.

  • run (Run, optional) – used to expand query with results of visual search

meerqat.ir.embedding.dataset_embed(dataset_path, map_kwargs={}, output_path=None, keep_columns=None, run=None, qe_predictions=None, qe_predictions_key=None, **fn_kwargs)[source]#

Loads dataset from path, maps it through embed, and saves it to output_path