meerqat.ir.embedding module#
Script to embed dataset and Knowledge Base prior to search.
Usage: embedding.py <dataset> <config> [–disable_caching –kb=<path> –output=<path>]
- Positional arguments:
<dataset> Path to the dataset
<config> Path to the JSON configuration file (passed as kwargs)
- Options:
- --disable_caching
Disables Dataset caching (useless when using save_to_disk), see datasets.set_caching_enabled()
- --kb=<path>
Path to the KB that can be mapped from the passages
- --output=<path>
Optionally save the resulting dataset there instead of overwriting the input dataset.
- meerqat.ir.embedding.get_face_inputs(batch, n_faces=4, face_dim=512, bbox_dim=7)[source]#
Formats pre-computed face features in nice square tensors similarly to PreComputedImageFeatures.get_face_inputs
- Parameters:
batch (dict) –
n_faces (int, optional) –
face_dim (int, optional) –
bbox_dim (int, optional) –
- Returns:
face_inputs –
- {
face: Tensor(batch_size, 1, n_faces, face_dim)
bbox: Tensor(batch_size, 1, n_faces, bbox_dim)
attention_mask: Tensor(batch_size, 1, n_faces)
}
- Return type:
dict[str, Tensor]
- meerqat.ir.embedding.get_image_inputs(batch, image_kwargs)[source]#
Formats pre-computed full-image features in nice square tensors similarly to PreComputedImageFeatures.get_image_inputs
- Parameters:
batch (dict) –
image_kwargs (dict) – keys are used to index batch to get precomputed features.
- Returns:
image_inputs – one key per image feature (the same as image_kwargs) {
input: Tensor(batch_size, 1, ?)
attention_mask: Tensor(batch_size, 1) None of the images are masked
}
- Return type:
dict[str, dict[str,Tensor]]
- meerqat.ir.embedding.map_passage_to_kb(batch, kb, features)[source]#
- Parameters:
batch (dict) – Should be a batch from the passages KB Should be able to map to the KB using the ‘index’ key
kb (Dataset) – Should be a dataset with pre-computed features
features (List[str]) – each feature in features is used to index kb and is then added to the batch
- meerqat.ir.embedding.expand_query(batch, key='passage', kb=None, run=None, tokenizer=None, qe_predictions_key=None, doc_name_key='wikidata_label')[source]#
- meerqat.ir.embedding.get_inputs(batch, model, tokenizer, tokenization_kwargs={}, key='passage', kb=None, run=None, qe_predictions_key=None)[source]#
Tokenizes input text and optionally gathers image features from the kb depending on model.
- Parameters:
batch (dict) –
model (nn.Module) – If it’s a ECAEncoder or IntermediateLinearFusion instance, will gather image features to take as input (from the kb if kb is not None)
tokenizer (PreTrainedTokenizer) –
tokenization_kwargs (dict, optional) – To be passed to tokenizer
key (str, optional) – Used to index the batch to get the text
kb (Dataset, optional) – Should hold image features and be mappable from batch[‘index’]
- meerqat.ir.embedding.embed(batch, model, tokenizer, tokenization_kwargs={}, key='passage', save_as='text_embedding', output_key=None, forward_kwargs={}, layers=None, kb=None, call=None, run=None, qe_predictions_key=None)[source]#
- Parameters:
batch – see
get_inputs
model – see
get_inputs
tokenizer – see
get_inputs
tokenization_kwargs – see
get_inputs
key – see
get_inputs
kb – see
get_inputs
save_as (str, optional) – key to save the resulting embedding in batch
output_key (str or int, optional) – if model outputs a dict, list, or tuple, used to get THE output Tensor you want
forward_kwargs (dict, optional) – passed to model.forward
layers (list[int], optional) – if not None, expects that the output is a List[Tensor] with each Tensor being shaped like (batch_size, sequence_length, hidden_size) In this case, it will save in {save_as}_layer_{layer} the representation of the first token (DPR-like), for each layer
call (str, optional) – Name of the method to call on model. By default, the model should be callable and is called.
run (Run, optional) – used to expand query with results of visual search