meerqat.ir.search module#

Script and classes to search. Built upon datasets (itself wrapping FAISS and ElasticSearch).

Usage: search.py <dataset> <config> [–k=<k> –disable_caching –metrics=<path>]

Positional arguments:

<dataset> Path to the dataset
<config> Path to the JSON configuration file (passed as kwargs)

Options:

--k=<k>: Hyperparameter to search for the k nearest neighbors [default: 100].
--disable_caching: Disables Dataset caching (useless when using save_to_disk), see datasets.set_caching_enabled()
--metrics=<path>: Path to the directory to save the results of the run and evaluation

meerqat.ir.search.L2norm(queries)[source]#: Normalize each query to have a unit-norm. Expects a batch of vectors of the same dimension

class meerqat.ir.search.IndexKind(value)[source]#

Bases: Enum

An enumeration.

FAISS = 0#

ES = 1#

PYSERINI = 2#

class meerqat.ir.search.Index(key, kind=IndexKind.FAISS, do_L2norm=False)[source]#

Bases: object

Dataclass to hold information about an index (either FaissIndex or ESIndex)

Parameters:

key (str) – Associated key in the dataset where the queries are stored
kind (IndexKind, optional) –
do_L2norm (bool, optional) – Whether to apply L2norm to the queries

Notes

Difficult to create a hierarchy like FaissIndex and ESIndex since public methods, such as search_batch, are defined in Dataset and take as input the index name.

class meerqat.ir.search.KnowledgeBase(kb_path=None, index_mapping_path=None, many2one=None, index_kwargs={}, es_client=None, load_dataset=True)[source]#

Bases: object

A KB can be indexed by several indexes.

Parameters:

kb_path (str, optional) – Path to the Dataset holding the KB
index_mapping_path (str, optional) – Path to the JSON file mapping KB articles to its corresponding passages indices
many2one (str, optional) – strategy to apply in case of many2one mapping (e.g. multiple passages to article) Choose from {‘max’}. Has no effect if index_mapping_path is None. Defaults assume that mapping is one2many (e.g. article to multiple passages) so it will overwrite results in iteration order if it is not the case.
index_kwargs (dict, optional) – Each key identifies an Index and each value is passed to add_or_load_index
es_client (Elasticsearch, optional) –
load_dataset (bool, optional) – This is useful for hyperparameter search if you want to use pre-computed results (see ir.hp)

pyserini_search_batch(index_name, queries, k=100, threads=10)[source]#

search_batch(index_name, queries, k=100)[source]#: Pre-process queries according to index before computing self.dataset.search_batch

search_batch_if_not_None(index_name, queries, k=100)[source]#: Filters out queries that are None and runs search_batch for the rest.

add_or_load_index(column=None, index_name=None, kind=None, key=None, **index_kwarg)[source]#

Calls either add_or_load_elasticsearch_index or ``add_or_load_faiss_index``according to es. Unless column is None, then it does not actually add the index. This is useful for hyperparameter search if you want to use pre-computed results (see ir.hp).

Parameters:

column (str) – Name/key of the column that holds the pre-computed embeddings.
index_name (str, optional) – Index identifier. Defaults to column
kind (IndexKind, optional) –
**index_kwarg – Passed to add_or_load_elasticsearch_index or add_or_load_faiss_index

add_or_load_faiss_index(column, index_name=None, load=False, save_path=None, string_factory=None, device=None, **kwargs)[source]#

Parameters:

column – see add_or_load_index
index_name – see add_or_load_index
load (bool, optional) – Whether to load_faiss_index or add_faiss_index
save_path (str, optional) – Save index using self.dataset.save_faiss_index Defaults not to save.
string_factory (str, optional) – see Dataset.add_faiss_index and facebookresearch/faiss
device (int, optional) – see Dataset.add_faiss_index
**kwargs – Passed to load_faiss_index or add_faiss_index

Returns:

do_L2norm – Inferred from string_factory. See Index.

Return type:

bool

add_or_load_pyserini_index(column=None, index_name=None, save_path=None, k1=0.9, b=0.4)[source]#

Parameters:

column (placeholder) –
index_name (str) –
save_path (str) –
k1 (float) – BM25 k1 parameter. (Default from pyserini)
b (float) – BM25 b parameter. (Default from pyserini)

add_or_load_elasticsearch_index(column, index_name=None, load=False, **kwargs)[source]#

When loading, it will also check the settings and eventually update them (using put_settings)

Parameters:

column – see add_or_load_index
index_name – see add_or_load_index
load (bool, optional) – Whether to load_elasticsearch_index or add_elasticsearch_index
**kwargs – Passed to load_elasticsearch_index or add_elasticsearch_index

class meerqat.ir.search.Searcher(kb_kwargs, k=100, reference_kb_path=None, reference_key='passage', qrels=None, request_timeout=1000, es_client_kwargs={}, fusion_kwargs={}, metrics_kwargs={}, do_fusion=None, qnonrels=None)[source]#

Bases: object

Aggregates several KnowledgeBases (KBs). Searches through a dataset using all the indexes of all KnowledgeBases. Fuses results of search with multiple indexes and compute metrics.

Parameters:

kb_kwargs (dict) – Each key identifies a KB and each value is passed to KnowledgeBase
k (int, optional) – Searches for the top-k results
reference_kb_path (str, optional) – Path to the Dataset that hold the reference KB, used to evaluate the results. If it is one of self.kbs, it will only get loaded once. Defaults to evaluate only from the provided qrels (not recommanded).
reference_key (str, optional) – Used to get the reference field in kb Defaults to ‘passage’
qrels (str, optional) – Path to the qrels JSON file. Defaults to start looking for relevant documents from scratch in self.reference_kb At least one of {reference_kb_path, qrels} should be provided
request_timeout (int, optional) – Timeout for Elasticsearch
es_client_kwargs (dict, optional) – Passed to Elasticsearch
fusion_kwargs (dict, optional) – Passed to Fusion (see fuse)
metrics_kwargs (dict, optional) – Passed to ranx.compare. Defaults to {“metrics”: [”mrr@100”, “precision@1”, “precision@20”, “hit_rate@20”]}
do_fusion (bool, optional) – Whether to fuse results of the indexes. Defaults to True if their are multiple indexes.
qnonrels (str, optional) – Path towards a JSON collection of irrelevant documents. Used as cache to make search faster. Defaults to look for all results.

meerqat.ir.search.dataset_search(dataset, k=100, metric_save_path=None, map_kwargs={}, **kwargs)[source]#

Instantiates searcher, maps the dataset through it, then compute and saves metrics.

Parameters:

dataset (Dataset) –
k (int, optional) – see Searcher
metric_save_path (str, optional) – Path to the directory where to save the results qrels, runs and metrics of eval_dataset. Defaults not to save.
map_kwargs (dict, optional) – Passed to self.dataset.map
**kwargs – Passed to Searcher