meerqat.ir.search module#
Script and classes to search. Built upon datasets (itself wrapping FAISS and ElasticSearch).
Usage: search.py <dataset> <config> [–k=<k> –disable_caching –metrics=<path>]
- Positional arguments:
<dataset> Path to the dataset
<config> Path to the JSON configuration file (passed as kwargs)
- Options:
- --k=<k>
Hyperparameter to search for the k nearest neighbors [default: 100].
- --disable_caching
Disables Dataset caching (useless when using save_to_disk), see datasets.set_caching_enabled()
- --metrics=<path>
Path to the directory to save the results of the run and evaluation
- meerqat.ir.search.L2norm(queries)[source]#
Normalize each query to have a unit-norm. Expects a batch of vectors of the same dimension
- class meerqat.ir.search.IndexKind(value)[source]#
Bases:
Enum
An enumeration.
- FAISS = 0#
- ES = 1#
- PYSERINI = 2#
- class meerqat.ir.search.Index(key, kind=IndexKind.FAISS, do_L2norm=False)[source]#
Bases:
object
Dataclass to hold information about an index (either FaissIndex or ESIndex)
- Parameters:
key (str) – Associated key in the dataset where the queries are stored
kind (IndexKind, optional) –
do_L2norm (bool, optional) – Whether to apply
L2norm
to the queries
Notes
Difficult to create a hierarchy like FaissIndex and ESIndex since public methods, such as search_batch, are defined in Dataset and take as input the index name.
- class meerqat.ir.search.KnowledgeBase(kb_path=None, index_mapping_path=None, many2one=None, index_kwargs={}, es_client=None, load_dataset=True)[source]#
Bases:
object
A KB can be indexed by several indexes.
- Parameters:
kb_path (str, optional) – Path to the Dataset holding the KB
index_mapping_path (str, optional) – Path to the JSON file mapping KB articles to its corresponding passages indices
many2one (str, optional) – strategy to apply in case of many2one mapping (e.g. multiple passages to article) Choose from {‘max’}. Has no effect if index_mapping_path is None. Defaults assume that mapping is one2many (e.g. article to multiple passages) so it will overwrite results in iteration order if it is not the case.
index_kwargs (dict, optional) – Each key identifies an Index and each value is passed to
add_or_load_index
es_client (Elasticsearch, optional) –
load_dataset (bool, optional) – This is useful for hyperparameter search if you want to use pre-computed results (see ir.hp)
- search_batch(index_name, queries, k=100)[source]#
Pre-process queries according to index before computing self.dataset.search_batch
- search_batch_if_not_None(index_name, queries, k=100)[source]#
Filters out queries that are None and runs
search_batch
for the rest.
- add_or_load_index(column=None, index_name=None, kind=None, key=None, **index_kwarg)[source]#
Calls either
add_or_load_elasticsearch_index
or ``add_or_load_faiss_index``according to es. Unless column is None, then it does not actually add the index. This is useful for hyperparameter search if you want to use pre-computed results (see ir.hp).- Parameters:
column (str) – Name/key of the column that holds the pre-computed embeddings.
index_name (str, optional) – Index identifier. Defaults to
column
kind (IndexKind, optional) –
**index_kwarg – Passed to
add_or_load_elasticsearch_index
oradd_or_load_faiss_index
- add_or_load_faiss_index(column, index_name=None, load=False, save_path=None, string_factory=None, device=None, **kwargs)[source]#
- Parameters:
column – see add_or_load_index
index_name – see add_or_load_index
load (bool, optional) – Whether to
load_faiss_index
oradd_faiss_index
save_path (str, optional) – Save index using
self.dataset.save_faiss_index
Defaults not to save.string_factory (str, optional) – see
Dataset.add_faiss_index
and facebookresearch/faissdevice (int, optional) – see
Dataset.add_faiss_index
**kwargs – Passed to
load_faiss_index
oradd_faiss_index
- Returns:
do_L2norm – Inferred from string_factory. See Index.
- Return type:
bool
- add_or_load_pyserini_index(column=None, index_name=None, save_path=None, k1=0.9, b=0.4)[source]#
- Parameters:
column (placeholder) –
index_name (str) –
save_path (str) –
k1 (float) – BM25 k1 parameter. (Default from pyserini)
b (float) – BM25 b parameter. (Default from pyserini)
- add_or_load_elasticsearch_index(column, index_name=None, load=False, **kwargs)[source]#
When loading, it will also check the settings and eventually update them (using put_settings)
- Parameters:
column – see add_or_load_index
index_name – see add_or_load_index
load (bool, optional) – Whether to
load_elasticsearch_index
oradd_elasticsearch_index
**kwargs – Passed to
load_elasticsearch_index
oradd_elasticsearch_index
- class meerqat.ir.search.Searcher(kb_kwargs, k=100, reference_kb_path=None, reference_key='passage', qrels=None, request_timeout=1000, es_client_kwargs={}, fusion_kwargs={}, metrics_kwargs={}, do_fusion=None, qnonrels=None)[source]#
Bases:
object
Aggregates several KnowledgeBases (KBs). Searches through a dataset using all the indexes of all KnowledgeBases. Fuses results of search with multiple indexes and compute metrics.
- Parameters:
kb_kwargs (dict) – Each key identifies a KB and each value is passed to KnowledgeBase
k (int, optional) – Searches for the top-k results
reference_kb_path (str, optional) – Path to the Dataset that hold the reference KB, used to evaluate the results. If it is one of self.kbs, it will only get loaded once. Defaults to evaluate only from the provided qrels (not recommanded).
reference_key (str, optional) – Used to get the reference field in kb Defaults to ‘passage’
qrels (str, optional) – Path to the qrels JSON file. Defaults to start looking for relevant documents from scratch in self.reference_kb At least one of {reference_kb_path, qrels} should be provided
request_timeout (int, optional) – Timeout for Elasticsearch
es_client_kwargs (dict, optional) – Passed to Elasticsearch
fusion_kwargs (dict, optional) – Passed to Fusion (see fuse)
metrics_kwargs (dict, optional) – Passed to ranx.compare. Defaults to {“metrics”: [”mrr@100”, “precision@1”, “precision@20”, “hit_rate@20”]}
do_fusion (bool, optional) – Whether to fuse results of the indexes. Defaults to True if their are multiple indexes.
qnonrels (str, optional) – Path towards a JSON collection of irrelevant documents. Used as cache to make search faster. Defaults to look for all results.
- meerqat.ir.search.dataset_search(dataset, k=100, metric_save_path=None, map_kwargs={}, **kwargs)[source]#
Instantiates searcher, maps the dataset through it, then compute and saves metrics.
- Parameters:
dataset (Dataset) –
k (int, optional) – see Searcher
metric_save_path (str, optional) – Path to the directory where to save the results qrels, runs and metrics of eval_dataset. Defaults not to save.
map_kwargs (dict, optional) – Passed to self.dataset.map
**kwargs – Passed to Searcher