meerqat.data.loading module#
Usages#
passages
#
Segments Wikipedia articles (from the kilt_wikipedia dataset) into passages (e.g. paragraphs) Current options (passed in a JSON file) are:
prepend_title: whether to prepend the title at the beginning of each passage like “<title> [SEP] <passage>”
special_fields: removes the title, sections titles (“Section::::”) and bullet-points (“BULLET::::”)
uniform: each passage is n tokens, without overlap. Tokenized with a transformers tokenizer
- uniform_sents: each article is first segmented into sentences using spacy.
Then sentences are grouped into passage s.t. each passage holds a maximum of n tokens (spacy tokens here, not transformers like above)
map
#
Make a JSON file out of a dataset column for quick (and string) indexing.
sentences
#
Used in Inverse Cloze Task (ICT) to segment the text of a dataset in a list of sentences via spaCy.
For docopt
#
Usage: loading.py passages <input> <output> [<config> –disable_caching] loading.py map <dataset> <key> <output> [–inverse –one2many –disable_caching] loading.py sentences <dataset>
- Options:
- --disable_caching
Disables Dataset caching (useless when using save_to_disk), see datasets.set_caching_enabled()
Functions#
- meerqat.data.loading.answer_preprocess(answer)[source]#
Adapted from datasets squad metric. Lower text and remove punctuation, articles and extra whitespace.
- meerqat.data.loading.map_kilt_triviaqa()[source]#
As instructed by huggingface/datasets
- meerqat.data.loading.remove_special_fields(paragraphs)[source]#
N. B. this code puts a lot of trust into KILT pre-processing facebookresearch/KILT and simply removes the title (1st paragraph), sections titles (“Section::::”) and bullet-points (“BULLET::::”)
- meerqat.data.loading.paragraphs_preprocess(paragraphs, method=None, **kwargs)[source]#
- Parameters:
paragraphs (List[str]) – List of paragraphs to preprocess
method (str, optional) – type of pre-processing, defaults to None (i.e. identity function)
**kwargs (additional arguments are passed to the appropriate pre-processing function) –
- Returns:
paragraphs
- Return type:
List[str]
- meerqat.data.loading.uniform_passages(paragraphs, tokenizer, n=100, title=None)[source]#
- Parameters:
paragraphs (List[str]) – List of pre-processed paragraphs to split into passages
tokenizer (PreTrainedTokenizer) –
n (int, optional) – Number of tokens in each passage (excluding title) Defaults to 100
title (str, optional) – To prepend at the beginning of each passage like “<title> [SEP] <passage>” Defaults to None (only “<passage>”)
- Returns:
passages – Each passage is pre-processed by the tokenizer (e.g. lower-cased, added space between punctuation marks, etc.)
- Return type:
List[str]
- meerqat.data.loading.uniform_passages_of_sentences(paragraphs, model, n=100, title=None, sep_token='[SEP]')[source]#
N. B. unlike uniform_passages which is based on transformers PreTrainedTokenizer here we’re able to get back the un-processed text corresponding to the tokens so the output text is not changed (e.g. not lower-cased), only the whitespace between sentences is lost (it is always set to ‘ ‘)
- Parameters:
paragraphs (List[str]) – List of pre-processed paragraphs to split into passages
model (spacy model) –
n (int, optional) – Maximum number of tokens in each passage (excluding title) There can actually be more tokens than this if the passage is a single sentence (with more tokens than n) Defaults to 100
title (str, optional) – To prepend at the beginning of each passage like “<title> [SEP] <passage>” Defaults to None (only “<passage>”)
sep_token (str, optional) – To separate title and passages (no effect if title is None) Defaults to ‘[SEP]’
- Returns:
passages
- Return type:
List[str]
- meerqat.data.loading.make_passages(paragraphs, method=None, preprocessing_method=None, preprocessing_kwargs={}, **kwargs)[source]#
- Parameters:
paragraphs (List[str]) – List of paragraphs to preprocess
method (str, optional) – How to split the text in passages, defaults to keep the original paragraphs
- meerqat.data.loading.make_passage_item(item, index, passage_dict, prepend_title=False, **kwargs)[source]#