meerqat.data.loading module#

Usages#

passages#

Segments Wikipedia articles (from the kilt_wikipedia dataset) into passages (e.g. paragraphs) Current options (passed in a JSON file) are:

  • prepend_title: whether to prepend the title at the beginning of each passage like “<title> [SEP] <passage>”

  • special_fields: removes the title, sections titles (“Section::::”) and bullet-points (“BULLET::::”)

  • uniform: each passage is n tokens, without overlap. Tokenized with a transformers tokenizer

  • uniform_sents: each article is first segmented into sentences using spacy.

    Then sentences are grouped into passage s.t. each passage holds a maximum of n tokens (spacy tokens here, not transformers like above)

map#

Make a JSON file out of a dataset column for quick (and string) indexing.

sentences#

Used in Inverse Cloze Task (ICT) to segment the text of a dataset in a list of sentences via spaCy.

For docopt#

Usage: loading.py passages <input> <output> [<config> –disable_caching] loading.py map <dataset> <key> <output> [–inverse –one2many –disable_caching] loading.py sentences <dataset>

Options:
--disable_caching

Disables Dataset caching (useless when using save_to_disk), see datasets.set_caching_enabled()

Functions#

meerqat.data.loading.verbose_load_from_disk(dataset_path)[source]#
meerqat.data.loading.save_image(image, output_path)[source]#
meerqat.data.loading.load_image(file_name)[source]#
meerqat.data.loading.load_image_batch(file_names, pool=None)[source]#
meerqat.data.loading.load_faces(image, root_face_path, max_n_faces=None)[source]#
meerqat.data.loading.remove_articles(text)[source]#
meerqat.data.loading.white_space_fix(text)[source]#
meerqat.data.loading.remove_punc(text)[source]#
meerqat.data.loading.answer_preprocess(answer)[source]#

Adapted from datasets squad metric. Lower text and remove punctuation, articles and extra whitespace.

meerqat.data.loading.get_class_from_name(class_name)[source]#
meerqat.data.loading.get_pretrained(class_name, pretrained_model_name_or_path, **kwargs)[source]#
meerqat.data.loading.map_kilt_triviaqa()[source]#

As instructed by huggingface/datasets

meerqat.data.loading.make_mapping(value, index, mapping, inverse=False, one2many=False)[source]#
meerqat.data.loading.make_mapping_dataset(dataset_path, key, save_name, **kwargs)[source]#
meerqat.data.loading.remove_special_fields(paragraphs)[source]#

N. B. this code puts a lot of trust into KILT pre-processing facebookresearch/KILT and simply removes the title (1st paragraph), sections titles (“Section::::”) and bullet-points (“BULLET::::”)

meerqat.data.loading.paragraphs_preprocess(paragraphs, method=None, **kwargs)[source]#
Parameters:
  • paragraphs (List[str]) – List of paragraphs to preprocess

  • method (str, optional) – type of pre-processing, defaults to None (i.e. identity function)

  • **kwargs (additional arguments are passed to the appropriate pre-processing function) –

Returns:

paragraphs

Return type:

List[str]

meerqat.data.loading.uniform_passages(paragraphs, tokenizer, n=100, title=None)[source]#
Parameters:
  • paragraphs (List[str]) – List of pre-processed paragraphs to split into passages

  • tokenizer (PreTrainedTokenizer) –

  • n (int, optional) – Number of tokens in each passage (excluding title) Defaults to 100

  • title (str, optional) – To prepend at the beginning of each passage like “<title> [SEP] <passage>” Defaults to None (only “<passage>”)

Returns:

passages – Each passage is pre-processed by the tokenizer (e.g. lower-cased, added space between punctuation marks, etc.)

Return type:

List[str]

meerqat.data.loading.uniform_passages_of_sentences(paragraphs, model, n=100, title=None, sep_token='[SEP]')[source]#

N. B. unlike uniform_passages which is based on transformers PreTrainedTokenizer here we’re able to get back the un-processed text corresponding to the tokens so the output text is not changed (e.g. not lower-cased), only the whitespace between sentences is lost (it is always set to ‘ ‘)

Parameters:
  • paragraphs (List[str]) – List of pre-processed paragraphs to split into passages

  • model (spacy model) –

  • n (int, optional) – Maximum number of tokens in each passage (excluding title) There can actually be more tokens than this if the passage is a single sentence (with more tokens than n) Defaults to 100

  • title (str, optional) – To prepend at the beginning of each passage like “<title> [SEP] <passage>” Defaults to None (only “<passage>”)

  • sep_token (str, optional) – To separate title and passages (no effect if title is None) Defaults to ‘[SEP]’

Returns:

passages

Return type:

List[str]

meerqat.data.loading.make_passages(paragraphs, method=None, preprocessing_method=None, preprocessing_kwargs={}, **kwargs)[source]#
Parameters:
  • paragraphs (List[str]) – List of paragraphs to preprocess

  • method (str, optional) – How to split the text in passages, defaults to keep the original paragraphs

meerqat.data.loading.make_passage_item(item, index, passage_dict, prepend_title=False, **kwargs)[source]#
meerqat.data.loading.make_passage_dataset(input_path, output_path, sentencizer=False, **kwargs)[source]#

Runs through dataset and create a new passage dataset from the paragraphs, saving index and reversed-index in both respectively

meerqat.data.loading.make_sentences_item(item, model)[source]#
meerqat.data.loading.make_sentences_dataset(dataset_path)[source]#
meerqat.data.loading.load_pretrained_in_kwargs(kwargs)[source]#

Recursively loads pre-trained models/tokenizer in kwargs using get_pretrained