meerqat.data.wiki module#

Usages#

Gathers data about entities mentioned in questions via Wikidata, Wikimedia Commons SPARQL services and Wikimedia REST API.

You should run all of these in this order to get the whole cake:

data entities#

input/output: entities.json (output of kilt2vqa.py count_entities) queries many different attributes for all entities in the questions

Also sets a ‘reference image’ to the entity using Wikidata properties in the following order of preference:
  • P18 ‘image’ (it is roughly equivalent to the infobox image in Wikipedia articles)

  • P154 ‘logo image’

  • P41 ‘flag image’

  • P94 ‘coat of arms image’

  • P2425 ‘service ribbon image’

data feminine#

input: entities.json output: feminine_labels.json gets feminine labels for classes and occupations of these entities

data superclasses#

input: entities.json output: <n>_superclasses.json

gets the superclasses of the entities classes up n level (defaults to ‘all’, i.e. up to the root)

Depictions (optional)#

we found that heuristics/images based on depictions were not that discriminative

commons sparql depicts#

input/output: entities.json Find all images in Commons that depict the entities

commons sparql depicted#

input: entities.json output: depictions.json Find all entities depicted in the previously gathered step

data depicted#

input: entities.json, depictions.json output: entities.json Gathers the same data as in wiki.py data entities <subset> for all entities depicted in any of the depictions Then apply a heuristic to tell whether an image depicts the entity prominently or not: the depiction is prominent if the entity is the only one of its class, e.g.:

  • pic of Barack Obama and Joe Biden -> not prominent

  • pic of Barack Obama and the Eiffel Tower -> prominent

Note this heuristic is not used in commons heuristics

filter#

input/output: entities.json Filters entities w.r.t. to their class/nature/”instance of” and date of death, see wiki.py docstring for option usage (TODO share concrete_entities/abstract_entities) Also entities with a ‘sex or gender’ (P21) or ‘occupation’ (P106) are kept by default.

Note this deletes data so maybe save it if you’re unsure about the filter.

commons rest#

input/output: entities.json

Gathers images and subcategories recursively from the entity root commons-category

Except if you have a very small dataset you should probably set --max_images=0 to query only categories and use wikidump.py to gather images from those. --max_categories defaults to 100.

commons heuristics#

input/output: entities.json Run wikidump.py first to gather images. Compute heuristics for the image (control with <heuristic>, default to all):

  • categories: the entity label should be included in all of the image category

  • description: the entity label should be included in the image description

  • title: the entity label should be included in the image title/file name

  • depictions: the image should be tagged as depicting the entity (gathered in commons sparql depicts)

For docopt#

Usage: wiki.py data entities <subset> [–skip=<attribute>] wiki.py data feminine <subset> wiki.py data depicted <subset> wiki.py data superclasses <subset> [–n=<n>] wiki.py commons sparql depicts <subset> wiki.py commons sparql depicted <subset> wiki.py commons rest <subset> [–max_images=<max_images> –max_categories=<max_categories>] wiki.py commons heuristics <subset> [<heuristic>…] wiki.py filter <subset> [–superclass=<level> –positive –negative –deceased=<year> <classes_to_exclude>…]

Options: –n=<n> Maximum level of superclasses. Defaults to all superclasses –max_images=<n> Maximum number of images to query per entity/root category.

Set to 0 if you only want to query categories [default: 1000].

--max_categories=<n>

Maximum number of categories to query per entity/root category [default: 100].

<heuristic>… Heuristic to compute for the image, one of {“categories”, “description”, “depictions”, “title”}

Defaults to all valid heuristics (listed above)

--superclass=<level>

Level of superclasses in the filter, int or “all” (defaults to None i.e. filter only classes)

--positive

Keep only classes in “concrete_entities” + entities with gender (P21) or occupation (P106). Applied before negative_filter.

--negative

Keep only classes that are not in “abstract_entities”. Applied after positive_filter

--deceased=<year>

Remove humans (Q5) that are alive or deceased after <year> (might avoid trouble with GDPR)

<classes_to_exclude>… Additional classes to exclude in the negative_filter (e.g. “Q5 Q82794”)

Note that you can use this option even without –negative i.e. specifying your own “abstract_entities”

Functions#

meerqat.data.wiki.file_name_to_thumbnail(file_name, image_width=None)[source]#

get upload.wikimedia.org url from image file_name using the desired thumbnail width

Parameters:
  • file_name (str) – file name/title (without the “File:” prefix)

  • image_width (int, optional) – desired thumbnail width in pixels for the image url Defaults to full-size

meerqat.data.wiki.thumbnail_to_file_name(url, original=True)[source]#

Handles thumbnails and special file paths

If original (default), Returns the original file-name, i.e. with the original extension and without any size specification introduced in file_name_to_thumbnail

e.g. “https://upload.wikimedia.org/wikipedia/commons/thumb/9/93/foo.tif/lossy-page1-469px-foo.tif.jpg” -> “foo.tif”

else, the file name is returned as processed in file_name_to_thumbnail, i.e. with size specification etc. e.g. “https://upload.wikimedia.org/wikipedia/commons/thumb/9/93/foo.tif/lossy-page1-469px-foo.tif.jpg” -> “lossy-page1-469px-foo.tif.jpg”

This is irrelevant of course if the url is not a thumbnail

meerqat.data.wiki.bytes2dict(b)[source]#
meerqat.data.wiki.get_license(image)[source]#

Get license short-name, upper-cased. Returns empty-string (‘’) if unavailable

meerqat.data.wiki.license_score(image)[source]#

Gets license value, normalize it and return score (in LICENSES)

meerqat.data.wiki.query_sparql_entities(query, endpoint, wikidata_ids, prefix='wd:', n=100, return_format='json', description=None)[source]#

Queries query%entities by batch of n (defaults 100) where entities is n QIDs in wikidata_ids space-separated and prefixed by prefix (should be ‘wd:’ for Wikidata entities and ‘sdc:’ for Commons entities)

Returns query results

meerqat.data.wiki.update_from_data(entities, skip=None)[source]#

Updates entities with info queried in from Wikidata

meerqat.data.wiki.set_reference_images(entities)[source]#

Set a reference image using RESERVED_IMAGES as order of preference if the entity has any available

meerqat.data.wiki.update_from_commons_sparql(entities)[source]#
meerqat.data.wiki.query_depicted_entities(depictions)[source]#
meerqat.data.wiki.depiction_instanceof_heuristic(depictions, entities)[source]#
meerqat.data.wiki.keep_prominent_depictions(entities)[source]#
meerqat.data.wiki.request(query, session, tries=0, max_tries=2)[source]#

GET query via requests, handles exceptions and returns None if something went wrong

meerqat.data.wiki.query_commons_subcategories(category, categories, images, max_images=1000, max_categories=100, n_queried_categories=0)[source]#

Query all commons subcategories (and optionally images) from a root category recursively

Parameters:
  • category (str) – Root category

  • categories (dict) – {str: bool}, True if the category has been processed

  • images (dict) – {str: dict}, Key is the file title, gathers data about the image, see query_image

  • max_images (int, optional) – Maximum number of images to query per entity/root category. Set to 0 if you only want to query categories (images dict will be left empty) Defaults to 1000

  • max_categories (int, optional) – Maximum number of categories to query per entity/root category. Enforced via: - n_queried_categories if max_images > 0 - len(categories) otherwise Defaults to 100

  • n_queried_categories (int, optional) – Keeps track of the number of queried categories in order to enforce max_categories Should be equal to the number of True in categories Defaults to 0

Returns:

categories, images – Same as input, hopefully enriched with new data

Return type:

dict

meerqat.data.wiki.query_image(title, session)[source]#
meerqat.data.wiki.save_image(url, session)[source]#
meerqat.data.wiki.update_from_commons_rest(entities, max_images=1000, max_categories=100)[source]#
meerqat.data.wiki.special_path_to_file_name(special_path)[source]#

split url, add “File:” prefix and replace underscores with spaces

meerqat.data.wiki.image_heuristic(entities, heuristics={'categories', 'depictions', 'description', 'title'})[source]#
meerqat.data.wiki.exclude_classes(entities, classes_to_exclude, superclasses={})[source]#
meerqat.data.wiki.keep_classes(entities, classes_to_keep, superclasses={}, attributes_to_keep={'gender', 'occupation'})[source]#
meerqat.data.wiki.iso2year(iso)[source]#

Handles negative dates

meerqat.data.wiki.remove_alive_humans(entities, year_threshold=inf)[source]#
meerqat.data.wiki.query_superclasses(entities, wikidata_superclasses_query, n_levels=None)[source]#
meerqat.data.wiki.uri_to_qid(uri)[source]#
meerqat.data.wiki.uris_to_qids(uris)[source]#
meerqat.data.wiki.query_feminine_labels(entities)[source]#
meerqat.data.wiki.stats(entities)[source]#

Simply count the # of field for every entity

meerqat.data.wiki.print_stats(entities)[source]#