meerqat.data.wiki module#
Usages#
Gathers data about entities mentioned in questions via Wikidata, Wikimedia Commons SPARQL services and Wikimedia REST API.
You should run all of these in this order to get the whole cake:
data entities
#
input/output: entities.json
(output of kilt2vqa.py count_entities
)
queries many different attributes for all entities in the questions
- Also sets a ‘reference image’ to the entity using Wikidata properties in the following order of preference:
P18 ‘image’ (it is roughly equivalent to the infobox image in Wikipedia articles)
P154 ‘logo image’
P41 ‘flag image’
P94 ‘coat of arms image’
P2425 ‘service ribbon image’
data feminine
#
input: entities.json
output: feminine_labels.json
gets feminine labels for classes and occupations of these entities
data superclasses
#
input: entities.json
output: <n>_superclasses.json
gets the superclasses of the entities classes up n
level (defaults to ‘all’, i.e. up to the root)
Depictions (optional)#
we found that heuristics/images based on depictions were not that discriminative
commons sparql depicts
#
input/output: entities.json
Find all images in Commons that depict the entities
commons sparql depicted
#
input: entities.json
output: depictions.json
Find all entities depicted in the previously gathered step
data depicted
#
input: entities.json
, depictions.json
output: entities.json
Gathers the same data as in wiki.py data entities <subset>
for all entities depicted in any of the depictions
Then apply a heuristic to tell whether an image depicts the entity prominently or not:
the depiction is prominent if the entity is the only one of its class, e.g.:
pic of Barack Obama and Joe Biden -> not prominent
pic of Barack Obama and the Eiffel Tower -> prominent
Note this heuristic is not used in commons heuristics
filter
#
input/output: entities.json
Filters entities w.r.t. to their class/nature/”instance of” and date of death, see wiki.py
docstring for option usage (TODO share concrete_entities/abstract_entities)
Also entities with a ‘sex or gender’ (P21) or ‘occupation’ (P106) are kept by default.
Note this deletes data so maybe save it if you’re unsure about the filter.
commons rest
#
input/output: entities.json
Gathers images and subcategories recursively from the entity root commons-category
Except if you have a very small dataset you should probably set --max_images=0
to query only categories and use wikidump.py
to gather images from those.
--max_categories
defaults to 100.
commons heuristics
#
input/output: entities.json
Run wikidump.py
first to gather images.
Compute heuristics for the image (control with <heuristic>
, default to all):
categories
: the entity label should be included in all of the image category
description
: the entity label should be included in the image description
title
: the entity label should be included in the image title/file name
depictions
: the image should be tagged as depicting the entity (gathered incommons sparql depicts
)
For docopt
#
Usage: wiki.py data entities <subset> [–skip=<attribute>] wiki.py data feminine <subset> wiki.py data depicted <subset> wiki.py data superclasses <subset> [–n=<n>] wiki.py commons sparql depicts <subset> wiki.py commons sparql depicted <subset> wiki.py commons rest <subset> [–max_images=<max_images> –max_categories=<max_categories>] wiki.py commons heuristics <subset> [<heuristic>…] wiki.py filter <subset> [–superclass=<level> –positive –negative –deceased=<year> <classes_to_exclude>…]
Options: –n=<n> Maximum level of superclasses. Defaults to all superclasses –max_images=<n> Maximum number of images to query per entity/root category.
Set to 0 if you only want to query categories [default: 1000].
- --max_categories=<n>
Maximum number of categories to query per entity/root category [default: 100].
- <heuristic>… Heuristic to compute for the image, one of {“categories”, “description”, “depictions”, “title”}
Defaults to all valid heuristics (listed above)
- --superclass=<level>
Level of superclasses in the filter, int or “all” (defaults to None i.e. filter only classes)
- --positive
Keep only classes in “concrete_entities” + entities with gender (P21) or occupation (P106). Applied before negative_filter.
- --negative
Keep only classes that are not in “abstract_entities”. Applied after positive_filter
- --deceased=<year>
Remove humans (Q5) that are alive or deceased after <year> (might avoid trouble with GDPR)
- <classes_to_exclude>… Additional classes to exclude in the negative_filter (e.g. “Q5 Q82794”)
Note that you can use this option even without –negative i.e. specifying your own “abstract_entities”
Functions#
- meerqat.data.wiki.file_name_to_thumbnail(file_name, image_width=None)[source]#
get upload.wikimedia.org url from image file_name using the desired thumbnail width
- Parameters:
file_name (str) – file name/title (without the “File:” prefix)
image_width (int, optional) – desired thumbnail width in pixels for the image url Defaults to full-size
- meerqat.data.wiki.thumbnail_to_file_name(url, original=True)[source]#
Handles thumbnails and special file paths
If original (default), Returns the original file-name, i.e. with the original extension and without any size specification introduced in file_name_to_thumbnail
e.g. “https://upload.wikimedia.org/wikipedia/commons/thumb/9/93/foo.tif/lossy-page1-469px-foo.tif.jpg” -> “foo.tif”
else, the file name is returned as processed in file_name_to_thumbnail, i.e. with size specification etc. e.g. “https://upload.wikimedia.org/wikipedia/commons/thumb/9/93/foo.tif/lossy-page1-469px-foo.tif.jpg” -> “lossy-page1-469px-foo.tif.jpg”
This is irrelevant of course if the url is not a thumbnail
- meerqat.data.wiki.get_license(image)[source]#
Get license short-name, upper-cased. Returns empty-string (‘’) if unavailable
- meerqat.data.wiki.license_score(image)[source]#
Gets license value, normalize it and return score (in LICENSES)
- meerqat.data.wiki.query_sparql_entities(query, endpoint, wikidata_ids, prefix='wd:', n=100, return_format='json', description=None)[source]#
Queries query%entities by batch of n (defaults 100) where entities is n QIDs in wikidata_ids space-separated and prefixed by prefix (should be ‘wd:’ for Wikidata entities and ‘sdc:’ for Commons entities)
Returns query results
- meerqat.data.wiki.update_from_data(entities, skip=None)[source]#
Updates entities with info queried in from Wikidata
- meerqat.data.wiki.set_reference_images(entities)[source]#
Set a reference image using RESERVED_IMAGES as order of preference if the entity has any available
- meerqat.data.wiki.request(query, session, tries=0, max_tries=2)[source]#
GET query via requests, handles exceptions and returns None if something went wrong
- meerqat.data.wiki.query_commons_subcategories(category, categories, images, max_images=1000, max_categories=100, n_queried_categories=0)[source]#
Query all commons subcategories (and optionally images) from a root category recursively
- Parameters:
category (str) – Root category
categories (dict) – {str: bool}, True if the category has been processed
images (dict) – {str: dict}, Key is the file title, gathers data about the image, see query_image
max_images (int, optional) – Maximum number of images to query per entity/root category. Set to 0 if you only want to query categories (images dict will be left empty) Defaults to 1000
max_categories (int, optional) – Maximum number of categories to query per entity/root category. Enforced via: - n_queried_categories if max_images > 0 - len(categories) otherwise Defaults to 100
n_queried_categories (int, optional) – Keeps track of the number of queried categories in order to enforce max_categories Should be equal to the number of True in categories Defaults to 0
- Returns:
categories, images – Same as input, hopefully enriched with new data
- Return type:
dict
- meerqat.data.wiki.special_path_to_file_name(special_path)[source]#
split url, add “File:” prefix and replace underscores with spaces
- meerqat.data.wiki.image_heuristic(entities, heuristics={'categories', 'depictions', 'description', 'title'})[source]#