meerqat.data.wikidump module#

input/output: entities.json Parses the dump (should be downloaded first, TODO add instructions), gathers images and assign them to the relevant entity given its common categories (retrieved in wiki.py commons rest) Note that the wikicode is parsed very lazily and might need a second run depending on your application, e.g. templates are not expanded…

Usage: wikidump.py <subset>

meerqat.data.wikidump.parse_file(path)[source]#
meerqat.data.wikidump.find(element, tag, namespace={'mw': 'http://www.mediawiki.org/xml/export-0.10/'})[source]#

test if element is None before returning ET.Element.find

meerqat.data.wikidump.find_text(element, tag, namespace={'mw': 'http://www.mediawiki.org/xml/export-0.10/'})[source]#

returns result.text if result is not None

meerqat.data.wikidump.get_field(wikitext, image, field)[source]#
meerqat.data.wikidump.process_article(article, entities, entity_categories)[source]#
meerqat.data.wikidump.process_articles(dump_path, entities)[source]#