meerqat.data.wit module#

WIT for MICT#

Generates the WIT subset for Multimodal Inverse Cloze Task as described in the ECIR-2023 paper:

english-only subset
images paired with the sections
filtering out images with irrelevant formats (e.g. .svg) or not downloaded (e.g. you got a 404)
splitting in train/validation/test without overlap between the articles
splitting sections in sentences (meerqat.data.loading sentences)
removing sections with a single sentence (DIY after)
images should be resized to have a maximum height or width of 512 pixels using meerqat.image.resize (DIY after)

You should end up with:

877,635 pairs in train
48,271 pairs in validation
48,815 pairs in test

What you should have first#

Downloaded from google-research-datasets/wit

(By any chance, if you have access to Jean Zay, it is available at $DSDIR/WIT with the right format).:

$ tree WIT
WIT/
├── train
│   ├── 00
│   │   ├── 000004379cfea6d71f7c47180c2163ee40887b7b23798535435d9b2c0065cea5.png
│   │   ├── 000004528fa952ab9e2212ff7c749dfb1f28eb0fae2f45bec768e3ba72265420.jpg
│   │   ├── ...
│   │   └── 00ffff77789c938b5c2ce004d09246d1d54ef5d325d831adf3611413794d757f.jpg
│   ├── 01
│   ├── ...
│   └── ff
├── train_images.tsv
├── wit_v1.train.all-00000-of-00010.tsv
├── wit_v1.train.all-00001-of-00010.tsv
├── wit_v1.train.all-00002-of-00010.tsv
├── wit_v1.train.all-00003-of-00010.tsv
├── wit_v1.train.all-00004-of-00010.tsv
├── wit_v1.train.all-00005-of-00010.tsv
├── wit_v1.train.all-00006-of-00010.tsv
├── wit_v1.train.all-00007-of-00010.tsv
├── wit_v1.train.all-00008-of-00010.tsv
└── wit_v1.train.all-00009-of-00010.tsv

Instructions for train_images.tsv#

The images from WIT are stored in the “train” directory with the following naming convention: “train/<xy>/<hash>.<ext>” where

<hash> is the SHA256 hash of the image’s URL

<xy> are the first two characters of the hash (which means there are 256 subfolders named “00” to “ff”)

<ext> is the extension of the image.

The file “train_images.tsv” contains all the URL of the images with their download status (“True” if the image could be downloaded, “False” otherwise) and the corresponding path.

Once you’ve done this mapping you hsould add it rouself to the dataset.

Sample from “train_images.tsv”::

url     downloaded      path
http://upload.wikimedia.org/wikipedia/ca/d/d4/Trobadores.jpeg   True    train/95/953feec3651efda25c166841ec8c0cd8d2064bf59f668c8dcb62dc823963a385.jpg
http://upload.wikimedia.org/wikipedia/commons/0/00/%2703-%2705_Pontiac_Montana_Taxi.jpg True    train/35/35bcbf0f09424126932707a702b152fac7ebd9c932a877a3f2515d9fe67bb44d.jpg
http://upload.wikimedia.org/wikipedia/commons/0/00/%2755_Singer_4ADT_Roadster_%28Hudson%29.JPG  True    train/dd/dd10ea054385d8fac82a7bca15202434b7ce0facb01519021980ba07c5e6f626.jpg
http://upload.wikimedia.org/wikipedia/commons/0/00/%2768_Chevrolet_Biscayne_Coupe_%28Centropolis_Laval_%2710%29.jpg     True    train/44/44a11a487b09c8118e1066491880ad7045513379b5c16cdc9460321db113ad2d.jpg
http://upload.wikimedia.org/wikipedia/commons/0/00/%2783_Buick_Century_Sedan.JPG        False   HTTP Error 404: Not Found

Docopt#

Usage: wit.py ict <root_path> <output_path> [–split] wit.py caption <root_path> <output_path> [–split –dedup]

Options:

--split: Whether to split in train/dev/test sets
--dedup: Whether to de-duplicate identical caption-image pairs

meerqat.data.wit.check_encoding(url)[source]#

meerqat.data.wit.fill_wit_for_mict(wit, wit_for_mict, downloaded_images)[source]#

meerqat.data.wit.dict_to_dataset(d)[source]#

meerqat.data.wit.common_filter(wit, downloaded_images)[source]#

meerqat.data.wit.mict(paths, downloaded_images, output, split=False)[source]#

meerqat.data.wit.is_unique(item, unique_pairs)[source]#

meerqat.data.wit.caption(paths, downloaded_images, output, split=False, dedup=False)[source]#