Metadata for image, video and audio captioning training datasets

In project LINDA (Using Language to Interpret Nonstructured Data) we conducted a metadata analysis of training datasets used in automatic captioning. We provide the metadata of 66 image, video and audio captioning training datasets, which can be used for finding suitable datasets and for other research purposes.

This file includes a list of captioning training datasets alongside information about their compilation principles (e.g., source data, number of captions). The file includes a codebook with information about the variables. For details regarding selection criteria for the initial list, please see Hekanaho, Hirvonen & Virtanen (forthcoming). The metadata file is available for research purposes and can be amended (updated versions to follow). Please cite the following article if the metadata is used for research.

Citation: Hekanaho, L., Hirvonen, M. & Virtanen, T. (forthcoming). Language-based machine perception: Linguistic perspectives on the compilation of captioning datasets. Digital Scholarship in the Humanities.

Contact:
Laura Hekanaho laura.hekanaho@helsinki.fi
Maija Hirvonen maija.hirvonen@tuni.fi
Tuomas Virtanen tuomas.virtanen@tuni.fi

Overview of datasets