av日韩中文_日韩成人午夜精品_日韩国产激情在线_久久久噜噜噜久久中文字幕色伊伊_久久综合社区_欧美激情一级精品国产_51精品在线观看

Skip to content

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

License

Notifications You must be signed in to change notification settings

google-research-datasets/wit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

?

History

65 Commits
?
?
?
?
?
?
?
?
?
?
?
?
?
?

Repository files navigation

WIT : Wikipedia-based Image Text Dataset

Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models.

Key Advantages

A few unique advantages of WIT:

  • The largest multimodal dataset (publicly available at the time of this writing) by the number of image-text examples.
  • A massively multilingual dataset (first of its kind) with coverage for 108 languages.
  • First image-text dataset with page level metadata and contextual information
  • A collection of diverse set of concepts and real world entities.
  • Brings forth challenging real-world test sets.

You can learn more about WIT Dataset from our arXiv paper.

Latest Updates

2021 April: Happy to share the good news that our paper got accepted at SIGIR Conference. From ACM site, you can find our paper, slides and presentation.

2021 September: WIT Image-Text Competition is live on Kaggle. Our collaborators from Wikimedia Research blogged about this and they have made available the raw pixels and resnet50 embeddings for the images in this set. Here is our Google AI blog post.

2022 April: We are happy to share that the WIT paper and dataset was awarded the WikiMedia Foundation's Research Award of the Year (tweet 1, tweet 2). We are deeply honored and thank you for the recognition.

2022 May: We have released the WIT validation set and test set. Please see the data page for download links.

2022 Oct: Authoring Tools for Multimedia Content proposal accepted at TREC 2023

2023 Apr: AToMiC accepted at SIGIR 2023.

2023 Apr: WikiWeb2M Dataset released.

2023 May: Accepted submissions at WikiWorkshop 2023.

  • WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset (pdf, arXiv)
  • Building Authoring Tools for Multimedia Content with Human-in-the-loop Relevance Annotations (pdf)
  • Characterizing Image Accessibility on Wikipedia across Languages (pdf)

WIT Example

Wikipedia Page

For example, let's take the Wikipedia page for Half Dome, Yosemite in CA.

WIT Wikipedia Half Dome Image

From the Wikipedia page for Half Dome : Photo by DAVID ILIFF. License: CC BY-SA 3.0

Wikipedia Page with Annotations of what we can extract

From this page, we highlight the various key pieces of data that we can extract - images, their respective text snippets and some contextual metadata.

WIT Half Dome Page with Annotations

By extracting and filtering these carefully, we get a clean, high quality image-text example that can be used in multimodal modeling.

Motivation

Multimodal visio-linguistic models rely on a rich dataset to help them learn to model the relationship between images and texts. Having large image-text datasets can significantly improve performance, as shown by recent works. Furthermore the lack of language coverage in existing datasets (which are mostly only in English) also impedes research in the multilingual multimodal space – we consider this a lost opportunity given the potential shown in leveraging images (as a language-agnostic medium) to help improve our multilingual textual understanding.

To address these challenges and advance research on multilingual, multimodal learning we created the Wikipedia-based Image Text (WIT) Dataset. WIT is created by extracting multiple different texts associated with an image (e.g., as shown in the above image) from Wikipedia articles and Wikimedia image links. This was accompanied by rigorous filtering to only retain high quality image-text sets.

The resulting dataset contains over 37.6 million image-text sets – making WIT the largest multimodal dataset (publicly available at the time of this writing) with unparalleled multilingual coverage – with 12K+ examples in each of 108 languages (53 languages have 100K+ image-text pairs).

WIT: Dataset Numbers

Type Train Val Test Total / Unique
Rows / Tuples 37.13M 261.8K 210.7K 37.6M
Unique Images 11.4M 58K 57K 11.5M
Ref. Text 16.9M 150K 104K 17.2M / 16.7M
Attr. Text 34.8M 193K 200K 35.2M / 10.9M
Alt Text 5.3M 29K 29K 5.4M / 5.3M
Context Texts - - - 119.8M

WIT: Image-Text Stats by Language

Image-Text # Lang Uniq. Images # Lang
total > 1M 9 images > 1M 6
total > 500K 10 images > 500K 12
total > 100K 36 images > 100K 35
total > 50K 15 images > 50K 17
total > 14K 38 images > 13K 38

Get WIT

We believe that such a powerful diverse dataset will aid researchers in building better multimodal multilingual models and in identifying better learning and representation techniques leading to improvement of Machine Learning models in real-world tasks over visio-linguistic data.

WIT Dataset is now available for download. Please check the data page.

Citing WIT

If you use the WIT dataset, you can cite our work as follows.

@inproceedings{10.1145/3404835.3463257,
author = {Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc},
title = {WIT: Wikipedia-Based Image Text Dataset for Multimodal Multilingual Machine Learning},
year = {2021},
isbn = {9781450380379},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3404835.3463257},
doi = {10.1145/3404835.3463257},
booktitle = {Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2443–2449},
numpages = {7},
keywords = {dataset, multimodal, machine learning, wikipedia, multilingual, image-text retrieval, neural networks},
location = {Virtual Event, Canada},
series = {SIGIR '21}
}

License

This data is available under the Creative Commons Attribution-ShareAlike 3.0 Unported license.

Projects using WIT

For information regarding MURAL (Multimodal, Multitask Retrieval Across Languages) paper accepted at EMNLP 2021.

Contact

For any questions, please contact wit-dataset@google.com. To any questions to the first author, Krishna, please reach via their personal page krishna2.com for contact informaiton.

If WIT dataset is useful to you, please do write to us about it. Be it a blog post, a research project or a paper, we are delighted to learn about it.

About

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Contributors 3

  •  
  •  
  •  
av日韩中文_日韩成人午夜精品_日韩国产激情在线_久久久噜噜噜久久中文字幕色伊伊_久久综合社区_欧美激情一级精品国产_51精品在线观看
精品欧美久久久| 国产美女精品人人做人人爽| 国产精品久久久久久久久果冻传媒| 一区二区三区欧美日韩| 国内成人精品2018免费看| 成av人片一区二区| 91久久国产综合久久| 欧美一级二级在线观看| 亚洲影院久久精品| 欧日韩精品视频| 久久精品国产**网站演员| 国产欧美日韩视频在线观看| 99综合影院在线| 亚洲成人tv网| 久久久午夜电影| 一本大道久久a久久精品综合| 亚洲精品成人少妇| 欧美一区国产二区| 国产在线播精品第三| 亚洲天堂精品在线观看| 欧美男人的天堂一二区| 国产激情一区二区三区| 亚洲黄色录像片| 日本一区二区三区免费乱视频| 欧美日韩aaaaaa| 99久久综合色| 成人综合在线网站| 久久精品国产秦先生| 亚洲综合精品自拍| 中文字幕va一区二区三区| 日韩欧美一级二级三级| 日韩欧美国产电影| 国产精品久久一卡二卡| 欧美三级电影在线看| 国产精品一二一区| 奇米精品一区二区三区在线观看一| 日本一区免费视频| 国产欧美精品一区| 亚洲精品一区二区在线观看| 色综合久久天天综合网| 91亚洲国产成人精品一区二三 | 国产伦理精品不卡| 欧美bbbbb| 久久成人羞羞网站| 精品夜夜嗨av一区二区三区| 久久97超碰国产精品超碰| 久久精品国产亚洲高清剧情介绍 | 中文字幕一区二区日韩精品绯色| www激情久久| 精品剧情v国产在线观看在线| 在线不卡免费欧美| 国产视频一区二区在线观看| 欧美国产精品v| 亚洲综合在线电影| 欧美最猛黑人xxxxx猛交| www.日韩精品| 欧美日韩免费电影| 国产日韩精品一区二区三区 | 成人免费看的视频| 555夜色666亚洲国产免| 精品久久久久久无| 午夜精品福利在线| 91视频一区二区三区| 日韩一区二区三区观看| 亚洲三级电影网站| 成人亚洲一区二区一| 精品国免费一区二区三区| 亚洲制服丝袜在线| 高清不卡在线观看av| 日韩欧美视频在线| 日本中文一区二区三区| 欧美日韩电影一区| 亚洲高清视频的网址| 精品1区2区3区| 五月天久久比比资源色| 欧美在线观看禁18| 亚洲三级在线免费观看| 91蝌蚪porny九色| 日韩成人精品视频| 欧美日韩国产高清一区二区三区 | 91美女视频网站| 国产女人水真多18毛片18精品视频| 日韩**一区毛片| 久久先锋影音av| 精品系列免费在线观看| 欧美高清一级片在线观看| 国产成人一区在线| 综合自拍亚洲综合图不卡区| 色成年激情久久综合| 性欧美疯狂xxxxbbbb| 久久中文字幕电影| 色视频成人在线观看免| 蜜臀av性久久久久av蜜臀妖精| 欧美一卡在线观看| 91网址在线看| 国产在线精品一区二区三区不卡| 久久精品男人的天堂| 91久久精品日日躁夜夜躁欧美| 日韩综合在线视频| 亚洲视频图片小说| 久久精品水蜜桃av综合天堂| 欧美精品免费视频| 色久优优欧美色久优优| 99久久国产综合精品色伊| 石原莉奈在线亚洲三区| 久久久www免费人成精品| 欧美精品免费视频| 欧美美女直播网站| 欧美体内she精高潮| 国产亚洲自拍一区| 91香蕉视频mp4| 大美女一区二区三区| 美国三级日本三级久久99| 一区二区三区在线播放| 亚洲卡通欧美制服中文| 最新国产の精品合集bt伙计| 亚洲欧美色综合| 国产精品色呦呦| 国产精品毛片久久久久久久| 国产三级一区二区| 久久综合久久久久88| 久久亚洲一级片| 亚洲免费观看高清完整版在线观看| 18涩涩午夜精品.www| 亚洲一区二区三区中文字幕| 午夜在线成人av| 国产精品一二三区| 91亚洲国产成人精品一区二三| 色综合久久综合中文综合网| 在线综合亚洲欧美在线视频| 久久蜜桃一区二区| 亚洲一区二区在线播放相泽 | 精品国精品国产尤物美女| 制服丝袜成人动漫| 国产精品毛片高清在线完整版| 中文字幕一区二区在线播放| 亚洲一区二区av在线| 国产一区二区不卡| 欧美区在线观看| 曰韩精品一区二区| 成人亚洲精品久久久久软件| 亚洲视频你懂的| 国产一区二区不卡老阿姨| 欧美精品三级在线观看| 亚洲激情男女视频| 波多野结衣视频一区| 中文字幕一区视频| 成人网页在线观看| 国产亚洲精品中文字幕| 麻豆精品国产传媒mv男同| 91美女视频网站| 一区二区三区欧美在线观看| 岛国精品在线播放| 一区二区三区丝袜| 欧美日韩综合色| 秋霞午夜鲁丝一区二区老狼| 在线播放日韩导航| 成人av高清在线| 国产精品福利一区| 精品视频在线免费观看| 午夜精品福利在线| 精品国产免费一区二区三区香蕉| 激情久久五月天| 亚洲欧美日韩国产另类专区| 欧美这里有精品| 久久av中文字幕片| 亚洲同性同志一二三专区| 在线免费观看日韩欧美| 日韩福利电影在线| 国产精品成人免费在线| 精品视频色一区| 大桥未久av一区二区三区中文| 伊人一区二区三区| 日本一区二区三区电影| 欧美一级欧美一级在线播放| 久久精品国产亚洲一区二区三区| 亚洲蜜臀av乱码久久精品| 免费欧美在线视频| 亚洲人成7777| 国产欧美视频一区二区| 99久久婷婷国产精品综合| 偷拍日韩校园综合在线| 北岛玲一区二区三区四区| 亚洲一区成人在线| 国产三级欧美三级日产三级99| 国产成+人+日韩+欧美+亚洲 | 国产老妇另类xxxxx| 亚洲在线视频网站| 国产精品情趣视频| 欧美一区二区三区四区视频| 91免费在线看| 成人一区在线看| 国产成人av一区二区三区在线 | 国产一区在线看| 久久99国产精品尤物| 日韩av电影天堂| 久久精品国产一区二区| 捆绑调教美女网站视频一区| 日本欧美加勒比视频| 久久99精品久久久久久国产越南 |