Mini-Gemini:

Mining the Potential of Multi-modality Vision Language Models

The Chinese University of Hong Kong

Updates: Mini-Gemini is comming! We release the paper, code, data, models, and demo for Mini-Gemini.

Abstract

In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i.e., high-resolution visual tokens, high-quality data, and VLM-guided generation. To enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count. We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs. In general, Mini-Gemini further mines the potential of VLMs and empowers current framework with image understanding, reasoning, and generation simultaneously. Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B. It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpass the developed private models.



Model

The framework of Mini-Gemini is conceptually simple: dual vision encoders are utilized to provide low-resolution visual embedding and high-resolution candidates; patch info mining is proposed to conduct patch-level mining between high-resolution regions and low-resolution visual queries; LLM is utilized to marry text with images for both comprehension and generation at the same time.

BibTeX


@article{li2024minigemini,
  title={Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models},
  author={Li, Yanwei and Zhang, Yuechen and Wang, Chengyao and Zhong, Zhisheng and Chen, Yixin and Chu, Ruihang and Liu, Shaoteng and Jia, Jiaya},
  journal={arXiv preprint arXiv:2403.18814},
  year={2024}
}
  

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Examples









主站蜘蛛池模板: 日本高清一二三| 亚洲av无码国产精品色| 国产亚洲精品bt天堂精选| 在线观看欧洲成人免费视频| 日本大臿亚洲香蕉大片| 欧美性69式xxxx护士| 真实乱小说在线阅读| 色视频www在线播放国产人成| 69国产精品视频免费| youjizz亚洲| 一级片一级毛片| 中文字幕高清有码在线中字| 久久精品国产亚洲AV麻豆不卡| 亚洲图片欧美小说| 亚洲精品一级片| 亚洲香蕉免费有线视频| 免费无遮挡无码永久视频| 可以免费看黄的网站| 啊灬啊灬啊灬快灬高潮少妇| 成人中文字幕在线观看| 日日插人人插天天插| 日韩在线国产精品| 曰批全过程免费视频播放网站| 欧美11一12周岁a在线观看| 色老头成人免费视频天天综合| 非洲人zoxxxx另类| 青青艹在线观看| 老子影院午夜精品欧美视频| 被公侵犯电影bd在线播放| 菠萝菠萝蜜视频在线| 翁虹三级在线伦理电影| 美国一级大黄一片免费网站| 精品人妻少妇一区二区三区| 精品3d动漫视频一区在线观看 | 一区二区三区在线看| www永久免费视频| 992tv国产人成在线观看| 777777农村一级毛片| 日本高清色www网站色| 黄瓜视频免费看| 紧缚调教波多野结衣在线观看|