BuboGPT:

Enabling Visual Grounding in Multi-Modal LLMs


Bytedance Inc.   *Equal Contribution   +Project Lead

BuboGPT is an advanced Large Language Model (LLM) that incorporates multi-modal inputs including text, image and audio, with a unique ability to ground its responses to visual objects. It demonstrates remarkable chat abilities for arbitrary image-audio data understanding, whether aligned or unaligned.

Bubo owls are well known for having strong vision and hearing abilities that help them thrive.

Abstract

LLMs have demonstrated remarkable abilities at interacting with humans through language, especially with the usage of instruction-following data. Recent advancements in LLMs, such as MiniGPT-4, LLaVA, and X-LLM, further enlarge their abilities by incorporating multi-modal inputs, including image, video, and speech. Despite their effectiveness at generating precise and detailed language understanding of the given modality signal, these LLMs give up the ability to ground specific parts of inputs, thus only constructing a coarse-grained mapping. However, explicit and informative correspondence between text and other modalities will not only improve the user experience but also help to expand the application scenario of multi-modal LLMs.

  1. BuboGPT Architecture . We build a multi-modal LLM, BuboGPT for multi-modal understanding including image, audio and text by learning a common semantic space and further explore the fine-grained relation between different visual objects and different modalities.
  2. Multimodal Instruct Data. We construct a high-quality multi-modal instruction-tuning dataset including fine-grained audio descriptions and cross-modal sound localization, and introduce both positive and negative image-audio pairs for semantic matching to facilitate the cross-modal understanding..

BuboGPT Architecture

As the figure shown, we perform joint multi-modal understanding and chatting for text, vision and audio, which is achieved by learning a shared representation space that aligns well with pre-trained Vicuna. We also build an off-the-shelf visual grounding pipeline to explore the fine-grained relation between different visual objects and modalities.

The framework of BuboGPT.

BuboGPT: Training Procedure

BuboGPT connects different modality Q-Former with pre-trained large language model Vicuna, using a simple projection matrix. We consider a two-stage instruction-tuning procedure:

  • Stage 1: Single-modal Pre-training. We train the corresponding modality Q-Former and linear projection layer on a large number of modality-text paired data.
  • Stage 2: Multi-Modal Instruct Tuning. We curate a high-quality multi-modal instruction-following dataset to fine tune only the linear projection layer:
    • Image-Text: We employ two previously published datasets from MiniGPT-4 and LLaVa for visual instruct tuning.
    • Audio-Text: We build a series of expressive and descriptive data to facilitate this process based on Clotho dataset.
    • Audio-Image-Text: We build <audio, image, text> pairs to act as triple-modality instruction tuning dataset based on VGGSS dataset and further introduce negative set to enhance our model.

-->

Examples on Fine-grained Visual Understanding

We first consider using a single image as input for fine-grained visual understanding with grounding. As the exmaples shown, the model can accurately associate textural words or phrases with image regions in various scenarios with different complexities.


Examples on Audio Understanding

When a single audio clip is provided for audio understanding, BuboGPT gives informative descriptions covering nearly all acoustic parts included, even when some audio fragments are too short for humans to notice, see examples for details.


Examples on Aligned audio-image understanding

We show that BuboGPT can perform sound localization with a matched audio-image pair provided, which gives a perfect example for aligned audio-image understanding, see examples for details.


Examples on Arbitrary audio-image understanding

The BuboGPT can also tell whether the image and audio are relevant to each other and generate high-quality response for arbitrary audio-image understanding, see examples for details.

BibTeX


  @article{zhao2023bubogpt,
    author      = {Yang Zhao and Zhijie Lin and Daquan Zhou and Zilong Huang and Jiashi Feng and Bingyi Kang},
    title       = {BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs},
    publisher   = {arXiv:2307.08581},
    year        = {2023}
  }
  
主站蜘蛛池模板: 国产精品久久久久久影视| 狠狠色婷婷丁香六月| 无码国内精品人妻少妇蜜桃视频| 国产女人aaa级久久久级| 久久精品无码专区免费东京热| 国产探花在线视频| 日韩午夜高清福利片在线观看| 国产在线国偷精品免费看| 久久波多野结衣| 草莓视频网站下载| 精品卡2卡3卡4卡免费| 岳一夜要我六次| 免费人成在线观看69式小视频 | 国产专区中文字幕| 中文字幕日产每天更新40| 精品爆乳一区二区三区无码AV| 少妇人妻在线视频| 人人妻人人澡人人爽欧美精品| 91麻豆爱豆果冻天美星空| 欧美性生交xxxxx久久久| 国产猛男猛女超爽免费视频| 久久精品国产亚洲夜色AV网站| 被窝影院午夜无码国产| 成人a视频高清在线观看| 免费99热在线观看| 884aa四虎四虎永久播放地址| 欧美三级在线播放| 国产在线观看麻豆91精品免费| 丰满多毛的陰户视频| 精品一区二区三区电影| 国内精品一区二区三区最新| 亚洲av综合色区无码一区爱av| 青青热久免费精品视频精品| 成人免费高清完整版在线观看| 亲密爱人之无限诱惑| 老色鬼久久综合第一| 日本在线高清视频| 免费无码又爽又高潮视频| 91一区二区视频| 日本精品久久久久中文字幕| 午夜三级A三级三点在线观看|