Image description CREMA: Multimodal Compositional Video Reasoning
via Efficient Modular Adaptation and Fusion

University of North Carolina, Chapel Hill
Teaser
Figure 1: We present CREMA, an efficient and modular modality-fusion framework. We utilize a single multi-modal Q-Former with a set of lightweight modality-specific adapters, hence allowing video frames, optical flow, 3D, etc.

Despite impressive advancements in multimodal compositional reasoning approaches, they are still limited in their flexibility and efficiency by processing fixed modality inputs while updating a lot of model parameters. This paper tackles these critical challenges and proposes CREMA, an efficient and modular modality-fusion framework for injecting any new modality into video reasoning.

We first augment multiple informative modalities (such as, optical flow, 3D point cloud, audio) from given videos without extra human annotation by leveraging existing pre-trained models. Next, we introduce a query transformer with multiple parameter-efficient modules associated with each accessible modality. It projects diverse modality features to the LLM token embedding space, allowing the model to integrate different data types for response generation. Furthermore, we propose a fusion module designed to compress multimodal queries, maintaining computational efficiency in the LLM while combining additional modalities. We validate our method on video-3D, video-audio, and video-language reasoning tasks and achieve better/equivalent performance against strong multimodal LLMs, including BLIP-2, 3D-LLM, and SeViLA while using 96% fewer trainable parameters. We provide extensive analyses of CREMA, including the impact of each modality on reasoning domains, the design of the fusion module, and example visualizations.

Method

Teaser
Figure 2: Overview of the CREMA method. The multimodal encoders, Q-former, and LLM are kept frozen in the process. For each modality input, we extract tokens using a corresponding modality-specific adaptation module. Then, we can employ the optional fusion module to blend and compact the obtained tokens. In the end, the LLM can leverage multimodal or modality-fusion tokens, which contain rich representations of different input modalities, to generate responses.

Results

Fine-tuning Results on Video-3D Reasoning Task (SQA3D)

Teaser
Our method is significantly efficient yet outperforms publicly available, strong MLLM baselines on 3Dassociated video reasoning.

Fine-tuning Results on Video-Audio Reasoning Task (MUSIC-AVQA)

Teaser
CREMA method achieves superior video-audio reasoning ability.

Fine-tuning Results on Video Reasoning Task (NExT-QA)

Teaser
CREMA method achieves superior performance against strong vision-language reasoning methods on the NExT-QA dataset.

Zero-shot Evalution on Video-3D and Video-Audio Reasoning Tasks

Teaser
CREMA method also achieves superior zero-shot performance on compositional video reasoning.

Visualization Examples

Teaser

Figure 3: Qualitative examples for multimodal compositional video reasoning from SQA3D (Left) and MUSIC-AVQA (Right). The correct predictions are marked by green check marks.

Beyond the numerical comparison of the effect integrating different sets of modalities for our CREMA method, we investigate our model's generated responses according to different types of input examples. In Figure 3 Left, CREMA with 3D point cloud inputs (P) fails to find the chair and respond to the color of the wall, brown, as its 2D scene image features are incorporated in 3D point cloud features. CREMA with Video (V) and V, P also predict inaccurate chair color, black. However, with the assistance of depth information, the method can capture objects accurately and find the designated chair as well. Similarly, in Figure 3 Right, optical flow inputs help to find musicians with their poses playing instruments, so our CREMA method can tell the middle instrument is not being played at the beginning, but from the left.

BibTeX

@article{yu2024crema,
  author    = {Shoubin Yu, Jaehong Yoon and Mohit Bansal},
  title     = {CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion},
  journal   = {arxiv},
  year      = {2024},
}