Instruction-Grounded Visual Projectors for Continual Learning of Generative Vision-Language Models

1 School of Computer Science and Engineering, Chung-Ang University, South Korea
2 School of Computer Science, University of Birmingham, United Kingdom
* Corresponding Author
Accepted to ICCV 2025
Instruction-grounded mixture of visual projectors with expert selection conditioned on instruction semantics

An overview of Mixture-of-Visual Projectors (MVP). MVP aligns visual representations with instruction semantics through instruction-grounded projector selection. Multiple projectors act as experts for distinct instruction contexts, while expert recommendation and pruning ensure efficient knowledge reuse and mitigate interference in continual learning.

Abstract

Continual learning enables pre-trained generative visionlanguage models (VLMs) to incorporate knowledge from new tasks without retraining data from previous ones. Recent methods update a visual projector to translate visual information for new tasks, connecting pre-trained vision encoders with large language models. However, such adjustments may cause the models to prioritize visual inputs over language instructions, particularly learning tasks with repetitive types of textual instructions. To address the neglect of language instructions, we propose a novel framework that grounds the translation of visual information on instructions for language models. We introduce a mixture of visual projectors, each serving as a specialized visual-tolanguage translation expert based on the given instruction context to adapt to new tasks. To avoid using experts for irrelevant instruction contexts, we propose an expert recommendation strategy that reuses experts for tasks similar to those previously learned. Additionally, we introduce expert pruning to alleviate interference from the use of experts that cumulatively activated in previous tasks. Extensive experiments on diverse vision-language tasks demonstrate that our method outperforms existing continual learning approaches by generating instruction-following responses

Motivation of MVP:
In continual learning for vision–language models, visual projectors trained on sequential tasks often overfit to repetitive instruction patterns, causing them to ignore task-specific textual guidance. This imbalance between visual and linguistic representations hinders instruction-aware response generation and results in cumulative interference across tasks.

Method of MVP:
To address this, MVP introduces a mixture of instruction-grounded visual projectors that adaptively route visual inputs based on the semantics of each instruction. An expert recommendation strategy identifies and reuses prior experts associated with semantically related tasks, while expert pruning suppresses those contributing to interference for future tasks. Adaptive Knowledge Aggregation between the pre-trained and newly learned projectors maintains zero-shot generalization while ensuring instruction consistency.