cjwbw / internlm-xcomposer

Advanced text-image comprehension and composition based on InternLM

  • Public
  • 164.2K runs
  • GitHub
  • Paper
  • License

InternLM-XComposer

logo

InternLM-XComposer is a vision-language large model (VLLM) based on InternLM for advanced text-image comprehension and composition. InternLM-XComposer has serveal appealing properties:

  • Interleaved Text-Image Composition: InternLM-XComposer can effortlessly generate coherent and contextual articles that seamlessly integrate images, providing a more engaging and immersive reading experience. The interleaved text-image composition is implemented in following steps:

    1. Text Generation: It crafts long-form text based on human-provided instructions.
    2. Image Spoting and Captioning: It pinpoints optimal locations for image placement and furnishes image descriptions.
    3. Image Retrieval and Selection: It select image candidates and identify the image that optimally complements the content.
  • Comprehension with Rich Multilingual Knowledge: The text-image comprehension is empowered by training on extensive multi-modal multilingual concepts with carefully crafted strategies, resulting in a deep understanding of visual content.

  • Strong performance: It consistently achieves state-of-the-art results across various benchmarks for vision-language large models, including MME Benchmark (English), MMBench (English), Seed-Bench (English), CCBench(Chinese), and MMBench-CN (Chineese).

We release InternLM-XComposer series in two versions:

  • InternLM-XComposer-VL: The pretrained VLLM model with InternLM as the initialization of the LLM, achieving strong performance on various multimodal benchmarks, e.g., MME Benchmark, MMBench Seed-Bench, CCBench, and MMBench-CN.
  • InternLM-XComposer: The finetuned VLLM for Interleaved Text-Image Composition and LLM-based AI assistant.

Citation

If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)

@misc{zhang2023internlmxcomposer,
      title={InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition}, 
      author={Pan Zhang and Xiaoyi Dong and Bin Wang and Yuhang Cao and Chao Xu and Linke Ouyang and Zhiyuan Zhao and Shuangrui Ding and Songyang Zhang and Haodong Duan and Hang Yan and Xinyue Zhang and Wei Li and Jingwen Li and Kai Chen and Conghui He and Xingcheng Zhang and Yu Qiao and Dahua Lin and Jiaqi Wang},
      year={2023},
      eprint={2309.15112},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}