lucataco/interactiveomni-8b

A unified omni-modal model that can simultaneously receive inputs such as images, audio, text, and video and directly generate coherent text and speech

Public
33 runs

InteractiveOmni-8B

InteractiveOmni-8B is an 8 billion parameter model that can see images, watch videos, listen to audio, and read text—then respond with both written text and natural speech. Think of it as a conversation partner that remembers what you talked about, even across multiple turns of dialogue.

What makes this interesting

Most models that work with multiple types of media need you to load different components for each task. InteractiveOmni-8B does it all in one unified model. You can show it a video with sound, ask questions about what’s happening, and get responses that draw from everything it perceived.

The model is particularly good at multi-turn conversations. It can remember images and context from earlier in the conversation and reference them later—something that trips up many other models. In benchmarks testing this “long-term memory” across conversation turns, the 8 billion parameter version performs comparably to models like Gemini 2.5 Flash.

How it works

The model combines several components into one system:

  • A vision encoder for processing images and video frames
  • An audio encoder (based on Whisper) for understanding speech and sounds
  • A language model for understanding and generating text
  • A speech decoder for generating natural-sounding voice responses

When you give it mixed inputs—say, a video file with audio, along with a text question—it processes all of these together and generates a response that can include both text and speech.

What you can do with it

Multi-turn conversations: Have back-and-forth exchanges where the model remembers what images, videos, or audio you showed it earlier. Ask follow-up questions that reference previous parts of the conversation.

Video understanding: Show the model videos and ask questions about what’s happening, including both visual and audio elements.

Audio analysis: Give it audio clips and get detailed understanding of speech content, environmental sounds, or other acoustic information.

Image comprehension: Process still images alongside other inputs for comprehensive understanding.

Speech generation: Get responses not just as text but as natural-sounding speech output.

Performance notes

Despite having only 8 billion parameters, this model punches above its weight. The smaller 4 billion parameter version retains about 97% of this model’s performance while using half the size, which tells you how efficiently the architecture was designed.

In tests comparing similarly-sized models on image, audio, video, and speech tasks, InteractiveOmni-8B achieves leading results. It particularly excels in scenarios requiring memory of past conversation context.

Try it yourself

You can experiment with InteractiveOmni-8B on the Replicate Playground at replicate.com/playground