lucataco/gemma-4-31b-it

Gemma 4 31B Instruct - Google open-weight VLM (image + text in, text out)

Public
3 runs

Gemma 4 31B Instruct

A multimodal large language model from Google’s Gemma 4 family. With 31 billion parameters and image understanding built in, Gemma 4 31B Instruct can answer questions about pictures, write long-form text, follow complex instructions, and hold extended conversations. See the Hugging Face model card for the full upstream details. The model accepts an optional image alongside your prompt and returns a single text response.

Capabilities

  • Vision question answering — upload an image and ask about what’s in it; the model describes scenes, reads text in pictures, identifies objects, and answers visual questions.
  • Image captioning and description — get short captions or detailed paragraph-length descriptions of any image.
  • General chat and instruction following — works as a strong text-only assistant when no image is provided: writing, summarization, reasoning, code explanation, translation.
  • Long-form generation — supports up to 2048 new tokens per response, suitable for articles, essays, structured analyses, and detailed answers.
  • Multilingual — Gemma 4 inherits broad language coverage from its training data and can understand and respond in many languages.

Inputs

  • prompt — your question, instruction, or message to the model. Required.
  • image — an optional image (URL or upload). When provided, the model treats your prompt as a question or instruction about that image. Leave blank for text-only chat.
  • system_prompt — an optional system instruction that sets the assistant’s persona, tone, or constraints. Leave blank for default behavior.
  • max_new_tokens — the maximum number of tokens the model will generate. Default is 256. Increase for long-form answers (up to 2048).
  • temperature — controls randomness. Default is 0.7.
  • 0.0 — fully deterministic; same input always produces the same output. Use for factual queries, OCR, and reproducible tasks.
  • 0.7 — balanced default for natural conversation and creative work.
  • 1.0+ — more diverse and creative, with higher chance of going off-topic.
  • top_p — nucleus sampling cutoff. Default is 0.95. Lower values (e.g. 0.8) tighten the output to only the most likely tokens; higher values allow more variety. Most users should leave this at the default.

Output

The model returns a single string of generated text — the assistant’s response to your prompt (and image, if provided).

Use cases

  • Visual assistant — ask questions about screenshots, photos, diagrams, charts, and documents.
  • Content generation — drafting emails, blog posts, marketing copy, summaries, and structured documents.
  • Education and tutoring — explaining concepts, walking through problem-solving steps, answering homework-style questions with optional diagrams.
  • Accessibility — generating image descriptions and alt-text for visually impaired users.
  • Research and analysis — extracting information from screenshots, OCR-style reading of in-image text, and reasoning over visual content.
  • Conversational agent — building chatbots that handle both text and image inputs without switching models.

Tips

  • For factual or OCR-style tasks where you want the same answer every time, set temperature to 0.
  • When asking detailed questions about an image, give the model a specific prompt (“Describe the colors and lighting in this scene” beats “Describe this”) for richer output.
  • If responses get cut off, raise max_new_tokens — the default of 256 is conservative.
  • Use system_prompt to lock in a persona or output format (“You are a precise technical writer. Always answer in bullet points.”) rather than re-stating it in every prompt.
  • For purely text-only chat, leave the image input blank — the model still performs as a strong language-only assistant.

Limitations

  • The model is quantized to 4-bit (bitsandbytes nf4) to fit on a single L40S GPU. Quality is very close to the BF16 reference for most prompts but may diverge slightly on long-context reasoning or rare tokens.
  • Vision capability is competent but not as strong as specialized vision-language models on tasks like fine-grained object counting, complex document layout parsing, or precise spatial reasoning.
  • Like all LLMs, it can produce confident but incorrect answers (hallucinations), especially for niche factual queries — verify important outputs.
  • Long generations consume KV cache memory; very long prompts combined with max_new_tokens=2048 may slow throughput.
  • The model is governed by Google’s Gemma license — review the use restrictions before deploying in commercial or sensitive contexts.
Model created