stspanho/spectacles-yolov7-trainer

Train a yolov7 model for the Snap Spectacles in one click

Public

6 runs

GitHub

Run stspanho/spectacles-yolov7-trainer with an API

Use one of our client libraries to get started quickly. Clicking on a library will take you to the Playground tab where you can tweak different inputs, see the results, and copy the corresponding code to use in your own project.

Input schema

The fields you can use to run this model with an API. If you don't give a value for a field its default value will be used.

Field	Type	Default value	Description
images_zip_url	string		URL to a .zip of real Spectacles frames (flat folder of images).
classes	string	coffee cup	Comma-separated class list. Must line up with the objects described in synthetic_prompts and the objects visible in the real Spectacles frames, e.g. "coffee cup" or "banana, apple".
synthetic_prompts	string	first-person POV wide-angle snapshot from smart glasses, a white ceramic coffee cup with steam on a wooden kitchen counter, eye-level, soft morning daylight from a window; head-tilted-down POV wide-angle smart-glasses view, a takeaway coffee cup with a brown lid on a cafe table with coffee rings, warm tungsten overhead light; first-person 45-degree downward POV snapshot from smart glasses, an espresso cup on a saucer on a marble counter, slight motion blur, cafe ambient light; near top-down POV from head-mounted smart glasses, a latte mug with foam art on a wooden desk next to a laptop, cool daylight; first-person POV looking down at an angle, a stainless steel travel mug on an office desk cluttered with notebooks and pens, mixed fluorescent + window light; overhead first-person POV wide-angle snapshot looking straight down, a paper coffee cup with a corrugated sleeve on a cafe table, surface filling most of the frame; Dutch-angle first-person smart-glasses view, a glass mug of black coffee on a glass coffee table in a living room, low golden-hour light; eye-level POV snapshot captured while walking, a takeaway coffee cup held loosely out of frame, blurred kitchen counter background	Semicolon-separated FLUX.1-schnell prompts. Each prompt should describe ONE scene as if seen through Spectacles: include (1) a first-person/POV camera angle (eye-level, head-tilted-down, top-down...), (2) the target object with concrete visual detail, (3) the surface it sits on, and (4) the lighting. Prompts are round-robined across `synthetic_count` frames with unique seeds for variety. The default is a set of Spectacles-POV coffee-cup scenes — replace every entry to retarget the synthetic data. To train on more than one class, mix prompts for each class across the list (roughly synthetic_count / len(prompts) per scene).
synthetic_count	integer	100 Max: 1000	Number of synthetic frames to generate (0 = skip Flux entirely).
epochs	integer	200 Min: 1 Max: 1000	None
batch_size	integer	64 Min: 1 Max: 128	None
img_size	None	224	Training + export image size (must be a multiple of 32). 224 is Snap's SnapML recipe for Spectacles; 320/416/512/640 give higher accuracy at higher on-device cost.
sam_score_threshold	number	0.5 Max: 1	Minimum SAM 3 detection confidence for an annotation to be kept. Lower (e.g. 0.3) -> more boxes per image but noisier labels; higher (e.g. 0.7) -> fewer, cleaner labels but you may drop images entirely. If a run errors with 'SAM 3 produced no detections above threshold', lower this. Tune by enabling include_dataset and inspecting the .txt labels.
include_dataset	boolean	False	If true, bundle the SAM 3-annotated train/val dataset into the output zip.

{
  "type": "object",
  "title": "Input",
  "required": [
    "images_zip_url"
  ],
  "properties": {
    "epochs": {
      "type": "integer",
      "title": "Epochs",
      "default": 200,
      "maximum": 1000,
      "minimum": 1,
      "x-order": 4
    },
    "classes": {
      "type": "string",
      "title": "Classes",
      "default": "coffee cup",
      "x-order": 1,
      "description": "Comma-separated class list. Must line up with the objects described in synthetic_prompts and the objects visible in the real Spectacles frames, e.g. \"coffee cup\" or \"banana, apple\"."
    },
    "img_size": {
      "enum": [
        224,
        320,
        416,
        512,
        640
      ],
      "type": "integer",
      "title": "img_size",
      "description": "Training + export image size (must be a multiple of 32). 224 is Snap's SnapML recipe for Spectacles; 320/416/512/640 give higher accuracy at higher on-device cost.",
      "default": 224,
      "x-order": 6
    },
    "batch_size": {
      "type": "integer",
      "title": "Batch Size",
      "default": 64,
      "maximum": 128,
      "minimum": 1,
      "x-order": 5
    },
    "images_zip_url": {
      "type": "string",
      "title": "Images Zip Url",
      "x-order": 0,
      "description": "URL to a .zip of real Spectacles frames (flat folder of images)."
    },
    "include_dataset": {
      "type": "boolean",
      "title": "Include Dataset",
      "default": false,
      "x-order": 8,
      "description": "If true, bundle the SAM 3-annotated train/val dataset into the output zip."
    },
    "synthetic_count": {
      "type": "integer",
      "title": "Synthetic Count",
      "default": 100,
      "maximum": 1000,
      "minimum": 0,
      "x-order": 3,
      "description": "Number of synthetic frames to generate (0 = skip Flux entirely)."
    },
    "synthetic_prompts": {
      "type": "string",
      "title": "Synthetic Prompts",
      "default": "first-person POV wide-angle snapshot from smart glasses, a white ceramic coffee cup with steam on a wooden kitchen counter, eye-level, soft morning daylight from a window; head-tilted-down POV wide-angle smart-glasses view, a takeaway coffee cup with a brown lid on a cafe table with coffee rings, warm tungsten overhead light; first-person 45-degree downward POV snapshot from smart glasses, an espresso cup on a saucer on a marble counter, slight motion blur, cafe ambient light; near top-down POV from head-mounted smart glasses, a latte mug with foam art on a wooden desk next to a laptop, cool daylight; first-person POV looking down at an angle, a stainless steel travel mug on an office desk cluttered with notebooks and pens, mixed fluorescent + window light; overhead first-person POV wide-angle snapshot looking straight down, a paper coffee cup with a corrugated sleeve on a cafe table, surface filling most of the frame; Dutch-angle first-person smart-glasses view, a glass mug of black coffee on a glass coffee table in a living room, low golden-hour light; eye-level POV snapshot captured while walking, a takeaway coffee cup held loosely out of frame, blurred kitchen counter background",
      "x-order": 2,
      "description": "Semicolon-separated FLUX.1-schnell prompts. Each prompt should describe ONE scene as if seen through Spectacles: include (1) a first-person/POV camera angle (eye-level, head-tilted-down, top-down...), (2) the target object with concrete visual detail, (3) the surface it sits on, and (4) the lighting. Prompts are round-robined across `synthetic_count` frames with unique seeds for variety. The default is a set of Spectacles-POV coffee-cup scenes \u2014 replace every entry to retarget the synthetic data. To train on more than one class, mix prompts for each class across the list (roughly synthetic_count / len(prompts) per scene)."
    },
    "sam_score_threshold": {
      "type": "number",
      "title": "Sam Score Threshold",
      "default": 0.5,
      "maximum": 1,
      "minimum": 0,
      "x-order": 7,
      "description": "Minimum SAM 3 detection confidence for an annotation to be kept. Lower (e.g. 0.3) -> more boxes per image but noisier labels; higher (e.g. 0.7) -> fewer, cleaner labels but you may drop images entirely. If a run errors with 'SAM 3 produced no detections above threshold', lower this. Tune by enabling include_dataset and inspecting the .txt labels."
    }
  }
}

Output schema

The shape of the response you’ll get when you run this model with an API.

Schema

{
  "type": "string",
  "title": "Output",
  "format": "uri"
}