stspanho/spectacles-yolov7-trainer
Train a yolov7 model for the Snap Spectacles in one click
Run stspanho/spectacles-yolov7-trainer with an API
Use one of our client libraries to get started quickly. Clicking on a library will take you to the Playground tab where you can tweak different inputs, see the results, and copy the corresponding code to use in your own project.
Input schema
The fields you can use to run this model with an API. If you don't give a value for a field its default value will be used.
| Field | Type | Default value | Description |
|---|---|---|---|
| images_zip_url |
string
|
URL to a .zip of real Spectacles frames (flat folder of images).
|
|
| classes |
string
|
coffee cup
|
Comma-separated class list. Must line up with the objects described in synthetic_prompts and the objects visible in the real Spectacles frames, e.g. "coffee cup" or "banana, apple".
|
| synthetic_prompts |
string
|
first-person POV wide-angle snapshot from smart glasses, a white ceramic coffee cup with steam on a wooden kitchen counter, eye-level, soft morning daylight from a window; head-tilted-down POV wide-angle smart-glasses view, a takeaway coffee cup with a brown lid on a cafe table with coffee rings, warm tungsten overhead light; first-person 45-degree downward POV snapshot from smart glasses, an espresso cup on a saucer on a marble counter, slight motion blur, cafe ambient light; near top-down POV from head-mounted smart glasses, a latte mug with foam art on a wooden desk next to a laptop, cool daylight; first-person POV looking down at an angle, a stainless steel travel mug on an office desk cluttered with notebooks and pens, mixed fluorescent + window light; overhead first-person POV wide-angle snapshot looking straight down, a paper coffee cup with a corrugated sleeve on a cafe table, surface filling most of the frame; Dutch-angle first-person smart-glasses view, a glass mug of black coffee on a glass coffee table in a living room, low golden-hour light; eye-level POV snapshot captured while walking, a takeaway coffee cup held loosely out of frame, blurred kitchen counter background
|
Semicolon-separated FLUX.1-schnell prompts. Each prompt should describe ONE scene as if seen through Spectacles: include (1) a first-person/POV camera angle (eye-level, head-tilted-down, top-down...), (2) the target object with concrete visual detail, (3) the surface it sits on, and (4) the lighting. Prompts are round-robined across `synthetic_count` frames with unique seeds for variety. The default is a set of Spectacles-POV coffee-cup scenes — replace every entry to retarget the synthetic data. To train on more than one class, mix prompts for each class across the list (roughly synthetic_count / len(prompts) per scene).
|
| synthetic_count |
integer
|
100
Max: 1000 |
Number of synthetic frames to generate (0 = skip Flux entirely).
|
| epochs |
integer
|
200
Min: 1 Max: 1000 |
None
|
| batch_size |
integer
|
64
Min: 1 Max: 128 |
None
|
| img_size |
None
|
224
|
Training + export image size (must be a multiple of 32). 224 is Snap's SnapML recipe for Spectacles; 320/416/512/640 give higher accuracy at higher on-device cost.
|
| sam_score_threshold |
number
|
0.5
Max: 1 |
Minimum SAM 3 detection confidence for an annotation to be kept. Lower (e.g. 0.3) -> more boxes per image but noisier labels; higher (e.g. 0.7) -> fewer, cleaner labels but you may drop images entirely. If a run errors with 'SAM 3 produced no detections above threshold', lower this. Tune by enabling include_dataset and inspecting the .txt labels.
|
| include_dataset |
boolean
|
False
|
If true, bundle the SAM 3-annotated train/val dataset into the output zip.
|
{
"type": "object",
"title": "Input",
"required": [
"images_zip_url"
],
"properties": {
"epochs": {
"type": "integer",
"title": "Epochs",
"default": 200,
"maximum": 1000,
"minimum": 1,
"x-order": 4
},
"classes": {
"type": "string",
"title": "Classes",
"default": "coffee cup",
"x-order": 1,
"description": "Comma-separated class list. Must line up with the objects described in synthetic_prompts and the objects visible in the real Spectacles frames, e.g. \"coffee cup\" or \"banana, apple\"."
},
"img_size": {
"enum": [
224,
320,
416,
512,
640
],
"type": "integer",
"title": "img_size",
"description": "Training + export image size (must be a multiple of 32). 224 is Snap's SnapML recipe for Spectacles; 320/416/512/640 give higher accuracy at higher on-device cost.",
"default": 224,
"x-order": 6
},
"batch_size": {
"type": "integer",
"title": "Batch Size",
"default": 64,
"maximum": 128,
"minimum": 1,
"x-order": 5
},
"images_zip_url": {
"type": "string",
"title": "Images Zip Url",
"x-order": 0,
"description": "URL to a .zip of real Spectacles frames (flat folder of images)."
},
"include_dataset": {
"type": "boolean",
"title": "Include Dataset",
"default": false,
"x-order": 8,
"description": "If true, bundle the SAM 3-annotated train/val dataset into the output zip."
},
"synthetic_count": {
"type": "integer",
"title": "Synthetic Count",
"default": 100,
"maximum": 1000,
"minimum": 0,
"x-order": 3,
"description": "Number of synthetic frames to generate (0 = skip Flux entirely)."
},
"synthetic_prompts": {
"type": "string",
"title": "Synthetic Prompts",
"default": "first-person POV wide-angle snapshot from smart glasses, a white ceramic coffee cup with steam on a wooden kitchen counter, eye-level, soft morning daylight from a window; head-tilted-down POV wide-angle smart-glasses view, a takeaway coffee cup with a brown lid on a cafe table with coffee rings, warm tungsten overhead light; first-person 45-degree downward POV snapshot from smart glasses, an espresso cup on a saucer on a marble counter, slight motion blur, cafe ambient light; near top-down POV from head-mounted smart glasses, a latte mug with foam art on a wooden desk next to a laptop, cool daylight; first-person POV looking down at an angle, a stainless steel travel mug on an office desk cluttered with notebooks and pens, mixed fluorescent + window light; overhead first-person POV wide-angle snapshot looking straight down, a paper coffee cup with a corrugated sleeve on a cafe table, surface filling most of the frame; Dutch-angle first-person smart-glasses view, a glass mug of black coffee on a glass coffee table in a living room, low golden-hour light; eye-level POV snapshot captured while walking, a takeaway coffee cup held loosely out of frame, blurred kitchen counter background",
"x-order": 2,
"description": "Semicolon-separated FLUX.1-schnell prompts. Each prompt should describe ONE scene as if seen through Spectacles: include (1) a first-person/POV camera angle (eye-level, head-tilted-down, top-down...), (2) the target object with concrete visual detail, (3) the surface it sits on, and (4) the lighting. Prompts are round-robined across `synthetic_count` frames with unique seeds for variety. The default is a set of Spectacles-POV coffee-cup scenes \u2014 replace every entry to retarget the synthetic data. To train on more than one class, mix prompts for each class across the list (roughly synthetic_count / len(prompts) per scene)."
},
"sam_score_threshold": {
"type": "number",
"title": "Sam Score Threshold",
"default": 0.5,
"maximum": 1,
"minimum": 0,
"x-order": 7,
"description": "Minimum SAM 3 detection confidence for an annotation to be kept. Lower (e.g. 0.3) -> more boxes per image but noisier labels; higher (e.g. 0.7) -> fewer, cleaner labels but you may drop images entirely. If a run errors with 'SAM 3 produced no detections above threshold', lower this. Tune by enabling include_dataset and inspecting the .txt labels."
}
}
}
Output schema
The shape of the response you’ll get when you run this model with an API.
{
"type": "string",
"title": "Output",
"format": "uri"
}