cjwbw/openpsg

Public
Panoptic Scene Graph Generation
183 runs

Run time and cost

Predictions run on Nvidia T4 GPU hardware. Predictions typically complete within 4 minutes. The predict time for this model varies significantly based on the inputs.

Readme

Panoptic Scene Graph Generation

<font size=5>Panoptic Scene Graph Generation</font>
Jingkang YangYi Zhe AngZujin GuoKaiyang ZhouWayne ZhangZiwei Liu
S-Lab, Nanyang Technological University & SenseTime Research

What is PSG Task?

The Panoptic Scene Graph Generation (PSG) Task aims to interpret a complex scene image with a scene graph representation, with each node in the scene graph grounded by its pixel-accurate segmentation mask in the image.

To promote comprehensive scene understanding, we take into account all the content in the image, including "things" and "stuff", to generate the scene graph.

psg.jpg
PSG Task: To generate a scene graph that is grounded by its panoptic segmentation

PSG addresses many SGG problems

We believe that the biggest problem of classic scene graph generation (SGG) comes from noisy datasets.
Classic scene graph generation datasets adopt a bounding box-based object grounding, which inevitably causes a number of issues:
- Coarse localization: bounding boxes cannot reach pixel-level accuracy,
- Inability to ground comprehensively: bounding boxes cannot ground backgrounds,
- Tendency to provide trivial information: current datasets usually capture frivolous objects like head to form trivial relations like person-has-head, due to too much freedom given during bounding box annotation.
- Duplicate groundings: the same object could be grounded by multiple separate bounding boxes.

All of the problems above can be easily addressed by the PSG dataset, which grounds the objects using panoptic segmentation with an appropriate granularity of object categories (adopted from COCO).

In fact, the PSG dataset contains 49k overlapping images from COCO and Visual Genome. In a nutshell, we asked annotators to annotate relations based on COCO panoptic segmentations, i.e., relations are mask-to-mask.

psg.jpg
Comparison between the classic VG-150 and PSG.

Clear Predicate Definition

We also find that a good definition of predicates is unfortunately ignored in the previous SGG datasets.
To better formulate PSG task, we carefully define 56 predicates for PSG dataset.
We try hard to avoid trivial or duplicated relations, and find that the designed 56 predicates are enough to cover the entire PSG dataset (or common everyday scenarios).

Type Predicates
Positional Relations (6) over, in front of, beside, on, in, attached to.
Common Object-Object Relations (5) hanging from, on the back of, falling off, going down, painted on.
Common Actions (31) walking on, running on, crossing, standing on, lying on, sitting on, leaning on, flying over, jumping over, jumping from, wearing, holding, carrying, looking at, guiding, kissing, eating, drinking, feeding, biting, catching, picking (grabbing), playing with, chasing, climbing, cleaning (washing, brushing), playing, touching, pushing, pulling, opening.
Human Actions (4) cooking, talking to, throwing (tossing), slicing.
Actions in Traffic Scene (4) driving, riding, parked on, driving on.
Actions in Sports Scene (3) about to hit, kicking, swinging.
Interaction between Background (3) entering, exiting, enclosing (surrounding, warping in)

Acknowledgements

OpenPSG is developed based on MMDetection. Most of the two-stage SGG implementations refer to MMSceneGraph and Scene-Graph-Benchmark.pytorch.
We sincerely appreciate the efforts of the developers from the previous codebases.

Citation

If you find our repository useful for your research, please consider citing our paper:

@inproceedings{yang2022psg,
    author = {Yang, Jingkang and Ang, Yi Zhe and Guo, Zujin and Zhou, Kaiyang and Zhang, Wayne and Liu, Ziwei},
    title = {Panoptic Scene Graph Generation},
    booktitle = {ECCV}
    year = {2022}
}

Replicate