Multiple Object Tracking and Segmentation in Complex Environments

Four challenges in long video, occluded object, diverse motion and open-world

October 00th, 8:00 am (TBD), ECCV 2022 Workshop

Overview

Abstract

Multiple object tracking and segmentation aims to localize and associate objects of interest along time, and serves as fundamental technologies in many practical applications, such as visual surveillance, public security, video analysis, and human-computer interaction.

Computer vision systems nowadays have achieved great performance in simple scenes, such as MOT dataset and DAVIS dataset, but are not as robust as the human vision system, especially in complex environments.

To advance current vision systems performance in complex environments, our workshop explores four settings of multiple object tracking and segmentation: (a) long video (b) occluded object (c) diverse motion (d) open-world.

Three challenges consist of:
  • 4th Large-scale Video Object Segmentation Challenge: Long Video Track
  • 2nd Occluded Video Instance Segmentation Challenge
  • 1st Multiple People Tracking in Group Dance Challenge
  • 2nd Open-World Video Object Detection and Segmentation Challenge

Challeng

VIS: Long Video
Video Instance Segmentation extends the image instance segmentation task from the image domain to the video domain. This problem aims at simultaneous detection, segmentation and tracking of object instances in videos. We extend VIS with long videos for validation and testing, consisting of:
  • 71 additional long videos in validation
  • 259 additional unique video instances with average duration of 49.8s
  • 9304 additional high-quality instance masks

The additional long videos (L) are separately evaluated from previous short videos. We use average precision (AP_L) at different intersection-over-union (IoU) thresholds and average recall (AR_L) as our evaluation metrics. The IoU in video instance segmentation is the sum of intersection area over the sum of union area across the video. For more details about the dataset, please refer to our paper or website.

Dataset Download
Evaluation Server
OVIS
Occluded Video Instance Segmentation is a new large scale benchmark dataset designed with the philosophy of perceiving object occlusions in videos, which could reveal the complexity and the diversity of real-world scenes. OVIS consists of:
  • 901 videos with severe object occlusions
  • 25 commonly seen semantic categories
  • 5,223 unique instances with average duration of 10.05s
  • 296k high-quality instance masks

We use average precision (AP) at different intersection-over-union (IoU) thresholds and average recall (AR) as our evaluation metrics. The IoU in video instance segmentation is the sum of intersection area over the sum of union area across the video. For more details about the dataset, please refer to our paper or website.

Dataset Download
Evaluation Server
DanceTrack
DanceTrack is a multi-human tracking dataset with two emphasized properties, (1) uniform appearance: humans are in highly similar and almost undistinguished appearance, (2) diverse motion: humans are in complicated motion pattern and their relative positions exchange frequently. DanceTrack consists of:
  • 100 videos of group dance, 40 training videos, 25 validation videos and 35 test videos
  • 990 unique instances with average duration of 52.9s
  • 877k high-quality bounding boxes

We use Higher Order Tracking Accuracy (HOTA) as the main metric, AssA and IDF1 to measure association performance, DetA and MOTA for detection quality. For more details about the dataset, please refer to our paper or website.

Dataset Download
Evaluation Server
UVO
Unidentified Video Objects benchmark is aimed at developing computer vision models that can detect and segment all objects that appear in images or videos regardless of their semantic concepts, either known or unknown. UVO is highlighted of:
  • high quality instance masks annotated at 30 fps on 1024 YouTube videos and 1fps on 10337 videos from Kinetics dataset
  • annotating ALL objects in each video, 13.5 objects per video on average
  • 57% of objects are not covered by COCO categories

We use average precision (AP) at different intersection-over-union (IoU) thresholds and average recall (AR) as our evaluation metrics. The IoU in video instance segmentation is the sum of intersection area over the sum of union area across the video. For more details about the dataset, please refer to our paper or website.

Dataset Download
Evaluation Server
Competition Schedule
Competition Date
Competition Phase 1 (open the submission of the val results) July 01, 2022 (11:59PM Pacific Time)
Competition Phase 2 (open the submission of the test results) September 01, 2022 (11:59PM Pacific Time)
Deadline for Submitting the Final Predictions October 01, 2022 (11:59PM Pacific Time)
Decisions to Participants October 05, 2022 (11:59PM Pacific Time)

Workshop

Invited Speakers
Workshop Schedule

October 00th, 8:00 - 11:20 am, 2:30 - 5:00 pm(TBD)

Time Speaker Topic
8:00-8:10 am Organizers Welcome
8:10-8:40 am Invited speaker 1 Topic 1
8:40-9:10 am Long video: 3 winners teams
9:10-9:40 am Invited speaker 2 Topic 2
9:40-10:10 am Occluded object: 3 winners teams
10:10-10:20 am Organizers Break
10:20-10:50 am Invited speaker 3 Topic 3
10:50-11:20 am Diverse motion: 3 winners teams
2:30-2:40 pm Organizers Welcome
2:40-3:10 pm Invited speaker 4 Topic 4
3:10-3:40 pm Open-world image: 3 winner teams
3:40-3:50 pm Organizers Break
3:50-4:20 pm Invited speaker 5 Topic 5
4:20-4:50 pm Open-world video: 3 winner teams
4:50-5:00 pm Organizers Closing

Organizers