Multiple Object Tracking and Segmentation in Complex Environments

Four challenges in long video, occluded object, diverse motion and open-world

October 24th, 9:00 am (UTC+3), ECCV 2022 Online Workshop

News

[October 22] All technical reports of top teams in four challenges are available now ! Thanks for their sharing.

[July 11] UVO Challenge is open today ! [Dataset Download] [Evaluation Server Image] [Evaluation Server Video]

[July 10] YouTubeVIS: Long Video Challenge is open today ! [Dataset Download] [Evaluation Server]

[July 5] OVIS Challenge is open today ! [Dataset Download] [Evaluation Server]

[July 4] DanceTrack Challenge is open today ! [Dataset Download] [Evaluation Server]

[July 3] Competition Phase 1 is postponed to July 11, 2022 (00:01am UTC Time). Apology for this delay.

Overview

Abstract

Multiple object tracking and segmentation aims to localize and associate objects of interest along time, and serves as fundamental technologies in many practical applications, such as visual surveillance, public security, video analysis, and human-computer interaction.

Computer vision systems nowadays have achieved great performance in simple tracking and segmentation scenes, such as MOT dataset and DAVIS dataset, but are not as robust as the human vision system, especially in complex environments.

To advance current vision systems performance in complex environments, our workshop explores four settings of multiple object tracking and segmentation: (a) long video (b) occluded object (c) diverse motion (d) open-world.

Four challenges consist of:
  • 4th YouTubeVIS and Long Video Instance Segmentation Challenge
  • 2nd Occluded Video Instance Segmentation Challenge
  • 1st Multiple People Tracking in Group Dance Challenge
  • 2nd Open-World Video Object Detection and Segmentation Challenge

Challenge

YouTubeVIS: Long Video
Video Instance Segmentation extends the image instance segmentation task from the image domain to the video domain. This problem aims at simultaneous detection, segmentation and tracking of object instances in videos. We extend VIS with long videos for validation and testing, consisting of:
  • 141 additional long videos, 71 in validation, 70 in test
  • 259 additional unique video instances with average duration of 49.8s
  • 9304 additional high-quality instance masks

The additional long videos (L) are separately evaluated from previous short videos. We use average precision (AP_L) at different intersection-over-union (IoU) thresholds and average recall (AR_L) as our evaluation metrics. The IoU in video instance segmentation is the sum of intersection area over the sum of union area across the video. For more details about the dataset, please refer to our paper or website.

Dataset Download
Evaluation Server
OVIS
Occluded Video Instance Segmentation is a new large scale benchmark dataset designed with the philosophy of perceiving object occlusions in videos, which could reveal the complexity and the diversity of real-world scenes. OVIS consists of:
  • 901 videos with severe object occlusions
  • 25 commonly seen semantic categories
  • 5,223 unique instances with average duration of 10.05s
  • 296k high-quality instance masks

We use average precision (AP) at different intersection-over-union (IoU) thresholds and average recall (AR) as our evaluation metrics. The IoU in video instance segmentation is the sum of intersection area over the sum of union area across the video. For more details about the dataset, please refer to our paper or website.

Dataset Download
Evaluation Server
DanceTrack
DanceTrack is a multi-human tracking dataset with two emphasized properties, (1) uniform appearance: humans are in highly similar and almost undistinguished appearance, (2) diverse motion: humans are in complicated motion pattern and their relative positions exchange frequently. DanceTrack consists of:
  • 100 videos of group dance, 40 training videos, 25 validation videos and 35 test videos
  • 990 unique instances with average duration of 52.9s
  • 877k high-quality bounding boxes

We use Higher Order Tracking Accuracy (HOTA) as the main metric, AssA and IDF1 to measure association performance, DetA and MOTA for detection quality. For more details about the dataset, please refer to our paper or website.

Dataset Download
Evaluation Server
UVO
Unidentified Video Objects benchmark is aimed at developing computer vision models that can detect and segment all objects that appear in images or videos regardless of their semantic concepts, either known or unknown. UVO is highlighted of:
  • high quality instance masks annotated at 30 fps on 1024 YouTube videos and 1fps on 10337 videos from Kinetics dataset
  • annotating ALL objects in each video, 13.5 objects per video on average
  • 57% of objects are not covered by COCO categories

We use average precision (AP) at different intersection-over-union (IoU) thresholds and average recall (AR) as our evaluation metrics. The IoU in video instance segmentation is the sum of intersection area over the sum of union area across the video. For more details about the dataset, please refer to our paper or website.

Dataset Download
Evaluation Server Image
Evaluation Server Video
Competition Schedule
Competition Date
Competition Phase 1 (open the submission of the val results) July 01, 2022 (00:01am UTC Time)
Competition Phase 2 (open the submission of the test results) September 01, 2022 (00:01am UTC Time)
Deadline for Submitting the Final Predictions October 01, 2022 (11:59pm UTC Time)
Decisions to Participants October 05, 2022 (11:59pm UTC Time)
Top Teams

(* equal contribution)

Challenge Rank Team Name Team Members Organization Technical Report
YouTubeVIS:Long Video 1st IIG Yong Liu1,2, Jixiang Sun1, Yitong Wang2, Cong Wei1, Yansong Tang1, Yujiu Yang1 1Tsinghua Shenzhen International Graduate School, Tsinghua University, 2ByteDance Inc. IIG
YouTubeVIS:Long Video 2nd ByteVIS Junfeng Wu1, Yi Jiang2, Qihao Liu3, Xiang Bai1, Song Bai2 1Huazhong University of Science and Technology, 2Bytedance, 3Johns Hopkins University ByteVIS
OVIS 1st BeyondSOTA Fengliang Qi, Jing Xian, Zhuang Li, Bo Yan, Yuchen Hu, Hongbin Wang Ant Group BeyondSOTA
OVIS 2nd IIG Yong Liu1,2, Jixiang Sun1, Yitong Wang2, Cong Wei1, Yansong Tang1, Yujiu Yang1 1Tsinghua Shenzhen International Graduate School, Tsinghua University, 2ByteDance Inc. IIG
DanceTrack 1st MOTRv2 Yuang Zhang1,2, Tiancai Wang1, Weiyao Lin2, Xiangyu Zhang1 1MEGVII Technology, 2Shanghai Jiao Tong University MOTRv2
DanceTrack 2nd C-BIoU Fan Yang, Shigeyuki Odashima, Shoichi Masui, Shan Jiang Fujitsu Research C-BIOU
DanceTrack 2nd mt_iot Feng Yan, Zhiheng Li, Weixin Luo, Zequn Jie, Fan Liang, Xiaolin Wei, Lin Ma Meituan mt_iot
DanceTrack 3rd DLUT_IIAU Guangxin Han1, Mingzhan Yang1, Yanxin Liu1, Shiyu Zhu2, Yuzhuo Han2, Xu Jia1, Huchuan Lu1 1Dalian University of Technology, 2Honor Device Co.Ltd DLUT_IIAU
UVO 1st TAL-BUPT Jiajun Zhang*1, Boyu Chen*2, Zhilong Ji2, Jinfeng Bai2, Zonghai Hu1 1Beijing University of Posts and Telecommunications, 2Tomorrow Advancing Life TAL-BUPT

Workshop

Invited Speakers
Workshop Schedule

October 24th, 9:00 am - 13:00 pm(UTC+3)

Time Speaker Topic
9:00-9:10 am Organizers Welcome
9:10-9:40 am Invited speaker 1 Recognizing objects in long time and in a large-vocabulary
9:40-10:10 am YouTubeVIS:Long Video winners teams Solutions for 4th YouTubeVIS and Long Video Instance Segmentation Challenge
10:10-10:40 am Invited speaker 2 Learning Robust Multiple Object Tracking and Segmentation
10:40-11:10 am OVIS winners teams Solutions for 2nd Occluded Video Instance Segmentation Challenge
11:10-11:20 am Organizers Break
11:20-11:50 am DanceTrack winners teams Solutions for 1st Multiple People Tracking in Group Dance Challenge
11:50-12:20 pm UVO winners teams Solutions for 2nd Open-World Video Object Detection and Segmentation Challenge
12:20-13:00 pm Organizers Closing

Organizers