2021 Call for Code Awards: Live from New York, with SNL’s Colin Jost! Learn more

Video-Text Compliance


VTC contains 7920 samples, each consisting of a video-text instruction pair and a compliance/non-compliance label. The dataset has over 1.2 million frames. We take a unique approach in data collection so that the dataset can be automatically augmented from a set of core videos. To answer growing concerns on data privacy, we carefully followed privacy preserving safe-guards in the generation of VTC dataset.

Dataset Metadata

Format License Domain Number of Records Size
CDLA-Sharing Video Classification 7920 video samples
1.2 million frames

Example Records

carry_bag_P1000344_iter006.mp4 0 open_predetermined_suitcase_calmly
carry_bag_P1000344_iter007.mp4 0 precisely_place_the_appropriate_box
carry_bag_P1000344_iter005.mp4 0 push_accessible_cart
carry_bag_P1000344_iter004.mp4 0 open_the_applicable_bag_at_once
carry_bag_P1000344_iter000.mp4 0 carry_the_specified_box


    author    = {Jaiswal, Mayoore and Liu, Frank and Jagannathan, Anupama and Gattiker, Anne and Hwang, Inseok and Lee, Jinho and Tong, Matthew and Dureja, Sahil and Shah, Soham and Hofstee, Peter and Chen, Valerie and Paul, Suvadip and Feris, Rogerio},
    title     = {Video-Text Compliance: Activity Verification Based on Natural Language Instructions},
    booktitle = {The IEEE International Conference on Computer Vision (ICCV) Workshops},
    month     = {Oct},
    year      = {2019}