-
SciVid: Cross-Domain Evaluation of Video Models in Scientific Applications
Authors:
Yana Hasson,
Pauline Luc,
Liliane Momeni,
Maks Ovsjanikov,
Guillaume Le Moing,
Alina Kuznetsova,
Ira Ktena,
Jennifer J. Sun,
Skanda Koppula,
Dilara Gokay,
Joseph Heyward,
Etienne Pot,
Andrew Zisserman
Abstract:
In recent years, there has been a proliferation of spatiotemporal foundation models in different scientific disciplines. While promising, these models are often domain-specific and are only assessed within the particular applications for which they are designed. Given that many tasks can be represented as video modeling problems, video foundation models (ViFMs) hold considerable promise as general…
▽ More
In recent years, there has been a proliferation of spatiotemporal foundation models in different scientific disciplines. While promising, these models are often domain-specific and are only assessed within the particular applications for which they are designed. Given that many tasks can be represented as video modeling problems, video foundation models (ViFMs) hold considerable promise as general-purpose domain-agnostic approaches. However, it is not known whether the knowledge acquired on large-scale but potentially out-of-domain data can be effectively transferred across diverse scientific disciplines, and if a single, pretrained ViFM can be competitive with domain-specific baselines. To address this, we introduce SciVid, a comprehensive benchmark comprising five *Sci*entific *Vid*eo tasks, across medical computer vision, animal behavior, and weather forecasting. We adapt six leading ViFMs to SciVid using simple trainable readout modules, establishing strong baselines and demonstrating the potential for effective transfer learning. Specifically, we show that state-of-the-art results can be obtained in several applications by leveraging the general-purpose representations from ViFM backbones. Furthermore, our results reveal the limitations of existing ViFMs, and highlight opportunities for the development of generalizable models for high-impact scientific applications. We release our code at https://github.com/google-deepmind/scivid to facilitate further research in the development of ViFMs.
△ Less
Submitted 4 July, 2025;
originally announced July 2025.
-
Discrete Audio Tokens: More Than a Survey!
Authors:
Pooneh Mousavi,
Gallil Maimon,
Adel Moumen,
Darius Petermann,
Jiatong Shi,
Haibin Wu,
Haici Yang,
Anastasia Kuznetsova,
Artem Ploujnikov,
Ricard Marxer,
Bhuvana Ramabhadran,
Benjamin Elizalde,
Loren Lugosch,
Jinyu Li,
Cem Subakan,
Phil Woodland,
Minje Kim,
Hung-yi Lee,
Shinji Watanabe,
Yossi Adi,
Mirco Ravanelli
Abstract:
Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs).…
▽ More
Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.
△ Less
Submitted 16 June, 2025; v1 submitted 11 June, 2025;
originally announced June 2025.
-
Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement
Authors:
Jae-Sung Bae,
Anastasia Kuznetsova,
Dinesh Manocha,
John Hershey,
Trausti Kristjansson,
Minje Kim
Abstract:
This paper presents a new challenge that calls for zero-shot text-to-speech (TTS) systems to augment speech data for the downstream task, personalized speech enhancement (PSE), as part of the Generative Data Augmentation workshop at ICASSP 2025. Collecting high-quality personalized data is challenging due to privacy concerns and technical difficulties in recording audio from the test scene. To add…
▽ More
This paper presents a new challenge that calls for zero-shot text-to-speech (TTS) systems to augment speech data for the downstream task, personalized speech enhancement (PSE), as part of the Generative Data Augmentation workshop at ICASSP 2025. Collecting high-quality personalized data is challenging due to privacy concerns and technical difficulties in recording audio from the test scene. To address these issues, synthetic data generation using generative models has gained significant attention. In this challenge, participants are tasked first with building zero-shot TTS systems to augment personalized data. Subsequently, PSE systems are asked to be trained with this augmented personalized dataset. Through this challenge, we aim to investigate how the quality of augmented data generated by zero-shot TTS models affects PSE model performance. We also provide baseline experiments using open-source zero-shot TTS models to encourage participation and benchmark advancements. Our baseline code implementation and checkpoints are available online.
△ Less
Submitted 22 January, 2025;
originally announced January 2025.
-
Testing MediaPipe Holistic for Linguistic Analysis of Nonmanual Markers in Sign Languages
Authors:
Anna Kuznetsova,
Vadim Kimmelman
Abstract:
Advances in Deep Learning have made possible reliable landmark tracking of human bodies and faces that can be used for a variety of tasks. We test a recent Computer Vision solution, MediaPipe Holistic (MPH), to find out if its tracking of the facial features is reliable enough for a linguistic analysis of data from sign languages, and compare it to an older solution (OpenFace, OF). We use an exist…
▽ More
Advances in Deep Learning have made possible reliable landmark tracking of human bodies and faces that can be used for a variety of tasks. We test a recent Computer Vision solution, MediaPipe Holistic (MPH), to find out if its tracking of the facial features is reliable enough for a linguistic analysis of data from sign languages, and compare it to an older solution (OpenFace, OF). We use an existing data set of sentences in Kazakh-Russian Sign Language and a newly created small data set of videos with head tilts and eyebrow movements. We find that MPH does not perform well enough for linguistic analysis of eyebrow movement - but in a different way from OF, which is also performing poorly without correction. We reiterate a previous proposal to train additional correction models to overcome these limitations.
△ Less
Submitted 25 March, 2024; v1 submitted 15 March, 2024;
originally announced March 2024.
-
OCTDL: Optical Coherence Tomography Dataset for Image-Based Deep Learning Methods
Authors:
Mikhail Kulyabin,
Aleksei Zhdanov,
Anastasia Nikiforova,
Andrey Stepichev,
Anna Kuznetsova,
Mikhail Ronkin,
Vasilii Borisov,
Alexander Bogachev,
Sergey Korotkich,
Paul A Constable,
Andreas Maier
Abstract:
Optical coherence tomography (OCT) is a non-invasive imaging technique with extensive clinical applications in ophthalmology. OCT enables the visualization of the retinal layers, playing a vital role in the early detection and monitoring of retinal diseases. OCT uses the principle of light wave interference to create detailed images of the retinal microstructures, making it a valuable tool for dia…
▽ More
Optical coherence tomography (OCT) is a non-invasive imaging technique with extensive clinical applications in ophthalmology. OCT enables the visualization of the retinal layers, playing a vital role in the early detection and monitoring of retinal diseases. OCT uses the principle of light wave interference to create detailed images of the retinal microstructures, making it a valuable tool for diagnosing ocular conditions. This work presents an open-access OCT dataset (OCTDL) comprising over 2000 OCT images labeled according to disease group and retinal pathology. The dataset consists of OCT records of patients with Age-related Macular Degeneration (AMD), Diabetic Macular Edema (DME), Epiretinal Membrane (ERM), Retinal Artery Occlusion (RAO), Retinal Vein Occlusion (RVO), and Vitreomacular Interface Disease (VID). The images were acquired with an Optovue Avanti RTVue XR using raster scanning protocols with dynamic scan length and image resolution. Each retinal b-scan was acquired by centering on the fovea and interpreted and cataloged by an experienced retinal specialist. In this work, we applied Deep Learning classification techniques to this new open-access dataset.
△ Less
Submitted 1 October, 2024; v1 submitted 13 December, 2023;
originally announced December 2023.
-
Beyond SOT: Tracking Multiple Generic Objects at Once
Authors:
Christoph Mayer,
Martin Danelljan,
Ming-Hsuan Yang,
Vittorio Ferrari,
Luc Van Gool,
Alina Kuznetsova
Abstract:
Generic Object Tracking (GOT) is the problem of tracking target objects, specified by bounding boxes in the first frame of a video. While the task has received much attention in the last decades, researchers have almost exclusively focused on the single object setting. Multi-object GOT benefits from a wider applicability, rendering it more attractive in real-world applications. We attribute the la…
▽ More
Generic Object Tracking (GOT) is the problem of tracking target objects, specified by bounding boxes in the first frame of a video. While the task has received much attention in the last decades, researchers have almost exclusively focused on the single object setting. Multi-object GOT benefits from a wider applicability, rendering it more attractive in real-world applications. We attribute the lack of research interest into this problem to the absence of suitable benchmarks. In this work, we introduce a new large-scale GOT benchmark, LaGOT, containing multiple annotated target objects per sequence. Our benchmark allows users to tackle key remaining challenges in GOT, aiming to increase robustness and reduce computation through joint tracking of multiple objects simultaneously. In addition, we propose a transformer-based GOT tracker baseline capable of joint processing of multiple objects through shared computation. Our approach achieves a 4x faster run-time in case of 10 concurrent objects compared to tracking each object independently and outperforms existing single object trackers on our new benchmark. In addition, our approach achieves highly competitive results on single-object GOT datasets, setting a new state of the art on TrackingNet with a success rate AUC of 84.4%. Our benchmark, code, and trained models will be made publicly available.
△ Less
Submitted 25 February, 2024; v1 submitted 22 December, 2022;
originally announced December 2022.
-
The Potential of Neural Speech Synthesis-based Data Augmentation for Personalized Speech Enhancement
Authors:
Anastasia Kuznetsova,
Aswin Sivaraman,
Minje Kim
Abstract:
With the advances in deep learning, speech enhancement systems benefited from large neural network architectures and achieved state-of-the-art quality. However, speaker-agnostic methods are not always desirable, both in terms of quality and their complexity, when they are to be used in a resource-constrained environment. One promising way is personalized speech enhancement (PSE), which is a smalle…
▽ More
With the advances in deep learning, speech enhancement systems benefited from large neural network architectures and achieved state-of-the-art quality. However, speaker-agnostic methods are not always desirable, both in terms of quality and their complexity, when they are to be used in a resource-constrained environment. One promising way is personalized speech enhancement (PSE), which is a smaller and easier speech enhancement problem for small models to solve, because it focuses on a particular test-time user. To achieve the personalization goal, while dealing with the typical lack of personal data, we investigate the effect of data augmentation based on neural speech synthesis (NSS). In the proposed method, we show that the quality of the NSS system's synthetic data matters, and if they are good enough the augmented dataset can be used to improve the PSE system that outperforms the speaker-agnostic baseline. The proposed PSE systems show significant complexity reduction while preserving the enhancement quality.
△ Less
Submitted 14 November, 2022;
originally announced November 2022.
-
Relationships between patenting trends and research activity for green energy technologies
Authors:
Regina Tuganova,
Anna Permyakova,
Anna Kuznetsova,
Karina Rakhmanova,
Natalia Monzul,
Roman Uvarov,
Elizaveta Kovtun,
Semen Budennyy
Abstract:
Green technology is viewed as a means of creating a sustainable society and a catalyst for sustainable development by the global community. It is responsible for both the potential reduction of production waste and the reduction of carbon footprint and CO2 emissions. However, alongside with the growing popularity of green technologies, there is an emerging skepticism about their contribution to so…
▽ More
Green technology is viewed as a means of creating a sustainable society and a catalyst for sustainable development by the global community. It is responsible for both the potential reduction of production waste and the reduction of carbon footprint and CO2 emissions. However, alongside with the growing popularity of green technologies, there is an emerging skepticism about their contribution to solving environmental challenges. This article focuses on three areas of eco-innovation in green technology: renewable energy, hydrogen power, and decarbonization. Our main goal is to analyze the relationship between publication activity and the number of patented research results, thus shedding light on the real-world applicability of scientific outcomes. We used several bibliometric methods for analyzing global publication and patent activity, applied to the Scopus citation database and the European Patent Office's patent database. Our results show that the advancement of research in all three areas of eco-innovation does not automatically lead to the increase in the number of patents. We offer possible reasons for such dependency based on the observations of the worldwide tendencies in green innovation sphere.
△ Less
Submitted 18 October, 2022;
originally announced October 2022.
-
Curriculum optimization for low-resource speech recognition
Authors:
Anastasia Kuznetsova,
Anurag Kumar,
Jennifer Drexler Fox,
Francis Tyers
Abstract:
Modern end-to-end speech recognition models show astonishing results in transcribing audio signals into written text. However, conventional data feeding pipelines may be sub-optimal for low-resource speech recognition, which still remains a challenging task. We propose an automated curriculum learning approach to optimize the sequence of training examples based on both the progress of the model wh…
▽ More
Modern end-to-end speech recognition models show astonishing results in transcribing audio signals into written text. However, conventional data feeding pipelines may be sub-optimal for low-resource speech recognition, which still remains a challenging task. We propose an automated curriculum learning approach to optimize the sequence of training examples based on both the progress of the model while training and prior knowledge about the difficulty of the training examples. We introduce a new difficulty measure called compression ratio that can be used as a scoring function for raw audio in various noise conditions. The proposed method improves speech recognition Word Error Rate performance by up to 33% relative over the baseline system
△ Less
Submitted 17 February, 2022;
originally announced February 2022.
-
ConsumerCheck: A Software for Analysis of Sensory and Consumer Data
Authors:
Oliver Tomic,
Alexandra Kuznetsova,
Per Bruun Brockhoff,
Thomas Graff,
Tormod Næs
Abstract:
ConsumerCheck is an open source data analysis software tailored for analysis of sensory and consumer data. Since some of the implemented methods are generic, such as PCA, PLSR and PCR, other data from other domains may also be analysed with ConsumerCheck. The software comes with a graphical user interface and as such provides non-statisticians and users without programming skills free access to a…
▽ More
ConsumerCheck is an open source data analysis software tailored for analysis of sensory and consumer data. Since some of the implemented methods are generic, such as PCA, PLSR and PCR, other data from other domains may also be analysed with ConsumerCheck. The software comes with a graphical user interface and as such provides non-statisticians and users without programming skills free access to a number of widely used analysis methods within the field of sensory and consumer science. Computational results are presented in plots that are easily generated from the tree-controls within the graphical user interfaces. Since the construction of conjoint analysis models is not always straightforward, ConsumerCheck provides three previously defined model structures of different complexity. ConsumerCheck is an ongoing research project and the objective is to implement further statistical methods over time.
△ Less
Submitted 11 January, 2022;
originally announced January 2022.
-
Factors of Influence for Transfer Learning across Diverse Appearance Domains and Task Types
Authors:
Thomas Mensink,
Jasper Uijlings,
Alina Kuznetsova,
Michael Gygli,
Vittorio Ferrari
Abstract:
Transfer learning enables to re-use knowledge learned on a source task to help learning a target task. A simple form of transfer learning is common in current state-of-the-art computer vision models, i.e. pre-training a model for image classification on the ILSVRC dataset, and then fine-tune on any target task. However, previous systematic studies of transfer learning have been limited and the cir…
▽ More
Transfer learning enables to re-use knowledge learned on a source task to help learning a target task. A simple form of transfer learning is common in current state-of-the-art computer vision models, i.e. pre-training a model for image classification on the ILSVRC dataset, and then fine-tune on any target task. However, previous systematic studies of transfer learning have been limited and the circumstances in which it is expected to work are not fully understood. In this paper we carry out an extensive experimental exploration of transfer learning across vastly different image domains (consumer photos, autonomous driving, aerial imagery, underwater, indoor scenes, synthetic, close-ups) and task types (semantic segmentation, object detection, depth estimation, keypoint detection). Importantly, these are all complex, structured output tasks types relevant to modern computer vision applications. In total we carry out over 2000 transfer learning experiments, including many where the source and target come from different image domains, task types, or both. We systematically analyze these experiments to understand the impact of image domain, task type, and dataset size on transfer learning performance. Our study leads to several insights and concrete recommendations: (1) for most tasks there exists a source which significantly outperforms ILSVRC'12 pre-training; (2) the image domain is the most important factor for achieving positive transfer; (3) the source dataset should \emph{include} the image domain of the target dataset to achieve best results; (4) at the same time, we observe only small negative effects when the image domain of the source task is much broader than that of the target; (5) transfer across task types can be beneficial, but its success is heavily dependent on both the source and target task types.
△ Less
Submitted 20 November, 2021; v1 submitted 24 March, 2021;
originally announced March 2021.
-
A bandit approach to curriculum generation for automatic speech recognition
Authors:
Anastasia Kuznetsova,
Anurag Kumar,
Francis M. Tyers
Abstract:
The Automated Speech Recognition (ASR) task has been a challenging domain especially for low data scenarios with few audio examples. This is the main problem in training ASR systems on the data from low-resource or marginalized languages. In this paper we present an approach to mitigate the lack of training data by employing Automated Curriculum Learning in combination with an adversarial bandit a…
▽ More
The Automated Speech Recognition (ASR) task has been a challenging domain especially for low data scenarios with few audio examples. This is the main problem in training ASR systems on the data from low-resource or marginalized languages. In this paper we present an approach to mitigate the lack of training data by employing Automated Curriculum Learning in combination with an adversarial bandit approach inspired by Reinforcement learning. The goal of the approach is to optimize the training sequence of mini-batches ranked by the level of difficulty and compare the ASR performance metrics against the random training sequence and discrete curriculum. We test our approach on a truly low-resource language and show that the bandit framework has a good improvement over the baseline transfer-learning model.
△ Less
Submitted 6 February, 2021;
originally announced February 2021.
-
Efficient video annotation with visual interpolation and frame selection guidance
Authors:
A. Kuznetsova,
A. Talati,
Y. Luo,
K. Simmons,
V. Ferrari
Abstract:
We introduce a unified framework for generic video annotation with bounding boxes. Video annotation is a longstanding problem, as it is a tedious and time-consuming process. We tackle two important challenges of video annotation: (1) automatic temporal interpolation and extrapolation of bounding boxes provided by a human annotator on a subset of all frames, and (2) automatic selection of frames to…
▽ More
We introduce a unified framework for generic video annotation with bounding boxes. Video annotation is a longstanding problem, as it is a tedious and time-consuming process. We tackle two important challenges of video annotation: (1) automatic temporal interpolation and extrapolation of bounding boxes provided by a human annotator on a subset of all frames, and (2) automatic selection of frames to annotate manually. Our contribution is two-fold: first, we propose a model that has both interpolating and extrapolating capabilities; second, we propose a guiding mechanism that sequentially generates suggestions for what frame to annotate next, based on the annotations made previously. We extensively evaluate our approach on several challenging datasets in simulation and demonstrate a reduction in terms of the number of manual bounding boxes drawn by 60% over linear interpolation and by 35% over an off-the-shelf tracker. Moreover, we also show 10% annotation time improvement over a state-of-the-art method for video annotation with bounding boxes [25]. Finally, we run human annotation experiments and provide extensive analysis of the results, showing that our approach reduces actual measured annotation time by 50% compared to commonly used linear interpolation.
△ Less
Submitted 23 December, 2020;
originally announced December 2020.
-
A Machine-Synesthetic Approach To DDoS Network Attack Detection
Authors:
Yuri Monakhov,
Oleg Nikitin,
Anna Kuznetsova,
Alexey Kharlamov,
Alexandr Amochkin
Abstract:
In the authors' opinion, anomaly detection systems, or ADS, seem to be the most perspective direction in the subject of attack detection, because these systems can detect, among others, the unknown (zero-day) attacks. To detect anomalies, the authors propose to use machine synesthesia. In this case, machine synesthesia is understood as an interface that allows using image classification algorithms…
▽ More
In the authors' opinion, anomaly detection systems, or ADS, seem to be the most perspective direction in the subject of attack detection, because these systems can detect, among others, the unknown (zero-day) attacks. To detect anomalies, the authors propose to use machine synesthesia. In this case, machine synesthesia is understood as an interface that allows using image classification algorithms in the problem of detecting network anomalies, making it possible to use non-specialized image detection methods that have recently been widely and actively developed. The proposed approach is that the network traffic data is "projected" into the image. It can be seen from the experimental results that the proposed method for detecting anomalies shows high results in the detection of attacks. On a large sample, the value of the complex efficiency indicator reaches 97%.
△ Less
Submitted 22 March, 2019; v1 submitted 13 January, 2019;
originally announced January 2019.
-
The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale
Authors:
Alina Kuznetsova,
Hassan Rom,
Neil Alldrin,
Jasper Uijlings,
Ivan Krasin,
Jordi Pont-Tuset,
Shahab Kamali,
Stefan Popov,
Matteo Malloci,
Alexander Kolesnikov,
Tom Duerig,
Vittorio Ferrari
Abstract:
We present Open Images V4, a dataset of 9.2M images with unified annotations for image classification, object detection and visual relationship detection. The images have a Creative Commons Attribution license that allows to share and adapt the material, and they have been collected from Flickr without a predefined list of class names or tags, leading to natural class statistics and avoiding an in…
▽ More
We present Open Images V4, a dataset of 9.2M images with unified annotations for image classification, object detection and visual relationship detection. The images have a Creative Commons Attribution license that allows to share and adapt the material, and they have been collected from Flickr without a predefined list of class names or tags, leading to natural class statistics and avoiding an initial design bias. Open Images V4 offers large scale across several dimensions: 30.1M image-level labels for 19.8k concepts, 15.4M bounding boxes for 600 object classes, and 375k visual relationship annotations involving 57 classes. For object detection in particular, we provide 15x more bounding boxes than the next largest datasets (15.4M boxes on 1.9M images). The images often show complex scenes with several objects (8 annotated objects per image on average). We annotated visual relationships between them, which support visual relationship detection, an emerging task that requires structured reasoning. We provide in-depth comprehensive statistics about the dataset, we validate the quality of the annotations, we study how the performance of several modern models evolves with increasing amounts of training data, and we demonstrate two applications made possible by having unified annotations of multiple types coexisting in the same images. We hope that the scale, quality, and variety of Open Images V4 will foster further research and innovation even beyond the areas of image classification, object detection, and visual relationship detection.
△ Less
Submitted 21 February, 2020; v1 submitted 2 November, 2018;
originally announced November 2018.
-
Analysis Of Congestion Control In Data Channels With Frequent Frame Loss
Authors:
Yuri Monakhov,
Anna Kuznetsova
Abstract:
Development of optimal control procedures for congested networks is a key factor in maintaining efficient network utilization. The absence of congestion control mechanism or its failure can lead to the lack of availability for certain network segments, and in severe cases -- for the entire network. The paper presents an analytical model describing the operation of the TCP Reno congestion control a…
▽ More
Development of optimal control procedures for congested networks is a key factor in maintaining efficient network utilization. The absence of congestion control mechanism or its failure can lead to the lack of availability for certain network segments, and in severe cases -- for the entire network. The paper presents an analytical model describing the operation of the TCP Reno congestion control algorithm in terms of differential calculus and queuing systems. The purpose of this research is to explore the possibilities and ways of increasing the virtual channel capacity utilization efficiency in a lossy environment.
△ Less
Submitted 10 October, 2018;
originally announced October 2018.
-
Detecting Visual Relationships Using Box Attention
Authors:
Alexander Kolesnikov,
Alina Kuznetsova,
Christoph H. Lampert,
Vittorio Ferrari
Abstract:
We propose a new model for detecting visual relationships, such as "person riding motorcycle" or "bottle on table". This task is an important step towards comprehensive structured image understanding, going beyond detecting individual objects. Our main novelty is a Box Attention mechanism that allows to model pairwise interactions between objects using standard object detection pipelines. The resu…
▽ More
We propose a new model for detecting visual relationships, such as "person riding motorcycle" or "bottle on table". This task is an important step towards comprehensive structured image understanding, going beyond detecting individual objects. Our main novelty is a Box Attention mechanism that allows to model pairwise interactions between objects using standard object detection pipelines. The resulting model is conceptually clean, expressive and relies on well-justified training and prediction procedures. Moreover, unlike previously proposed approaches, our model does not introduce any additional complex components or hyperparameters on top of those already required by the underlying detection model. We conduct an experimental evaluation on three challenging datasets, V-COCO, Visual Relationships and Open Images, demonstrating strong quantitative and qualitative results.
△ Less
Submitted 2 May, 2019; v1 submitted 5 July, 2018;
originally announced July 2018.