-
MS4UI: A Dataset for Multi-modal Summarization of User Interface Instructional Videos
Authors:
Yuan Zang,
Hao Tan,
Seunghyun Yoon,
Franck Dernoncourt,
Jiuxiang Gu,
Kushal Kafle,
Chen Sun,
Trung Bui
Abstract:
We study multi-modal summarization for instructional videos, whose goal is to provide users an efficient way to learn skills in the form of text instructions and key video frames. We observe that existing benchmarks focus on generic semantic-level video summarization, and are not suitable for providing step-by-step executable instructions and illustrations, both of which are crucial for instructio…
▽ More
We study multi-modal summarization for instructional videos, whose goal is to provide users an efficient way to learn skills in the form of text instructions and key video frames. We observe that existing benchmarks focus on generic semantic-level video summarization, and are not suitable for providing step-by-step executable instructions and illustrations, both of which are crucial for instructional videos. We propose a novel benchmark for user interface (UI) instructional video summarization to fill the gap. We collect a dataset of 2,413 UI instructional videos, which spans over 167 hours. These videos are manually annotated for video segmentation, text summarization, and video summarization, which enable the comprehensive evaluations for concise and executable video summarization. We conduct extensive experiments on our collected MS4UI dataset, which suggest that state-of-the-art multi-modal summarization methods struggle on UI video summarization, and highlight the importance of new methods for UI instructional video summarization.
△ Less
Submitted 14 June, 2025;
originally announced June 2025.
-
MAGNET: Augmenting Generative Decoders with Representation Learning and Infilling Capabilities
Authors:
Savya Khosla,
Aditi Tiwari,
Kushal Kafle,
Simon Jenni,
Handong Zhao,
John Collomosse,
Jing Shi
Abstract:
While originally designed for unidirectional generative modeling, decoder-only large language models (LLMs) are increasingly being adapted for bidirectional modeling. However, unidirectional and bidirectional models are typically trained separately with distinct objectives (generation and representation learning). This separation overlooks the opportunity for developing a more versatile language m…
▽ More
While originally designed for unidirectional generative modeling, decoder-only large language models (LLMs) are increasingly being adapted for bidirectional modeling. However, unidirectional and bidirectional models are typically trained separately with distinct objectives (generation and representation learning). This separation overlooks the opportunity for developing a more versatile language model and for these objectives to complement each other. In this work, we propose MAGNET, a method for adapting decoder-only LLMs to generate robust representations and infill missing text spans. MAGNET employs three self-supervised training objectives and introduces an attention mechanism that combines bidirectional and causal attention, enabling unified training across all objectives. Our results demonstrate that LLMs adapted with MAGNET (1) surpass strong text encoders on token-level and sentence-level representation learning tasks, (2) generate contextually appropriate text infills by leveraging past and future contexts, (3) perform open-ended text generation without excessive repetition of words or phrases, and (4) preserve the knowledge and reasoning capability gained by the LLM during pretraining.
△ Less
Submitted 13 February, 2025; v1 submitted 15 January, 2025;
originally announced January 2025.
-
Revisiting Multi-Modal LLM Evaluation
Authors:
Jian Lu,
Shikhar Srivastava,
Junyu Chen,
Robik Shrestha,
Manoj Acharya,
Kushal Kafle,
Christopher Kanan
Abstract:
With the advent of multi-modal large language models (MLLMs), datasets used for visual question answering (VQA) and referring expression comprehension have seen a resurgence. However, the most popular datasets used to evaluate MLLMs are some of the earliest ones created, and they have many known problems, including extreme bias, spurious correlations, and an inability to permit fine-grained analys…
▽ More
With the advent of multi-modal large language models (MLLMs), datasets used for visual question answering (VQA) and referring expression comprehension have seen a resurgence. However, the most popular datasets used to evaluate MLLMs are some of the earliest ones created, and they have many known problems, including extreme bias, spurious correlations, and an inability to permit fine-grained analysis. In this paper, we pioneer evaluating recent MLLMs (LLaVA 1.5, LLaVA-NeXT, BLIP2, InstructBLIP, GPT-4V, and GPT-4o) on datasets designed to address weaknesses in earlier ones. We assess three VQA datasets: 1) TDIUC, which permits fine-grained analysis on 12 question types; 2) TallyQA, which has simple and complex counting questions; and 3) DVQA, which requires optical character recognition for chart understanding. We also study VQDv1, a dataset that requires identifying all image regions that satisfy a given query. Our experiments reveal the weaknesses of many MLLMs that have not previously been reported. Our code is integrated into the widely used LAVIS framework for MLLM evaluation, enabling the rapid assessment of future MLLMs. Project webpage: https://kevinlujian.github.io/MLLM_Evaluations/
△ Less
Submitted 9 August, 2024;
originally announced August 2024.
-
They're All Doctors: Synthesizing Diverse Counterfactuals to Mitigate Associative Bias
Authors:
Salma Abdel Magid,
Jui-Hsien Wang,
Kushal Kafle,
Hanspeter Pfister
Abstract:
Vision Language Models (VLMs) such as CLIP are powerful models; however they can exhibit unwanted biases, making them less safe when deployed directly in applications such as text-to-image, text-to-video retrievals, reverse search, or classification tasks. In this work, we propose a novel framework to generate synthetic counterfactual images to create a diverse and balanced dataset that can be use…
▽ More
Vision Language Models (VLMs) such as CLIP are powerful models; however they can exhibit unwanted biases, making them less safe when deployed directly in applications such as text-to-image, text-to-video retrievals, reverse search, or classification tasks. In this work, we propose a novel framework to generate synthetic counterfactual images to create a diverse and balanced dataset that can be used to fine-tune CLIP. Given a set of diverse synthetic base images from text-to-image models, we leverage off-the-shelf segmentation and inpainting models to place humans with diverse visual appearances in context. We show that CLIP trained on such datasets learns to disentangle the human appearance from the context of an image, i.e., what makes a doctor is not correlated to the person's visual appearance, like skin color or body type, but to the context, such as background, the attire they are wearing, or the objects they are holding. We demonstrate that our fine-tuned CLIP model, $CF_α$, improves key fairness metrics such as MaxSkew, MinSkew, and NDKL by 40-66\% for image retrieval tasks, while still achieving similar levels of performance in downstream tasks. We show that, by design, our model retains maximal compatibility with the original CLIP models, and can be easily controlled to support different accuracy versus fairness trade-offs in a plug-n-play fashion.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication
Authors:
Eric Slyman,
Stefan Lee,
Scott Cohen,
Kushal Kafle
Abstract:
Recent dataset deduplication techniques have demonstrated that content-aware dataset pruning can dramatically reduce the cost of training Vision-Language Pretrained (VLP) models without significant performance losses compared to training on the original dataset. These results have been based on pruning commonly used image-caption datasets collected from the web -- datasets that are known to harbor…
▽ More
Recent dataset deduplication techniques have demonstrated that content-aware dataset pruning can dramatically reduce the cost of training Vision-Language Pretrained (VLP) models without significant performance losses compared to training on the original dataset. These results have been based on pruning commonly used image-caption datasets collected from the web -- datasets that are known to harbor harmful social biases that may then be codified in trained models. In this work, we evaluate how deduplication affects the prevalence of these biases in the resulting trained models and introduce an easy-to-implement modification to the recent SemDeDup algorithm that can reduce the negative effects that we observe. When examining CLIP-style models trained on deduplicated variants of LAION-400M, we find our proposed FairDeDup algorithm consistently leads to improved fairness metrics over SemDeDup on the FairFace and FACET datasets while maintaining zero-shot performance on CLIP benchmarks.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction
Authors:
Hang Hua,
Jing Shi,
Kushal Kafle,
Simon Jenni,
Daoan Zhang,
John Collomosse,
Scott Cohen,
Jiebo Luo
Abstract:
Recent progress in large-scale pre-training has led to the development of advanced vision-language models (VLMs) with remarkable proficiency in comprehending and generating multimodal content. Despite the impressive ability to perform complex reasoning for VLMs, current models often struggle to effectively and precisely capture the compositional information on both the image and text sides. To add…
▽ More
Recent progress in large-scale pre-training has led to the development of advanced vision-language models (VLMs) with remarkable proficiency in comprehending and generating multimodal content. Despite the impressive ability to perform complex reasoning for VLMs, current models often struggle to effectively and precisely capture the compositional information on both the image and text sides. To address this, we propose FineMatch, a new aspect-based fine-grained text and image matching benchmark, focusing on text and image mismatch detection and correction. This benchmark introduces a novel task for boosting and evaluating the VLMs' compositionality for aspect-based fine-grained text and image matching. In this task, models are required to identify mismatched aspect phrases within a caption, determine the aspect's class, and propose corrections for an image-text pair that may contain between 0 and 3 mismatches. To evaluate the models' performance on this new task, we propose a new evaluation metric named ITM-IoU for which our experiments show a high correlation to human evaluation. In addition, we also provide a comprehensive experimental analysis of existing mainstream VLMs, including fully supervised learning and in-context learning settings. We have found that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches. Moreover, models (e.g., GPT-4V, Gemini Pro Vision) with strong abilities to perform multimodal in-context learning are not as skilled at fine-grained compositional image and text matching analysis. With FineMatch, we are able to build a system for text-to-image generation hallucination detection and correction.
△ Less
Submitted 19 July, 2024; v1 submitted 22 April, 2024;
originally announced April 2024.
-
SCoRD: Subject-Conditional Relation Detection with Text-Augmented Data
Authors:
Ziyan Yang,
Kushal Kafle,
Zhe Lin,
Scott Cohen,
Zhihong Ding,
Vicente Ordonez
Abstract:
We propose Subject-Conditional Relation Detection SCoRD, where conditioned on an input subject, the goal is to predict all its relations to other objects in a scene along with their locations. Based on the Open Images dataset, we propose a challenging OIv6-SCoRD benchmark such that the training and testing splits have a distribution shift in terms of the occurrence statistics of $\langle$subject,…
▽ More
We propose Subject-Conditional Relation Detection SCoRD, where conditioned on an input subject, the goal is to predict all its relations to other objects in a scene along with their locations. Based on the Open Images dataset, we propose a challenging OIv6-SCoRD benchmark such that the training and testing splits have a distribution shift in terms of the occurrence statistics of $\langle$subject, relation, object$\rangle$ triplets. To solve this problem, we propose an auto-regressive model that given a subject, it predicts its relations, objects, and object locations by casting this output as a sequence of tokens. First, we show that previous scene-graph prediction methods fail to produce as exhaustive an enumeration of relation-object pairs when conditioned on a subject on this benchmark. Particularly, we obtain a recall@3 of 83.8% for our relation-object predictions compared to the 49.75% obtained by a recent scene graph detector. Then, we show improved generalization on both relation-object and object-box predictions by leveraging during training relation-object pairs obtained automatically from textual captions and for which no object-box annotations are available. Particularly, for $\langle$subject, relation, object$\rangle$ triplets for which no object locations are available during training, we are able to obtain a recall@3 of 33.80% for relation-object pairs and 26.75% for their box locations.
△ Less
Submitted 4 January, 2024; v1 submitted 24 August, 2023;
originally announced August 2023.
-
Helion: Enabling Natural Testing of Smart Homes
Authors:
Prianka Mandal,
Sunil Manandhar,
Kaushal Kafle,
Kevin Moran,
Denys Poshyvanyk,
Adwait Nadkarni
Abstract:
Prior work has developed numerous systems that test the security and safety of smart homes. For these systems to be applicable in practice, it is necessary to test them with realistic scenarios that represent the use of the smart home, i.e., home automation, in the wild. This demo paper presents the technical details and usage of Helion, a system that uses n-gram language modeling to learn the reg…
▽ More
Prior work has developed numerous systems that test the security and safety of smart homes. For these systems to be applicable in practice, it is necessary to test them with realistic scenarios that represent the use of the smart home, i.e., home automation, in the wild. This demo paper presents the technical details and usage of Helion, a system that uses n-gram language modeling to learn the regularities in user-driven programs, i.e., routines developed for the smart home, and predicts natural scenarios of home automation, i.e., event sequences that reflect realistic home automation usage. We demonstrate the HelionHA platform, developed by integrating Helion with the popular Home Assistant smart home platform. HelionHA allows an end-to-end exploration of Helion's scenarios by executing them as test cases with real and virtual smart home devices.
△ Less
Submitted 13 August, 2023;
originally announced August 2023.
-
MASC: A Tool for Mutation-Based Evaluation of Static Crypto-API Misuse Detectors
Authors:
Amit Seal Ami,
Syed Yusuf Ahmed,
Radowan Mahmud Redoy,
Nathan Cooper,
Kaushal Kafle,
Kevin Moran,
Denys Poshyvanyk,
Adwait Nadkarni
Abstract:
While software engineers are optimistically adopting crypto-API misuse detectors (or crypto-detectors) in their software development cycles, this momentum must be accompanied by a rigorous understanding of crypto-detectors' effectiveness at finding crypto-API misuses in practice. This demo paper presents the technical details and usage scenarios of our tool, namely Mutation Analysis for evaluating…
▽ More
While software engineers are optimistically adopting crypto-API misuse detectors (or crypto-detectors) in their software development cycles, this momentum must be accompanied by a rigorous understanding of crypto-detectors' effectiveness at finding crypto-API misuses in practice. This demo paper presents the technical details and usage scenarios of our tool, namely Mutation Analysis for evaluating Static Crypto-API misuse detectors (MASC). We developed $12$ generalizable, usage based mutation operators and three mutation scopes, namely Main Scope, Similarity Scope, and Exhaustive Scope, which can be used to expressively instantiate compilable variants of the crypto-API misuse cases. Using MASC, we evaluated nine major crypto-detectors, and discovered $19$ unique, undocumented flaws. We designed MASC to be configurable and user-friendly; a user can configure the parameters to change the nature of generated mutations. Furthermore, MASC comes with both Command Line Interface and Web-based front-end, making it practical for users of different levels of expertise.
△ Less
Submitted 13 August, 2023; v1 submitted 4 August, 2023;
originally announced August 2023.
-
Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations
Authors:
Ziyan Yang,
Kushal Kafle,
Franck Dernoncourt,
Vicente Ordonez
Abstract:
We propose a margin-based loss for tuning joint vision-language models so that their gradient-based explanations are consistent with region-level annotations provided by humans for relatively smaller grounding datasets. We refer to this objective as Attention Mask Consistency (AMC) and demonstrate that it produces superior visual grounding results than previous methods that rely on using vision-la…
▽ More
We propose a margin-based loss for tuning joint vision-language models so that their gradient-based explanations are consistent with region-level annotations provided by humans for relatively smaller grounding datasets. We refer to this objective as Attention Mask Consistency (AMC) and demonstrate that it produces superior visual grounding results than previous methods that rely on using vision-language models to score the outputs of object detectors. Particularly, a model trained with AMC on top of standard vision-language modeling objectives obtains a state-of-the-art accuracy of 86.49% in the Flickr30k visual grounding benchmark, an absolute improvement of 5.38% when compared to the best previous model trained under the same level of supervision. Our approach also performs exceedingly well on established benchmarks for referring expression comprehension where it obtains 80.34% accuracy in the easy test of RefCOCO+, and 64.55% in the difficult split. AMC is effective, easy to implement, and is general as it can be adopted by any vision-language model, and can use any type of region annotations.
△ Less
Submitted 6 January, 2024; v1 submitted 30 June, 2022;
originally announced June 2022.
-
OccamNets: Mitigating Dataset Bias by Favoring Simpler Hypotheses
Authors:
Robik Shrestha,
Kushal Kafle,
Christopher Kanan
Abstract:
Dataset bias and spurious correlations can significantly impair generalization in deep neural networks. Many prior efforts have addressed this problem using either alternative loss functions or sampling strategies that focus on rare patterns. We propose a new direction: modifying the network architecture to impose inductive biases that make the network robust to dataset bias. Specifically, we prop…
▽ More
Dataset bias and spurious correlations can significantly impair generalization in deep neural networks. Many prior efforts have addressed this problem using either alternative loss functions or sampling strategies that focus on rare patterns. We propose a new direction: modifying the network architecture to impose inductive biases that make the network robust to dataset bias. Specifically, we propose OccamNets, which are biased to favor simpler solutions by design. OccamNets have two inductive biases. First, they are biased to use as little network depth as needed for an individual example. Second, they are biased toward using fewer image locations for prediction. While OccamNets are biased toward simpler hypotheses, they can learn more complex hypotheses if necessary. In experiments, OccamNets outperform or rival state-of-the-art methods run on architectures that do not incorporate these inductive biases. Furthermore, we demonstrate that when the state-of-the-art debiasing methods are combined with OccamNets results further improve.
△ Less
Submitted 14 April, 2024; v1 submitted 5 April, 2022;
originally announced April 2022.
-
Towards Practical Integrity in the Smart Home with HomeEndorser
Authors:
Kaushal Kafle,
Kirti Jagtap,
Mansoor Ahmed-Rengers,
Trent Jaeger,
Adwait Nadkarni
Abstract:
Home automation in modern smart home platforms is often facilitated using trigger-action routines. While such routines enable flexible automation, they also lead to an instance of the integrity problem in these systems: untrusted third-parties may use platform APIs to modify the abstract home objects (AHOs) that privileged, high-integrity devices such as security cameras rely on (i.e., as triggers…
▽ More
Home automation in modern smart home platforms is often facilitated using trigger-action routines. While such routines enable flexible automation, they also lead to an instance of the integrity problem in these systems: untrusted third-parties may use platform APIs to modify the abstract home objects (AHOs) that privileged, high-integrity devices such as security cameras rely on (i.e., as triggers), thereby transitively attacking them. As most accesses to AHOs are legitimate, removing the permissions or applying naive information flow controls would not only fail to prevent these problems, but also break useful functionality. Therefore, this paper proposes the alternate approach of home abstraction endorsement, which endorses a proposed change to an AHO by correlating it with certain specific, preceding, environmental changes. We present the HomeEndorser framework, which provides a policy model for specifying endorsement policies for AHOs as changes in device states, relative to their location, and a platform-based reference monitor for mediating all API requests to change AHOs against those device states. We evaluate HomeEndorser on the HomeAssistant platform, finding that we can derive over 1000 policy rules for HomeEndorser to endorse changes to 6 key AHOs, preventing malice and accidents for less than 10% overhead for endorsement check microbenchmarks, and with no false alarms under realistic usage scenarios. In doing so, HomeEndorser lays the first steps towards providing a practical foundation for ensuring that API-induced changes to abstract home objects correlate with the physical realities of the user's environment.
△ Less
Submitted 10 September, 2021;
originally announced September 2021.
-
Why Crypto-detectors Fail: A Systematic Evaluation of Cryptographic Misuse Detection Techniques
Authors:
Amit Seal Ami,
Nathan Cooper,
Kaushal Kafle,
Kevin Moran,
Denys Poshyvanyk,
Adwait Nadkarni
Abstract:
The correct use of cryptography is central to ensuring data security in modern software systems. Hence, several academic and commercial static analysis tools have been developed for detecting and mitigating crypto-API misuse. While developers are optimistically adopting these crypto-API misuse detectors (or crypto-detectors) in their software development cycles, this momentum must be accompanied b…
▽ More
The correct use of cryptography is central to ensuring data security in modern software systems. Hence, several academic and commercial static analysis tools have been developed for detecting and mitigating crypto-API misuse. While developers are optimistically adopting these crypto-API misuse detectors (or crypto-detectors) in their software development cycles, this momentum must be accompanied by a rigorous understanding of their effectiveness at finding crypto-API misuse in practice. This paper presents the MASC framework, which enables a systematic and data-driven evaluation of crypto-detectors using mutation testing. We ground MASC in a comprehensive view of the problem space by developing a data-driven taxonomy of existing crypto-API misuse, containing $105$ misuse cases organized among nine semantic clusters. We develop $12$ generalizable usage-based mutation operators and three mutation scopes that can expressively instantiate thousands of compilable variants of the misuse cases for thoroughly evaluating crypto-detectors. Using MASC, we evaluate nine major crypto-detectors and discover $19$ unique, undocumented flaws that severely impact the ability of crypto-detectors to discover misuses in practice. We conclude with a discussion on the diverse perspectives that influence the design of crypto-detectors and future directions towards building security-focused crypto-detectors by design.
△ Less
Submitted 24 July, 2022; v1 submitted 14 July, 2021;
originally announced July 2021.
-
Learning to Predict Visual Attributes in the Wild
Authors:
Khoi Pham,
Kushal Kafle,
Zhe Lin,
Zhihong Ding,
Scott Cohen,
Quan Tran,
Abhinav Shrivastava
Abstract:
Visual attributes constitute a large portion of information contained in a scene. Objects can be described using a wide variety of attributes which portray their visual appearance (color, texture), geometry (shape, size, posture), and other intrinsic properties (state, action). Existing work is mostly limited to study of attribute prediction in specific domains. In this paper, we introduce a large…
▽ More
Visual attributes constitute a large portion of information contained in a scene. Objects can be described using a wide variety of attributes which portray their visual appearance (color, texture), geometry (shape, size, posture), and other intrinsic properties (state, action). Existing work is mostly limited to study of attribute prediction in specific domains. In this paper, we introduce a large-scale in-the-wild visual attribute prediction dataset consisting of over 927K attribute annotations for over 260K object instances. Formally, object attribute prediction is a multi-label classification problem where all attributes that apply to an object must be predicted. Our dataset poses significant challenges to existing methods due to large number of attributes, label sparsity, data imbalance, and object occlusion. To this end, we propose several techniques that systematically tackle these challenges, including a base model that utilizes both low- and high-level CNN features with multi-hop attention, reweighting and resampling techniques, a novel negative label expansion scheme, and a novel supervised attribute-aware contrastive learning algorithm. Using these techniques, we achieve near 3.7 mAP and 5.7 overall F1 points improvement over the current state of the art. Further details about the VAW dataset can be found at http://vawdataset.com/.
△ Less
Submitted 17 June, 2021;
originally announced June 2021.
-
Are Bias Mitigation Techniques for Deep Learning Effective?
Authors:
Robik Shrestha,
Kushal Kafle,
Christopher Kanan
Abstract:
A critical problem in deep learning is that systems learn inappropriate biases, resulting in their inability to perform well on minority groups. This has led to the creation of multiple algorithms that endeavor to mitigate bias. However, it is not clear how effective these methods are. This is because study protocols differ among papers, systems are tested on datasets that fail to test many forms…
▽ More
A critical problem in deep learning is that systems learn inappropriate biases, resulting in their inability to perform well on minority groups. This has led to the creation of multiple algorithms that endeavor to mitigate bias. However, it is not clear how effective these methods are. This is because study protocols differ among papers, systems are tested on datasets that fail to test many forms of bias, and systems have access to hidden knowledge or are tuned specifically to the test set. To address this, we introduce an improved evaluation protocol, sensible metrics, and a new dataset, which enables us to ask and answer critical questions about bias mitigation algorithms. We evaluate seven state-of-the-art algorithms using the same network architecture and hyperparameter selection policy across three benchmark datasets. We introduce a new dataset called Biased MNIST that enables assessment of robustness to multiple bias sources. We use Biased MNIST and a visual question answering (VQA) benchmark to assess robustness to hidden biases. Rather than only tuning to the test set distribution, we study robustness across different tuning distributions, which is critical because for many applications the test distribution may not be known during development. We find that algorithms exploit hidden biases, are unable to scale to multiple forms of bias, and are highly sensitive to the choice of tuning set. Based on our findings, we implore the community to adopt more rigorous assessment of future bias mitigation methods. All data, code, and results are publicly available at: https://github.com/erobic/bias-mitigators.
△ Less
Submitted 23 April, 2024; v1 submitted 31 March, 2021;
originally announced April 2021.
-
Systematic Mutation-based Evaluation of the Soundness of Security-focused Android Static Analysis Techniques
Authors:
Amit Seal Ami,
Kaushal Kafle,
Kevin Moran,
Adwait Nadkarni,
Denys Poshyvanyk
Abstract:
Mobile application security has been a major area of focus for security research over the course of the last decade. Numerous application analysis tools have been proposed in response to malicious, curious, or vulnerable apps. However, existing tools, and specifically, static analysis tools, trade soundness of the analysis for precision and performance and are hence soundy. Unfortunately, the spec…
▽ More
Mobile application security has been a major area of focus for security research over the course of the last decade. Numerous application analysis tools have been proposed in response to malicious, curious, or vulnerable apps. However, existing tools, and specifically, static analysis tools, trade soundness of the analysis for precision and performance and are hence soundy. Unfortunately, the specific unsound choices or flaws in the design of these tools is often not known or well-documented, leading to misplaced confidence among researchers, developers, and users. This paper describes the Mutation-based Soundness Evaluation ($μ$SE) framework, which systematically evaluates Android static analysis tools to discover, document, and fix flaws, by leveraging the well-founded practice of mutation analysis. We implemented $μ$SE and applied it to a set of prominent Android static analysis tools that detect private data leaks in apps. In a study conducted previously, we used $μ$SE to discover $13$ previously undocumented flaws in FlowDroid, one of the most prominent data leak detectors for Android apps. Moreover, we discovered that flaws also propagated to other tools that build upon the design or implementation of FlowDroid or its components. This paper substantially extends our $μ$SE framework and offers an new in-depth analysis of two more major tools in our 2020 study, we find $12$ new, undocumented flaws and demonstrate that all $25$ flaws are found in more than one tool, regardless of any inheritance-relation among the tools. Our results motivate the need for systematic discovery and documentation of unsound choices in soundy tools and demonstrate the opportunities in leveraging mutation testing in achieving this goal.
△ Less
Submitted 17 July, 2021; v1 submitted 12 February, 2021;
originally announced February 2021.
-
$μ$SE: Mutation-based Evaluation of Security-focused Static Analysis Tools for Android
Authors:
Amit Seal Ami,
Kaushal Kafle,
Kevin Moran,
Adwait Nadkarni,
Denys Poshyvanyk
Abstract:
This demo paper presents the technical details and usage scenarios of $μ$SE: a mutation-based tool for evaluating security-focused static analysis tools for Android. Mutation testing is generally used by software practitioners to assess the robustness of a given test-suite. However, we leverage this technique to systematically evaluate static analysis tools and uncover and document soundness issue…
▽ More
This demo paper presents the technical details and usage scenarios of $μ$SE: a mutation-based tool for evaluating security-focused static analysis tools for Android. Mutation testing is generally used by software practitioners to assess the robustness of a given test-suite. However, we leverage this technique to systematically evaluate static analysis tools and uncover and document soundness issues. $μ$SE's analysis has found 25 previously undocumented flaws in static data leak detection tools for Android. $μ$SE offers four mutation schemes, namely Reachability, Complex-reachability, TaintSink, and ScopeSink, which determine the locations of seeded mutants. Furthermore, the user can extend $μ$SE by customizing the API calls targeted by the mutation analysis. $μ$SE is also practical, as it makes use of filtering techniques based on compilation and execution criteria that reduces the number of ineffective mutations.
△ Less
Submitted 12 February, 2021;
originally announced February 2021.
-
On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law
Authors:
Damien Teney,
Kushal Kafle,
Robik Shrestha,
Ehsan Abbasnejad,
Christopher Kanan,
Anton van den Hengel
Abstract:
Out-of-distribution (OOD) testing is increasingly popular for evaluating a machine learning system's ability to generalize beyond the biases of a training set. OOD benchmarks are designed to present a different joint distribution of data and labels between training and test time. VQA-CP has become the standard OOD benchmark for visual question answering, but we discovered three troubling practices…
▽ More
Out-of-distribution (OOD) testing is increasingly popular for evaluating a machine learning system's ability to generalize beyond the biases of a training set. OOD benchmarks are designed to present a different joint distribution of data and labels between training and test time. VQA-CP has become the standard OOD benchmark for visual question answering, but we discovered three troubling practices in its current use. First, most published methods rely on explicit knowledge of the construction of the OOD splits. They often rely on ``inverting'' the distribution of labels, e.g. answering mostly 'yes' when the common training answer is 'no'. Second, the OOD test set is used for model selection. Third, a model's in-domain performance is assessed after retraining it on in-domain splits (VQA v2) that exhibit a more balanced distribution of labels. These three practices defeat the objective of evaluating generalization, and put into question the value of methods specifically designed for this dataset. We show that embarrassingly-simple methods, including one that generates answers at random, surpass the state of the art on some question types. We provide short- and long-term solutions to avoid these pitfalls and realize the benefits of OOD evaluation.
△ Less
Submitted 19 May, 2020;
originally announced May 2020.
-
Do We Need Fully Connected Output Layers in Convolutional Networks?
Authors:
Zhongchao Qian,
Tyler L. Hayes,
Kushal Kafle,
Christopher Kanan
Abstract:
Traditionally, deep convolutional neural networks consist of a series of convolutional and pooling layers followed by one or more fully connected (FC) layers to perform the final classification. While this design has been successful, for datasets with a large number of categories, the fully connected layers often account for a large percentage of the network's parameters. For applications with mem…
▽ More
Traditionally, deep convolutional neural networks consist of a series of convolutional and pooling layers followed by one or more fully connected (FC) layers to perform the final classification. While this design has been successful, for datasets with a large number of categories, the fully connected layers often account for a large percentage of the network's parameters. For applications with memory constraints, such as mobile devices and embedded platforms, this is not ideal. Recently, a family of architectures that involve replacing the learned fully connected output layer with a fixed layer has been proposed as a way to achieve better efficiency. In this paper we examine this idea further and demonstrate that fixed classifiers offer no additional benefit compared to simply removing the output layer along with its parameters. We further demonstrate that the typical approach of having a fully connected final output layer is inefficient in terms of parameter count. We are able to achieve comparable performance to a traditionally learned fully connected classification output layer on the ImageNet-1K, CIFAR-100, Stanford Cars-196, and Oxford Flowers-102 datasets, while not having a fully connected output layer at all.
△ Less
Submitted 28 April, 2020; v1 submitted 28 April, 2020;
originally announced April 2020.
-
Visual Grounding Methods for VQA are Working for the Wrong Reasons!
Authors:
Robik Shrestha,
Kushal Kafle,
Christopher Kanan
Abstract:
Existing Visual Question Answering (VQA) methods tend to exploit dataset biases and spurious statistical correlations, instead of producing right answers for the right reasons. To address this issue, recent bias mitigation methods for VQA propose to incorporate visual cues (e.g., human attention maps) to better ground the VQA models, showcasing impressive gains. However, we show that the performan…
▽ More
Existing Visual Question Answering (VQA) methods tend to exploit dataset biases and spurious statistical correlations, instead of producing right answers for the right reasons. To address this issue, recent bias mitigation methods for VQA propose to incorporate visual cues (e.g., human attention maps) to better ground the VQA models, showcasing impressive gains. However, we show that the performance improvements are not a result of improved visual grounding, but a regularization effect which prevents over-fitting to linguistic priors. For instance, we find that it is not actually necessary to provide proper, human-based cues; random, insensible cues also result in similar improvements. Based on this observation, we propose a simpler regularization scheme that does not require any external annotations and yet achieves near state-of-the-art performance on VQA-CPv2.
△ Less
Submitted 23 April, 2024; v1 submitted 12 April, 2020;
originally announced April 2020.
-
REMIND Your Neural Network to Prevent Catastrophic Forgetting
Authors:
Tyler L. Hayes,
Kushal Kafle,
Robik Shrestha,
Manoj Acharya,
Christopher Kanan
Abstract:
People learn throughout life. However, incrementally updating conventional neural networks leads to catastrophic forgetting. A common remedy is replay, which is inspired by how the brain consolidates memory. Replay involves fine-tuning a network on a mixture of new and old instances. While there is neuroscientific evidence that the brain replays compressed memories, existing methods for convolutio…
▽ More
People learn throughout life. However, incrementally updating conventional neural networks leads to catastrophic forgetting. A common remedy is replay, which is inspired by how the brain consolidates memory. Replay involves fine-tuning a network on a mixture of new and old instances. While there is neuroscientific evidence that the brain replays compressed memories, existing methods for convolutional networks replay raw images. Here, we propose REMIND, a brain-inspired approach that enables efficient replay with compressed representations. REMIND is trained in an online manner, meaning it learns one example at a time, which is closer to how humans learn. Under the same constraints, REMIND outperforms other methods for incremental class learning on the ImageNet ILSVRC-2012 dataset. We probe REMIND's robustness to data ordering schemes known to induce catastrophic forgetting. We demonstrate REMIND's generality by pioneering online learning for Visual Question Answering (VQA).
△ Less
Submitted 13 July, 2020; v1 submitted 6 October, 2019;
originally announced October 2019.
-
Answering Questions about Data Visualizations using Efficient Bimodal Fusion
Authors:
Kushal Kafle,
Robik Shrestha,
Brian Price,
Scott Cohen,
Christopher Kanan
Abstract:
Chart question answering (CQA) is a newly proposed visual question answering (VQA) task where an algorithm must answer questions about data visualizations, e.g. bar charts, pie charts, and line graphs. CQA requires capabilities that natural-image VQA algorithms lack: fine-grained measurements, optical character recognition, and handling out-of-vocabulary words in both questions and answers. Withou…
▽ More
Chart question answering (CQA) is a newly proposed visual question answering (VQA) task where an algorithm must answer questions about data visualizations, e.g. bar charts, pie charts, and line graphs. CQA requires capabilities that natural-image VQA algorithms lack: fine-grained measurements, optical character recognition, and handling out-of-vocabulary words in both questions and answers. Without modifications, state-of-the-art VQA algorithms perform poorly on this task. Here, we propose a novel CQA algorithm called parallel recurrent fusion of image and language (PReFIL). PReFIL first learns bimodal embeddings by fusing question and image features and then intelligently aggregates these learned embeddings to answer the given question. Despite its simplicity, PReFIL greatly surpasses state-of-the art systems and human baselines on both the FigureQA and DVQA datasets. Additionally, we demonstrate that PReFIL can be used to reconstruct tables by asking a series of questions about a chart.
△ Less
Submitted 22 July, 2020; v1 submitted 5 August, 2019;
originally announced August 2019.
-
Helion: Enabling a Natural Perspective of Home Automation
Authors:
Sunil Manandhar,
Kevin Moran,
Kaushal Kafle,
Ruhao Tang,
Denys Poshyvanyk,
Adwait Nadkarni
Abstract:
Security researchers have recently discovered significant security and safety issues related to home automation and developed approaches to address them. Such approaches often face design and evaluation challenges which arise from their restricted perspective of home automation that is bounded by the IoT apps they analyze. The challenges of past work can be overcome by relying on a deeper understa…
▽ More
Security researchers have recently discovered significant security and safety issues related to home automation and developed approaches to address them. Such approaches often face design and evaluation challenges which arise from their restricted perspective of home automation that is bounded by the IoT apps they analyze. The challenges of past work can be overcome by relying on a deeper understanding of realistic home automation usage. More specifically, the availability of natural home automation scenarios, i.e., sequences of home automation events that may realistically occur in an end-user's home, could help security researchers design better security/safety systems. This paper presents Helion, a framework for building a natural perspective of home automation. Helion identifies the regularities in user-driven home automation, i.e., from user-driven routines that are increasingly being created by users through intuitive platform UIs. Our intuition for designing Helion is that smart home event sequences created by users exhibit an inherent set of semantic patterns, or naturalness that can be modeled and used to generate valid and useful scenarios. To evaluate our approach, we first empirically demonstrate that this naturalness hypothesis holds, with a corpus of 30,518 home automation events, constructed from 273 routines collected from 40 users. We then demonstrate that the scenarios generated by Helion are reasonable and valid from an end-user perspective, through an evaluation with 16 external evaluators. We further show the usefulness of Helion's scenarios by generating 17 home security/safety policies with significantly less effort than existing approaches. We conclude by discussing key takeaways and future research challenges enabled by Helion's natural perspective of home automation.
△ Less
Submitted 28 June, 2019;
originally announced July 2019.
-
Challenges and Prospects in Vision and Language Research
Authors:
Kushal Kafle,
Robik Shrestha,
Christopher Kanan
Abstract:
Language grounded image understanding tasks have often been proposed as a method for evaluating progress in artificial intelligence. Ideally, these tasks should test a plethora of capabilities that integrate computer vision, reasoning, and natural language understanding. However, rather than behaving as visual Turing tests, recent studies have demonstrated state-of-the-art systems are achieving go…
▽ More
Language grounded image understanding tasks have often been proposed as a method for evaluating progress in artificial intelligence. Ideally, these tasks should test a plethora of capabilities that integrate computer vision, reasoning, and natural language understanding. However, rather than behaving as visual Turing tests, recent studies have demonstrated state-of-the-art systems are achieving good performance through flaws in datasets and evaluation procedures. We review the current state of affairs and outline a path forward.
△ Less
Submitted 24 May, 2019; v1 submitted 19 April, 2019;
originally announced April 2019.
-
Answer Them All! Toward Universal Visual Question Answering Models
Authors:
Robik Shrestha,
Kushal Kafle,
Christopher Kanan
Abstract:
Visual Question Answering (VQA) research is split into two camps: the first focuses on VQA datasets that require natural image understanding and the second focuses on synthetic datasets that test reasoning. A good VQA algorithm should be capable of both, but only a few VQA algorithms are tested in this manner. We compare five state-of-the-art VQA algorithms across eight VQA datasets covering both…
▽ More
Visual Question Answering (VQA) research is split into two camps: the first focuses on VQA datasets that require natural image understanding and the second focuses on synthetic datasets that test reasoning. A good VQA algorithm should be capable of both, but only a few VQA algorithms are tested in this manner. We compare five state-of-the-art VQA algorithms across eight VQA datasets covering both domains. To make the comparison fair, all of the models are standardized as much as possible, e.g., they use the same visual features, answer vocabularies, etc. We find that methods do not generalize across the two domains. To address this problem, we propose a new VQA algorithm that rivals or exceeds the state-of-the-art for both domains.
△ Less
Submitted 5 April, 2019; v1 submitted 1 March, 2019;
originally announced March 2019.
-
A Study of Data Store-based Home Automation
Authors:
Kaushal Kafle,
Kevin Moran,
Sunil Manandhar,
Adwait Nadkarni,
Denys Poshyvanyk
Abstract:
Home automation platforms provide a new level of convenience by enabling consumers to automate various aspects of physical objects in their homes. While the convenience is beneficial, security flaws in the platforms or integrated third-party products can have serious consequences for the integrity of a user's physical environment. In this paper we perform a systematic security evaluation of two po…
▽ More
Home automation platforms provide a new level of convenience by enabling consumers to automate various aspects of physical objects in their homes. While the convenience is beneficial, security flaws in the platforms or integrated third-party products can have serious consequences for the integrity of a user's physical environment. In this paper we perform a systematic security evaluation of two popular smart home platforms, Google's Nest platform and Philips Hue, that implement home automation "routines" (i.e., trigger-action programs involving apps and devices) via manipulation of state variables in a centralized data store. Our semi-automated analysis examines, among other things, platform access control enforcement, the rigor of non-system enforcement procedures, and the potential for misuse of routines. This analysis results in ten key findings with serious security implications. For instance, we demonstrate the potential for the misuse of smart home routines in the Nest platform to perform a lateral privilege escalation, illustrate how Nest's product review system is ineffective at preventing multiple stages of this attack that it examines, and demonstrate how emerging platforms may fail to provide even bare-minimum security by allowing apps to arbitrarily add/remove other apps from the user's smart home. Our findings draw attention to the unique security challenges of platforms that execute routines via centralized data stores and highlight the importance of enforcing security by design in emerging home automation platforms.
△ Less
Submitted 4 December, 2018;
originally announced December 2018.
-
TallyQA: Answering Complex Counting Questions
Authors:
Manoj Acharya,
Kushal Kafle,
Christopher Kanan
Abstract:
Most counting questions in visual question answering (VQA) datasets are simple and require no more than object detection. Here, we study algorithms for complex counting questions that involve relationships between objects, attribute identification, reasoning, and more. To do this, we created TallyQA, the world's largest dataset for open-ended counting. We propose a new algorithm for counting that…
▽ More
Most counting questions in visual question answering (VQA) datasets are simple and require no more than object detection. Here, we study algorithms for complex counting questions that involve relationships between objects, attribute identification, reasoning, and more. To do this, we created TallyQA, the world's largest dataset for open-ended counting. We propose a new algorithm for counting that uses relation networks with region proposals. Our method lets relation networks be efficiently used with high-resolution imagery. It yields state-of-the-art results compared to baseline and recent systems on both TallyQA and the HowMany-QA benchmark.
△ Less
Submitted 31 October, 2018; v1 submitted 29 October, 2018;
originally announced October 2018.
-
Discovering Flaws in Security-Focused Static Analysis Tools for Android using Systematic Mutation
Authors:
Richard Bonett,
Kaushal Kafle,
Kevin Moran,
Adwait Nadkarni,
Denys Poshyvanyk
Abstract:
Mobile application security has been one of the major areas of security research in the last decade. Numerous application analysis tools have been proposed in response to malicious, curious, or vulnerable apps. However, existing tools, and specifically, static analysis tools, trade soundness of the analysis for precision and performance, and are hence soundy. Unfortunately, the specific unsound ch…
▽ More
Mobile application security has been one of the major areas of security research in the last decade. Numerous application analysis tools have been proposed in response to malicious, curious, or vulnerable apps. However, existing tools, and specifically, static analysis tools, trade soundness of the analysis for precision and performance, and are hence soundy. Unfortunately, the specific unsound choices or flaws in the design of these tools are often not known or well-documented, leading to a misplaced confidence among researchers, developers, and users. This paper proposes the Mutation-based soundness evaluation ($μ$SE) framework, which systematically evaluates Android static analysis tools to discover, document, and fix, flaws, by leveraging the well-founded practice of mutation analysis. We implement $μ$SE as a semi-automated framework, and apply it to a set of prominent Android static analysis tools that detect private data leaks in apps. As the result of an in-depth analysis of one of the major tools, we discover 13 undocumented flaws. More importantly, we discover that all 13 flaws propagate to tools that inherit the flawed tool. We successfully fix one of the flaws in cooperation with the tool developers. Our results motivate the urgent need for systematic discovery and documentation of unsound choices in soundy tools, and demonstrate the opportunities in leveraging mutation testing in achieving this goal.
△ Less
Submitted 27 June, 2018; v1 submitted 25 June, 2018;
originally announced June 2018.
-
DVQA: Understanding Data Visualizations via Question Answering
Authors:
Kushal Kafle,
Brian Price,
Scott Cohen,
Christopher Kanan
Abstract:
Bar charts are an effective way to convey numeric information, but today's algorithms cannot parse them. Existing methods fail when faced with even minor variations in appearance. Here, we present DVQA, a dataset that tests many aspects of bar chart understanding in a question answering framework. Unlike visual question answering (VQA), DVQA requires processing words and answers that are unique to…
▽ More
Bar charts are an effective way to convey numeric information, but today's algorithms cannot parse them. Existing methods fail when faced with even minor variations in appearance. Here, we present DVQA, a dataset that tests many aspects of bar chart understanding in a question answering framework. Unlike visual question answering (VQA), DVQA requires processing words and answers that are unique to a particular bar chart. State-of-the-art VQA algorithms perform poorly on DVQA, and we propose two strong baselines that perform considerably better. Our work will enable algorithms to automatically extract numeric and semantic information from vast quantities of bar charts found in scientific publications, Internet articles, business reports, and many other areas.
△ Less
Submitted 29 March, 2018; v1 submitted 24 January, 2018;
originally announced January 2018.
-
An Analysis of Visual Question Answering Algorithms
Authors:
Kushal Kafle,
Christopher Kanan
Abstract:
In visual question answering (VQA), an algorithm must answer text-based questions about images. While multiple datasets for VQA have been created since late 2014, they all have flaws in both their content and the way algorithms are evaluated on them. As a result, evaluation scores are inflated and predominantly determined by answering easier questions, making it difficult to compare different meth…
▽ More
In visual question answering (VQA), an algorithm must answer text-based questions about images. While multiple datasets for VQA have been created since late 2014, they all have flaws in both their content and the way algorithms are evaluated on them. As a result, evaluation scores are inflated and predominantly determined by answering easier questions, making it difficult to compare different methods. In this paper, we analyze existing VQA algorithms using a new dataset. It contains over 1.6 million questions organized into 12 different categories. We also introduce questions that are meaningless for a given image to force a VQA system to reason about image content. We propose new evaluation schemes that compensate for over-represented question-types and make it easier to study the strengths and weaknesses of algorithms. We analyze the performance of both baseline and state-of-the-art VQA models, including multi-modal compact bilinear pooling (MCB), neural module networks, and recurrent answering units. Our experiments establish how attention helps certain categories more than others, determine which models work better than others, and explain how simple models (e.g. MLP) can surpass more complex models (MCB) by simply learning to answer large, easy question categories.
△ Less
Submitted 13 September, 2017; v1 submitted 28 March, 2017;
originally announced March 2017.
-
Visual Question Answering: Datasets, Algorithms, and Future Challenges
Authors:
Kushal Kafle,
Christopher Kanan
Abstract:
Visual Question Answering (VQA) is a recent problem in computer vision and natural language processing that has garnered a large amount of interest from the deep learning, computer vision, and natural language processing communities. In VQA, an algorithm needs to answer text-based questions about images. Since the release of the first VQA dataset in 2014, additional datasets have been released and…
▽ More
Visual Question Answering (VQA) is a recent problem in computer vision and natural language processing that has garnered a large amount of interest from the deep learning, computer vision, and natural language processing communities. In VQA, an algorithm needs to answer text-based questions about images. Since the release of the first VQA dataset in 2014, additional datasets have been released and many algorithms have been proposed. In this review, we critically examine the current state of VQA in terms of problem formulation, existing datasets, evaluation metrics, and algorithms. In particular, we discuss the limitations of current datasets with regard to their ability to properly train and assess VQA algorithms. We then exhaustively review existing algorithms for VQA. Finally, we discuss possible future directions for VQA and image understanding research.
△ Less
Submitted 14 June, 2017; v1 submitted 5 October, 2016;
originally announced October 2016.