-
GameChat: Multi-LLM Dialogue for Safe, Agile, and Socially Optimal Multi-Agent Navigation in Constrained Environments
Authors:
Vagul Mahadevan,
Shangtong Zhang,
Rohan Chandra
Abstract:
Safe, agile, and socially compliant multi-robot navigation in cluttered and constrained environments remains a critical challenge. This is especially difficult with self-interested agents in decentralized settings, where there is no central authority to resolve conflicts induced by spatial symmetry. We address this challenge by proposing a novel approach, GameChat, which facilitates safe, agile, a…
▽ More
Safe, agile, and socially compliant multi-robot navigation in cluttered and constrained environments remains a critical challenge. This is especially difficult with self-interested agents in decentralized settings, where there is no central authority to resolve conflicts induced by spatial symmetry. We address this challenge by proposing a novel approach, GameChat, which facilitates safe, agile, and deadlock-free navigation for both cooperative and self-interested agents. Key to our approach is the use of natural language communication to resolve conflicts, enabling agents to prioritize more urgent tasks and break spatial symmetry in a socially optimal manner. Our algorithm ensures subgame perfect equilibrium, preventing agents from deviating from agreed-upon behaviors and supporting cooperation. Furthermore, we guarantee safety through control barrier functions and preserve agility by minimizing disruptions to agents' planned trajectories. We evaluate GameChat in simulated environments with doorways and intersections. The results show that even in the worst case, GameChat reduces the time for all agents to reach their goals by over 35% from a naive baseline and by over 20% from SMG-CBF in the intersection scenario, while doubling the rate of ensuring the agent with a higher priority task reaches the goal first, from 50% (equivalent to random chance) to a 100% perfect performance at maximizing social welfare.
△ Less
Submitted 15 March, 2025;
originally announced March 2025.
-
DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models
Authors:
Sungnyun Kim,
Haofu Liao,
Srikar Appalaraju,
Peng Tang,
Zhuowen Tu,
Ravi Kumar Satzoda,
R. Manmatha,
Vijay Mahadevan,
Stefano Soatto
Abstract:
Visual document understanding (VDU) is a challenging task that involves understanding documents across various modalities (text and image) and layouts (forms, tables, etc.). This study aims to enhance generalizability of small VDU models by distilling knowledge from LLMs. We identify that directly prompting LLMs often fails to generate informative and useful data. In response, we present a new fra…
▽ More
Visual document understanding (VDU) is a challenging task that involves understanding documents across various modalities (text and image) and layouts (forms, tables, etc.). This study aims to enhance generalizability of small VDU models by distilling knowledge from LLMs. We identify that directly prompting LLMs often fails to generate informative and useful data. In response, we present a new framework (called DocKD) that enriches the data generation process by integrating external document knowledge. Specifically, we provide an LLM with various document elements like key-value pairs, layouts, and descriptions, to elicit open-ended answers. Our experiments show that DocKD produces high-quality document annotations and surpasses the direct knowledge distillation approach that does not leverage external document knowledge. Moreover, student VDU models trained with solely DocKD-generated data are not only comparable to those trained with human-annotated data on in-domain tasks but also significantly excel them on out-of-domain tasks.
△ Less
Submitted 3 October, 2024;
originally announced October 2024.
-
RiskSEA : A Scalable Graph Embedding for Detecting On-chain Fraudulent Activities on the Ethereum Blockchain
Authors:
Ayush Agarwal,
Lv Lu,
Arjun Maheswaran,
Varsha Mahadevan,
Bhaskar Krishnamachari
Abstract:
Like any other useful technology, cryptocurrencies are sometimes used for criminal activities. While transactions are recorded on the blockchain, there exists a need for a more rapid and scalable method to detect addresses associated with fraudulent activities. We present RiskSEA, a scalable risk scoring system capable of effectively handling the dynamic nature of large-scale blockchain transactio…
▽ More
Like any other useful technology, cryptocurrencies are sometimes used for criminal activities. While transactions are recorded on the blockchain, there exists a need for a more rapid and scalable method to detect addresses associated with fraudulent activities. We present RiskSEA, a scalable risk scoring system capable of effectively handling the dynamic nature of large-scale blockchain transaction graphs. The risk scoring system, which we implement for Ethereum, consists of 1. a scalable approach to generating node2vec embedding for entire set of addresses to capture the graph topology 2. transaction-based features to capture the transactional behavioral pattern of an address 3. a classifier model to generate risk score for addresses that combines the node2vec embedding and behavioral features. Efficiently generating node2vec embedding for large scale and dynamically evolving blockchain transaction graphs is challenging, we present two novel approaches for generating node2vec embeddings and effectively scaling it to the entire set of blockchain addresses: 1. node2vec embedding propagation and 2. dynamic node2vec embedding. We present a comprehensive analysis of the proposed approaches. Our experiments show that combining both behavioral and node2vec features boosts the classification performance significantly, and that the dynamic node2vec embeddings perform better than the node2vec propagated embeddings.
△ Less
Submitted 2 October, 2024;
originally announced October 2024.
-
Enhancing Vision-Language Pre-training with Rich Supervisions
Authors:
Yuan Gao,
Kunyu Shi,
Pengkai Zhu,
Edouard Belval,
Oren Nuriel,
Srikar Appalaraju,
Shabnam Ghadar,
Vijay Mahadevan,
Zhuowen Tu,
Stefano Soatto
Abstract:
We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4, we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localiza…
▽ More
We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4, we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localization to carefully design 10 pre-training tasks with large scale annotated data. These tasks resemble downstream tasks across different domains and the annotations are cheap to obtain. We demonstrate that, compared to current screenshot pre-training objectives, our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks - up to 76.1% improvements on Table Detection, and at least 1% on Widget Captioning.
△ Less
Submitted 12 March, 2025; v1 submitted 5 March, 2024;
originally announced March 2024.
-
DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder Transformer Models
Authors:
Peng Tang,
Pengkai Zhu,
Tian Li,
Srikar Appalaraju,
Vijay Mahadevan,
R. Manmatha
Abstract:
Encoder-decoder transformer models have achieved great success on various vision-language (VL) tasks, but they suffer from high inference latency. Typically, the decoder takes up most of the latency because of the auto-regressive decoding. To accelerate the inference, we propose an approach of performing Dynamic Early Exit on Decoder (DEED). We build a multi-exit encoder-decoder transformer model…
▽ More
Encoder-decoder transformer models have achieved great success on various vision-language (VL) tasks, but they suffer from high inference latency. Typically, the decoder takes up most of the latency because of the auto-regressive decoding. To accelerate the inference, we propose an approach of performing Dynamic Early Exit on Decoder (DEED). We build a multi-exit encoder-decoder transformer model which is trained with deep supervision so that each of its decoder layers is capable of generating plausible predictions. In addition, we leverage simple yet practical techniques, including shared generation head and adaptation modules, to keep accuracy when exiting at shallow decoder layers. Based on the multi-exit model, we perform step-level dynamic early exit during inference, where the model may decide to use fewer decoder layers based on its confidence of the current layer at each individual decoding step. Considering different number of decoder layers may be used at different decoding steps, we compute deeper-layer decoder features of previous decoding steps just-in-time, which ensures the features from different decoding steps are semantically aligned. We evaluate our approach with two state-of-the-art encoder-decoder transformer models on various VL tasks. We show our approach can reduce overall inference latency by 30%-60% with comparable or even higher accuracy compared to baselines.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
Multiple-Question Multiple-Answer Text-VQA
Authors:
Peng Tang,
Srikar Appalaraju,
R. Manmatha,
Yusheng Xie,
Vijay Mahadevan
Abstract:
We present Multiple-Question Multiple-Answer (MQMA), a novel approach to do text-VQA in encoder-decoder transformer models. The text-VQA task requires a model to answer a question by understanding multi-modal content: text (typically from OCR) and an associated image. To the best of our knowledge, almost all previous approaches for text-VQA process a single question and its associated content to p…
▽ More
We present Multiple-Question Multiple-Answer (MQMA), a novel approach to do text-VQA in encoder-decoder transformer models. The text-VQA task requires a model to answer a question by understanding multi-modal content: text (typically from OCR) and an associated image. To the best of our knowledge, almost all previous approaches for text-VQA process a single question and its associated content to predict a single answer. In order to answer multiple questions from the same image, each question and content are fed into the model multiple times. In contrast, our proposed MQMA approach takes multiple questions and content as input at the encoder and predicts multiple answers at the decoder in an auto-regressive manner at the same time. We make several novel architectural modifications to standard encoder-decoder transformers to support MQMA. We also propose a novel MQMA denoising pre-training task which is designed to teach the model to align and delineate multiple questions and content with associated answers. MQMA pre-trained model achieves state-of-the-art results on multiple text-VQA datasets, each with strong baselines. Specifically, on OCR-VQA (+2.5%), TextVQA (+1.4%), ST-VQA (+0.6%), DocVQA (+1.1%) absolute improvements over the previous state-of-the-art approaches.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
DocTr: Document Transformer for Structured Information Extraction in Documents
Authors:
Haofu Liao,
Aruni RoyChowdhury,
Weijian Li,
Ankan Bansal,
Yuting Zhang,
Zhuowen Tu,
Ravi Kumar Satzoda,
R. Manmatha,
Vijay Mahadevan
Abstract:
We present a new formulation for structured information extraction (SIE) from visually rich documents. It aims to address the limitations of existing IOB tagging or graph-based formulations, which are either overly reliant on the correct ordering of input text or struggle with decoding a complex graph. Instead, motivated by anchor-based object detectors in vision, we represent an entity as an anch…
▽ More
We present a new formulation for structured information extraction (SIE) from visually rich documents. It aims to address the limitations of existing IOB tagging or graph-based formulations, which are either overly reliant on the correct ordering of input text or struggle with decoding a complex graph. Instead, motivated by anchor-based object detectors in vision, we represent an entity as an anchor word and a bounding box, and represent entity linking as the association between anchor words. This is more robust to text ordering, and maintains a compact graph for entity linking. The formulation motivates us to introduce 1) a DOCument TRansformer (DocTr) that aims at detecting and associating entity bounding boxes in visually rich documents, and 2) a simple pre-training strategy that helps learn entity detection in the context of language. Evaluations on three SIE benchmarks show the effectiveness of the proposed formulation, and the overall approach outperforms existing solutions.
△ Less
Submitted 15 July, 2023;
originally announced July 2023.
-
PolyFormer: Referring Image Segmentation as Sequential Polygon Generation
Authors:
Jiang Liu,
Hui Ding,
Zhaowei Cai,
Yuting Zhang,
Ravi Kumar Satzoda,
Vijay Mahadevan,
R. Manmatha
Abstract:
In this work, instead of directly predicting the pixel-level segmentation masks, the problem of referring image segmentation is formulated as sequential polygon generation, and the predicted polygons can be later converted into segmentation masks. This is enabled by a new sequence-to-sequence framework, Polygon Transformer (PolyFormer), which takes a sequence of image patches and text query tokens…
▽ More
In this work, instead of directly predicting the pixel-level segmentation masks, the problem of referring image segmentation is formulated as sequential polygon generation, and the predicted polygons can be later converted into segmentation masks. This is enabled by a new sequence-to-sequence framework, Polygon Transformer (PolyFormer), which takes a sequence of image patches and text query tokens as input, and outputs a sequence of polygon vertices autoregressively. For more accurate geometric localization, we propose a regression-based decoder, which predicts the precise floating-point coordinates directly, without any coordinate quantization error. In the experiments, PolyFormer outperforms the prior art by a clear margin, e.g., 5.40% and 4.52% absolute improvements on the challenging RefCOCO+ and RefCOCOg datasets. It also shows strong generalization ability when evaluated on the referring video segmentation task without fine-tuning, e.g., achieving competitive 61.5% J&F on the Ref-DAVIS17 dataset.
△ Less
Submitted 27 March, 2023; v1 submitted 14 February, 2023;
originally announced February 2023.
-
Customizable Adaptive Regularization Techniques for B-Spline Modeling
Authors:
David Lenz,
Raine Yeh,
Vijay Mahadevan,
Iulian Grindeanu,
Tom Peterka
Abstract:
B-spline models are a powerful way to represent scientific data sets with a functional approximation. However, these models can suffer from spurious oscillations when the data to be approximated are not uniformly distributed. Model regularization (i.e., smoothing) has traditionally been used to minimize these oscillations; unfortunately, it is sometimes impossible to sufficiently remove unwanted a…
▽ More
B-spline models are a powerful way to represent scientific data sets with a functional approximation. However, these models can suffer from spurious oscillations when the data to be approximated are not uniformly distributed. Model regularization (i.e., smoothing) has traditionally been used to minimize these oscillations; unfortunately, it is sometimes impossible to sufficiently remove unwanted artifacts without smoothing away key features of the data set. In this article, we present a method of model regularization that preserves significant features of a data set while minimizing artificial oscillations. Our method varies the strength of a smoothing parameter throughout the domain automatically, removing artifacts in poorly-constrained regions while leaving other regions unchanged. The proposed method selectively incorporates regularization terms based on first and second derivatives to maintain model accuracy while minimizing numerical artifacts. The behavior of our method is validated on a collection of two- and three-dimensional data sets produced by scientific simulations. In addition, a key tuning parameter is highlighted and the effects of this parameter are presented in detail. This paper is an extension of our previous conference paper at the 2022 International Conference on Computational Science (ICCS) [Lenz et al. 2022].
△ Less
Submitted 3 January, 2023;
originally announced January 2023.
-
Parallel Domain Decomposition techniques applied to Multivariate Functional Approximation of discrete data
Authors:
Vijay S. Mahadevan,
David Lenz,
Iulian Grindeanu,
Thomas Peterka
Abstract:
Compactly expressing large-scale datasets through Multivariate Functional Approximations (MFA) can be critically important for analysis and visualization to drive scientific discovery. Tackling such problems requires scalable data partitioning approaches to compute MFA representations in amenable wall clock times. We introduce a fully parallel scheme to reduce the total work per task in combinatio…
▽ More
Compactly expressing large-scale datasets through Multivariate Functional Approximations (MFA) can be critically important for analysis and visualization to drive scientific discovery. Tackling such problems requires scalable data partitioning approaches to compute MFA representations in amenable wall clock times. We introduce a fully parallel scheme to reduce the total work per task in combination with an overlapping additive Schwarz-based iterative scheme to compute MFA with a tensor expansion of B-spline bases, while preserving full degree continuity across subdomain boundaries. While previous work on MFA has been successfully proven to be effective, the computational complexity of encoding large datasets on a single process can be severely prohibitive. Parallel algorithms for generating reconstructions from the MFA have had to rely on post-processing techniques to blend discontinuities across subdomain boundaries. In contrast, a robust constrained minimization infrastructure to impose higher-order continuity directly on the MFA representation is presented here. We demonstrate the effectiveness of the parallel approach with domain decomposition solvers, to minimize the subdomain error residuals of the decoded MFA, and more specifically to recover continuity across non-matching boundaries at scale. The analysis of the presented scheme for analytical and scientific datasets in 1-, 2- and 3-dimensions are presented. Extensive strong and weak scalability performances are also demonstrated for large-scale datasets to evaluate the parallel speedup of the MPI-based algorithm implementation on leadership computing machines.
△ Less
Submitted 12 October, 2022;
originally announced October 2022.
-
MATrIX -- Modality-Aware Transformer for Information eXtraction
Authors:
Thomas Delteil,
Edouard Belval,
Lei Chen,
Luis Goncalves,
Vijay Mahadevan
Abstract:
We present MATrIX - a Modality-Aware Transformer for Information eXtraction in the Visual Document Understanding (VDU) domain. VDU covers information extraction from visually rich documents such as forms, invoices, receipts, tables, graphs, presentations, or advertisements. In these, text semantics and visual information supplement each other to provide a global understanding of the document. MATr…
▽ More
We present MATrIX - a Modality-Aware Transformer for Information eXtraction in the Visual Document Understanding (VDU) domain. VDU covers information extraction from visually rich documents such as forms, invoices, receipts, tables, graphs, presentations, or advertisements. In these, text semantics and visual information supplement each other to provide a global understanding of the document. MATrIX is pre-trained in an unsupervised way with specifically designed tasks that require the use of multi-modal information (spatial, visual, or textual). We consider the spatial and text modalities all at once in a single token set. To make the attention more flexible, we use a learned modality-aware relative bias in the attention mechanism to modulate the attention between the tokens of different modalities. We evaluate MATrIX on 3 different datasets each with strong baselines.
△ Less
Submitted 17 May, 2022;
originally announced May 2022.
-
Towards Differential Relational Privacy and its use in Question Answering
Authors:
Simone Bombari,
Alessandro Achille,
Zijian Wang,
Yu-Xiang Wang,
Yusheng Xie,
Kunwar Yashraj Singh,
Srikar Appalaraju,
Vijay Mahadevan,
Stefano Soatto
Abstract:
Memorization of the relation between entities in a dataset can lead to privacy issues when using a trained model for question answering. We introduce Relational Memorization (RM) to understand, quantify and control this phenomenon. While bounding general memorization can have detrimental effects on the performance of a trained model, bounding RM does not prevent effective learning. The difference…
▽ More
Memorization of the relation between entities in a dataset can lead to privacy issues when using a trained model for question answering. We introduce Relational Memorization (RM) to understand, quantify and control this phenomenon. While bounding general memorization can have detrimental effects on the performance of a trained model, bounding RM does not prevent effective learning. The difference is most pronounced when the data distribution is long-tailed, with many queries having only few training examples: Impeding general memorization prevents effective learning, while impeding only relational memorization still allows learning general properties of the underlying concepts. We formalize the notion of Relational Privacy (RP) and, inspired by Differential Privacy (DP), we provide a possible definition of Differential Relational Privacy (DrP). These notions can be used to describe and compute bounds on the amount of RM in a trained model. We illustrate Relational Privacy concepts in experiments with large-scale models for Question Answering.
△ Less
Submitted 30 March, 2022;
originally announced March 2022.
-
Adaptive Regularization of B-Spline Models for Scientific Data
Authors:
David Lenz,
Raine Yeh,
Vijay Mahadevan,
Iulian Grindeanu,
Tom Peterka
Abstract:
B-spline models are a powerful way to represent scientific data sets with a functional approximation. However, these models can suffer from spurious oscillations when the data to be approximated are not uniformly distributed. Model regularization (i.e., smoothing) has traditionally been used to minimize these oscillations; unfortunately, it is sometimes impossible to sufficiently remove unwanted a…
▽ More
B-spline models are a powerful way to represent scientific data sets with a functional approximation. However, these models can suffer from spurious oscillations when the data to be approximated are not uniformly distributed. Model regularization (i.e., smoothing) has traditionally been used to minimize these oscillations; unfortunately, it is sometimes impossible to sufficiently remove unwanted artifacts without smoothing away key features of the data set. In this article, we present a method of model regularization that preserves significant features of a data set while minimizing artificial oscillations. Our method varies the strength of a smoothing parameter throughout the domain automatically, removing artifacts in poorly-constrained regions while leaving other regions unchanged. The behavior of our method is validated on a collection of two- and three-dimensional data sets produced by scientific simulations.
△ Less
Submitted 23 March, 2022;
originally announced March 2022.
-
Contrastive Neighborhood Alignment
Authors:
Pengkai Zhu,
Zhaowei Cai,
Yuanjun Xiong,
Zhuowen Tu,
Luis Goncalves,
Vijay Mahadevan,
Stefano Soatto
Abstract:
We present Contrastive Neighborhood Alignment (CNA), a manifold learning approach to maintain the topology of learned features whereby data points that are mapped to nearby representations by the source (teacher) model are also mapped to neighbors by the target (student) model. The target model aims to mimic the local structure of the source representation space using a contrastive loss. CNA is an…
▽ More
We present Contrastive Neighborhood Alignment (CNA), a manifold learning approach to maintain the topology of learned features whereby data points that are mapped to nearby representations by the source (teacher) model are also mapped to neighbors by the target (student) model. The target model aims to mimic the local structure of the source representation space using a contrastive loss. CNA is an unsupervised learning algorithm that does not require ground-truth labels for the individual samples. CNA is illustrated in three scenarios: manifold learning, where the model maintains the local topology of the original data in a dimension-reduced space; model distillation, where a small student model is trained to mimic a larger teacher; and legacy model update, where an older model is replaced by a more powerful one. Experiments show that CNA is able to capture the manifold in a high-dimensional space and improves performance compared to the competing methods in their domains.
△ Less
Submitted 5 January, 2022;
originally announced January 2022.
-
Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries
Authors:
Qi Dong,
Zhuowen Tu,
Haofu Liao,
Yuting Zhang,
Vijay Mahadevan,
Stefano Soatto
Abstract:
Computer vision applications such as visual relationship detection and human object interaction can be formulated as a composite (structured) set detection problem in which both the parts (subject, object, and predicate) and the sum (triplet as a whole) are to be detected in a hierarchical fashion. In this paper, we present a new approach, denoted Part-and-Sum detection Transformer (PST), to perfo…
▽ More
Computer vision applications such as visual relationship detection and human object interaction can be formulated as a composite (structured) set detection problem in which both the parts (subject, object, and predicate) and the sum (triplet as a whole) are to be detected in a hierarchical fashion. In this paper, we present a new approach, denoted Part-and-Sum detection Transformer (PST), to perform end-to-end visual composite set detection. Different from existing Transformers in which queries are at a single level, we simultaneously model the joint part and sum hypotheses/interactions with composite queries and attention modules. We explicitly incorporate sum queries to enable better modeling of the part-and-sum relations that are absent in the standard Transformers. Our approach also uses novel tensor-based part queries and vector-based sum queries, and models their joint interaction. We report experiments on two vision tasks, visual relationship detection and human object interaction and demonstrate that PST achieves state of the art results among single-stage models, while nearly matching the results of custom designed two-stage models.
△ Less
Submitted 19 August, 2021; v1 submitted 5 May, 2021;
originally announced May 2021.
-
Fourier-Informed Knot Placement Schemes for B-Spline Approximation
Authors:
David Lenz,
Oana Marin,
Vijay Mahadevan,
Raine Yeh,
Tom Peterka
Abstract:
Fitting B-splines to discrete data is especially challenging when the given data contain noise, jumps, or corners. Here, we describe how periodic data sets with these features can be efficiently and robustly approximated with B-splines by analyzing the Fourier spectrum of the data. Our method uses a collection of spectral filters to produce different indicator functions that guide effective knot p…
▽ More
Fitting B-splines to discrete data is especially challenging when the given data contain noise, jumps, or corners. Here, we describe how periodic data sets with these features can be efficiently and robustly approximated with B-splines by analyzing the Fourier spectrum of the data. Our method uses a collection of spectral filters to produce different indicator functions that guide effective knot placement. In particular, we describe how spectral filters can be used to compute high-order derivatives, smoothed versions of noisy data, and the locations of jump discontinuities. Our knot placement method can combine one or more of these indicators to place knots that align with the qualitative features of the data, leading to accurate B-spline approximations without needing many knots. The method we introduce is direct and does not require any intermediate B-spline fitting before choosing the final knot vector. Aside from a fast Fourier transform to transfer to and from Fourier space, the method runs in linear time with very little communication. The method is applied to several test cases in one and two dimensions, including data sets with jump discontinuities and noise. These tests show that the method can fit discontinuous data without spurious oscillations and remains accurate in the presence of noise.
△ Less
Submitted 7 December, 2020;
originally announced December 2020.
-
LayoutTransformer: Layout Generation and Completion with Self-attention
Authors:
Kamal Gupta,
Justin Lazarow,
Alessandro Achille,
Larry Davis,
Vijay Mahadevan,
Abhinav Shrivastava
Abstract:
We address the problem of scene layout generation for diverse domains such as images, mobile applications, documents, and 3D objects. Most complex scenes, natural or human-designed, can be expressed as a meaningful arrangement of simpler compositional graphical primitives. Generating a new layout or extending an existing layout requires understanding the relationships between these primitives. To…
▽ More
We address the problem of scene layout generation for diverse domains such as images, mobile applications, documents, and 3D objects. Most complex scenes, natural or human-designed, can be expressed as a meaningful arrangement of simpler compositional graphical primitives. Generating a new layout or extending an existing layout requires understanding the relationships between these primitives. To do this, we propose LayoutTransformer, a novel framework that leverages self-attention to learn contextual relationships between layout elements and generate novel layouts in a given domain. Our framework allows us to generate a new layout either from an empty set or from an initial seed set of primitives, and can easily scale to support an arbitrary of primitives per layout. Furthermore, our analyses show that the model is able to automatically capture the semantic properties of the primitives. We propose simple improvements in both representation of layout primitives, as well as training methods to demonstrate competitive performance in very diverse data domains such as object bounding boxes in natural images(COCO bounding box), documents (PubLayNet), mobile applications (RICO dataset) as well as 3D shapes (Part-Net). Code and other materials will be made available at https://kampta.github.io/layout.
△ Less
Submitted 30 September, 2021; v1 submitted 25 June, 2020;
originally announced June 2020.
-
Toward Understanding Catastrophic Forgetting in Continual Learning
Authors:
Cuong V. Nguyen,
Alessandro Achille,
Michael Lam,
Tal Hassner,
Vijay Mahadevan,
Stefano Soatto
Abstract:
We study the relationship between catastrophic forgetting and properties of task sequences. In particular, given a sequence of tasks, we would like to understand which properties of this sequence influence the error rates of continual learning algorithms trained on the sequence. To this end, we propose a new procedure that makes use of recent developments in task space modeling as well as correlat…
▽ More
We study the relationship between catastrophic forgetting and properties of task sequences. In particular, given a sequence of tasks, we would like to understand which properties of this sequence influence the error rates of continual learning algorithms trained on the sequence. To this end, we propose a new procedure that makes use of recent developments in task space modeling as well as correlation analysis to specify and analyze the properties we are interested in. As an application, we apply our procedure to study two properties of a task sequence: (1) total complexity and (2) sequential heterogeneity. We show that error rates are strongly and positively correlated to a task sequence's total complexity for some state-of-the-art algorithms. We also show that, surprisingly, the error rates have no or even negative correlations in some cases to sequential heterogeneity. Our findings suggest directions for improving continual learning benchmarks and methods.
△ Less
Submitted 2 August, 2019;
originally announced August 2019.
-
Deep Convolutional Neural Networks for Eigenvalue Problems in Mechanics
Authors:
David Finol,
Yan Lu,
Vijay Mahadevan,
Ankit Srivastava
Abstract:
We show that deep convolutional neural networks (CNN) can massively outperform traditional densely-connected neural networks (both deep or shallow) in predicting eigenvalue problems in mechanics. In this sense, we strike out in a new direction in mechanics computations with strongly predictive NNs whose success depends not only on architectures being deep, but also being fundamentally different fr…
▽ More
We show that deep convolutional neural networks (CNN) can massively outperform traditional densely-connected neural networks (both deep or shallow) in predicting eigenvalue problems in mechanics. In this sense, we strike out in a new direction in mechanics computations with strongly predictive NNs whose success depends not only on architectures being deep, but also being fundamentally different from the widely-used to date. We consider a model problem: predicting the eigenvalues of 1-D and 2-D phononic crystals. For the 1-D case, the optimal CNN architecture reaches $98\%$ accuracy level on unseen data when trained with just 20,000 samples, compared to $85\%$ accuracy even with $100,000$ samples for the typical network of choice in mechanics research. We show that, with relatively high data-efficiency, CNNs have the capability to generalize well and automatically learn deep symmetry operations, easily extending to higher dimensions and our 2D case. Most importantly, we show how CNNs can naturally represent mechanical material tensors, with its convolution kernels serving as local receptive fields, which is a natural representation of mechanical response. Strategies proposed are applicable to other mechanics' problems and may, in the future, be used to sidestep cumbersome algorithms with purely data-driven approaches based upon modern deep architectures.
△ Less
Submitted 17 July, 2018; v1 submitted 17 January, 2018;
originally announced January 2018.