-
DeepSORT-Driven Visual Tracking Approach for Gesture Recognition in Interactive Systems
Authors:
Tong Zhang,
Fenghua Shao,
Runsheng Zhang,
Yifan Zhuang,
Liuqingqing Yang
Abstract:
Based on the DeepSORT algorithm, this study explores the application of visual tracking technology in intelligent human-computer interaction, especially in the field of gesture recognition and tracking. With the rapid development of artificial intelligence and deep learning technology, visual-based interaction has gradually replaced traditional input devices and become an important way for intelli…
▽ More
Based on the DeepSORT algorithm, this study explores the application of visual tracking technology in intelligent human-computer interaction, especially in the field of gesture recognition and tracking. With the rapid development of artificial intelligence and deep learning technology, visual-based interaction has gradually replaced traditional input devices and become an important way for intelligent systems to interact with users. The DeepSORT algorithm can achieve accurate target tracking in dynamic environments by combining Kalman filters and deep learning feature extraction methods. It is especially suitable for complex scenes with multi-target tracking and fast movements. This study experimentally verifies the superior performance of DeepSORT in gesture recognition and tracking. It can accurately capture and track the user's gesture trajectory and is superior to traditional tracking methods in terms of real-time and accuracy. In addition, this study also combines gesture recognition experiments to evaluate the recognition ability and feedback response of the DeepSORT algorithm under different gestures (such as sliding, clicking, and zooming). The experimental results show that DeepSORT can not only effectively deal with target occlusion and motion blur but also can stably track in a multi-target environment, achieving a smooth user interaction experience. Finally, this paper looks forward to the future development direction of intelligent human-computer interaction systems based on visual tracking and proposes future research focuses such as algorithm optimization, data fusion, and multimodal interaction in order to promote a more intelligent and personalized interactive experience. Keywords-DeepSORT, visual tracking, gesture recognition, human-computer interaction
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025
Authors:
Nitya Thakkar,
Mert Yuksekgonul,
Jake Silberg,
Animesh Garg,
Nanyun Peng,
Fei Sha,
Rose Yu,
Carl Vondrick,
James Zou
Abstract:
Peer review at AI conferences is stressed by rapidly rising submission volumes, leading to deteriorating review quality and increased author dissatisfaction. To address these issues, we developed Review Feedback Agent, a system leveraging multiple large language models (LLMs) to improve review clarity and actionability by providing automated feedback on vague comments, content misunderstandings, a…
▽ More
Peer review at AI conferences is stressed by rapidly rising submission volumes, leading to deteriorating review quality and increased author dissatisfaction. To address these issues, we developed Review Feedback Agent, a system leveraging multiple large language models (LLMs) to improve review clarity and actionability by providing automated feedback on vague comments, content misunderstandings, and unprofessional remarks to reviewers. Implemented at ICLR 2025 as a large randomized control study, our system provided optional feedback to more than 20,000 randomly selected reviews. To ensure high-quality feedback for reviewers at this scale, we also developed a suite of automated reliability tests powered by LLMs that acted as guardrails to ensure feedback quality, with feedback only being sent to reviewers if it passed all the tests. The results show that 27% of reviewers who received feedback updated their reviews, and over 12,000 feedback suggestions from the agent were incorporated by those reviewers. This suggests that many reviewers found the AI-generated feedback sufficiently helpful to merit updating their reviews. Incorporating AI feedback led to significantly longer reviews (an average increase of 80 words among those who updated after receiving feedback) and more informative reviews, as evaluated by blinded researchers. Moreover, reviewers who were selected to receive AI feedback were also more engaged during paper rebuttals, as seen in longer author-reviewer discussions. This work demonstrates that carefully designed LLM-generated review feedback can enhance peer review quality by making reviews more specific and actionable while increasing engagement between reviewers and authors. The Review Feedback Agent is publicly available at https://github.com/zou-group/review_feedback_agent.
△ Less
Submitted 13 April, 2025;
originally announced April 2025.
-
Automated UI Interface Generation via Diffusion Models: Enhancing Personalization and Efficiency
Authors:
Yifei Duan,
Liuqingqing Yang,
Tong Zhang,
Zhijun Song,
Fenghua Shao
Abstract:
This study proposes a UI interface generation method based on a diffusion model, aiming to achieve high-quality, diversified, and personalized interface design through generative artificial intelligence technology. The diffusion model is based on its step-by-step denoising generation process. By combining the conditional generation mechanism, design optimization module, and user feedback mechanism…
▽ More
This study proposes a UI interface generation method based on a diffusion model, aiming to achieve high-quality, diversified, and personalized interface design through generative artificial intelligence technology. The diffusion model is based on its step-by-step denoising generation process. By combining the conditional generation mechanism, design optimization module, and user feedback mechanism, the model can generate a UI interface that meets the requirements based on multimodal inputs such as text descriptions and sketches provided by users. In the study, a complete experimental evaluation framework was designed, and mainstream generation models (such as GAN, VAE, DALL E, etc.) were selected for comparative experiments. The generation results were quantitatively analyzed from indicators such as PSNR, SSIM, and FID. The results show that the model proposed in this study is superior to other models in terms of generation quality and user satisfaction, especially in terms of logical clarity of information transmission and visual aesthetics. The ablation experiment further verifies the key role of conditional generation and design optimization modules in improving interface quality. This study provides a new technical path for UI design automation and lays the foundation for the intelligent and personalized development of human-computer interaction interfaces. In the future, the application potential of the model in virtual reality, game design, and other fields will be further explored.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models
Authors:
Linlu Qiu,
Fei Sha,
Kelsey Allen,
Yoon Kim,
Tal Linzen,
Sjoerd van Steenkiste
Abstract:
Artificial intelligence systems based on large language models (LLMs) are increasingly used as agents that interact with users and with the world. To do so successfully, LLMs need to construct internal representations of the world and form probabilistic beliefs about those representations. To provide a user with personalized recommendations, for example, the LLM needs to gradually infer the user's…
▽ More
Artificial intelligence systems based on large language models (LLMs) are increasingly used as agents that interact with users and with the world. To do so successfully, LLMs need to construct internal representations of the world and form probabilistic beliefs about those representations. To provide a user with personalized recommendations, for example, the LLM needs to gradually infer the user's preferences, over the course of multiple interactions. To evaluate whether contemporary LLMs are able to do so, we use the Bayesian inference framework from probability theory, which lays out the optimal way to update an agent's beliefs as it receives new information. We first show that the LLMs do not update their beliefs as expected from the Bayesian framework, and that consequently their predictions do not improve as expected as more information becomes available, even less so than we find is the case for humans. To address this issue, we teach the LLMs to reason in a Bayesian manner by training them to mimic the predictions of an optimal Bayesian model. We find that this approach not only significantly improves the LLM's performance on the particular recommendation task it is trained on, but also enables generalization to other tasks. This suggests that this method endows the LLM with broader Bayesian reasoning skills. More generally, our results indicate that LLMs can learn about reasoning strategies effectively and generalize those skills to new domains, which in part explains LLMs' empirical success.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
Graders should cheat: privileged information enables expert-level automated evaluations
Authors:
Jin Peng Zhou,
Sébastien M. R. Arnold,
Nan Ding,
Kilian Q. Weinberger,
Nan Hua,
Fei Sha
Abstract:
Auto-evaluating language models (LMs), i.e., using a grader LM to evaluate the candidate LM, is an appealing way to accelerate the evaluation process and the cost associated with it. But this presents a paradox: how can we trust the grader LM, which is presumably weaker than the candidate LM, to assess problems that are beyond the frontier of the capabilities of either model or both? For instance,…
▽ More
Auto-evaluating language models (LMs), i.e., using a grader LM to evaluate the candidate LM, is an appealing way to accelerate the evaluation process and the cost associated with it. But this presents a paradox: how can we trust the grader LM, which is presumably weaker than the candidate LM, to assess problems that are beyond the frontier of the capabilities of either model or both? For instance, today's LMs struggle on graduate-level physics and Olympiad-level math, making them unreliable graders in these domains.
We show that providing privileged information -- such as ground-truth solutions or problem-specific guidelines -- improves automated evaluations on such frontier problems. This approach offers two key advantages. First, it expands the range of problems where LMs graders apply. Specifically, weaker models can now rate the predictions of stronger models. Second, privileged information can be used to devise easier variations of challenging problems which improves the separability of different LMs on tasks where their performance is generally low. With this approach, general-purpose LM graders match the state of the art performance on RewardBench, surpassing almost all the specially-tuned models. LM graders also outperform individual human raters on Vibe-Eval, and approach human expert graders on Olympiad-level math problems.
△ Less
Submitted 15 February, 2025;
originally announced February 2025.
-
AutoCBT: An Autonomous Multi-agent Framework for Cognitive Behavioral Therapy in Psychological Counseling
Authors:
Ancheng Xu,
Di Yang,
Renhao Li,
Jingwei Zhu,
Minghuan Tan,
Min Yang,
Wanxin Qiu,
Mingchen Ma,
Haihong Wu,
Bingyu Li,
Feng Sha,
Chengming Li,
Xiping Hu,
Qiang Qu,
Derek F. Wong,
Ruifeng Xu
Abstract:
Traditional in-person psychological counseling remains primarily niche, often chosen by individuals with psychological issues, while online automated counseling offers a potential solution for those hesitant to seek help due to feelings of shame. Cognitive Behavioral Therapy (CBT) is an essential and widely used approach in psychological counseling. The advent of large language models (LLMs) and a…
▽ More
Traditional in-person psychological counseling remains primarily niche, often chosen by individuals with psychological issues, while online automated counseling offers a potential solution for those hesitant to seek help due to feelings of shame. Cognitive Behavioral Therapy (CBT) is an essential and widely used approach in psychological counseling. The advent of large language models (LLMs) and agent technology enables automatic CBT diagnosis and treatment. However, current LLM-based CBT systems use agents with a fixed structure, limiting their self-optimization capabilities, or providing hollow, unhelpful suggestions due to redundant response patterns. In this work, we utilize Quora-like and YiXinLi single-round consultation models to build a general agent framework that generates high-quality responses for single-turn psychological consultation scenarios. We use a bilingual dataset to evaluate the quality of single-response consultations generated by each framework. Then, we incorporate dynamic routing and supervisory mechanisms inspired by real psychological counseling to construct a CBT-oriented autonomous multi-agent framework, demonstrating its general applicability. Experimental results indicate that AutoCBT can provide higher-quality automated psychological counseling services.
△ Less
Submitted 16 January, 2025;
originally announced January 2025.
-
Computer Vision-Driven Gesture Recognition: Toward Natural and Intuitive Human-Computer
Authors:
Fenghua Shao,
Tong Zhang,
Shang Gao,
Qi Sun,
Liuqingqing Yang
Abstract:
This study mainly explores the application of natural gesture recognition based on computer vision in human-computer interaction, aiming to improve the fluency and naturalness of human-computer interaction through gesture recognition technology. In the fields of virtual reality, augmented reality and smart home, traditional input methods have gradually failed to meet the needs of users for interac…
▽ More
This study mainly explores the application of natural gesture recognition based on computer vision in human-computer interaction, aiming to improve the fluency and naturalness of human-computer interaction through gesture recognition technology. In the fields of virtual reality, augmented reality and smart home, traditional input methods have gradually failed to meet the needs of users for interactive experience. As an intuitive and convenient interaction method, gestures have received more and more attention. This paper proposes a gesture recognition method based on a three-dimensional hand skeleton model. By simulating the three-dimensional spatial distribution of hand joints, a simplified hand skeleton structure is constructed. By connecting the palm and each finger joint, a dynamic and static gesture model of the hand is formed, which further improves the accuracy and efficiency of gesture recognition. Experimental results show that this method can effectively recognize various gestures and maintain high recognition accuracy and real-time response capabilities in different environments. In addition, combined with multimodal technologies such as eye tracking, the intelligence level of the gesture recognition system can be further improved, bringing a richer and more intuitive user experience. In the future, with the continuous development of computer vision, deep learning and multimodal interaction technology, natural interaction based on gestures will play an important role in a wider range of application scenarios and promote revolutionary progress in human-computer interaction.
△ Less
Submitted 24 December, 2024;
originally announced December 2024.
-
Statistical Downscaling via High-Dimensional Distribution Matching with Generative Models
Authors:
Zhong Yi Wan,
Ignacio Lopez-Gomez,
Robert Carver,
Tapio Schneider,
John Anderson,
Fei Sha,
Leonardo Zepeda-Núñez
Abstract:
Statistical downscaling is a technique used in climate modeling to increase the resolution of climate simulations. High-resolution climate information is essential for various high-impact applications, including natural hazard risk assessment. However, simulating climate at high resolution is intractable. Thus, climate simulations are often conducted at a coarse scale and then downscaled to the de…
▽ More
Statistical downscaling is a technique used in climate modeling to increase the resolution of climate simulations. High-resolution climate information is essential for various high-impact applications, including natural hazard risk assessment. However, simulating climate at high resolution is intractable. Thus, climate simulations are often conducted at a coarse scale and then downscaled to the desired resolution. Existing downscaling techniques are either simulation-based methods with high computational costs, or statistical approaches with limitations in accuracy or application specificity. We introduce Generative Bias Correction and Super-Resolution (GenBCSR), a two-stage probabilistic framework for statistical downscaling that overcomes the limitations of previous methods. GenBCSR employs two transformations to match high-dimensional distributions at different resolutions: (i) the first stage, bias correction, aligns the distributions at coarse scale, (ii) the second stage, statistical super-resolution, lifts the corrected coarse distribution by introducing fine-grained details. Each stage is instantiated by a state-of-the-art generative model, resulting in an efficient and effective computational pipeline for the well-studied distribution matching problem. By framing the downscaling problem as distribution matching, GenBCSR relaxes the constraints of supervised learning, which requires samples to be aligned. Despite not requiring such correspondence, we show that GenBCSR surpasses standard approaches in predictive accuracy of critical impact variables, particularly in predicting the tails (99% percentile) of composite indexes composed of interacting variables, achieving up to 4-5 folds of error reduction.
△ Less
Submitted 10 December, 2024;
originally announced December 2024.
-
Diff4Steer: Steerable Diffusion Prior for Generative Music Retrieval with Semantic Guidance
Authors:
Xuchan Bao,
Judith Yue Li,
Zhong Yi Wan,
Kun Su,
Timo Denk,
Joonseok Lee,
Dima Kuzmin,
Fei Sha
Abstract:
Modern music retrieval systems often rely on fixed representations of user preferences, limiting their ability to capture users' diverse and uncertain retrieval needs. To address this limitation, we introduce Diff4Steer, a novel generative retrieval framework that employs lightweight diffusion models to synthesize diverse seed embeddings from user queries that represent potential directions for mu…
▽ More
Modern music retrieval systems often rely on fixed representations of user preferences, limiting their ability to capture users' diverse and uncertain retrieval needs. To address this limitation, we introduce Diff4Steer, a novel generative retrieval framework that employs lightweight diffusion models to synthesize diverse seed embeddings from user queries that represent potential directions for music exploration. Unlike deterministic methods that map user query to a single point in embedding space, Diff4Steer provides a statistical prior on the target modality (audio) for retrieval, effectively capturing the uncertainty and multi-faceted nature of user preferences. Furthermore, Diff4Steer can be steered by image or text inputs, enabling more flexible and controllable music discovery combined with nearest neighbor search. Our framework outperforms deterministic regression methods and LLM-based generative retrieval baseline in terms of retrieval and ranking metrics, demonstrating its effectiveness in capturing user preferences, leading to more diverse and relevant recommendations. Listening examples are available at tinyurl.com/diff4steer.
△ Less
Submitted 5 December, 2024;
originally announced December 2024.
-
MICAS: Multi-grained In-Context Adaptive Sampling for 3D Point Cloud Processing
Authors:
Feifei Shao,
Ping Liu,
Zhao Wang,
Yawei Luo,
Hongwei Wang,
Jun Xiao
Abstract:
Point cloud processing (PCP) encompasses tasks like reconstruction, denoising, registration, and segmentation, each often requiring specialized models to address unique task characteristics. While in-context learning (ICL) has shown promise across tasks by using a single model with task-specific demonstration prompts, its application to PCP reveals significant limitations. We identify inter-task a…
▽ More
Point cloud processing (PCP) encompasses tasks like reconstruction, denoising, registration, and segmentation, each often requiring specialized models to address unique task characteristics. While in-context learning (ICL) has shown promise across tasks by using a single model with task-specific demonstration prompts, its application to PCP reveals significant limitations. We identify inter-task and intra-task sensitivity issues in current ICL methods for PCP, which we attribute to inflexible sampling strategies lacking context adaptation at the point and prompt levels. To address these challenges, we propose MICAS, an advanced ICL framework featuring a multi-grained adaptive sampling mechanism tailored for PCP. MICAS introduces two core components: task-adaptive point sampling, which leverages inter-task cues for point-level sampling, and query-specific prompt sampling, which selects optimal prompts per query to mitigate intra-task sensitivity. To our knowledge, this is the first approach to introduce adaptive sampling tailored to the unique requirements of point clouds within an ICL framework. Extensive experiments show that MICAS not only efficiently handles various PCP tasks but also significantly outperforms existing methods. Notably, it achieves a remarkable $4.1\%$ improvement in the part segmentation task and delivers consistent gains across various PCP applications.
△ Less
Submitted 27 November, 2024; v1 submitted 25 November, 2024;
originally announced November 2024.
-
Optimizing Gesture Recognition for Seamless UI Interaction Using Convolutional Neural Networks
Authors:
Qi Sun,
Tong Zhang,
Shang Gao,
Liuqingqing Yang,
Fenghua Shao
Abstract:
This study introduces an advanced gesture recognition and user interface (UI) interaction system powered by deep learning, highlighting its transformative impact on UI design and functionality. By utilizing optimized convolutional neural networks (CNNs), the system achieves high-precision gesture recognition, significantly improving user interactions with digital interfaces. The process begins wit…
▽ More
This study introduces an advanced gesture recognition and user interface (UI) interaction system powered by deep learning, highlighting its transformative impact on UI design and functionality. By utilizing optimized convolutional neural networks (CNNs), the system achieves high-precision gesture recognition, significantly improving user interactions with digital interfaces. The process begins with preprocessing collected gesture images to meet CNN input requirements, followed by sophisticated feature extraction and classification techniques. To address class imbalance, we employ Focal Loss as the loss function, ensuring robust model performance across diverse gesture types. Experimental results demonstrate notable improvements in model metrics, with the Area Under the Curve (AUC) and Recall metrics improving as we transition from simpler models like VGG16 to more advanced ones such as DenseNet. Our enhanced model achieves strong AUC and Recall values, outperforming standard benchmarks. Notably, the system's ability to support real-time and efficient gesture recognition paves the way for a new era in UI design, where intuitive user gestures can be seamlessly integrated into everyday technology use, reducing the learning curve and enhancing user satisfaction. The implications of this development extend beyond technical performance to fundamentally reshape user-technology interactions, underscoring the critical role of gesture-based interfaces in the next generation of UI development. Such advancements promise to significantly enhance smart life experiences, positioning gesture recognition as a key driver in the evolution of user-centric interfaces.
△ Less
Submitted 23 November, 2024;
originally announced November 2024.
-
No-Reference Point Cloud Quality Assessment via Graph Convolutional Network
Authors:
Wu Chen,
Qiuping Jiang,
Wei Zhou,
Feng Shao,
Guangtao Zhai,
Weisi Lin
Abstract:
Three-dimensional (3D) point cloud, as an emerging visual media format, is increasingly favored by consumers as it can provide more realistic visual information than two-dimensional (2D) data. Similar to 2D plane images and videos, point clouds inevitably suffer from quality degradation and information loss through multimedia communication systems. Therefore, automatic point cloud quality assessme…
▽ More
Three-dimensional (3D) point cloud, as an emerging visual media format, is increasingly favored by consumers as it can provide more realistic visual information than two-dimensional (2D) data. Similar to 2D plane images and videos, point clouds inevitably suffer from quality degradation and information loss through multimedia communication systems. Therefore, automatic point cloud quality assessment (PCQA) is of critical importance. In this work, we propose a novel no-reference PCQA method by using a graph convolutional network (GCN) to characterize the mutual dependencies of multi-view 2D projected image contents. The proposed GCN-based PCQA (GC-PCQA) method contains three modules, i.e., multi-view projection, graph construction, and GCN-based quality prediction. First, multi-view projection is performed on the test point cloud to obtain a set of horizontally and vertically projected images. Then, a perception-consistent graph is constructed based on the spatial relations among different projected images. Finally, reasoning on the constructed graph is performed by GCN to characterize the mutual dependencies and interactions between different projected images, and aggregate feature information of multi-view projected images for final quality prediction. Experimental results on two publicly available benchmark databases show that our proposed GC-PCQA can achieve superior performance than state-of-the-art quality assessment metrics. The code will be available at: https://github.com/chenwuwq/GC-PCQA.
△ Less
Submitted 12 November, 2024;
originally announced November 2024.
-
Dynamical-generative downscaling of climate model ensembles
Authors:
Ignacio Lopez-Gomez,
Zhong Yi Wan,
Leonardo Zepeda-Núñez,
Tapio Schneider,
John Anderson,
Fei Sha
Abstract:
Regional high-resolution climate projections are crucial for many applications, such as agriculture, hydrology, and natural hazard risk assessment. Dynamical downscaling, the state-of-the-art method to produce localized future climate information, involves running a regional climate model (RCM) driven by an Earth System Model (ESM), but it is too computationally expensive to apply to large climate…
▽ More
Regional high-resolution climate projections are crucial for many applications, such as agriculture, hydrology, and natural hazard risk assessment. Dynamical downscaling, the state-of-the-art method to produce localized future climate information, involves running a regional climate model (RCM) driven by an Earth System Model (ESM), but it is too computationally expensive to apply to large climate projection ensembles. We propose a novel approach combining dynamical downscaling with generative artificial intelligence to reduce the cost and improve the uncertainty estimates of downscaled climate projections. In our framework, an RCM dynamically downscales ESM output to an intermediate resolution, followed by a generative diffusion model that further refines the resolution to the target scale. This approach leverages the generalizability of physics-based models and the sampling efficiency of diffusion models, enabling the downscaling of large multi-model ensembles. We evaluate our method against dynamically-downscaled climate projections from the CMIP6 ensemble. Our results demonstrate its ability to provide more accurate uncertainty bounds on future regional climate than alternatives such as dynamical downscaling of smaller ensembles, or traditional empirical statistical downscaling methods. We also show that dynamical-generative downscaling results in significantly lower errors than bias correction and spatial disaggregation (BCSD), and captures more accurately the spectra and multivariate correlations of meteorological fields. These characteristics make the dynamical-generative framework a flexible, accurate, and efficient way to downscale large ensembles of climate projections, currently out of reach for pure dynamical downscaling.
△ Less
Submitted 2 October, 2024;
originally announced October 2024.
-
Generative AI for fast and accurate statistical computation of fluids
Authors:
Roberto Molinaro,
Samuel Lanthaler,
Bogdan Raonić,
Tobias Rohner,
Victor Armegioiu,
Stephan Simonis,
Dana Grund,
Yannick Ramic,
Zhong Yi Wan,
Fei Sha,
Siddhartha Mishra,
Leonardo Zepeda-Núñez
Abstract:
We present a generative AI algorithm for addressing the pressing task of fast, accurate, and robust statistical computation of three-dimensional turbulent fluid flows. Our algorithm, termed as GenCFD, is based on an end-to-end conditional score-based diffusion model. Through extensive numerical experimentation with a set of challenging fluid flows, we demonstrate that GenCFD provides an accurate a…
▽ More
We present a generative AI algorithm for addressing the pressing task of fast, accurate, and robust statistical computation of three-dimensional turbulent fluid flows. Our algorithm, termed as GenCFD, is based on an end-to-end conditional score-based diffusion model. Through extensive numerical experimentation with a set of challenging fluid flows, we demonstrate that GenCFD provides an accurate approximation of relevant statistical quantities of interest while also efficiently generating high-quality realistic samples of turbulent fluid flows and ensuring excellent spectral resolution. In contrast, ensembles of deterministic ML algorithms, trained to minimize mean square errors, regress to the mean flow. We present rigorous theoretical results uncovering the surprising mechanisms through which diffusion models accurately generate fluid flows. These mechanisms are illustrated with solvable toy models that exhibit the mathematically relevant features of turbulent fluid flows while being amenable to explicit analytical formulae. Our codes are publicly available at https://github.com/camlab-ethz/GenCFD.
△ Less
Submitted 2 February, 2025; v1 submitted 26 September, 2024;
originally announced September 2024.
-
Rational-WENO: A lightweight, physically-consistent three-point weighted essentially non-oscillatory scheme
Authors:
Shantanu Shahane,
Sheide Chammas,
Deniz A. Bezgin,
Aaron B. Buhendwa,
Steffen J. Schmidt,
Nikolaus A. Adams,
Spencer H. Bryngelson,
Yi-Fan Chen,
Qing Wang,
Fei Sha,
Leonardo Zepeda-Núñez
Abstract:
Conventional WENO3 methods are known to be highly dissipative at lower resolutions, introducing significant errors in the pre-asymptotic regime. In this paper, we employ a rational neural network to accurately estimate the local smoothness of the solution, dynamically adapting the stencil weights based on local solution features. As rational neural networks can represent fast transitions between s…
▽ More
Conventional WENO3 methods are known to be highly dissipative at lower resolutions, introducing significant errors in the pre-asymptotic regime. In this paper, we employ a rational neural network to accurately estimate the local smoothness of the solution, dynamically adapting the stencil weights based on local solution features. As rational neural networks can represent fast transitions between smooth and sharp regimes, this approach achieves a granular reconstruction with significantly reduced dissipation, improving the accuracy of the simulation. The network is trained offline on a carefully chosen dataset of analytical functions, bypassing the need for differentiable solvers. We also propose a robust model selection criterion based on estimates of the interpolation's convergence order on a set of test functions, which correlates better with the model performance in downstream tasks. We demonstrate the effectiveness of our approach on several one-, two-, and three-dimensional fluid flow problems: our scheme generalizes across grid resolutions while handling smooth and discontinuous solutions. In most cases, our rational network-based scheme achieves higher accuracy than conventional WENO3 with the same stencil size, and in a few of them, it achieves accuracy comparable to WENO5, which uses a larger stencil.
△ Less
Submitted 13 September, 2024;
originally announced September 2024.
-
A probabilistic framework for learning non-intrusive corrections to long-time climate simulations from short-time training data
Authors:
Benedikt Barthel Sorensen,
Leonardo Zepeda-Núñez,
Ignacio Lopez-Gomez,
Zhong Yi Wan,
Rob Carver,
Fei Sha,
Themistoklis Sapsis
Abstract:
Chaotic systems, such as turbulent flows, are ubiquitous in science and engineering. However, their study remains a challenge due to the large range scales, and the strong interaction with other, often not fully understood, physics. As a consequence, the spatiotemporal resolution required for accurate simulation of these systems is typically computationally infeasible, particularly for application…
▽ More
Chaotic systems, such as turbulent flows, are ubiquitous in science and engineering. However, their study remains a challenge due to the large range scales, and the strong interaction with other, often not fully understood, physics. As a consequence, the spatiotemporal resolution required for accurate simulation of these systems is typically computationally infeasible, particularly for applications of long-term risk assessment, such as the quantification of extreme weather risk due to climate change. While data-driven modeling offers some promise of alleviating these obstacles, the scarcity of high-quality simulations results in limited available data to train such models, which is often compounded by the lack of stability for long-horizon simulations. As such, the computational, algorithmic, and data restrictions generally imply that the probability of rare extreme events is not accurately captured. In this work we present a general strategy for training neural network models to non-intrusively correct under-resolved long-time simulations of chaotic systems. The approach is based on training a post-processing correction operator on under-resolved simulations nudged towards a high-fidelity reference. This enables us to learn the dynamics of the underlying system directly, which allows us to use very little training data, even when the statistics thereof are far from converged. Additionally, through the use of probabilistic network architectures we are able to leverage the uncertainty due to the limited training data to further improve extrapolation capabilities. We apply our framework to severely under-resolved simulations of quasi-geostrophic flow and demonstrate its ability to accurately predict the anisotropic statistics over time horizons more than 30 times longer than the data seen in training.
△ Less
Submitted 22 November, 2024; v1 submitted 2 August, 2024;
originally announced August 2024.
-
DySLIM: Dynamics Stable Learning by Invariant Measure for Chaotic Systems
Authors:
Yair Schiff,
Zhong Yi Wan,
Jeffrey B. Parker,
Stephan Hoyer,
Volodymyr Kuleshov,
Fei Sha,
Leonardo Zepeda-Núñez
Abstract:
Learning dynamics from dissipative chaotic systems is notoriously difficult due to their inherent instability, as formalized by their positive Lyapunov exponents, which exponentially amplify errors in the learned dynamics. However, many of these systems exhibit ergodicity and an attractor: a compact and highly complex manifold, to which trajectories converge in finite-time, that supports an invari…
▽ More
Learning dynamics from dissipative chaotic systems is notoriously difficult due to their inherent instability, as formalized by their positive Lyapunov exponents, which exponentially amplify errors in the learned dynamics. However, many of these systems exhibit ergodicity and an attractor: a compact and highly complex manifold, to which trajectories converge in finite-time, that supports an invariant measure, i.e., a probability distribution that is invariant under the action of the dynamics, which dictates the long-term statistical behavior of the system. In this work, we leverage this structure to propose a new framework that targets learning the invariant measure as well as the dynamics, in contrast with typical methods that only target the misfit between trajectories, which often leads to divergence as the trajectories' length increases. We use our framework to propose a tractable and sample efficient objective that can be used with any existing learning objectives. Our Dynamics Stable Learning by Invariant Measure (DySLIM) objective enables model training that achieves better point-wise tracking and long-term statistical accuracy relative to other learning objectives. By targeting the distribution with a scalable regularization term, we hope that this approach can be extended to more complex systems exhibiting slowly-variant distributions, such as weather and climate models.
△ Less
Submitted 5 June, 2024; v1 submitted 6 February, 2024;
originally announced February 2024.
-
NODLINK: An Online System for Fine-Grained APT Attack Detection and Investigation
Authors:
Shaofei Li,
Feng Dong,
Xusheng Xiao,
Haoyu Wang,
Fei Shao,
Jiedong Chen,
Yao Guo,
Xiangqun Chen,
Ding Li
Abstract:
Advanced Persistent Threats (APT) attacks have plagued modern enterprises, causing significant financial losses. To counter these attacks, researchers propose techniques that capture the complex and stealthy scenarios of APT attacks by using provenance graphs to model system entities and their dependencies. Particularly, to accelerate attack detection and reduce financial losses, online provenance…
▽ More
Advanced Persistent Threats (APT) attacks have plagued modern enterprises, causing significant financial losses. To counter these attacks, researchers propose techniques that capture the complex and stealthy scenarios of APT attacks by using provenance graphs to model system entities and their dependencies. Particularly, to accelerate attack detection and reduce financial losses, online provenance-based detection systems that detect and investigate APT attacks under the constraints of timeliness and limited resources are in dire need. Unfortunately, existing online systems usually sacrifice detection granularity to reduce computational complexity and produce provenance graphs with more than 100,000 nodes, posing challenges for security admins to interpret the detection results. In this paper, we design and implement NodLink, the first online detection system that maintains high detection accuracy without sacrificing detection granularity. Our insight is that the APT attack detection process in online provenance-based detection systems can be modeled as a Steiner Tree Problem (STP), which has efficient online approximation algorithms that recover concise attack-related provenance graphs with a theoretically bounded error. To utilize STP approximation algorithm frameworks for APT attack detection, we propose a novel design of in-memory cache, an efficient attack screening method, and a new STP approximation algorithm that is more efficient than the conventional one in APT attack detection while maintaining the same complexity. We evaluate NodLink in a production environment. The open-world experiment shows that NodLink outperforms two state-of-the-art (SOTA) online provenance analysis systems by achieving magnitudes higher detection and investigation accuracy while having the same or higher throughput.
△ Less
Submitted 4 November, 2023;
originally announced November 2023.
-
A Systematic Comparison of Syllogistic Reasoning in Humans and Language Models
Authors:
Tiwalayo Eisape,
MH Tessler,
Ishita Dasgupta,
Fei Sha,
Sjoerd van Steenkiste,
Tal Linzen
Abstract:
A central component of rational behavior is logical inference: the process of determining which conclusions follow from a set of premises. Psychologists have documented several ways in which humans' inferences deviate from the rules of logic. Do language models, which are trained on text generated by humans, replicate such human biases, or are they able to overcome them? Focusing on the case of sy…
▽ More
A central component of rational behavior is logical inference: the process of determining which conclusions follow from a set of premises. Psychologists have documented several ways in which humans' inferences deviate from the rules of logic. Do language models, which are trained on text generated by humans, replicate such human biases, or are they able to overcome them? Focusing on the case of syllogisms -- inferences from two simple premises -- we show that, within the PaLM2 family of transformer language models, larger models are more logical than smaller ones, and also more logical than humans. At the same time, even the largest models make systematic errors, some of which mirror human reasoning biases: they show sensitivity to the (irrelevant) ordering of the variables in the syllogism, and draw confident but incorrect inferences from particular syllogisms (syllogistic fallacies). Overall, we find that language models often mimic the human biases included in their training data, but are able to overcome them in some cases.
△ Less
Submitted 11 April, 2024; v1 submitted 1 November, 2023;
originally announced November 2023.
-
The Impact of Depth on Compositional Generalization in Transformer Language Models
Authors:
Jackson Petty,
Sjoerd van Steenkiste,
Ishita Dasgupta,
Fei Sha,
Dan Garrette,
Tal Linzen
Abstract:
To process novel sentences, language models (LMs) must generalize compositionally -- combine familiar elements in new ways. What aspects of a model's structure promote compositional generalization? Focusing on transformers, we test the hypothesis, motivated by theoretical and empirical work, that deeper transformers generalize more compositionally. Simply adding layers increases the total number o…
▽ More
To process novel sentences, language models (LMs) must generalize compositionally -- combine familiar elements in new ways. What aspects of a model's structure promote compositional generalization? Focusing on transformers, we test the hypothesis, motivated by theoretical and empirical work, that deeper transformers generalize more compositionally. Simply adding layers increases the total number of parameters; to address this confound between depth and size, we construct three classes of models which trade off depth for width such that the total number of parameters is kept constant (41M, 134M and 374M parameters). We pretrain all models as LMs and fine-tune them on tasks that test for compositional generalization. We report three main conclusions: (1) after fine-tuning, deeper models generalize more compositionally than shallower models do, but the benefit of additional layers diminishes rapidly; (2) within each family, deeper models show better language modeling performance, but returns are similarly diminishing; (3) the benefits of depth for compositional generalization cannot be attributed solely to better performance on language modeling. Because model latency is approximately linear in the number of layers, these results lead us to the recommendation that, with a given total parameter budget, transformers can be made shallower than is typical without sacrificing performance.
△ Less
Submitted 10 April, 2024; v1 submitted 30 October, 2023;
originally announced October 2023.
-
WeatherBench 2: A benchmark for the next generation of data-driven global weather models
Authors:
Stephan Rasp,
Stephan Hoyer,
Alexander Merose,
Ian Langmore,
Peter Battaglia,
Tyler Russel,
Alvaro Sanchez-Gonzalez,
Vivian Yang,
Rob Carver,
Shreya Agrawal,
Matthew Chantry,
Zied Ben Bouallegue,
Peter Dueben,
Carla Bromberg,
Jared Sisk,
Luke Barrington,
Aaron Bell,
Fei Sha
Abstract:
WeatherBench 2 is an update to the global, medium-range (1-14 day) weather forecasting benchmark proposed by Rasp et al. (2020), designed with the aim to accelerate progress in data-driven weather modeling. WeatherBench 2 consists of an open-source evaluation framework, publicly available training, ground truth and baseline data as well as a continuously updated website with the latest metrics and…
▽ More
WeatherBench 2 is an update to the global, medium-range (1-14 day) weather forecasting benchmark proposed by Rasp et al. (2020), designed with the aim to accelerate progress in data-driven weather modeling. WeatherBench 2 consists of an open-source evaluation framework, publicly available training, ground truth and baseline data as well as a continuously updated website with the latest metrics and state-of-the-art models: https://sites.research.google/weatherbench. This paper describes the design principles of the evaluation framework and presents results for current state-of-the-art physical and data-driven weather models. The metrics are based on established practices for evaluating weather forecasts at leading operational weather centers. We define a set of headline scores to provide an overview of model performance. In addition, we also discuss caveats in the current evaluation setup and challenges for the future of data-driven weather forecasting.
△ Less
Submitted 26 January, 2024; v1 submitted 29 August, 2023;
originally announced August 2023.
-
Fine-grained Text-Video Retrieval with Frozen Image Encoders
Authors:
Zuozhuo Dai,
Fangtao Shao,
Qingkun Su,
Zilong Dong,
Siyu Zhu
Abstract:
State-of-the-art text-video retrieval (TVR) methods typically utilize CLIP and cosine similarity for efficient retrieval. Meanwhile, cross attention methods, which employ a transformer decoder to compute attention between each text query and all frames in a video, offer a more comprehensive interaction between text and videos. However, these methods lack important fine-grained spatial information…
▽ More
State-of-the-art text-video retrieval (TVR) methods typically utilize CLIP and cosine similarity for efficient retrieval. Meanwhile, cross attention methods, which employ a transformer decoder to compute attention between each text query and all frames in a video, offer a more comprehensive interaction between text and videos. However, these methods lack important fine-grained spatial information as they directly compute attention between text and video-level tokens. To address this issue, we propose CrossTVR, a two-stage text-video retrieval architecture. In the first stage, we leverage existing TVR methods with cosine similarity network for efficient text/video candidate selection. In the second stage, we propose a novel decoupled video text cross attention module to capture fine-grained multimodal information in spatial and temporal dimensions. Additionally, we employ the frozen CLIP model strategy in fine-grained retrieval, enabling scalability to larger pre-trained vision models like ViT-G, resulting in improved retrieval performance. Experiments on text video retrieval datasets demonstrate the effectiveness and scalability of our proposed CrossTVR compared to state-of-the-art approaches.
△ Less
Submitted 13 July, 2023;
originally announced July 2023.
-
SEEDS: Emulation of Weather Forecast Ensembles with Diffusion Models
Authors:
Lizao Li,
Rob Carver,
Ignacio Lopez-Gomez,
Fei Sha,
John Anderson
Abstract:
Uncertainty quantification is crucial to decision-making. A prominent example is probabilistic forecasting in numerical weather prediction. The dominant approach to representing uncertainty in weather forecasting is to generate an ensemble of forecasts. This is done by running many physics-based simulations under different conditions, which is a computationally costly process. We propose to amorti…
▽ More
Uncertainty quantification is crucial to decision-making. A prominent example is probabilistic forecasting in numerical weather prediction. The dominant approach to representing uncertainty in weather forecasting is to generate an ensemble of forecasts. This is done by running many physics-based simulations under different conditions, which is a computationally costly process. We propose to amortize the computational cost by emulating these forecasts with deep generative diffusion models learned from historical data. The learned models are highly scalable with respect to high-performance computing accelerators and can sample hundreds to tens of thousands of realistic weather forecasts at low cost. When designed to emulate operational ensemble forecasts, the generated ones are similar to physics-based ensembles in important statistical properties and predictive skill. When designed to correct biases present in the operational forecasting system, the generated ensembles show improved probabilistic forecast metrics. They are more reliable and forecast probabilities of extreme weather events more accurately. While this work demonstrates the utility of the methodology by focusing on weather forecasting, the generative artificial intelligence methodology can be extended for uncertainty quantification in climate modeling, where we believe the generation of very large ensembles of climate projections will play an increasingly important role in climate risk assessment.
△ Less
Submitted 8 October, 2023; v1 submitted 24 June, 2023;
originally announced June 2023.
-
Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories
Authors:
Thomas Mensink,
Jasper Uijlings,
Lluis Castrejon,
Arushi Goel,
Felipe Cadar,
Howard Zhou,
Fei Sha,
André Araujo,
Vittorio Ferrari
Abstract:
We propose Encyclopedic-VQA, a large scale visual question answering (VQA) dataset featuring visual questions about detailed properties of fine-grained categories and instances. It contains 221k unique question+answer pairs each matched with (up to) 5 images, resulting in a total of 1M VQA samples. Moreover, our dataset comes with a controlled knowledge base derived from Wikipedia, marking the evi…
▽ More
We propose Encyclopedic-VQA, a large scale visual question answering (VQA) dataset featuring visual questions about detailed properties of fine-grained categories and instances. It contains 221k unique question+answer pairs each matched with (up to) 5 images, resulting in a total of 1M VQA samples. Moreover, our dataset comes with a controlled knowledge base derived from Wikipedia, marking the evidence to support each answer. Empirically, we show that our dataset poses a hard challenge for large vision+language models as they perform poorly on our dataset: PaLI [14] is state-of-the-art on OK-VQA [37], yet it only achieves 13.0% accuracy on our dataset. Moreover, we experimentally show that progress on answering our encyclopedic questions can be achieved by augmenting large models with a mechanism that retrieves relevant information from the knowledge base. An oracle experiment with perfect retrieval achieves 87.0% accuracy on the single-hop portion of our dataset, and an automatic retrieval-augmented prototype yields 48.8%. We believe that our dataset enables future research on retrieval-augmented vision+language models. It is available at https://github.com/google-research/google-research/tree/master/encyclopedic_vqa .
△ Less
Submitted 24 July, 2023; v1 submitted 15 June, 2023;
originally announced June 2023.
-
User-defined Event Sampling and Uncertainty Quantification in Diffusion Models for Physical Dynamical Systems
Authors:
Marc Finzi,
Anudhyan Boral,
Andrew Gordon Wilson,
Fei Sha,
Leonardo Zepeda-Núñez
Abstract:
Diffusion models are a class of probabilistic generative models that have been widely used as a prior for image processing tasks like text conditional generation and inpainting. We demonstrate that these models can be adapted to make predictions and provide uncertainty quantification for chaotic dynamical systems. In these applications, diffusion models can implicitly represent knowledge about out…
▽ More
Diffusion models are a class of probabilistic generative models that have been widely used as a prior for image processing tasks like text conditional generation and inpainting. We demonstrate that these models can be adapted to make predictions and provide uncertainty quantification for chaotic dynamical systems. In these applications, diffusion models can implicitly represent knowledge about outliers and extreme events; however, querying that knowledge through conditional sampling or measuring probabilities is surprisingly difficult. Existing methods for conditional sampling at inference time seek mainly to enforce the constraints, which is insufficient to match the statistics of the distribution or compute the probability of the chosen events. To achieve these ends, optimally one would use the conditional score function, but its computation is typically intractable. In this work, we develop a probabilistic approximation scheme for the conditional score function which provably converges to the true distribution as the noise level decreases. With this scheme we are able to sample conditionally on nonlinear userdefined events at inference time, and matches data statistics even when sampling from the tails of the distribution.
△ Less
Submitted 12 June, 2023;
originally announced June 2023.
-
Neural Ideal Large Eddy Simulation: Modeling Turbulence with Neural Stochastic Differential Equations
Authors:
Anudhyan Boral,
Zhong Yi Wan,
Leonardo Zepeda-Núñez,
James Lottes,
Qing Wang,
Yi-fan Chen,
John Roberts Anderson,
Fei Sha
Abstract:
We introduce a data-driven learning framework that assimilates two powerful ideas: ideal large eddy simulation (LES) from turbulence closure modeling and neural stochastic differential equations (SDE) for stochastic modeling. The ideal LES models the LES flow by treating each full-order trajectory as a random realization of the underlying dynamics, as such, the effect of small-scales is marginaliz…
▽ More
We introduce a data-driven learning framework that assimilates two powerful ideas: ideal large eddy simulation (LES) from turbulence closure modeling and neural stochastic differential equations (SDE) for stochastic modeling. The ideal LES models the LES flow by treating each full-order trajectory as a random realization of the underlying dynamics, as such, the effect of small-scales is marginalized to obtain the deterministic evolution of the LES state. However, ideal LES is analytically intractable. In our work, we use a latent neural SDE to model the evolution of the stochastic process and an encoder-decoder pair for transforming between the latent space and the desired ideal flow field. This stands in sharp contrast to other types of neural parameterization of closure models where each trajectory is treated as a deterministic realization of the dynamics. We show the effectiveness of our approach (niLES - neural ideal LES) on a challenging chaotic dynamical system: Kolmogorov flow at a Reynolds number of 20,000. Compared to competing methods, our method can handle non-uniform geometries using unstructured meshes seamlessly. In particular, niLES leads to trajectories with more accurate statistics and enhances stability, particularly for long-horizon rollouts.
△ Less
Submitted 1 June, 2023;
originally announced June 2023.
-
Debias Coarsely, Sample Conditionally: Statistical Downscaling through Optimal Transport and Probabilistic Diffusion Models
Authors:
Zhong Yi Wan,
Ricardo Baptista,
Yi-fan Chen,
John Anderson,
Anudhyan Boral,
Fei Sha,
Leonardo Zepeda-Núñez
Abstract:
We introduce a two-stage probabilistic framework for statistical downscaling using unpaired data. Statistical downscaling seeks a probabilistic map to transform low-resolution data from a biased coarse-grained numerical scheme to high-resolution data that is consistent with a high-fidelity scheme. Our framework tackles the problem by composing two transformations: (i) a debiasing step via an optim…
▽ More
We introduce a two-stage probabilistic framework for statistical downscaling using unpaired data. Statistical downscaling seeks a probabilistic map to transform low-resolution data from a biased coarse-grained numerical scheme to high-resolution data that is consistent with a high-fidelity scheme. Our framework tackles the problem by composing two transformations: (i) a debiasing step via an optimal transport map, and (ii) an upsampling step achieved by a probabilistic diffusion model with a posteriori conditional sampling. This approach characterizes a conditional distribution without needing paired data, and faithfully recovers relevant physical statistics from biased samples. We demonstrate the utility of the proposed approach on one- and two-dimensional fluid flow problems, which are representative of the core difficulties present in numerical simulations of weather and climate. Our method produces realistic high-resolution outputs from low-resolution inputs, by upsampling resolutions of 8x and 16x. Moreover, our procedure correctly matches the statistics of physical quantities, even when the low-frequency content of the inputs and outputs do not match, a crucial but difficult-to-satisfy assumption needed by current state-of-the-art alternatives. Code for this work is available at: https://github.com/google-research/swirl-dynamics/tree/main/swirl_dynamics/projects/probabilistic_diffusion.
△ Less
Submitted 30 October, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Counterfactual Co-occurring Learning for Bias Mitigation in Weakly-supervised Object Localization
Authors:
Feifei Shao,
Yawei Luo,
Lei Chen,
Ping Liu,
Wei Yang,
Yi Yang,
Jun Xiao
Abstract:
Contemporary weakly-supervised object localization (WSOL) methods have primarily focused on addressing the challenge of localizing the most discriminative region while largely overlooking the relatively less explored issue of biased activation -- incorrectly spotlighting co-occurring background with the foreground feature. In this paper, we conduct a thorough causal analysis to investigate the ori…
▽ More
Contemporary weakly-supervised object localization (WSOL) methods have primarily focused on addressing the challenge of localizing the most discriminative region while largely overlooking the relatively less explored issue of biased activation -- incorrectly spotlighting co-occurring background with the foreground feature. In this paper, we conduct a thorough causal analysis to investigate the origins of biased activation. Based on our analysis, we attribute this phenomenon to the presence of co-occurring background confounders. Building upon this profound insight, we introduce a pioneering paradigm known as Counterfactual Co-occurring Learning (CCL), meticulously engendering counterfactual representations by adeptly disentangling the foreground from the co-occurring background elements. Furthermore, we propose an innovative network architecture known as Counterfactual-CAM. This architecture seamlessly incorporates a perturbation mechanism for counterfactual representations into the vanilla CAM-based model. By training the WSOL model with these perturbed representations, we guide the model to prioritize the consistent foreground content while concurrently reducing the influence of distracting co-occurring backgrounds. To the best of our knowledge, this study represents the initial exploration of this research direction. Our extensive experiments conducted across multiple benchmarks validate the effectiveness of the proposed Counterfactual-CAM in mitigating biased activation.
△ Less
Submitted 9 March, 2024; v1 submitted 24 May, 2023;
originally announced May 2023.
-
V2Meow: Meowing to the Visual Beat via Video-to-Music Generation
Authors:
Kun Su,
Judith Yue Li,
Qingqing Huang,
Dima Kuzmin,
Joonseok Lee,
Chris Donahue,
Fei Sha,
Aren Jansen,
Yu Wang,
Mauro Verzetti,
Timo I. Denk
Abstract:
Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally alig…
▽ More
Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally aligned signatures between video and music directly from paired music and videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow, a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types using a multi-stage autoregressive model. Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general-purpose visual features extracted from video frames, with optional style control via text prompts. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality. Music samples are available at tinyurl.com/v2meow.
△ Less
Submitted 22 February, 2024; v1 submitted 11 May, 2023;
originally announced May 2023.
-
Policy-Induced Self-Supervision Improves Representation Finetuning in Visual RL
Authors:
Sébastien M. R. Arnold,
Fei Sha
Abstract:
We study how to transfer representations pretrained on source tasks to target tasks in visual percept based RL. We analyze two popular approaches: freezing or finetuning the pretrained representations. Empirical studies on a set of popular tasks reveal several properties of pretrained representations. First, finetuning is required even when pretrained representations perfectly capture the informat…
▽ More
We study how to transfer representations pretrained on source tasks to target tasks in visual percept based RL. We analyze two popular approaches: freezing or finetuning the pretrained representations. Empirical studies on a set of popular tasks reveal several properties of pretrained representations. First, finetuning is required even when pretrained representations perfectly capture the information required to solve the target task. Second, finetuned representations improve learnability and are more robust to noise. Third, pretrained bottom layers are task-agnostic and readily transferable to new tasks, while top layers encode task-specific information and require adaptation. Building on these insights, we propose a self-supervised objective that clusters representations according to the policy they induce, as opposed to traditional representation similarity measures which are policy-agnostic (e.g. Euclidean norm, cosine similarity). Together with freezing the bottom layers, this objective results in significantly better representation than frozen, finetuned, and self-supervised alternatives on a wide range of benchmarks.
△ Less
Submitted 12 February, 2023;
originally announced February 2023.
-
Pre-computed memory or on-the-fly encoding? A hybrid approach to retrieval augmentation makes the most of your compute
Authors:
Michiel de Jong,
Yury Zemlyanskiy,
Nicholas FitzGerald,
Joshua Ainslie,
Sumit Sanghai,
Fei Sha,
William Cohen
Abstract:
Retrieval-augmented language models such as Fusion-in-Decoder are powerful, setting the state of the art on a variety of knowledge-intensive tasks. However, they are also expensive, due to the need to encode a large number of retrieved passages. Some work avoids this cost by pre-encoding a text corpus into a memory and retrieving dense representations directly. However, pre-encoding memory incurs…
▽ More
Retrieval-augmented language models such as Fusion-in-Decoder are powerful, setting the state of the art on a variety of knowledge-intensive tasks. However, they are also expensive, due to the need to encode a large number of retrieved passages. Some work avoids this cost by pre-encoding a text corpus into a memory and retrieving dense representations directly. However, pre-encoding memory incurs a severe quality penalty as the memory representations are not conditioned on the current input. We propose LUMEN, a hybrid between these two extremes, pre-computing the majority of the retrieval representation and completing the encoding on the fly using a live encoder that is conditioned on the question and fine-tuned for the task. We show that LUMEN significantly outperforms pure memory on multiple question-answering tasks while being much cheaper than FiD, and outperforms both for any given compute budget. Moreover, the advantage of LUMEN over FiD increases with model size.
△ Less
Submitted 2 June, 2023; v1 submitted 25 January, 2023;
originally announced January 2023.
-
Evolve Smoothly, Fit Consistently: Learning Smooth Latent Dynamics For Advection-Dominated Systems
Authors:
Zhong Yi Wan,
Leonardo Zepeda-Núñez,
Anudhyan Boral,
Fei Sha
Abstract:
We present a data-driven, space-time continuous framework to learn surrogate models for complex physical systems described by advection-dominated partial differential equations. Those systems have slow-decaying Kolmogorov n-width that hinders standard methods, including reduced order modeling, from producing high-fidelity simulations at low cost. In this work, we construct hypernetwork-based laten…
▽ More
We present a data-driven, space-time continuous framework to learn surrogate models for complex physical systems described by advection-dominated partial differential equations. Those systems have slow-decaying Kolmogorov n-width that hinders standard methods, including reduced order modeling, from producing high-fidelity simulations at low cost. In this work, we construct hypernetwork-based latent dynamical models directly on the parameter space of a compact representation network. We leverage the expressive power of the network and a specially designed consistency-inducing regularization to obtain latent trajectories that are both low-dimensional and smooth. These properties render our surrogate models highly efficient at inference time. We show the efficacy of our framework by learning models that generate accurate multi-step rollout predictions at much faster inference speed compared to competitors, for several challenging examples.
△ Less
Submitted 6 February, 2023; v1 submitted 24 January, 2023;
originally announced January 2023.
-
Towards Robust Video Instance Segmentation with Temporal-Aware Transformer
Authors:
Zhenghao Zhang,
Fangtao Shao,
Zuozhuo Dai,
Siyu Zhu
Abstract:
Most existing transformer based video instance segmentation methods extract per frame features independently, hence it is challenging to solve the appearance deformation problem. In this paper, we observe the temporal information is important as well and we propose TAFormer to aggregate spatio-temporal features both in transformer encoder and decoder. Specifically, in transformer encoder, we propo…
▽ More
Most existing transformer based video instance segmentation methods extract per frame features independently, hence it is challenging to solve the appearance deformation problem. In this paper, we observe the temporal information is important as well and we propose TAFormer to aggregate spatio-temporal features both in transformer encoder and decoder. Specifically, in transformer encoder, we propose a novel spatio-temporal joint multi-scale deformable attention module which dynamically integrates the spatial and temporal information to obtain enriched spatio-temporal features. In transformer decoder, we introduce a temporal self-attention module to enhance the frame level box queries with the temporal relation. Moreover, TAFormer adopts an instance level contrastive loss to increase the discriminability of instance query embeddings. Therefore the tracking error caused by visually similar instances can be decreased. Experimental results show that TAFormer effectively leverages the spatial and temporal information to obtain context-aware feature representation and outperforms state-of-the-art methods.
△ Less
Submitted 20 January, 2023;
originally announced January 2023.
-
Knowledge-guided Causal Intervention for Weakly-supervised Object Localization
Authors:
Feifei Shao,
Yawei Luo,
Fei Gao,
Yi Yang,
Jun Xiao
Abstract:
Previous weakly-supervised object localization (WSOL) methods aim to expand activation map discriminative areas to cover the whole objects, yet neglect two inherent challenges when relying solely on image-level labels. First, the ``entangled context'' issue arises from object-context co-occurrence (\eg, fish and water), making the model inspection hard to distinguish object boundaries clearly. Sec…
▽ More
Previous weakly-supervised object localization (WSOL) methods aim to expand activation map discriminative areas to cover the whole objects, yet neglect two inherent challenges when relying solely on image-level labels. First, the ``entangled context'' issue arises from object-context co-occurrence (\eg, fish and water), making the model inspection hard to distinguish object boundaries clearly. Second, the ``C-L dilemma'' issue results from the information decay caused by the pooling layers, which struggle to retain both the semantic information for precise classification and those essential details for accurate localization, leading to a trade-off in performance. In this paper, we propose a knowledge-guided causal intervention method, dubbed KG-CI-CAM, to address these two under-explored issues in one go. More specifically, we tackle the co-occurrence context confounder problem via causal intervention, which explores the causalities among image features, contexts, and categories to eliminate the biased object-context entanglement in the class activation maps. Based on the disentangled object feature, we introduce a multi-source knowledge guidance framework to strike a balance between absorbing classification knowledge and localization knowledge during model training. Extensive experiments conducted on several benchmark datasets demonstrate the effectiveness of KG-CI-CAM in learning distinct object boundaries amidst confounding contexts and mitigating the dilemma between classification and localization performance.
△ Less
Submitted 12 March, 2024; v1 submitted 3 January, 2023;
originally announced January 2023.
-
FiDO: Fusion-in-Decoder optimized for stronger performance and faster inference
Authors:
Michiel de Jong,
Yury Zemlyanskiy,
Joshua Ainslie,
Nicholas FitzGerald,
Sumit Sanghai,
Fei Sha,
William Cohen
Abstract:
Fusion-in-Decoder (FiD) is a powerful retrieval-augmented language model that sets the state-of-the-art on many knowledge-intensive NLP tasks. However, the architecture used for FiD was chosen by making minimal modifications to a standard T5 model, which our analysis shows to be highly suboptimal for a retrieval-augmented model. In particular, FiD allocates the bulk of FLOPs to the encoder, while…
▽ More
Fusion-in-Decoder (FiD) is a powerful retrieval-augmented language model that sets the state-of-the-art on many knowledge-intensive NLP tasks. However, the architecture used for FiD was chosen by making minimal modifications to a standard T5 model, which our analysis shows to be highly suboptimal for a retrieval-augmented model. In particular, FiD allocates the bulk of FLOPs to the encoder, while the majority of inference time results from memory bandwidth constraints in the decoder. We propose two simple changes to the FiD architecture to alleviate memory bandwidth constraints, and speed up inference by 7x. This allows us to use a much larger decoder at modest cost. We denote FiD with the above modifications as FiDO, and show that it strongly improves performance over existing FiD models for a wide range of inference budgets. For example, FiDO-Large-XXL performs faster inference than FiD-Base and achieves better performance than FiD-Large.
△ Less
Submitted 2 June, 2023; v1 submitted 15 December, 2022;
originally announced December 2022.
-
Generate-and-Retrieve: use your predictions to improve retrieval for semantic parsing
Authors:
Yury Zemlyanskiy,
Michiel de Jong,
Joshua Ainslie,
Panupong Pasupat,
Peter Shaw,
Linlu Qiu,
Sumit Sanghai,
Fei Sha
Abstract:
A common recent approach to semantic parsing augments sequence-to-sequence models by retrieving and appending a set of training samples, called exemplars. The effectiveness of this recipe is limited by the ability to retrieve informative exemplars that help produce the correct parse, which is especially challenging in low-resource settings. Existing retrieval is commonly based on similarity of que…
▽ More
A common recent approach to semantic parsing augments sequence-to-sequence models by retrieving and appending a set of training samples, called exemplars. The effectiveness of this recipe is limited by the ability to retrieve informative exemplars that help produce the correct parse, which is especially challenging in low-resource settings. Existing retrieval is commonly based on similarity of query and exemplar inputs. We propose GandR, a retrieval procedure that retrieves exemplars for which outputs are also similar. GandRfirst generates a preliminary prediction with input-based retrieval. Then, it retrieves exemplars with outputs similar to the preliminary prediction which are used to generate a final prediction. GandR sets the state of the art on multiple low-resource semantic parsing tasks.
△ Less
Submitted 29 September, 2022;
originally announced September 2022.
-
Quality Evaluation of Arbitrary Style Transfer: Subjective Study and Objective Metric
Authors:
Hangwei Chen,
Feng Shao,
Xiongli Chai,
Yuese Gu,
Qiuping Jiang,
Xiangchao Meng,
Yo-Sung Ho
Abstract:
Arbitrary neural style transfer is a vital topic with great research value and wide industrial application, which strives to render the structure of one image using the style of another. Recent researches have devoted great efforts on the task of arbitrary style transfer (AST) for improving the stylization quality. However, there are very few explorations about the quality evaluation of AST images…
▽ More
Arbitrary neural style transfer is a vital topic with great research value and wide industrial application, which strives to render the structure of one image using the style of another. Recent researches have devoted great efforts on the task of arbitrary style transfer (AST) for improving the stylization quality. However, there are very few explorations about the quality evaluation of AST images, even it can potentially guide the design of different algorithms. In this paper, we first construct a new AST images quality assessment database (AST-IQAD), which consists 150 content-style image pairs and the corresponding 1200 stylized images produced by eight typical AST algorithms. Then, a subjective study is conducted on our AST-IQAD database, which obtains the subjective rating scores of all stylized images on the three subjective evaluations, i.e., content preservation (CP), style resemblance (SR), and overall vision (OV). To quantitatively measure the quality of AST image, we propose a new sparse representation-based method, which computes the quality according to the sparse feature similarity. Experimental results on our AST-IQAD have demonstrated the superiority of the proposed method. The dataset and source code will be released at https://github.com/Hangwei-Chen/AST-IQAD-SRQE
△ Less
Submitted 29 January, 2023; v1 submitted 1 August, 2022;
originally announced August 2022.
-
ALMA: Hierarchical Learning for Composite Multi-Agent Tasks
Authors:
Shariq Iqbal,
Robby Costales,
Fei Sha
Abstract:
Despite significant progress on multi-agent reinforcement learning (MARL) in recent years, coordination in complex domains remains a challenge. Work in MARL often focuses on solving tasks where agents interact with all other agents and entities in the environment; however, we observe that real-world tasks are often composed of several isolated instances of local agent interactions (subtasks), and…
▽ More
Despite significant progress on multi-agent reinforcement learning (MARL) in recent years, coordination in complex domains remains a challenge. Work in MARL often focuses on solving tasks where agents interact with all other agents and entities in the environment; however, we observe that real-world tasks are often composed of several isolated instances of local agent interactions (subtasks), and each agent can meaningfully focus on one subtask to the exclusion of all else in the environment. In these composite tasks, successful policies can often be decomposed into two levels of decision-making: agents are allocated to specific subtasks and each agent acts productively towards their assigned subtask alone. This decomposed decision making provides a strong structural inductive bias, significantly reduces agent observation spaces, and encourages subtask-specific policies to be reused and composed during training, as opposed to treating each new composition of subtasks as unique. We introduce ALMA, a general learning method for taking advantage of these structured tasks. ALMA simultaneously learns a high-level subtask allocation policy and low-level agent policies. We demonstrate that ALMA learns sophisticated coordination behavior in a number of challenging environments, outperforming strong baselines. ALMA's modularity also enables it to better generalize to new environment configurations. Finally, we find that while ALMA can integrate separately trained allocation and action policies, the best performance is obtained only by training all components jointly. Our code is available at https://github.com/shariqiqbal2810/ALMA
△ Less
Submitted 25 September, 2022; v1 submitted 27 May, 2022;
originally announced May 2022.
-
Evaluating the Impact of Model Scale for Compositional Generalization in Semantic Parsing
Authors:
Linlu Qiu,
Peter Shaw,
Panupong Pasupat,
Tianze Shi,
Jonathan Herzig,
Emily Pitler,
Fei Sha,
Kristina Toutanova
Abstract:
Despite their strong performance on many tasks, pre-trained language models have been shown to struggle on out-of-distribution compositional generalization. Meanwhile, recent work has shown considerable improvements on many NLP tasks from model scaling. Can scaling up model size also improve compositional generalization in semantic parsing? We evaluate encoder-decoder models up to 11B parameters a…
▽ More
Despite their strong performance on many tasks, pre-trained language models have been shown to struggle on out-of-distribution compositional generalization. Meanwhile, recent work has shown considerable improvements on many NLP tasks from model scaling. Can scaling up model size also improve compositional generalization in semantic parsing? We evaluate encoder-decoder models up to 11B parameters and decoder-only models up to 540B parameters, and compare model scaling curves for three different methods for applying a pre-trained language model to a new task: fine-tuning all parameters, prompt tuning, and in-context learning. We observe that fine-tuning generally has flat or negative scaling curves on out-of-distribution compositional generalization in semantic parsing evaluations. In-context learning has positive scaling curves, but is generally outperformed by much smaller fine-tuned models. Prompt-tuning can outperform fine-tuning, suggesting further potential improvements from scaling as it exhibits a more positive scaling curve. Additionally, we identify several error trends that vary with model scale. For example, larger models are generally better at modeling the syntax of the output space, but are also more prone to certain types of overfitting. Overall, our study highlights limitations of current techniques for effectively leveraging model scale for compositional generalization, while our analysis also suggests promising directions for future work.
△ Less
Submitted 24 October, 2022; v1 submitted 24 May, 2022;
originally announced May 2022.
-
Possibility Before Utility: Learning And Using Hierarchical Affordances
Authors:
Robby Costales,
Shariq Iqbal,
Fei Sha
Abstract:
Reinforcement learning algorithms struggle on tasks with complex hierarchical dependency structures. Humans and other intelligent agents do not waste time assessing the utility of every high-level action in existence, but instead only consider ones they deem possible in the first place. By focusing only on what is feasible, or "afforded", at the present moment, an agent can spend more time both ev…
▽ More
Reinforcement learning algorithms struggle on tasks with complex hierarchical dependency structures. Humans and other intelligent agents do not waste time assessing the utility of every high-level action in existence, but instead only consider ones they deem possible in the first place. By focusing only on what is feasible, or "afforded", at the present moment, an agent can spend more time both evaluating the utility of and acting on what matters. To this end, we present Hierarchical Affordance Learning (HAL), a method that learns a model of hierarchical affordances in order to prune impossible subtasks for more effective learning. Existing works in hierarchical reinforcement learning provide agents with structural representations of subtasks but are not affordance-aware, and by grounding our definition of hierarchical affordances in the present state, our approach is more flexible than the multitude of approaches that ground their subtask dependencies in a symbolic history. While these logic-based methods often require complete knowledge of the subtask hierarchy, our approach is able to utilize incomplete and varying symbolic specifications. Furthermore, we demonstrate that relative to non-affordance-aware methods, HAL agents are better able to efficiently learn complex tasks, navigate environment stochasticity, and acquire diverse skills in the absence of extrinsic supervision -- all of which are hallmarks of human learning.
△ Less
Submitted 23 March, 2022;
originally announced March 2022.
-
Active Learning for Point Cloud Semantic Segmentation via Spatial-Structural Diversity Reasoning
Authors:
Feifei Shao,
Yawei Luo,
Ping Liu,
Jie Chen,
Yi Yang,
Yulei Lu,
Jun Xiao
Abstract:
The expensive annotation cost is notoriously known as the main constraint for the development of the point cloud semantic segmentation technique. Active learning methods endeavor to reduce such cost by selecting and labeling only a subset of the point clouds, yet previous attempts ignore the spatial-structural diversity of the selected samples, inducing the model to select clustered candidates wit…
▽ More
The expensive annotation cost is notoriously known as the main constraint for the development of the point cloud semantic segmentation technique. Active learning methods endeavor to reduce such cost by selecting and labeling only a subset of the point clouds, yet previous attempts ignore the spatial-structural diversity of the selected samples, inducing the model to select clustered candidates with similar shapes in a local area while missing other representative ones in the global environment. In this paper, we propose a new 3D region-based active learning method to tackle this problem. Dubbed SSDR-AL, our method groups the original point clouds into superpoints and incrementally selects the most informative and representative ones for label acquisition. We achieve the selection mechanism via a graph reasoning network that considers both the spatial and structural diversities of superpoints. To deploy SSDR-AL in a more practical scenario, we design a noise-aware iterative labeling strategy to confront the "noisy annotation" problem introduced by the previous "dominant labeling" strategy in superpoints. Extensive experiments on two point cloud benchmarks demonstrate the effectiveness of SSDR-AL in the semantic segmentation task. Particularly, SSDR-AL significantly outperforms the baseline method and reduces the annotation cost by up to 63.0% and 24.0% when achieving 90% performance of fully supervised learning, respectively.
△ Less
Submitted 18 April, 2022; v1 submitted 25 February, 2022;
originally announced February 2022.
-
Policy Learning and Evaluation with Randomized Quasi-Monte Carlo
Authors:
Sebastien M. R. Arnold,
Pierre L'Ecuyer,
Liyu Chen,
Yi-fan Chen,
Fei Sha
Abstract:
Reinforcement learning constantly deals with hard integrals, for example when computing expectations in policy evaluation and policy iteration. These integrals are rarely analytically solvable and typically estimated with the Monte Carlo method, which induces high variance in policy values and gradients. In this work, we propose to replace Monte Carlo samples with low-discrepancy point sets. We co…
▽ More
Reinforcement learning constantly deals with hard integrals, for example when computing expectations in policy evaluation and policy iteration. These integrals are rarely analytically solvable and typically estimated with the Monte Carlo method, which induces high variance in policy values and gradients. In this work, we propose to replace Monte Carlo samples with low-discrepancy point sets. We combine policy gradient methods with Randomized Quasi-Monte Carlo, yielding variance-reduced formulations of policy gradient and actor-critic algorithms. These formulations are effective for policy evaluation and policy improvement, as they outperform state-of-the-art algorithms on standardized continuous control benchmarks. Our empirical analyses validate the intuition that replacing Monte Carlo with Quasi-Monte Carlo yields significantly more accurate gradient estimates.
△ Less
Submitted 21 February, 2022; v1 submitted 15 February, 2022;
originally announced February 2022.
-
Improving Compositional Generalization with Latent Structure and Data Augmentation
Authors:
Linlu Qiu,
Peter Shaw,
Panupong Pasupat,
Paweł Krzysztof Nowak,
Tal Linzen,
Fei Sha,
Kristina Toutanova
Abstract:
Generic unstructured neural networks have been shown to struggle on out-of-distribution compositional generalization. Compositional data augmentation via example recombination has transferred some prior knowledge about compositionality to such black-box neural models for several semantic parsing tasks, but this often required task-specific engineering or provided limited gains.
We present a more…
▽ More
Generic unstructured neural networks have been shown to struggle on out-of-distribution compositional generalization. Compositional data augmentation via example recombination has transferred some prior knowledge about compositionality to such black-box neural models for several semantic parsing tasks, but this often required task-specific engineering or provided limited gains.
We present a more powerful data recombination method using a model called Compositional Structure Learner (CSL). CSL is a generative model with a quasi-synchronous context-free grammar backbone, which we induce from the training data. We sample recombined examples from CSL and add them to the fine-tuning data of a pre-trained sequence-to-sequence model (T5). This procedure effectively transfers most of CSL's compositional bias to T5 for diagnostic tasks, and results in a model even stronger than a T5-CSL ensemble on two real world compositional generalization tasks. This results in new state-of-the-art performance for these challenging semantic parsing tasks requiring generalization to both natural language variation and novel compositions of elements.
△ Less
Submitted 4 May, 2022; v1 submitted 14 December, 2021;
originally announced December 2021.
-
Co-training Transformer with Videos and Images Improves Action Recognition
Authors:
Bowen Zhang,
Jiahui Yu,
Christopher Fifty,
Wei Han,
Andrew M. Dai,
Ruoming Pang,
Fei Sha
Abstract:
In learning action recognition, models are typically pre-trained on object recognition with images, such as ImageNet, and later fine-tuned on target action recognition with videos. This approach has achieved good empirical performance especially with recent transformer-based video architectures. While recently many works aim to design more advanced transformer architectures for action recognition,…
▽ More
In learning action recognition, models are typically pre-trained on object recognition with images, such as ImageNet, and later fine-tuned on target action recognition with videos. This approach has achieved good empirical performance especially with recent transformer-based video architectures. While recently many works aim to design more advanced transformer architectures for action recognition, less effort has been made on how to train video transformers. In this work, we explore several training paradigms and present two findings. First, video transformers benefit from joint training on diverse video datasets and label spaces (e.g., Kinetics is appearance-focused while SomethingSomething is motion-focused). Second, by further co-training with images (as single-frame videos), the video transformers learn even better video representations. We term this approach as Co-training Videos and Images for Action Recognition (CoVeR). In particular, when pretrained on ImageNet-21K based on the TimeSFormer architecture, CoVeR improves Kinetics-400 Top-1 Accuracy by 2.4%, Kinetics-600 by 2.3%, and SomethingSomething-v2 by 2.3%. When pretrained on larger-scale image datasets following previous state-of-the-art, CoVeR achieves best results on Kinetics-400 (87.2%), Kinetics-600 (87.9%), Kinetics-700 (79.8%), SomethingSomething-v2 (70.9%), and Moments-in-Time (46.1%), with a simple spatio-temporal video transformer.
△ Less
Submitted 14 December, 2021;
originally announced December 2021.
-
Learning to Generalize Compositionally by Transferring Across Semantic Parsing Tasks
Authors:
Wang Zhu,
Peter Shaw,
Tal Linzen,
Fei Sha
Abstract:
Neural network models often generalize poorly to mismatched domains or distributions. In NLP, this issue arises in particular when models are expected to generalize compositionally, that is, to novel combinations of familiar words and constructions. We investigate learning representations that facilitate transfer learning from one compositional task to another: the representation and the task-spec…
▽ More
Neural network models often generalize poorly to mismatched domains or distributions. In NLP, this issue arises in particular when models are expected to generalize compositionally, that is, to novel combinations of familiar words and constructions. We investigate learning representations that facilitate transfer learning from one compositional task to another: the representation and the task-specific layers of the models are strategically trained differently on a pre-finetuning task such that they generalize well on mismatched splits that require compositionality. We apply this method to semantic parsing, using three very different datasets, COGS, GeoQuery and SCAN, used alternately as the pre-finetuning and target task. Our method significantly improves compositional generalization over baselines on the test set of the target task, which is held out during fine-tuning. Ablation studies characterize the utility of the major steps in the proposed algorithm and support our hypothesis.
△ Less
Submitted 9 November, 2021;
originally announced November 2021.
-
HyperPINN: Learning parameterized differential equations with physics-informed hypernetworks
Authors:
Filipe de Avila Belbute-Peres,
Yi-fan Chen,
Fei Sha
Abstract:
Many types of physics-informed neural network models have been proposed in recent years as approaches for learning solutions to differential equations. When a particular task requires solving a differential equation at multiple parameterizations, this requires either re-training the model, or expanding its representation capacity to include the parameterization -- both solution that increase its c…
▽ More
Many types of physics-informed neural network models have been proposed in recent years as approaches for learning solutions to differential equations. When a particular task requires solving a differential equation at multiple parameterizations, this requires either re-training the model, or expanding its representation capacity to include the parameterization -- both solution that increase its computational cost. We propose the HyperPINN, which uses hypernetworks to learn to generate neural networks that can solve a differential equation from a given parameterization. We demonstrate with experiments on both a PDE and an ODE that this type of model can lead to neural network solutions to differential equations that maintain a small size, even when learning a family of solutions over a parameter space.
△ Less
Submitted 28 October, 2021;
originally announced November 2021.
-
Mention Memory: incorporating textual knowledge into Transformers through entity mention attention
Authors:
Michiel de Jong,
Yury Zemlyanskiy,
Nicholas FitzGerald,
Fei Sha,
William Cohen
Abstract:
Natural language understanding tasks such as open-domain question answering often require retrieving and assimilating factual information from multiple sources. We propose to address this problem by integrating a semi-parametric representation of a large text corpus into a Transformer model as a source of factual knowledge. Specifically, our method represents knowledge with `mention memory', a tab…
▽ More
Natural language understanding tasks such as open-domain question answering often require retrieving and assimilating factual information from multiple sources. We propose to address this problem by integrating a semi-parametric representation of a large text corpus into a Transformer model as a source of factual knowledge. Specifically, our method represents knowledge with `mention memory', a table of dense vector representations of every entity mention in a corpus. The proposed model - TOME - is a Transformer that accesses the information through internal memory layers in which each entity mention in the input passage attends to the mention memory. This approach enables synthesis of and reasoning over many disparate sources of information within a single Transformer model. In experiments using a memory of 150 million Wikipedia mentions, TOME achieves strong performance on several open-domain knowledge-intensive tasks, including the claim verification benchmarks HoVer and FEVER and several entity-based QA benchmarks. We also show that the model learns to attend to informative mentions without any direct supervision. Finally we demonstrate that the model can generalize to new unseen entities by updating the memory without retraining.
△ Less
Submitted 19 April, 2022; v1 submitted 12 October, 2021;
originally announced October 2021.
-
Visually Grounded Concept Composition
Authors:
Bowen Zhang,
Hexiang Hu,
Linlu Qiu,
Peter Shaw,
Fei Sha
Abstract:
We investigate ways to compose complex concepts in texts from primitive ones while grounding them in images. We propose Concept and Relation Graph (CRG), which builds on top of constituency analysis and consists of recursively combined concepts with predicate functions. Meanwhile, we propose a concept composition neural network called Composer to leverage the CRG for visually grounded concept lear…
▽ More
We investigate ways to compose complex concepts in texts from primitive ones while grounding them in images. We propose Concept and Relation Graph (CRG), which builds on top of constituency analysis and consists of recursively combined concepts with predicate functions. Meanwhile, we propose a concept composition neural network called Composer to leverage the CRG for visually grounded concept learning. Specifically, we learn the grounding of both primitive and all composed concepts by aligning them to images and show that learning to compose leads to more robust grounding results, measured in text-to-image matching accuracy. Notably, our model can model grounded concepts forming at both the finer-grained sentence level and the coarser-grained intermediate level (or word-level). Composer leads to pronounced improvement in matching accuracy when the evaluation data has significant compound divergence from the training data.
△ Less
Submitted 28 September, 2021;
originally announced September 2021.
-
Systematic Generalization on gSCAN: What is Nearly Solved and What is Next?
Authors:
Linlu Qiu,
Hexiang Hu,
Bowen Zhang,
Peter Shaw,
Fei Sha
Abstract:
We analyze the grounded SCAN (gSCAN) benchmark, which was recently proposed to study systematic generalization for grounded language understanding. First, we study which aspects of the original benchmark can be solved by commonly used methods in multi-modal research. We find that a general-purpose Transformer-based model with cross-modal attention achieves strong performance on a majority of the g…
▽ More
We analyze the grounded SCAN (gSCAN) benchmark, which was recently proposed to study systematic generalization for grounded language understanding. First, we study which aspects of the original benchmark can be solved by commonly used methods in multi-modal research. We find that a general-purpose Transformer-based model with cross-modal attention achieves strong performance on a majority of the gSCAN splits, surprisingly outperforming more specialized approaches from prior work. Furthermore, our analysis suggests that many of the remaining errors reveal the same fundamental challenge in systematic generalization of linguistic constructs regardless of visual context. Second, inspired by this finding, we propose challenging new tasks for gSCAN by generating data to incorporate relations between objects in the visual environment. Finally, we find that current models are surprisingly data inefficient given the narrow scope of commands in gSCAN, suggesting another challenge for future work.
△ Less
Submitted 24 September, 2021;
originally announced September 2021.
-
Towards Top-Down Just Noticeable Difference Estimation of Natural Images
Authors:
Qiuping Jiang,
Zhentao Liu,
Shiqi Wang,
Feng Shao,
Weisi Lin
Abstract:
Just noticeable difference (JND) of natural images refers to the maximum pixel intensity change magnitude that typical human visual system (HVS) cannot perceive. Existing efforts on JND estimation mainly dedicate to modeling the diverse masking effects in either/both spatial or/and frequency domains, and then fusing them into an overall JND estimate. In this work, we turn to a dramatically differe…
▽ More
Just noticeable difference (JND) of natural images refers to the maximum pixel intensity change magnitude that typical human visual system (HVS) cannot perceive. Existing efforts on JND estimation mainly dedicate to modeling the diverse masking effects in either/both spatial or/and frequency domains, and then fusing them into an overall JND estimate. In this work, we turn to a dramatically different way to address this problem with a top-down design philosophy. Instead of explicitly formulating and fusing different masking effects in a bottom-up way, the proposed JND estimation model dedicates to first predicting a critical perceptual lossless (CPL) counterpart of the original image and then calculating the difference map between the original image and the predicted CPL image as the JND map. We conduct subjective experiments to determine the critical points of 500 images and find that the distribution of cumulative normalized KLT coefficient energy values over all 500 images at these critical points can be well characterized by a Weibull distribution. Given a testing image, its corresponding critical point is determined by a simple weighted average scheme where the weights are determined by a fitted Weibull distribution function. The performance of the proposed JND model is evaluated explicitly with direct JND prediction and implicitly with two applications including JND-guided noise injection and JND-guided image compression. Experimental results have demonstrated that our proposed JND model can achieve better performance than several latest JND models. In addition, we also compare the proposed JND model with existing visual difference predicator (VDP) metrics in terms of the capability in distortion detection and discrimination. The results indicate that our JND model also has a good performance in this task.
△ Less
Submitted 24 May, 2022; v1 submitted 11 August, 2021;
originally announced August 2021.