-
Testing the Fault-Tolerance of Multi-Sensor Fusion Perception in Autonomous Driving Systems
Authors:
Haoxiang Tian,
Wenqiang Ding,
Xingshuo Han,
Guoquan Wu,
An Guo,
Junqi Zhang. Wei Chen,
Jun Wei,
Tianwei Zhang
Abstract:
High-level Autonomous Driving Systems (ADSs), such as Google Waymo and Baidu Apollo, typically rely on multi-sensor fusion (MSF) based approaches to perceive their surroundings. This strategy increases perception robustness by combining the respective strengths of the camera and LiDAR and directly affects the safety-critical driving decisions of autonomous vehicles (AVs). However, in real-world au…
▽ More
High-level Autonomous Driving Systems (ADSs), such as Google Waymo and Baidu Apollo, typically rely on multi-sensor fusion (MSF) based approaches to perceive their surroundings. This strategy increases perception robustness by combining the respective strengths of the camera and LiDAR and directly affects the safety-critical driving decisions of autonomous vehicles (AVs). However, in real-world autonomous driving scenarios, cameras and LiDAR are subject to various faults, which can probably significantly impact the decision-making and behaviors of ADSs. Existing MSF testing approaches only discovered corner cases that the MSF-based perception cannot accurately detected by MSF-based perception, while lacking research on how sensor faults affect the system-level behaviors of ADSs.
To address this gap, we conduct the first exploration of the fault tolerance of MSF perception-based ADS for sensor faults. In this paper, we systematically and comprehensively build fault models for cameras and LiDAR in AVs and inject them into the MSF perception-based ADS to test its behaviors in test scenarios. To effectively and efficiently explore the parameter spaces of sensor fault models, we design a feedback-guided differential fuzzer to discover the safety violations of MSF perception-based ADS caused by the injected sensor faults. We evaluate FADE on the representative and practical industrial ADS, Baidu Apollo. Our evaluation results demonstrate the effectiveness and efficiency of FADE, and we conclude some useful findings from the experimental results. To validate the findings in the physical world, we use a real Baidu Apollo 6.0 EDU autonomous vehicle to conduct the physical experiments, and the results show the practical significance of our findings.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
BOP Challenge 2024 on Model-Based and Model-Free 6D Object Pose Estimation
Authors:
Van Nguyen Nguyen,
Stephen Tyree,
Andrew Guo,
Mederic Fourmy,
Anas Gouda,
Taeyeop Lee,
Sungphill Moon,
Hyeontae Son,
Lukas Ranftl,
Jonathan Tremblay,
Eric Brachmann,
Bertram Drost,
Vincent Lepetit,
Carsten Rother,
Stan Birchfield,
Jiri Matas,
Yann Labbe,
Martin Sundermeyer,
Tomas Hodan
Abstract:
We present the evaluation methodology, datasets and results of the BOP Challenge 2024, the 6th in a series of public competitions organized to capture the state of the art in 6D object pose estimation and related tasks. In 2024, our goal was to transition BOP from lab-like setups to real-world scenarios. First, we introduced new model-free tasks, where no 3D object models are available and methods…
▽ More
We present the evaluation methodology, datasets and results of the BOP Challenge 2024, the 6th in a series of public competitions organized to capture the state of the art in 6D object pose estimation and related tasks. In 2024, our goal was to transition BOP from lab-like setups to real-world scenarios. First, we introduced new model-free tasks, where no 3D object models are available and methods need to onboard objects just from provided reference videos. Second, we defined a new, more practical 6D object detection task where identities of objects visible in a test image are not provided as input. Third, we introduced new BOP-H3 datasets recorded with high-resolution sensors and AR/VR headsets, closely resembling real-world scenarios. BOP-H3 include 3D models and onboarding videos to support both model-based and model-free tasks. Participants competed on seven challenge tracks. Notably, the best 2024 method for model-based 6D localization of unseen objects (FreeZeV2.1) achieves 22% higher accuracy on BOP-Classic-Core than the best 2023 method (GenFlow), and is only 4% behind the best 2023 method for seen objects (GPose2023) although being significantly slower (24.9 vs 2.7s per image). A more practical 2024 method for this task is Co-op which takes only 0.8s per image and is 13% more accurate than GenFlow. Methods have similar rankings on 6D detection as on 6D localization but higher run time. On model-based 2D detection of unseen objects, the best 2024 method (MUSE) achieves 21--29% relative improvement compared to the best 2023 method (CNOS). However, the 2D detection accuracy for unseen objects is still -35% behind the accuracy for seen objects (GDet2023), and the 2D detection stage is consequently the main bottleneck of existing pipelines for 6D localization/detection of unseen objects. The online evaluation system stays open and is available at http://bop.felk.cvut.cz/
△ Less
Submitted 23 April, 2025; v1 submitted 3 April, 2025;
originally announced April 2025.
-
Rubikon: Intelligent Tutoring for Rubik's Cube Learning Through AR-enabled Physical Task Reconfiguration
Authors:
Haocheng Ren,
Muzhe Wu,
Gregory Croisdale,
Anhong Guo,
Xu Wang
Abstract:
Learning to solve a Rubik's Cube requires the learners to repeatedly practice a skill component, e.g., identifying a misplaced square and putting it back. However, for 3D physical tasks such as this, generating sufficient repeated practice opportunities for learners can be challenging, in part because it is difficult for novices to reconfigure the physical object to specific states. We propose Rub…
▽ More
Learning to solve a Rubik's Cube requires the learners to repeatedly practice a skill component, e.g., identifying a misplaced square and putting it back. However, for 3D physical tasks such as this, generating sufficient repeated practice opportunities for learners can be challenging, in part because it is difficult for novices to reconfigure the physical object to specific states. We propose Rubikon, an intelligent tutoring system for learning to solve the Rubik's Cube. Rubikon reduces the necessity for repeated manual configurations of the Rubik's Cube without compromising the tactile experience of handling a physical cube. The foundational design of Rubikon is an AR setup, where learners manipulate a physical cube while seeing an AR-rendered cube on a display. Rubikon automatically generates configurations of the Rubik's Cube to target learners' weaknesses and help them exercise diverse knowledge components. In a between-subjects experiment, we showed that Rubikon learners scored 25% higher on a post-test compared to baselines.
△ Less
Submitted 14 May, 2025; v1 submitted 16 March, 2025;
originally announced March 2025.
-
HandProxy: Expanding the Affordances of Speech Interfaces in Immersive Environments with a Virtual Proxy Hand
Authors:
Chen Liang,
Yuxuan Liu,
Martez Mott,
Anhong Guo
Abstract:
Hand interactions are increasingly used as the primary input modality in immersive environments, but they are not always feasible due to situational impairments, motor limitations, and environmental constraints. Speech interfaces have been explored as an alternative to hand input in research and commercial solutions, but are limited to initiating basic hand gestures and system controls. We introdu…
▽ More
Hand interactions are increasingly used as the primary input modality in immersive environments, but they are not always feasible due to situational impairments, motor limitations, and environmental constraints. Speech interfaces have been explored as an alternative to hand input in research and commercial solutions, but are limited to initiating basic hand gestures and system controls. We introduce HandProxy, a system that expands the affordances of speech interfaces to support expressive hand interactions. Instead of relying on predefined speech commands directly mapped to possible interactions, HandProxy enables users to control the movement of a virtual hand as an interaction proxy, allowing them to describe the intended interactions naturally while the system translates speech into a sequence of hand controls for real-time execution. A user study with 20 participants demonstrated that HandProxy effectively enabled diverse hand interactions in virtual environments, achieving a 100% task completion rate with an average of 1.09 attempts per speech command and 91.8% command execution accuracy, while supporting flexible, natural speech input with varying levels of control and granularity.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
Efficient Transformer-based Decoder for Varshamov-Tenengolts Codes
Authors:
Yali Wei,
Alan J. X. Guo,
Zihui Yan,
Yufan Dai
Abstract:
In recent years, the rise of DNA data storage technology has brought significant attention to the challenge of correcting insertion, deletion, and substitution (IDS) errors. Among various coding methods for IDS correction, Varshamov-Tenengolts (VT) codes, primarily designed for single-error correction, have emerged as a central research focus. While existing decoding methods achieve high accuracy…
▽ More
In recent years, the rise of DNA data storage technology has brought significant attention to the challenge of correcting insertion, deletion, and substitution (IDS) errors. Among various coding methods for IDS correction, Varshamov-Tenengolts (VT) codes, primarily designed for single-error correction, have emerged as a central research focus. While existing decoding methods achieve high accuracy in correcting a single error, they often fail to correct multiple IDS errors. In this work, we observe that VT codes retain some capability for addressing multiple errors by introducing a transformer-based VT decoder (TVTD) along with symbol- and statistic-based codeword embedding. Experimental results demonstrate that the proposed TVTD achieves perfect correction of a single error. Furthermore, when decoding multiple errors across various codeword lengths, the bit error rate and frame error rate are significantly improved compared to existing hard decision and soft-in soft-out algorithms. Additionally, through model architecture optimization, the proposed method reduces time consumption by an order of magnitude compared to other soft decoders.
△ Less
Submitted 28 February, 2025;
originally announced February 2025.
-
ACiS: Complex Processing in the Switch Fabric
Authors:
Pouya Haghi,
Anqi Guo,
Tong Geng,
Anthony Skjellum,
Martin Herbordt
Abstract:
For the last three decades a core use of FPGAs has been for processing communication: FPGA-based SmartNICs are in widespread use from the datacenter to IoT. Augmenting switches with FPGAs, however, has been less studied, but has numerous advantages built around the processing being moved from the edge of the network to the center. Communication switches have previously been augmented to process co…
▽ More
For the last three decades a core use of FPGAs has been for processing communication: FPGA-based SmartNICs are in widespread use from the datacenter to IoT. Augmenting switches with FPGAs, however, has been less studied, but has numerous advantages built around the processing being moved from the edge of the network to the center. Communication switches have previously been augmented to process collectives, e.g., IBM BlueGene and Mellanox SHArP, but the support has been limited to a small set of predefined scalar operations and datatypes. Here we present ACiS, a framework and taxonomy for Advanced Computing in the Switch that unifies and expands our previous work in this area. In addition to fixed scalar collectives (Type 1), we propose three more types of in-switch application processing: (Type 2) User-defined operations and types, including data structures; (Type 3) Look-aside operations that have state within the operation and can have loops; and (Type 4) Fused collectives built by fusing multiple existing collectives or collectives with map computations. ACiS is supported in hardware with modular switch extensions including a CGRA architecture. Software support for ACiS includes evaluation and translation of relevant parts of user programs, compilation of user specifications into control flow graphs, and mapping the graphs into switch hardware. The overall goal is the transparent acceleration of HPC applications encapsulated within an MPI implementation.
△ Less
Submitted 30 January, 2025;
originally announced January 2025.
-
Marker Track: Accurate Fiducial Marker Tracking for Evaluation of Residual Motions During Breath-Hold Radiotherapy
Authors:
Aimee Guo,
Weihua Mao
Abstract:
Fiducial marker positions in projection image of cone-beam computed tomography (CBCT) scans have been studied to evaluate daily residual motion during breath-hold radiation therapy. Fiducial marker migration posed challenges in accurately locating markers, prompting the development of a novel algorithm that reconstructs volumetric probability maps of marker locations from filtered gradient maps of…
▽ More
Fiducial marker positions in projection image of cone-beam computed tomography (CBCT) scans have been studied to evaluate daily residual motion during breath-hold radiation therapy. Fiducial marker migration posed challenges in accurately locating markers, prompting the development of a novel algorithm that reconstructs volumetric probability maps of marker locations from filtered gradient maps of projections. This guides the development of a Python-based algorithm to detect fiducial markers in projection images using Meta AI's Segment Anything Model 2 (SAM 2). Retrospective data from a pancreatic cancer patient with two fiducial markers were analyzed. The three-dimensional (3D) marker positions from simulation computed tomography (CT) were compared to those reconstructed from CBCT images, revealing a decrease in relative distances between markers over time. Fiducial markers were successfully detected in 2777 out of 2786 projection frames. The average standard deviation of superior-inferior (SI) marker positions was 0.56 mm per breath-hold, with differences in average SI positions between two breath-holds in the same scan reaching up to 5.2 mm, and a gap of up to 7.3 mm between the end of the first and beginning of the second breath-hold. 3D marker positions were calculated using projection positions and confirmed marker migration. This method effectively calculates marker probability volume and enables accurate fiducial marker tracking during treatment without requiring any specialized equipment, additional radiation doses, or manual initialization and labeling. It has significant potential for automatically assessing daily residual motion to adjust planning margins, functioning as an adaptive radiation therapy tool.
△ Less
Submitted 26 January, 2025;
originally announced January 2025.
-
An LLM-Empowered Adaptive Evolutionary Algorithm For Multi-Component Deep Learning Systems
Authors:
Haoxiang Tian,
Xingshuo Han,
Guoquan Wu,
An Guo,
Yuan Zhou. Jie Zhang,
Shuo Li,
Jun Wei,
Tianwei Zhang
Abstract:
Multi-objective evolutionary algorithms (MOEAs) are widely used for searching optimal solutions in complex multi-component applications. Traditional MOEAs for multi-component deep learning (MCDL) systems face challenges in enhancing the search efficiency while maintaining the diversity. To combat these, this paper proposes $μ$MOEA, the first LLM-empowered adaptive evolutionary search algorithm to…
▽ More
Multi-objective evolutionary algorithms (MOEAs) are widely used for searching optimal solutions in complex multi-component applications. Traditional MOEAs for multi-component deep learning (MCDL) systems face challenges in enhancing the search efficiency while maintaining the diversity. To combat these, this paper proposes $μ$MOEA, the first LLM-empowered adaptive evolutionary search algorithm to detect safety violations in MCDL systems. Inspired by the context-understanding ability of Large Language Models (LLMs), $μ$MOEA promotes the LLM to comprehend the optimization problem and generate an initial population tailed to evolutionary objectives. Subsequently, it employs adaptive selection and variation to iteratively produce offspring, balancing the evolutionary efficiency and diversity. During the evolutionary process, to navigate away from the local optima, $μ$MOEA integrates the evolutionary experience back into the LLM. This utilization harnesses the LLM's quantitative reasoning prowess to generate differential seeds, breaking away from current optimal solutions. We evaluate $μ$MOEA in finding safety violations of MCDL systems, and compare its performance with state-of-the-art MOEA methods. Experimental results show that $μ$MOEA can significantly improve the efficiency and diversity of the evolutionary search.
△ Less
Submitted 1 January, 2025;
originally announced January 2025.
-
ObjectFinder: An Open-Vocabulary Assistive System for Interactive Object Search by Blind People
Authors:
Ruiping Liu,
Jiaming Zhang,
Angela Schön,
Karin Müller,
Junwei Zheng,
Kailun Yang,
Anhong Guo,
Kathrin Gerling,
Rainer Stiefelhagen
Abstract:
Searching for objects in unfamiliar scenarios is a challenging task for blind people. It involves specifying the target object, detecting it, and then gathering detailed information according to the user's intent. However, existing description- and detection-based assistive technologies do not sufficiently support the multifaceted nature of interactive object search tasks. We present ObjectFinder,…
▽ More
Searching for objects in unfamiliar scenarios is a challenging task for blind people. It involves specifying the target object, detecting it, and then gathering detailed information according to the user's intent. However, existing description- and detection-based assistive technologies do not sufficiently support the multifaceted nature of interactive object search tasks. We present ObjectFinder, an open-vocabulary wearable assistive system for interactive object search by blind people. ObjectFinder allows users to query target objects using flexible wording. Once the target object is detected, it provides egocentric localization information in real-time, including distance and direction. Users can then initiate different branches to gather detailed information based on their intent towards the target object, such as navigating to it or perceiving its surroundings. ObjectFinder is powered by a seamless combination of open-vocabulary models, namely an open-vocabulary object detector and a multimodal large language model. The ObjectFinder design concept and its development were carried out in collaboration with a blind co-designer. To evaluate ObjectFinder, we conducted an exploratory user study with eight blind participants. We compared ObjectFinder to BeMyAI and Google Lookout, popular description- and detection-based assistive applications. Our findings indicate that most participants felt more independent with ObjectFinder and preferred it for object search, as it enhanced scene context gathering and navigation, and allowed for active target identification. Finally, we discuss the implications for future assistive systems to support interactive object search.
△ Less
Submitted 30 April, 2025; v1 submitted 4 December, 2024;
originally announced December 2024.
-
Agnostic Learning of Arbitrary ReLU Activation under Gaussian Marginals
Authors:
Anxin Guo,
Aravindan Vijayaraghavan
Abstract:
We consider the problem of learning an arbitrarily-biased ReLU activation (or neuron) over Gaussian marginals with the squared loss objective. Despite the ReLU neuron being the basic building block of modern neural networks, we still do not understand the basic algorithmic question of whether one arbitrary ReLU neuron is learnable in the non-realizable setting. In particular, all existing polynomi…
▽ More
We consider the problem of learning an arbitrarily-biased ReLU activation (or neuron) over Gaussian marginals with the squared loss objective. Despite the ReLU neuron being the basic building block of modern neural networks, we still do not understand the basic algorithmic question of whether one arbitrary ReLU neuron is learnable in the non-realizable setting. In particular, all existing polynomial time algorithms only provide approximation guarantees for the better-behaved unbiased setting or restricted bias setting.
Our main result is a polynomial time statistical query (SQ) algorithm that gives the first constant factor approximation for arbitrary bias. It outputs a ReLU activation that achieves a loss of $O(\mathrm{OPT}) + \varepsilon$ in time $\mathrm{poly}(d,1/\varepsilon)$, where $\mathrm{OPT}$ is the loss obtained by the optimal ReLU activation. Our algorithm presents an interesting departure from existing algorithms, which are all based on gradient descent and thus fall within the class of correlational statistical query (CSQ) algorithms. We complement our algorithmic result by showing that no polynomial time CSQ algorithm can achieve a constant factor approximation. Together, these results shed light on the intrinsic limitation of gradient descent, while identifying arguably the simplest setting (a single neuron) where there is a separation between SQ and CSQ algorithms.
△ Less
Submitted 22 November, 2024; v1 submitted 21 November, 2024;
originally announced November 2024.
-
From Pen to Prompt: How Creative Writers Integrate AI into their Writing Practice
Authors:
Alicia Guo,
Shreya Sathyanarayanan,
Leijie Wang,
Jeffrey Heer,
Amy Zhang
Abstract:
Creative writing is a deeply human craft, yet AI systems using large language models (LLMs) offer the automation of significant parts of the writing process. So why do some creative writers choose to use AI? Through interviews and observed writing sessions with 18 creative writers who already use AI regularly in their writing practice, we find that creative writers are intentional about how they i…
▽ More
Creative writing is a deeply human craft, yet AI systems using large language models (LLMs) offer the automation of significant parts of the writing process. So why do some creative writers choose to use AI? Through interviews and observed writing sessions with 18 creative writers who already use AI regularly in their writing practice, we find that creative writers are intentional about how they incorporate AI, making many deliberate decisions about when and how to engage AI based on their core values, such as authenticity and craftsmanship. We characterize the interplay between writers' values, their fluid relationships with AI, and specific integration strategies -- ultimately enabling writers to create new AI workflows without compromising their creative values. We provide insight for writing communities, AI developers and future researchers on the importance of supporting transparency of these emerging writing processes and rethinking what AI features can best serve writers.
△ Less
Submitted 3 May, 2025; v1 submitted 5 November, 2024;
originally announced November 2024.
-
SoVAR: Building Generalizable Scenarios from Accident Reports for Autonomous Driving Testing
Authors:
An Guo,
Yuan Zhou,
Haoxiang Tian,
Chunrong Fang,
Yunjian Sun,
Weisong Sun,
Xinyu Gao,
Anh Tuan Luu,
Yang Liu,
Zhenyu Chen
Abstract:
Autonomous driving systems (ADSs) have undergone remarkable development and are increasingly employed in safety-critical applications. However, recently reported data on fatal accidents involving ADSs suggests that the desired level of safety has not yet been fully achieved. Consequently, there is a growing need for more comprehensive and targeted testing approaches to ensure safe driving. Scenari…
▽ More
Autonomous driving systems (ADSs) have undergone remarkable development and are increasingly employed in safety-critical applications. However, recently reported data on fatal accidents involving ADSs suggests that the desired level of safety has not yet been fully achieved. Consequently, there is a growing need for more comprehensive and targeted testing approaches to ensure safe driving. Scenarios from real-world accident reports provide valuable resources for ADS testing, including critical scenarios and high-quality seeds. However, existing scenario reconstruction methods from accident reports often exhibit limited accuracy in information extraction. Moreover, due to the diversity and complexity of road environments, matching current accident information with the simulation map data for reconstruction poses significant challenges. In this paper, we design and implement SoVAR, a tool for automatically generating road-generalizable scenarios from accident reports. SoVAR utilizes well-designed prompts with linguistic patterns to guide the large language model in extracting accident information from textual data. Subsequently, it formulates and solves accident-related constraints in conjunction with the extracted accident information to generate accident trajectories. Finally, SoVAR reconstructs accident scenarios on various map structures and converts them into test scenarios to evaluate its capability to detect defects in industrial ADSs. We experiment with SoVAR, using accident reports from the National Highway Traffic Safety Administration's database to generate test scenarios for the industrial-grade ADS Apollo. The experimental findings demonstrate that SoVAR can effectively generate generalized accident scenarios across different road structures. Furthermore, the results confirm that SoVAR identified 5 distinct safety violation types that contributed to the crash of Baidu Apollo.
△ Less
Submitted 12 September, 2024;
originally announced September 2024.
-
Average Causal Effect Estimation in DAGs with Hidden Variables: Extensions of Back-Door and Front-Door Criteria
Authors:
Anna Guo,
Razieh Nabi
Abstract:
The identification theory for causal effects in directed acyclic graphs (DAGs) with hidden variables is well-developed, but methods for estimating and inferring functionals beyond the g-formula remain limited. Previous studies have proposed semiparametric estimators for identifiable functionals in a broad class of DAGs with hidden variables. While demonstrating double robustness in some models, ex…
▽ More
The identification theory for causal effects in directed acyclic graphs (DAGs) with hidden variables is well-developed, but methods for estimating and inferring functionals beyond the g-formula remain limited. Previous studies have proposed semiparametric estimators for identifiable functionals in a broad class of DAGs with hidden variables. While demonstrating double robustness in some models, existing estimators face challenges, particularly with density estimation and numerical integration for continuous variables, and their estimates may fall outside the parameter space of the target estimand. Their asymptotic properties are also underexplored, especially when using flexible statistical and machine learning models for nuisance estimation. This study addresses these challenges by introducing novel one-step corrected plug-in and targeted minimum loss-based estimators of causal effects for a class of DAGs that extend classical back-door and front-door criteria (known as the treatment primal fixability criterion in prior literature). These estimators leverage machine learning to minimize modeling assumptions while ensuring key statistical properties such as asymptotic linearity, double robustness, efficiency, and staying within the bounds of the target parameter space. We establish conditions for nuisance functional estimates in terms of L2(P)-norms to achieve root-n consistent causal effect estimates. To facilitate practical application, we have developed the flexCausal package in R.
△ Less
Submitted 5 September, 2024;
originally announced September 2024.
-
CooTest: An Automated Testing Approach for V2X Communication Systems
Authors:
An Guo,
Xinyu Gao,
Zhenyu Chen,
Yuan Xiao,
Jiakai Liu,
Xiuting Ge,
Weisong Sun,
Chunrong Fang
Abstract:
Perceiving the complex driving environment precisely is crucial to the safe operation of autonomous vehicles. With the tremendous advancement of deep learning and communication technology, Vehicle-to-Everything (V2X) collaboration has the potential to address limitations in sensing distant objects and occlusion for a single-agent perception system. However, despite spectacular progress, several co…
▽ More
Perceiving the complex driving environment precisely is crucial to the safe operation of autonomous vehicles. With the tremendous advancement of deep learning and communication technology, Vehicle-to-Everything (V2X) collaboration has the potential to address limitations in sensing distant objects and occlusion for a single-agent perception system. However, despite spectacular progress, several communication challenges can undermine the effectiveness of multi-vehicle cooperative perception. The low interpretability of Deep Neural Networks (DNNs) and the high complexity of communication mechanisms make conventional testing techniques inapplicable for the cooperative perception of autonomous driving systems (ADS). Besides, the existing testing techniques, depending on manual data collection and labeling, become time-consuming and prohibitively expensive.
In this paper, we design and implement CooTest, the first automated testing tool of the V2X-oriented cooperative perception module. CooTest devises the V2X-specific metamorphic relation and equips communication and weather transformation operators that can reflect the impact of the various cooperative driving factors to produce transformed scenes. Furthermore, we adopt a V2X-oriented guidance strategy for the transformed scene generation process and improve testing efficiency. We experiment CooTest with multiple cooperative perception models with different fusion schemes to evaluate its performance on different tasks. The experiment results show that CooTest can effectively detect erroneous behaviors under various V2X-oriented driving conditions. Also, the results confirm that CooTest can improve detection average precision and decrease misleading cooperation errors by retraining with the generated scenes.
△ Less
Submitted 29 August, 2024;
originally announced August 2024.
-
Audio Description Customization
Authors:
Rosiana Natalie,
Ruei-Che Chang,
Smitha Sheshadri,
Anhong Guo,
Kotaro Hara
Abstract:
Blind and low-vision (BLV) people use audio descriptions (ADs) to access videos. However, current ADs are unalterable by end users, thus are incapable of supporting BLV individuals' potentially diverse needs and preferences. This research investigates if customizing AD could improve how BLV individuals consume videos. We conducted an interview study (Study 1) with fifteen BLV participants, which r…
▽ More
Blind and low-vision (BLV) people use audio descriptions (ADs) to access videos. However, current ADs are unalterable by end users, thus are incapable of supporting BLV individuals' potentially diverse needs and preferences. This research investigates if customizing AD could improve how BLV individuals consume videos. We conducted an interview study (Study 1) with fifteen BLV participants, which revealed desires for customizing properties like length, emphasis, speed, voice, format, tone, and language. At the same time, concerns like interruptions and increased interaction load due to customization emerged. To examine AD customization's effectiveness and tradeoffs, we designed CustomAD, a prototype that enables BLV users to customize AD content and presentation. An evaluation study (Study 2) with twelve BLV participants showed using CustomAD significantly enhanced BLV people's video understanding, immersion, and information navigation efficiency. Our work illustrates the importance of AD customization and offers a design that enhances video accessibility for BLV individuals.
△ Less
Submitted 21 August, 2024;
originally announced August 2024.
-
ProgramAlly: Creating Custom Visual Access Programs via Multi-Modal End-User Programming
Authors:
Jaylin Herskovitz,
Andi Xu,
Rahaf Alharbi,
Anhong Guo
Abstract:
Existing visual assistive technologies are built for simple and common use cases, and have few avenues for blind people to customize their functionalities. Drawing from prior work on DIY assistive technology, this paper investigates end-user programming as a means for users to create and customize visual access programs to meet their unique needs. We introduce ProgramAlly, a system for creating cu…
▽ More
Existing visual assistive technologies are built for simple and common use cases, and have few avenues for blind people to customize their functionalities. Drawing from prior work on DIY assistive technology, this paper investigates end-user programming as a means for users to create and customize visual access programs to meet their unique needs. We introduce ProgramAlly, a system for creating custom filters for visual information, e.g., 'find NUMBER on BUS', leveraging three end-user programming approaches: block programming, natural language, and programming by example. To implement ProgramAlly, we designed a representation of visual filtering tasks based on scenarios encountered by blind people, and integrated a set of on-device and cloud models for generating and running these programs. In user studies with 12 blind adults, we found that participants preferred different programming modalities depending on the task, and envisioned using visual access programs to address unique accessibility challenges that are otherwise difficult with existing applications. Through ProgramAlly, we present an exploration of how blind end-users can create visual access programs to customize and control their experiences.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
VRCopilot: Authoring 3D Layouts with Generative AI Models in VR
Authors:
Lei Zhang,
Jin Pan,
Jacob Gettig,
Steve Oney,
Anhong Guo
Abstract:
Immersive authoring provides an intuitive medium for users to create 3D scenes via direct manipulation in Virtual Reality (VR). Recent advances in generative AI have enabled the automatic creation of realistic 3D layouts. However, it is unclear how capabilities of generative AI can be used in immersive authoring to support fluid interactions, user agency, and creativity. We introduce VRCopilot, a…
▽ More
Immersive authoring provides an intuitive medium for users to create 3D scenes via direct manipulation in Virtual Reality (VR). Recent advances in generative AI have enabled the automatic creation of realistic 3D layouts. However, it is unclear how capabilities of generative AI can be used in immersive authoring to support fluid interactions, user agency, and creativity. We introduce VRCopilot, a mixed-initiative system that integrates pre-trained generative AI models into immersive authoring to facilitate human-AI co-creation in VR. VRCopilot presents multimodal interactions to support rapid prototyping and iterations with AI, and intermediate representations such as wireframes to augment user controllability over the created content. Through a series of user studies, we evaluated the potential and challenges in manual, scaffolded, and automatic creation in immersive authoring. We found that scaffolded creation using wireframes enhanced the user agency compared to automatic creation. We also found that manual creation via multimodal specification offers the highest sense of creativity and agency.
△ Less
Submitted 18 August, 2024;
originally announced August 2024.
-
EditScribe: Non-Visual Image Editing with Natural Language Verification Loops
Authors:
Ruei-Che Chang,
Yuxuan Liu,
Lotus Zhang,
Anhong Guo
Abstract:
Image editing is an iterative process that requires precise visual evaluation and manipulation for the output to match the editing intent. However, current image editing tools do not provide accessible interaction nor sufficient feedback for blind and low vision individuals to achieve this level of control. To address this, we developed EditScribe, a prototype system that makes image editing acces…
▽ More
Image editing is an iterative process that requires precise visual evaluation and manipulation for the output to match the editing intent. However, current image editing tools do not provide accessible interaction nor sufficient feedback for blind and low vision individuals to achieve this level of control. To address this, we developed EditScribe, a prototype system that makes image editing accessible using natural language verification loops powered by large multimodal models. Using EditScribe, the user first comprehends the image content through initial general and object descriptions, then specifies edit actions using open-ended natural language prompts. EditScribe performs the image edit, and provides four types of verification feedback for the user to verify the performed edit, including a summary of visual changes, AI judgement, and updated general and object descriptions. The user can ask follow-up questions to clarify and probe into the edits or verification feedback, before performing another edit. In a study with ten blind or low-vision users, we found that EditScribe supported participants to perform and verify image edit actions non-visually. We observed different prompting strategies from participants, and their perceptions on the various types of verification feedback. Finally, we discuss the implications of leveraging natural language verification loops to make visual authoring non-visually accessible.
△ Less
Submitted 13 August, 2024;
originally announced August 2024.
-
WorldScribe: Towards Context-Aware Live Visual Descriptions
Authors:
Ruei-Che Chang,
Yuxuan Liu,
Anhong Guo
Abstract:
Automated live visual descriptions can aid blind people in understanding their surroundings with autonomy and independence. However, providing descriptions that are rich, contextual, and just-in-time has been a long-standing challenge in accessibility. In this work, we develop WorldScribe, a system that generates automated live real-world visual descriptions that are customizable and adaptive to u…
▽ More
Automated live visual descriptions can aid blind people in understanding their surroundings with autonomy and independence. However, providing descriptions that are rich, contextual, and just-in-time has been a long-standing challenge in accessibility. In this work, we develop WorldScribe, a system that generates automated live real-world visual descriptions that are customizable and adaptive to users' contexts: (i) WorldScribe's descriptions are tailored to users' intents and prioritized based on semantic relevance. (ii) WorldScribe is adaptive to visual contexts, e.g., providing consecutively succinct descriptions for dynamic scenes, while presenting longer and detailed ones for stable settings. (iii) WorldScribe is adaptive to sound contexts, e.g., increasing volume in noisy environments, or pausing when conversations start. Powered by a suite of vision, language, and sound recognition models, WorldScribe introduces a description generation pipeline that balances the tradeoffs between their richness and latency to support real-time use. The design of WorldScribe is informed by prior work on providing visual descriptions and a formative study with blind participants. Our user study and subsequent pipeline evaluation show that WorldScribe can provide real-time and fairly accurate visual descriptions to facilitate environment understanding that is adaptive and customized to users' contexts. Finally, we discuss the implications and further steps toward making live visual descriptions more context-aware and humanized.
△ Less
Submitted 13 August, 2024;
originally announced August 2024.
-
Eliminating Backdoors in Neural Code Models for Secure Code Understanding
Authors:
Weisong Sun,
Yuchen Chen,
Chunrong Fang,
Yebo Feng,
Yuan Xiao,
An Guo,
Quanjun Zhang,
Yang Liu,
Baowen Xu,
Zhenyu Chen
Abstract:
Neural code models (NCMs) have been widely used to address various code understanding tasks, such as defect detection. However, numerous recent studies reveal that such models are vulnerable to backdoor attacks. Backdoored NCMs function normally on normal/clean code snippets, but exhibit adversary-expected behavior on poisoned code snippets injected with the adversary-crafted trigger. It poses a s…
▽ More
Neural code models (NCMs) have been widely used to address various code understanding tasks, such as defect detection. However, numerous recent studies reveal that such models are vulnerable to backdoor attacks. Backdoored NCMs function normally on normal/clean code snippets, but exhibit adversary-expected behavior on poisoned code snippets injected with the adversary-crafted trigger. It poses a significant security threat. Therefore, there is an urgent need for effective techniques to detect and eliminate backdoors stealthily implanted in NCMs.
To address this issue, in this paper, we innovatively propose a backdoor elimination technique for secure code understanding, called EliBadCode. EliBadCode eliminates backdoors in NCMs by inverting/reverse-engineering and unlearning backdoor triggers. Specifically, EliBadCode first filters the model vocabulary for trigger tokens based on the naming conventions of specific programming languages to reduce the trigger search space and cost. Then, EliBadCode introduces a sample-specific trigger position identification method, which can reduce the interference of non-backdoor (adversarial) perturbations for subsequent trigger inversion, thereby producing effective inverted backdoor triggers efficiently. Backdoor triggers can be viewed as backdoor (adversarial) perturbations. Subsequently, EliBadCode employs a Greedy Coordinate Gradient algorithm to optimize the inverted trigger and designs a trigger anchoring method to purify the inverted trigger. Finally, EliBadCode eliminates backdoors through model unlearning. We evaluate the effectiveness of EliBadCode in eliminating backdoors implanted in multiple NCMs used for three safety-critical code understanding tasks. The results demonstrate that EliBadCode can effectively eliminate backdoors while having minimal adverse effects on the normal functionality of the model.
△ Less
Submitted 20 February, 2025; v1 submitted 8 August, 2024;
originally announced August 2024.
-
Gumbel-Softmax Discretization Constraint, Differentiable IDS Channel, and an IDS-Correcting Code for DNA Storage
Authors:
Alan J. X. Guo,
Mengyi Wei,
Yufan Dai,
Yali Wei,
Pengchen Zhang
Abstract:
Insertion, deletion, and substitution (IDS) error-correcting codes have garnered increased attention with recent advancements in DNA storage technology. However, a universal method for designing IDS-correcting codes across varying channel settings remains underexplored. We present an autoencoder-based method, THEA-code, aimed at efficiently generating IDS-correcting codes for complex IDS channels.…
▽ More
Insertion, deletion, and substitution (IDS) error-correcting codes have garnered increased attention with recent advancements in DNA storage technology. However, a universal method for designing IDS-correcting codes across varying channel settings remains underexplored. We present an autoencoder-based method, THEA-code, aimed at efficiently generating IDS-correcting codes for complex IDS channels. In the work, a Gumbel-Softmax discretization constraint is proposed to discretize the features of the autoencoder, and a simulated differentiable IDS channel is developed as a differentiable alternative for IDS operations. These innovations facilitate the successful convergence of the autoencoder, resulting in channel-customized IDS-correcting codes with commendable performance across complex IDS channels.
△ Less
Submitted 5 January, 2025; v1 submitted 10 July, 2024;
originally announced July 2024.
-
Towards scalable efficient on-device ASR with transfer learning
Authors:
Laxmi Pandey,
Ke Li,
Jinxi Guo,
Debjyoti Paul,
Arthur Guo,
Jay Mahadeokar,
Xuedong Zhang
Abstract:
Multilingual pretraining for transfer learning significantly boosts the robustness of low-resource monolingual ASR models. This study systematically investigates three main aspects: (a) the impact of transfer learning on model performance during initial training or fine-tuning, (b) the influence of transfer learning across dataset domains and languages, and (c) the effect on rare-word recognition…
▽ More
Multilingual pretraining for transfer learning significantly boosts the robustness of low-resource monolingual ASR models. This study systematically investigates three main aspects: (a) the impact of transfer learning on model performance during initial training or fine-tuning, (b) the influence of transfer learning across dataset domains and languages, and (c) the effect on rare-word recognition compared to non-rare words. Our finding suggests that RNNT-loss pretraining, followed by monolingual fine-tuning with Minimum Word Error Rate (MinWER) loss, consistently reduces Word Error Rates (WER) across languages like Italian and French. WER Reductions (WERR) reach 36.2% and 42.8% compared to monolingual baselines for MLS and in-house datasets. Out-of-domain pretraining leads to 28% higher WERR than in-domain pretraining. Both rare and non-rare words benefit, with rare words showing greater improvements with out-of-domain pretraining, and non-rare words with in-domain pretraining.
△ Less
Submitted 23 July, 2024;
originally announced July 2024.
-
LMM-enhanced Safety-Critical Scenario Generation for Autonomous Driving System Testing From Non-Accident Traffic Videos
Authors:
Haoxiang Tian,
Xingshuo Han,
Yuan Zhou,
Guoquan Wu,
An Guo,
Mingfei Cheng,
Shuo Li,
Jun Wei,
Tianwei Zhang
Abstract:
Safety testing serves as the fundamental pillar for the development of autonomous driving systems (ADSs). To ensure the safety of ADSs, it is paramount to generate a diverse range of safety-critical test scenarios. While existing ADS practitioners primarily focus on reproducing real-world traffic accidents in simulation environments to create test scenarios, it's essential to highlight that many o…
▽ More
Safety testing serves as the fundamental pillar for the development of autonomous driving systems (ADSs). To ensure the safety of ADSs, it is paramount to generate a diverse range of safety-critical test scenarios. While existing ADS practitioners primarily focus on reproducing real-world traffic accidents in simulation environments to create test scenarios, it's essential to highlight that many of these accidents do not directly result in safety violations for ADSs due to the differences between human driving and autonomous driving. More importantly, we observe that some accident-free real-world scenarios can not only lead to misbehaviors in ADSs but also be leveraged for the generation of ADS violations during simulation testing. Therefore, it is of significant importance to discover safety violations of ADSs from routine traffic scenarios (i.e., non-crash scenarios).
We introduce LEADE, a novel methodology to achieve the above goal. It automatically generates abstract and concrete scenarios from real-traffic videos. Then it optimizes these scenarios to search for safety violations of the ADS in semantically consistent scenarios where human-driving worked safely. Specifically, LEADE enhances the ability of Large Multimodal Models (LMMs) to accurately construct abstract scenarios from traffic videos and generate concrete scenarios by multi-modal few-shot Chain of Thought (CoT). Based on them, LEADE assesses and increases the behavior differences between the ego vehicle and human-driving in semantic equivalent scenarios (here equivalent semantics means that each participant in test scenarios has the same behaviors as those observed in the original real traffic scenarios). We implement and evaluate LEADE on the industrial-grade Level-4 ADS, Apollo.
△ Less
Submitted 1 January, 2025; v1 submitted 16 June, 2024;
originally announced June 2024.
-
Scrutinize What We Ignore: Reining In Task Representation Shift Of Context-Based Offline Meta Reinforcement Learning
Authors:
Hai Zhang,
Boyuan Zheng,
Tianying Ji,
Jinhang Liu,
Anqi Guo,
Junqiao Zhao,
Lanqing Li
Abstract:
Offline meta reinforcement learning (OMRL) has emerged as a promising approach for interaction avoidance and strong generalization performance by leveraging pre-collected data and meta-learning techniques. Previous context-based approaches predominantly rely on the intuition that alternating optimization between the context encoder and the policy can lead to performance improvements, as long as th…
▽ More
Offline meta reinforcement learning (OMRL) has emerged as a promising approach for interaction avoidance and strong generalization performance by leveraging pre-collected data and meta-learning techniques. Previous context-based approaches predominantly rely on the intuition that alternating optimization between the context encoder and the policy can lead to performance improvements, as long as the context encoder follows the principle of maximizing the mutual information between the task variable $M$ and its latent representation $Z$ ($I(Z;M)$) while the policy adopts the standard offline reinforcement learning (RL) algorithms conditioning on the learned task representation.Despite promising results, the theoretical justification of performance improvements for such intuition remains underexplored.Inspired by the return discrepancy scheme in the model-based RL field, we find that the previous optimization framework can be linked with the general RL objective of maximizing the expected return, thereby explaining performance improvements. Furthermore, after scrutinizing this optimization framework, we observe that the condition for monotonic performance improvements does not consider the variation of the task representation. When these variations are considered, the previously established condition may no longer be sufficient to ensure monotonicity, thereby impairing the optimization process.We name this issue task representation shift and theoretically prove that the monotonic performance improvements can be guaranteed with appropriate context encoder updates.Our work opens up a new avenue for OMRL, leading to a better understanding between the task representation and performance improvements.
△ Less
Submitted 2 February, 2025; v1 submitted 20 May, 2024;
originally announced May 2024.
-
PEVA-Net: Prompt-Enhanced View Aggregation Network for Zero/Few-Shot Multi-View 3D Shape Recognition
Authors:
Dongyun Lin,
Yi Cheng,
Shangbo Mao,
Aiyuan Guo,
Yiqun Li
Abstract:
Large vision-language models have impressively promote the performance of 2D visual recognition under zero/few-shot scenarios. In this paper, we focus on exploiting the large vision-language model, i.e., CLIP, to address zero/few-shot 3D shape recognition based on multi-view representations. The key challenge for both tasks is to generate a discriminative descriptor of the 3D shape represented by…
▽ More
Large vision-language models have impressively promote the performance of 2D visual recognition under zero/few-shot scenarios. In this paper, we focus on exploiting the large vision-language model, i.e., CLIP, to address zero/few-shot 3D shape recognition based on multi-view representations. The key challenge for both tasks is to generate a discriminative descriptor of the 3D shape represented by multiple view images under the scenarios of either without explicit training (zero-shot 3D shape recognition) or training with a limited number of data (few-shot 3D shape recognition). We analyze that both tasks are relevant and can be considered simultaneously. Specifically, leveraging the descriptor which is effective for zero-shot inference to guide the tuning of the aggregated descriptor under the few-shot training can significantly improve the few-shot learning efficacy. Hence, we propose Prompt-Enhanced View Aggregation Network (PEVA-Net) to simultaneously address zero/few-shot 3D shape recognition. Under the zero-shot scenario, we propose to leverage the prompts built up from candidate categories to enhance the aggregation process of multiple view-associated visual features. The resulting aggregated feature serves for effective zero-shot recognition of the 3D shapes. Under the few-shot scenario, we first exploit a transformer encoder to aggregate the view-associated visual features into a global descriptor. To tune the encoder, together with the main classification loss, we propose a self-distillation scheme via a feature distillation loss by treating the zero-shot descriptor as the guidance signal for the few-shot descriptor. This scheme can significantly enhance the few-shot learning efficacy.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
InternLM2 Technical Report
Authors:
Zheng Cai,
Maosong Cao,
Haojiong Chen,
Kai Chen,
Keyu Chen,
Xin Chen,
Xun Chen,
Zehui Chen,
Zhi Chen,
Pei Chu,
Xiaoyi Dong,
Haodong Duan,
Qi Fan,
Zhaoye Fei,
Yang Gao,
Jiaye Ge,
Chenya Gu,
Yuzhe Gu,
Tao Gui,
Aijia Guo,
Qipeng Guo,
Conghui He,
Yingfan Hu,
Ting Huang,
Tao Jiang
, et al. (75 additional authors not shown)
Abstract:
The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context m…
▽ More
The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context modeling, and open-ended subjective evaluations through innovative pre-training and optimization techniques. The pre-training process of InternLM2 is meticulously detailed, highlighting the preparation of diverse data types including text, code, and long-context data. InternLM2 efficiently captures long-term dependencies, initially trained on 4k tokens before advancing to 32k tokens in pre-training and fine-tuning stages, exhibiting remarkable performance on the 200k ``Needle-in-a-Haystack" test. InternLM2 is further aligned using Supervised Fine-Tuning (SFT) and a novel Conditional Online Reinforcement Learning from Human Feedback (COOL RLHF) strategy that addresses conflicting human preferences and reward hacking. By releasing InternLM2 models in different training stages and model sizes, we provide the community with insights into the model's evolution.
△ Less
Submitted 25 March, 2024;
originally announced March 2024.
-
Pre-trained Model-based Actionable Warning Identification: A Feasibility Study
Authors:
Xiuting Ge,
Chunrong Fang,
Quanjun Zhang,
Daoyuan Wu,
Bowen Yu,
Qirui Zheng,
An Guo,
Shangwei Lin,
Zhihong Zhao,
Yang Liu,
Zhenyu Chen
Abstract:
Actionable Warning Identification (AWI) plays a pivotal role in improving the usability of static code analyzers. Currently, Machine Learning (ML)-based AWI approaches, which mainly learn an AWI classifier from labeled warnings, are notably common. However, these approaches still face the problem of restricted performance due to the direct reliance on a limited number of labeled warnings to develo…
▽ More
Actionable Warning Identification (AWI) plays a pivotal role in improving the usability of static code analyzers. Currently, Machine Learning (ML)-based AWI approaches, which mainly learn an AWI classifier from labeled warnings, are notably common. However, these approaches still face the problem of restricted performance due to the direct reliance on a limited number of labeled warnings to develop a classifier. Very recently, Pre-Trained Models (PTMs), which have been trained through billions of text/code tokens and demonstrated substantial success applications on various code-related tasks, could potentially circumvent the above problem. Nevertheless, the performance of PTMs on AWI has not been systematically investigated, leaving a gap in understanding their pros and cons. In this paper, we are the first to explore the feasibility of applying various PTMs for AWI. By conducting the extensive evaluation on 10K+ SpotBugs warnings from 10 large-scale and open-source projects, we observe that all studied PTMs are consistently 9.85%~21.12% better than the state-of-the-art ML-based AWI approaches. Besides, we investigate the impact of three primary aspects (i.e., data preprocessing, model training, and model prediction) in the typical PTM-based AWI workflow. Further, we identify the reasons for current PTMs' underperformance on AWI. Based on our findings, we provide several practical guidelines to enhance PTM-based AWI in future work.
△ Less
Submitted 5 March, 2024;
originally announced March 2024.
-
Exploring the Impact of AI Value Alignment in Collaborative Ideation: Effects on Perception, Ownership, and Output
Authors:
Alicia Guo,
Pat Pataranutaporn,
Pattie Maes
Abstract:
AI-based virtual assistants are increasingly used to support daily ideation tasks. The values or bias present in these agents can influence output in hidden ways. They may also affect how people perceive the ideas produced with these AI agents and lead to implications for the design of AI-based tools. We explored the effects of AI agents with different values on the ideation process and user perce…
▽ More
AI-based virtual assistants are increasingly used to support daily ideation tasks. The values or bias present in these agents can influence output in hidden ways. They may also affect how people perceive the ideas produced with these AI agents and lead to implications for the design of AI-based tools. We explored the effects of AI agents with different values on the ideation process and user perception of idea quality, ownership, agent competence, and values present in the output. Our study tasked 180 participants with brainstorming practical solutions to a set of problems with AI agents of different values. Results show no significant difference in self-evaluation of idea quality and perception of the agent based on value alignment; however, ideas generated reflected the AI's values and feeling of ownership is affected. This highlights an intricate interplay between AI values and human ideation, suggesting careful design considerations for future AI-supported brainstorming tools.
△ Less
Submitted 22 April, 2024; v1 submitted 20 February, 2024;
originally announced February 2024.
-
To Store or Not to Store: a graph theoretical approach for Dataset Versioning
Authors:
Anxin Guo,
Jingwei Li,
Pattara Sukprasert,
Samir Khuller,
Amol Deshpande,
Koyel Mukherjee
Abstract:
In this work, we study the cost efficient data versioning problem, where the goal is to optimize the storage and reconstruction (retrieval) costs of data versions, given a graph of datasets as nodes and edges capturing edit/delta information. One central variant we study is MinSum Retrieval (MSR) where the goal is to minimize the total retrieval costs, while keeping the storage costs bounded. This…
▽ More
In this work, we study the cost efficient data versioning problem, where the goal is to optimize the storage and reconstruction (retrieval) costs of data versions, given a graph of datasets as nodes and edges capturing edit/delta information. One central variant we study is MinSum Retrieval (MSR) where the goal is to minimize the total retrieval costs, while keeping the storage costs bounded. This problem (along with its variants) was introduced by Bhattacherjee et al. [VLDB'15]. While such problems are frequently encountered in collaborative tools (e.g., version control systems and data analysis pipelines), to the best of our knowledge, no existing research studies the theoretical aspects of these problems.
We establish that the currently best-known heuristic, LMG, can perform arbitrarily badly in a simple worst case. Moreover, we show that it is hard to get $o(n)$-approximation for MSR on general graphs even if we relax the storage constraints by an $O(\log n)$ factor. Similar hardness results are shown for other variants. Meanwhile, we propose poly-time approximation schemes for tree-like graphs, motivated by the fact that the graphs arising in practice from typical edit operations are often not arbitrary. As version graphs typically have low treewidth, we further develop new algorithms for bounded treewidth graphs.
Furthermore, we propose two new heuristics and evaluate them empirically. First, we extend LMG by considering more potential ``moves'', to propose a new heuristic LMG-All. LMG-All consistently outperforms LMG while having comparable run time on a wide variety of datasets, i.e., version graphs. Secondly, we apply our tree algorithms on the minimum-storage arborescence of an instance, yielding algorithms that are qualitatively better than all previous heuristics for MSR, as well as for another variant BoundedMin Retrieval (BMR).
△ Less
Submitted 18 February, 2024;
originally announced February 2024.
-
ClickSAM: Fine-tuning Segment Anything Model using click prompts for ultrasound image segmentation
Authors:
Aimee Guo,
Grace Fei,
Hemanth Pasupuleti,
Jing Wang
Abstract:
The newly released Segment Anything Model (SAM) is a popular tool used in image processing due to its superior segmentation accuracy, variety of input prompts, training capabilities, and efficient model design. However, its current model is trained on a diverse dataset not tailored to medical images, particularly ultrasound images. Ultrasound images tend to have a lot of noise, making it difficult…
▽ More
The newly released Segment Anything Model (SAM) is a popular tool used in image processing due to its superior segmentation accuracy, variety of input prompts, training capabilities, and efficient model design. However, its current model is trained on a diverse dataset not tailored to medical images, particularly ultrasound images. Ultrasound images tend to have a lot of noise, making it difficult to segment out important structures. In this project, we developed ClickSAM, which fine-tunes the Segment Anything Model using click prompts for ultrasound images. ClickSAM has two stages of training: the first stage is trained on single-click prompts centered in the ground-truth contours, and the second stage focuses on improving the model performance through additional positive and negative click prompts. By comparing the first stage predictions to the ground-truth masks, true positive, false positive, and false negative segments are calculated. Positive clicks are generated using the true positive and false negative segments, and negative clicks are generated using the false positive segments. The Centroidal Voronoi Tessellation algorithm is then employed to collect positive and negative click prompts in each segment that are used to enhance the model performance during the second stage of training. With click-train methods, ClickSAM exhibits superior performance compared to other existing models for ultrasound image segmentation.
△ Less
Submitted 24 February, 2024; v1 submitted 8 February, 2024;
originally announced February 2024.
-
InteractOut: Leveraging Interaction Proxies as Input Manipulation Strategies for Reducing Smartphone Overuse
Authors:
Tao Lu,
Hongxiao Zheng,
Tianying Zhang,
Xuhai Xu,
Anhong Guo
Abstract:
Smartphone overuse poses risks to people's physical and mental health. However, current intervention techniques mainly focus on explicitly changing screen content (i.e., output) and often fail to persistently reduce smartphone overuse due to being over-restrictive or over-flexible. We present the design and implementation of InteractOut, a suite of implicit input manipulation techniques that lever…
▽ More
Smartphone overuse poses risks to people's physical and mental health. However, current intervention techniques mainly focus on explicitly changing screen content (i.e., output) and often fail to persistently reduce smartphone overuse due to being over-restrictive or over-flexible. We present the design and implementation of InteractOut, a suite of implicit input manipulation techniques that leverage interaction proxies to weakly inhibit the natural execution of common user gestures on mobile devices. We present a design space for input manipulations and demonstrate 8 Android implementations of input interventions. We first conducted a pilot lab study (N=30) to evaluate the usability of these interventions. Based on the results, we then performed a 5-week within-subject field experiment (N=42) to evaluate InteractOut in real-world scenarios. Compared to the traditional and common timed lockout technique, InteractOut significantly reduced the usage time by an additional 15.6% and opening frequency by 16.5% on participant-selected target apps. InteractOut also achieved a 25.3% higher user acceptance rate, and resulted in less frustration and better user experience according to participants' subjective feedback. InteractOut demonstrates a new direction for smartphone overuse intervention and serves as a strong complementary set of techniques with existing methods.
△ Less
Submitted 19 February, 2024; v1 submitted 29 January, 2024;
originally announced January 2024.
-
SoundShift: Exploring Sound Manipulations for Accessible Mixed-Reality Awareness
Authors:
Ruei-Che Chang,
Chia-Sheng Hung,
Bing-Yu Chen,
Dhruv Jain,
Anhong Guo
Abstract:
Mixed-reality (MR) soundscapes blend real-world sound with virtual audio from hearing devices, presenting intricate auditory information that is hard to discern and differentiate. This is particularly challenging for blind or visually impaired individuals, who rely on sounds and descriptions in their everyday lives. To understand how complex audio information is consumed, we analyzed online forum…
▽ More
Mixed-reality (MR) soundscapes blend real-world sound with virtual audio from hearing devices, presenting intricate auditory information that is hard to discern and differentiate. This is particularly challenging for blind or visually impaired individuals, who rely on sounds and descriptions in their everyday lives. To understand how complex audio information is consumed, we analyzed online forum posts within the blind community, identifying prevailing challenges, needs, and desired solutions. We synthesized the results and propose SoundShift for increasing MR sound awareness, which includes six sound manipulations: Transparency Shift, Envelope Shift, Position Shift, Style Shift, Time Shift, and Sound Append. To evaluate the effectiveness of SoundShift, we conducted a user study with 18 blind participants across three simulated MR scenarios, where participants identified specific sounds within intricate soundscapes. We found that SoundShift increased MR sound awareness and minimized cognitive load. Finally, we developed three real-world example applications to demonstrate the practicality of SoundShift.
△ Less
Submitted 26 May, 2024; v1 submitted 19 January, 2024;
originally announced January 2024.
-
Team Flow at DRC2023: Building Common Ground and Text-based Turn-taking in a Travel Agent Spoken Dialogue System
Authors:
Ryu Hirai,
Shinya Iizuka,
Haruhisa Iseno,
Ao Guo,
Jingjing Jiang,
Atsumoto Ohashi,
Ryuichiro Higashinaka
Abstract:
At the Dialogue Robot Competition 2023 (DRC2023), which was held to improve the capability of dialogue robots, our team developed a system that could build common ground and take more natural turns based on user utterance texts. Our system generated queries for sightseeing spot searches using the common ground and engaged in dialogue while waiting for user comprehension.
At the Dialogue Robot Competition 2023 (DRC2023), which was held to improve the capability of dialogue robots, our team developed a system that could build common ground and take more natural turns based on user utterance texts. Our system generated queries for sightseeing spot searches using the common ground and engaged in dialogue while waiting for user comprehension.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
DoDo-Code: a Deep Levenshtein Distance Embedding-based Code for IDS Channel and DNA Storage
Authors:
Alan J. X. Guo,
Sihan Sun,
Xiang Wei,
Mengyi Wei,
Xin Chen
Abstract:
Recently, DNA storage has emerged as a promising data storage solution, offering significant advantages in storage density, maintenance cost efficiency, and parallel replication capability. Mathematically, the DNA storage pipeline can be viewed as an insertion, deletion, and substitution (IDS) channel. Because of the mathematical terra incognita of the Levenshtein distance, designing an IDS-correc…
▽ More
Recently, DNA storage has emerged as a promising data storage solution, offering significant advantages in storage density, maintenance cost efficiency, and parallel replication capability. Mathematically, the DNA storage pipeline can be viewed as an insertion, deletion, and substitution (IDS) channel. Because of the mathematical terra incognita of the Levenshtein distance, designing an IDS-correcting code is still a challenge. In this paper, we propose an innovative approach that utilizes deep Levenshtein distance embedding to bypass these mathematical challenges. By representing the Levenshtein distance between two sequences as a conventional distance between their corresponding embedding vectors, the inherent structural property of Levenshtein distance is revealed in the friendly embedding space. Leveraging this embedding space, we introduce the DoDo-Code, an IDS-correcting code that incorporates deep embedding of Levenshtein distance, deep embedding-based codeword search, and deep embedding-based segment correcting. To address the requirements of DNA storage, we also present a preliminary algorithm for long sequence decoding. As far as we know, the DoDo-Code is the first IDS-correcting code designed using plausible deep learning methodologies, potentially paving the way for a new direction in error-correcting code research. It is also the first IDS code that exhibits characteristics of being `optimal' in terms of redundancy, significantly outperforming the mainstream IDS-correcting codes of the Varshamov-Tenengolts code family in code rate.
△ Less
Submitted 19 December, 2023;
originally announced December 2023.
-
Levenshtein Distance Embedding with Poisson Regression for DNA Storage
Authors:
Xiang Wei,
Alan J. X. Guo,
Sihan Sun,
Mengyi Wei,
Wei Yu
Abstract:
Efficient computation or approximation of Levenshtein distance, a widely-used metric for evaluating sequence similarity, has attracted significant attention with the emergence of DNA storage and other biological applications. Sequence embedding, which maps Levenshtein distance to a conventional distance between embedding vectors, has emerged as a promising solution. In this paper, a novel neural n…
▽ More
Efficient computation or approximation of Levenshtein distance, a widely-used metric for evaluating sequence similarity, has attracted significant attention with the emergence of DNA storage and other biological applications. Sequence embedding, which maps Levenshtein distance to a conventional distance between embedding vectors, has emerged as a promising solution. In this paper, a novel neural network-based sequence embedding technique using Poisson regression is proposed. We first provide a theoretical analysis of the impact of embedding dimension on model performance and present a criterion for selecting an appropriate embedding dimension. Under this embedding dimension, the Poisson regression is introduced by assuming the Levenshtein distance between sequences of fixed length following a Poisson distribution, which naturally aligns with the definition of Levenshtein distance. Moreover, from the perspective of the distribution of embedding distances, Poisson regression approximates the negative log likelihood of the chi-squared distribution and offers advancements in removing the skewness. Through comprehensive experiments on real DNA storage data, we demonstrate the superior performance of the proposed method compared to state-of-the-art approaches.
△ Less
Submitted 13 December, 2023;
originally announced December 2023.
-
Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We?
Authors:
Weisong Sun,
Chunrong Fang,
Yun Miao,
Yudu You,
Mengzhe Yuan,
Yuchen Chen,
Quanjun Zhang,
An Guo,
Xiang Chen,
Yang Liu,
Zhenyu Chen
Abstract:
Programming language understanding and representation (a.k.a code representation learning) has always been a hot and challenging task in software engineering. It aims to apply deep learning techniques to produce numerical representations of the source code features while preserving its semantics. These representations can be used for facilitating subsequent code-related tasks. The abstract syntax…
▽ More
Programming language understanding and representation (a.k.a code representation learning) has always been a hot and challenging task in software engineering. It aims to apply deep learning techniques to produce numerical representations of the source code features while preserving its semantics. These representations can be used for facilitating subsequent code-related tasks. The abstract syntax tree (AST), a fundamental code feature, illustrates the syntactic information of the source code and has been widely used in code representation learning. However, there is still a lack of systematic and quantitative evaluation of how well AST-based code representation facilitates subsequent code-related tasks. In this paper, we first conduct a comprehensive empirical study to explore the effectiveness of the AST-based code representation in facilitating follow-up code-related tasks. To do so, we compare the performance of models trained with code token sequence (Token for short) based code representation and AST-based code representation on three popular types of code-related tasks. Surprisingly, the overall quantitative statistical results demonstrate that models trained with AST-based code representation consistently perform worse across all three tasks compared to models trained with Token-based code representation. Our further quantitative analysis reveals that models trained with AST-based code representation outperform models trained with Token-based code representation in certain subsets of samples across all three tasks. We also conduct comprehensive experiments to evaluate and reveal the impact of the choice of AST parsing/preprocessing/encoding methods on AST-based code representation and subsequent code-related tasks. Our study provides future researchers with detailed guidance on how to select solutions at each stage to fully exploit AST.
△ Less
Submitted 1 December, 2023;
originally announced December 2023.
-
The Blessing of Randomness: SDE Beats ODE in General Diffusion-based Image Editing
Authors:
Shen Nie,
Hanzhong Allan Guo,
Cheng Lu,
Yuhao Zhou,
Chenyu Zheng,
Chongxuan Li
Abstract:
We present a unified probabilistic formulation for diffusion-based image editing, where a latent variable is edited in a task-specific manner and generally deviates from the corresponding marginal distribution induced by the original stochastic or ordinary differential equation (SDE or ODE). Instead, it defines a corresponding SDE or ODE for editing. In the formulation, we prove that the Kullback-…
▽ More
We present a unified probabilistic formulation for diffusion-based image editing, where a latent variable is edited in a task-specific manner and generally deviates from the corresponding marginal distribution induced by the original stochastic or ordinary differential equation (SDE or ODE). Instead, it defines a corresponding SDE or ODE for editing. In the formulation, we prove that the Kullback-Leibler divergence between the marginal distributions of the two SDEs gradually decreases while that for the ODEs remains as the time approaches zero, which shows the promise of SDE in image editing. Inspired by it, we provide the SDE counterparts for widely used ODE baselines in various tasks including inpainting and image-to-image translation, where SDE shows a consistent and substantial improvement. Moreover, we propose SDE-Drag -- a simple yet effective method built upon the SDE formulation for point-based content dragging. We build a challenging benchmark (termed DragBench) with open-set natural, art, and AI-generated images for evaluation. A user study on DragBench indicates that SDE-Drag significantly outperforms our ODE baseline, existing diffusion-based methods, and the renowned DragGAN. Our results demonstrate the superiority and versatility of SDE in image editing and push the boundary of diffusion-based editing methods.
△ Less
Submitted 29 February, 2024; v1 submitted 2 November, 2023;
originally announced November 2023.
-
Reliable Generation of Privacy-preserving Synthetic Electronic Health Record Time Series via Diffusion Models
Authors:
Muhang Tian,
Bernie Chen,
Allan Guo,
Shiyi Jiang,
Anru R. Zhang
Abstract:
Electronic Health Records (EHRs) are rich sources of patient-level data, offering valuable resources for medical data analysis. However, privacy concerns often restrict access to EHRs, hindering downstream analysis. Current EHR de-identification methods are flawed and can lead to potential privacy leakage. Additionally, existing publicly available EHR databases are limited, preventing the advancem…
▽ More
Electronic Health Records (EHRs) are rich sources of patient-level data, offering valuable resources for medical data analysis. However, privacy concerns often restrict access to EHRs, hindering downstream analysis. Current EHR de-identification methods are flawed and can lead to potential privacy leakage. Additionally, existing publicly available EHR databases are limited, preventing the advancement of medical research using EHR. This study aims to overcome these challenges by generating realistic and privacy-preserving synthetic electronic health records (EHRs) time series efficiently. We introduce a new method for generating diverse and realistic synthetic EHR time series data using Denoising Diffusion Probabilistic Models (DDPM). We conducted experiments on six databases: Medical Information Mart for Intensive Care III and IV (MIMIC-III/IV), the eICU Collaborative Research Database (eICU), and non-EHR datasets on Stocks and Energy. We compared our proposed method with eight existing methods. Our results demonstrate that our approach significantly outperforms all existing methods in terms of data fidelity while requiring less training effort. Additionally, data generated by our method yields a lower discriminative accuracy compared to other baseline methods, indicating the proposed method can generate data with less privacy risk. The proposed diffusion-model-based method can reliably and efficiently generate synthetic EHR time series, which facilitates the downstream medical data analysis. Our numerical results show the superiority of the proposed method over all other existing methods.
△ Less
Submitted 2 December, 2024; v1 submitted 23 October, 2023;
originally announced October 2023.
-
Dual-path Transformer Based Neural Beamformer for Target Speech Extraction
Authors:
Aoqi Guo,
Sichong Qian,
Baoxiang Li,
Dazhi Gao
Abstract:
Neural beamformers, which integrate both pre-separation and beamforming modules, have demonstrated impressive effectiveness in target speech extraction. Nevertheless, the performance of these beamformers is inherently limited by the predictive accuracy of the pre-separation module. In this paper, we introduce a neural beamformer supported by a dual-path transformer. Initially, we employ the cross-…
▽ More
Neural beamformers, which integrate both pre-separation and beamforming modules, have demonstrated impressive effectiveness in target speech extraction. Nevertheless, the performance of these beamformers is inherently limited by the predictive accuracy of the pre-separation module. In this paper, we introduce a neural beamformer supported by a dual-path transformer. Initially, we employ the cross-attention mechanism in the time domain to extract crucial spatial information related to beamforming from the noisy covariance matrix. Subsequently, in the frequency domain, the self-attention mechanism is employed to enhance the model's ability to process frequency-specific details. By design, our model circumvents the influence of pre-separation modules, delivering performance in a more comprehensive end-to-end manner. Experimental results reveal that our model not only outperforms contemporary leading neural beamforming algorithms in separation performance but also achieves this with a significant reduction in parameter count.
△ Less
Submitted 7 September, 2023; v1 submitted 30 August, 2023;
originally announced August 2023.
-
HANDAL: A Dataset of Real-World Manipulable Object Categories with Pose Annotations, Affordances, and Reconstructions
Authors:
Andrew Guo,
Bowen Wen,
Jianhe Yuan,
Jonathan Tremblay,
Stephen Tyree,
Jeffrey Smith,
Stan Birchfield
Abstract:
We present the HANDAL dataset for category-level object pose estimation and affordance prediction. Unlike previous datasets, ours is focused on robotics-ready manipulable objects that are of the proper size and shape for functional grasping by robot manipulators, such as pliers, utensils, and screwdrivers. Our annotation process is streamlined, requiring only a single off-the-shelf camera and semi…
▽ More
We present the HANDAL dataset for category-level object pose estimation and affordance prediction. Unlike previous datasets, ours is focused on robotics-ready manipulable objects that are of the proper size and shape for functional grasping by robot manipulators, such as pliers, utensils, and screwdrivers. Our annotation process is streamlined, requiring only a single off-the-shelf camera and semi-automated processing, allowing us to produce high-quality 3D annotations without crowd-sourcing. The dataset consists of 308k annotated image frames from 2.2k videos of 212 real-world objects in 17 categories. We focus on hardware and kitchen tool objects to facilitate research in practical scenarios in which a robot manipulator needs to interact with the environment beyond simple pushing or indiscriminate grasping. We outline the usefulness of our dataset for 6-DoF category-level pose+scale estimation and related tasks. We also provide 3D reconstructed meshes of all objects, and we outline some of the bottlenecks to be addressed for democratizing the collection of datasets like this one.
△ Less
Submitted 2 August, 2023;
originally announced August 2023.
-
SCA-PVNet: Self-and-Cross Attention Based Aggregation of Point Cloud and Multi-View for 3D Object Retrieval
Authors:
Dongyun Lin,
Yi Cheng,
Aiyuan Guo,
Shangbo Mao,
Yiqun Li
Abstract:
To address 3D object retrieval, substantial efforts have been made to generate highly discriminative descriptors of 3D objects represented by a single modality, e.g., voxels, point clouds or multi-view images. It is promising to leverage the complementary information from multi-modality representations of 3D objects to further improve retrieval performance. However, multi-modality 3D object retrie…
▽ More
To address 3D object retrieval, substantial efforts have been made to generate highly discriminative descriptors of 3D objects represented by a single modality, e.g., voxels, point clouds or multi-view images. It is promising to leverage the complementary information from multi-modality representations of 3D objects to further improve retrieval performance. However, multi-modality 3D object retrieval is rarely developed and analyzed on large-scale datasets. In this paper, we propose self-and-cross attention based aggregation of point cloud and multi-view images (SCA-PVNet) for 3D object retrieval. With deep features extracted from point clouds and multi-view images, we design two types of feature aggregation modules, namely the In-Modality Aggregation Module (IMAM) and the Cross-Modality Aggregation Module (CMAM), for effective feature fusion. IMAM leverages a self-attention mechanism to aggregate multi-view features while CMAM exploits a cross-attention mechanism to interact point cloud features with multi-view features. The final descriptor of a 3D object for object retrieval can be obtained via concatenating the aggregated features from both modules. Extensive experiments and analysis are conducted on three datasets, ranging from small to large scale, to show the superiority of the proposed SCA-PVNet over the state-of-the-art methods.
△ Less
Submitted 29 November, 2023; v1 submitted 20 July, 2023;
originally announced July 2023.
-
Enhanced Neural Beamformer with Spatial Information for Target Speech Extraction
Authors:
Aoqi Guo,
Junnan Wu,
Peng Gao,
Wenbo Zhu,
Qinwen Guo,
Dazhi Gao,
Yujun Wang
Abstract:
Recently, deep learning-based beamforming algorithms have shown promising performance in target speech extraction tasks. However, most systems do not fully utilize spatial information. In this paper, we propose a target speech extraction network that utilizes spatial information to enhance the performance of neural beamformer. To achieve this, we first use the UNet-TCN structure to model input fea…
▽ More
Recently, deep learning-based beamforming algorithms have shown promising performance in target speech extraction tasks. However, most systems do not fully utilize spatial information. In this paper, we propose a target speech extraction network that utilizes spatial information to enhance the performance of neural beamformer. To achieve this, we first use the UNet-TCN structure to model input features and improve the estimation accuracy of the speech pre-separation module by avoiding information loss caused by direct dimensionality reduction in other models. Furthermore, we introduce a multi-head cross-attention mechanism that enhances the neural beamformer's perception of spatial information by making full use of the spatial information received by the array. Experimental results demonstrate that our approach, which incorporates a more reasonable target mask estimation network and a spatial information-based cross-attention mechanism into the neural beamformer, effectively improves speech separation performance.
△ Less
Submitted 28 June, 2023;
originally announced June 2023.
-
No-Regret Learning in Dynamic Competition with Reference Effects Under Logit Demand
Authors:
Mengzi Amy Guo,
Donghao Ying,
Javad Lavaei,
Zuo-Jun Max Shen
Abstract:
This work is dedicated to the algorithm design in a competitive framework, with the primary goal of learning a stable equilibrium. We consider the dynamic price competition between two firms operating within an opaque marketplace, where each firm lacks information about its competitor. The demand follows the multinomial logit (MNL) choice model, which depends on the consumers' observed price and t…
▽ More
This work is dedicated to the algorithm design in a competitive framework, with the primary goal of learning a stable equilibrium. We consider the dynamic price competition between two firms operating within an opaque marketplace, where each firm lacks information about its competitor. The demand follows the multinomial logit (MNL) choice model, which depends on the consumers' observed price and their reference price, and consecutive periods in the repeated games are connected by reference price updates. We use the notion of stationary Nash equilibrium (SNE), defined as the fixed point of the equilibrium pricing policy for the single-period game, to simultaneously capture the long-run market equilibrium and stability. We propose the online projected gradient ascent algorithm (OPGA), where the firms adjust prices using the first-order derivatives of their log-revenues that can be obtained from the market feedback mechanism. Despite the absence of typical properties required for the convergence of online games, such as strong monotonicity and variational stability, we demonstrate that under diminishing step-sizes, the price and reference price paths generated by OPGA converge to the unique SNE, thereby achieving the no-regret learning and a stable market. Moreover, with appropriate step-sizes, we prove that this convergence exhibits a rate of $\mathcal{O}(1/t)$.
△ Less
Submitted 27 May, 2023;
originally announced May 2023.
-
SLLEN: Semantic-aware Low-light Image Enhancement Network
Authors:
Mingye Ju,
Chuheng Chen,
Charles A. Guo,
Jinshan Pan,
Jinhui Tang,
Dacheng Tao
Abstract:
How to effectively explore semantic feature is vital for low-light image enhancement (LLE). Existing methods usually utilize the semantic feature that is only drawn from the output produced by high-level semantic segmentation (SS) network. However, if the output is not accurately estimated, it would affect the high-level semantic feature (HSF) extraction, which accordingly interferes with LLE. To…
▽ More
How to effectively explore semantic feature is vital for low-light image enhancement (LLE). Existing methods usually utilize the semantic feature that is only drawn from the output produced by high-level semantic segmentation (SS) network. However, if the output is not accurately estimated, it would affect the high-level semantic feature (HSF) extraction, which accordingly interferes with LLE. To this end, we develop a simple and effective semantic-aware LLE network (SSLEN) composed of a LLE main-network (LLEmN) and a SS auxiliary-network (SSaN). In SLLEN, LLEmN integrates the random intermediate embedding feature (IEF), i.e., the information extracted from the intermediate layer of SSaN, together with the HSF into a unified framework for better LLE. SSaN is designed to act as a SS role to provide HSF and IEF. Moreover, thanks to a shared encoder between LLEmN and SSaN, we further propose an alternating training mechanism to facilitate the collaboration between them. Unlike currently available approaches, the proposed SLLEN is able to fully lever the semantic information, e.g., IEF, HSF, and SS dataset, to assist LLE, thereby leading to a more promising enhancement performance. Comparisons between the proposed SLLEN and other state-of-the-art techniques demonstrate the superiority of SLLEN with respect to LLE quality over all the comparable alternatives.
△ Less
Submitted 21 October, 2024; v1 submitted 21 November, 2022;
originally announced November 2022.
-
Team Flow at DRC2022: Pipeline System for Travel Destination Recommendation Task in Spoken Dialogue
Authors:
Ryu Hirai,
Atsumoto Ohashi,
Ao Guo,
Hideki Shiroma,
Xulin Zhou,
Yukihiko Tone,
Shinya Iizuka,
Ryuichiro Higashinaka
Abstract:
To improve the interactive capabilities of a dialogue system, e.g., to adapt to different customers, the Dialogue Robot Competition (DRC2022) was held. As one of the teams, we built a dialogue system with a pipeline structure containing four modules. The natural language understanding (NLU) and natural language generation (NLG) modules were GPT-2 based models, and the dialogue state tracking (DST)…
▽ More
To improve the interactive capabilities of a dialogue system, e.g., to adapt to different customers, the Dialogue Robot Competition (DRC2022) was held. As one of the teams, we built a dialogue system with a pipeline structure containing four modules. The natural language understanding (NLU) and natural language generation (NLG) modules were GPT-2 based models, and the dialogue state tracking (DST) and policy modules were designed on the basis of hand-crafted rules. After the preliminary round of the competition, we found that the low variation in training examples for the NLU and failed recommendation due to the policy used were probably the main reasons for the limited performance of the system.
△ Less
Submitted 17 October, 2022;
originally announced October 2022.
-
GMN: Generative Multi-modal Network for Practical Document Information Extraction
Authors:
Haoyu Cao,
Jiefeng Ma,
Antai Guo,
Yiqing Hu,
Hao Liu,
Deqiang Jiang,
Yinsong Liu,
Bo Ren
Abstract:
Document Information Extraction (DIE) has attracted increasing attention due to its various advanced applications in the real world. Although recent literature has already achieved competitive results, these approaches usually fail when dealing with complex documents with noisy OCR results or mutative layouts. This paper proposes Generative Multi-modal Network (GMN) for real-world scenarios to add…
▽ More
Document Information Extraction (DIE) has attracted increasing attention due to its various advanced applications in the real world. Although recent literature has already achieved competitive results, these approaches usually fail when dealing with complex documents with noisy OCR results or mutative layouts. This paper proposes Generative Multi-modal Network (GMN) for real-world scenarios to address these problems, which is a robust multi-modal generation method without predefined label categories. With the carefully designed spatial encoder and modal-aware mask module, GMN can deal with complex documents that are hard to serialized into sequential order. Moreover, GMN tolerates errors in OCR results and requires no character-level annotation, which is vital because fine-grained annotation of numerous documents is laborious and even requires annotators with specialized domain knowledge. Extensive experiments show that GMN achieves new state-of-the-art performance on several public DIE datasets and surpasses other methods by a large margin, especially in realistic scenes.
△ Less
Submitted 11 July, 2022;
originally announced July 2022.
-
Deep Squared Euclidean Approximation to the Levenshtein Distance for DNA Storage
Authors:
Alan J. X. Guo,
Cong Liang,
Qing-Hu Hou
Abstract:
Storing information in DNA molecules is of great interest because of its advantages in longevity, high storage density, and low maintenance cost. A key step in the DNA storage pipeline is to efficiently cluster the retrieved DNA sequences according to their similarities. Levenshtein distance is the most suitable metric on the similarity between two DNA sequences, but it is inferior in terms of com…
▽ More
Storing information in DNA molecules is of great interest because of its advantages in longevity, high storage density, and low maintenance cost. A key step in the DNA storage pipeline is to efficiently cluster the retrieved DNA sequences according to their similarities. Levenshtein distance is the most suitable metric on the similarity between two DNA sequences, but it is inferior in terms of computational complexity and less compatible with mature clustering algorithms. In this work, we propose a novel deep squared Euclidean embedding for DNA sequences using Siamese neural network, squared Euclidean embedding, and chi-squared regression. The Levenshtein distance is approximated by the squared Euclidean distance between the embedding vectors, which is fast calculated and clustering algorithm friendly. The proposed approach is analyzed theoretically and experimentally. The results show that the proposed embedding is efficient and robust.
△ Less
Submitted 11 July, 2022;
originally announced July 2022.
-
H-GCN: A Graph Convolutional Network Accelerator on Versal ACAP Architecture
Authors:
Chengming Zhang,
Tong Geng,
Anqi Guo,
Jiannan Tian,
Martin Herbordt,
Ang Li,
Dingwen Tao
Abstract:
Graph Neural Networks (GNNs) have drawn tremendous attention due to their unique capability to extend Machine Learning (ML) approaches to applications broadly-defined as having unstructured data, especially graphs. Compared with other Machine Learning (ML) modalities, the acceleration of Graph Neural Networks (GNNs) is more challenging due to the irregularity and heterogeneity derived from graph t…
▽ More
Graph Neural Networks (GNNs) have drawn tremendous attention due to their unique capability to extend Machine Learning (ML) approaches to applications broadly-defined as having unstructured data, especially graphs. Compared with other Machine Learning (ML) modalities, the acceleration of Graph Neural Networks (GNNs) is more challenging due to the irregularity and heterogeneity derived from graph typologies. Existing efforts, however, have focused mainly on handling graphs' irregularity and have not studied their heterogeneity.
To this end we propose H-GCN, a PL (Programmable Logic) and AIE (AI Engine) based hybrid accelerator that leverages the emerging heterogeneity of Xilinx Versal Adaptive Compute Acceleration Platforms (ACAPs) to achieve high-performance GNN inference. In particular, H-GCN partitions each graph into three subgraphs based on its inherent heterogeneity, and processes them using PL and AIE, respectively. To further improve performance, we explore the sparsity support of AIE and develop an efficient density-aware method to automatically map tiles of sparse matrix-matrix multiplication (SpMM) onto the systolic tensor array. Compared with state-of-the-art GCN accelerators, H-GCN achieves, on average, speedups of 1.1~2.3X.
△ Less
Submitted 27 June, 2022;
originally announced June 2022.
-
Policy-based Primal-Dual Methods for Concave CMDP with Variance Reduction
Authors:
Donghao Ying,
Mengzi Amy Guo,
Hyunin Lee,
Yuhao Ding,
Javad Lavaei,
Zuo-Jun Max Shen
Abstract:
We study Concave Constrained Markov Decision Processes (Concave CMDPs) where both the objective and constraints are defined as concave functions of the state-action occupancy measure. We propose the Variance-Reduced Primal-Dual Policy Gradient Algorithm (VR-PDPG), which updates the primal variable via policy gradient ascent and the dual variable via projected sub-gradient descent. Despite the chal…
▽ More
We study Concave Constrained Markov Decision Processes (Concave CMDPs) where both the objective and constraints are defined as concave functions of the state-action occupancy measure. We propose the Variance-Reduced Primal-Dual Policy Gradient Algorithm (VR-PDPG), which updates the primal variable via policy gradient ascent and the dual variable via projected sub-gradient descent. Despite the challenges posed by the loss of additivity structure and the nonconcave nature of the problem, we establish the global convergence of VR-PDPG by exploiting a form of hidden concavity. In the exact setting, we prove an $O(T^{-1/3})$ convergence rate for both the average optimality gap and constraint violation, which further improves to $O(T^{-1/2})$ under strong concavity of the objective in the occupancy measure. In the sample-based setting, we demonstrate that VR-PDPG achieves an $\widetilde{O}(ε^{-4})$ sample complexity for $ε$-global optimality. Moreover, by incorporating a diminishing pessimistic term into the constraint, we show that VR-PDPG can attain a zero constraint violation without compromising the convergence rate of the optimality gap. Finally, we validate the effectiveness of our methods through numerical experiments.
△ Less
Submitted 26 May, 2024; v1 submitted 21 May, 2022;
originally announced May 2022.
-
AI-assisted Optimization of the ECCE Tracking System at the Electron Ion Collider
Authors:
C. Fanelli,
Z. Papandreou,
K. Suresh,
J. K. Adkins,
Y. Akiba,
A. Albataineh,
M. Amaryan,
I. C. Arsene,
C. Ayerbe Gayoso,
J. Bae,
X. Bai,
M. D. Baker,
M. Bashkanov,
R. Bellwied,
F. Benmokhtar,
V. Berdnikov,
J. C. Bernauer,
F. Bock,
W. Boeglin,
M. Borysova,
E. Brash,
P. Brindza,
W. J. Briscoe,
M. Brooks,
S. Bueltmann
, et al. (258 additional authors not shown)
Abstract:
The Electron-Ion Collider (EIC) is a cutting-edge accelerator facility that will study the nature of the "glue" that binds the building blocks of the visible matter in the universe. The proposed experiment will be realized at Brookhaven National Laboratory in approximately 10 years from now, with detector design and R&D currently ongoing. Notably, EIC is one of the first large-scale facilities to…
▽ More
The Electron-Ion Collider (EIC) is a cutting-edge accelerator facility that will study the nature of the "glue" that binds the building blocks of the visible matter in the universe. The proposed experiment will be realized at Brookhaven National Laboratory in approximately 10 years from now, with detector design and R&D currently ongoing. Notably, EIC is one of the first large-scale facilities to leverage Artificial Intelligence (AI) already starting from the design and R&D phases. The EIC Comprehensive Chromodynamics Experiment (ECCE) is a consortium that proposed a detector design based on a 1.5T solenoid. The EIC detector proposal review concluded that the ECCE design will serve as the reference design for an EIC detector. Herein we describe a comprehensive optimization of the ECCE tracker using AI. The work required a complex parametrization of the simulated detector system. Our approach dealt with an optimization problem in a multidimensional design space driven by multiple objectives that encode the detector performance, while satisfying several mechanical constraints. We describe our strategy and show results obtained for the ECCE tracking system. The AI-assisted design is agnostic to the simulation framework and can be extended to other sub-detectors or to a system of sub-detectors to further optimize the performance of the EIC detector.
△ Less
Submitted 19 May, 2022; v1 submitted 18 May, 2022;
originally announced May 2022.