Search | arXiv e-print repository

Personalized Safety in LLMs: A Benchmark and A Planning-Based Agent Approach

Authors: Yuchen Wu, Edward Sun, Kaijie Zhu, Jianxun Lian, Jose Hernandez-Orallo, Aylin Caliskan, Jindong Wang

Abstract: Large language models (LLMs) typically generate identical or similar responses for all users given the same prompt, posing serious safety risks in high-stakes applications where user vulnerabilities differ widely. Existing safety evaluations primarily rely on context-independent metrics - such as factuality, bias, or toxicity - overlooking the fact that the same response may carry divergent risks… ▽ More Large language models (LLMs) typically generate identical or similar responses for all users given the same prompt, posing serious safety risks in high-stakes applications where user vulnerabilities differ widely. Existing safety evaluations primarily rely on context-independent metrics - such as factuality, bias, or toxicity - overlooking the fact that the same response may carry divergent risks depending on the user's background or condition. We introduce personalized safety to fill this gap and present PENGUIN - a benchmark comprising 14,000 scenarios across seven sensitive domains with both context-rich and context-free variants. Evaluating six leading LLMs, we demonstrate that personalized user information significantly improves safety scores by 43.2%, confirming the effectiveness of personalization in safety alignment. However, not all context attributes contribute equally to safety enhancement. To address this, we develop RAISE - a training-free, two-stage agent framework that strategically acquires user-specific background. RAISE improves safety scores by up to 31.6% over six vanilla LLMs, while maintaining a low interaction cost of just 2.7 user queries on average. Our findings highlight the importance of selective information gathering in safety-critical domains and offer a practical solution for personalizing LLM responses without model retraining. This work establishes a foundation for safety research that adapts to individual user contexts rather than assuming a universal harm standard. △ Less

Submitted 29 May, 2025; v1 submitted 24 May, 2025; originally announced May 2025.

arXiv:2505.10151 [pdf, other]

Training People to Reward Robots

Authors: Endong Sun, Yuqing Zhu, Matthew Howard

Abstract: Learning from demonstration (LfD) is a technique that allows expert teachers to teach task-oriented skills to robotic systems. However, the most effective way of guiding novice teachers to approach expert-level demonstrations quantitatively for specific teaching tasks remains an open question. To this end, this paper investigates the use of machine teaching (MT) to guide novice teachers to improve… ▽ More Learning from demonstration (LfD) is a technique that allows expert teachers to teach task-oriented skills to robotic systems. However, the most effective way of guiding novice teachers to approach expert-level demonstrations quantitatively for specific teaching tasks remains an open question. To this end, this paper investigates the use of machine teaching (MT) to guide novice teachers to improve their teaching skills based on reinforcement learning from demonstration (RLfD). The paper reports an experiment in which novices receive MT-derived guidance to train their ability to teach a given motor skill with only 8 demonstrations and generalise this to previously unseen ones. Results indicate that the MT-guidance not only enhances robot learning performance by 89% on the training skill but also causes a 70% improvement in robot learning performance on skills not seen by subjects during training. These findings highlight the effectiveness of MT-guidance in upskilling human teaching behaviours, ultimately improving demonstration quality in RLfD. △ Less

Submitted 15 May, 2025; originally announced May 2025.

Comments: 6 pages

arXiv:2504.08739 [pdf, other]

doi 10.1145/3701716.3717853

Enhancing Product Search Interfaces with Sketch-Guided Diffusion and Language Agents

Authors: Edward Sun

Abstract: The rapid progress in diffusion models, transformers, and language agents has unlocked new possibilities, yet their potential in user interfaces and commercial applications remains underexplored. We present Sketch-Search Agent, a novel framework that transforms the image search experience by integrating a multimodal language agent with freehand sketches as control signals for diffusion models. Usi… ▽ More The rapid progress in diffusion models, transformers, and language agents has unlocked new possibilities, yet their potential in user interfaces and commercial applications remains underexplored. We present Sketch-Search Agent, a novel framework that transforms the image search experience by integrating a multimodal language agent with freehand sketches as control signals for diffusion models. Using the T2I-Adapter, Sketch-Search Agent combines sketches and text prompts to generate high-quality query images, encoded via a CLIP image encoder for efficient matching against an image corpus. Unlike existing methods, Sketch-Search Agent requires minimal setup, no additional training, and excels in sketch-based image retrieval and natural language interactions. The multimodal agent enhances user experience by dynamically retaining preferences, ranking results, and refining queries for personalized recommendations. This interactive design empowers users to create sketches and receive tailored product suggestions, showcasing the potential of diffusion models in user-centric image retrieval. Experiments confirm Sketch-Search Agent's high accuracy in delivering relevant product search results. △ Less

Submitted 21 March, 2025; originally announced April 2025.

Comments: Companion Proceedings of the ACM Web Conference 2025

arXiv:2504.00901 [pdf, other]

A Decade of Deep Learning for Remote Sensing Spatiotemporal Fusion: Advances, Challenges, and Opportunities

Authors: Enzhe Sun, Yongchuan Cui, Peng Liu, Jining Yan

Abstract: Hardware limitations and satellite launch costs make direct acquisition of high temporal-spatial resolution remote sensing imagery challenging. Remote sensing spatiotemporal fusion (STF) technology addresses this problem by merging high temporal but low spatial resolution imagery with high spatial but low temporal resolution imagery to efficiently generate high spatiotemporal resolution satellite… ▽ More Hardware limitations and satellite launch costs make direct acquisition of high temporal-spatial resolution remote sensing imagery challenging. Remote sensing spatiotemporal fusion (STF) technology addresses this problem by merging high temporal but low spatial resolution imagery with high spatial but low temporal resolution imagery to efficiently generate high spatiotemporal resolution satellite images. STF provides unprecedented observational capabilities for land surface change monitoring, agricultural management, and environmental research. Deep learning (DL) methods have revolutionized the remote sensing spatiotemporal fusion field over the past decade through powerful automatic feature extraction and nonlinear modeling capabilities, significantly outperforming traditional methods in handling complex spatiotemporal data. Despite the rapid development of DL-based remote sensing STF, the community lacks a systematic review of this quickly evolving field. This paper comprehensively reviews DL developments in remote sensing STF over the last decade, analyzing key research trends, method classifications, commonly used datasets, and evaluation metrics. It discusses major challenges in existing research and identifies promising future research directions as references for researchers in this field to inspire new ideas. The specific models, datasets, and other information mentioned in this article have been collected in: https://github.com/yc-cui/Deep-Learning-Spatiotemporal-Fusion-Survey. △ Less

Submitted 1 April, 2025; originally announced April 2025.

arXiv:2503.19456 [pdf, ps, other]

Online Stochastic Matching with Unknown Arrival Order: Beating $0.5$ against the Online Optimum

Authors: Enze Sun, Zhihao Gavin Tang, Yifan Wang

Abstract: We study the online stochastic matching problem. Against the offline benchmark, Feldman, Gravin, and Lucier (SODA 2015) designed an optimal $0.5$-competitive algorithm. A recent line of work, initiated by Papadimitriou, Pollner, Saberi, and Wajc (MOR 2024), focuses on designing approximation algorithms against the online optimum. The online benchmark allows positive results surpassing the $0.5$ ra… ▽ More We study the online stochastic matching problem. Against the offline benchmark, Feldman, Gravin, and Lucier (SODA 2015) designed an optimal $0.5$-competitive algorithm. A recent line of work, initiated by Papadimitriou, Pollner, Saberi, and Wajc (MOR 2024), focuses on designing approximation algorithms against the online optimum. The online benchmark allows positive results surpassing the $0.5$ ratio. In this work, adapting the order-competitive analysis by Ezra, Feldman, Gravin, and Tang (SODA 2023), we design a $0.5+Ω(1)$ order-competitive algorithm against the online benchmark with unknown arrival order. Our algorithm is significantly different from existing ones, as the known arrival order is crucial to the previous approximation algorithms. △ Less

Submitted 25 March, 2025; originally announced March 2025.

Comments: To appear in the 57th Annual ACM Symposium on Theory of Computing (STOC 2025)

arXiv:2503.18684 [pdf, other]

Efficient Continual Adaptation of Pretrained Robotic Policy with Online Meta-Learned Adapters

Authors: Ruiqi Zhu, Endong Sun, Guanhe Huang, Oya Celiktutan

Abstract: Continual adaptation is essential for general autonomous agents. For example, a household robot pretrained with a repertoire of skills must still adapt to unseen tasks specific to each household. Motivated by this, building upon parameter-efficient fine-tuning in language models, prior works have explored lightweight adapters to adapt pretrained policies, which can preserve learned features from t… ▽ More Continual adaptation is essential for general autonomous agents. For example, a household robot pretrained with a repertoire of skills must still adapt to unseen tasks specific to each household. Motivated by this, building upon parameter-efficient fine-tuning in language models, prior works have explored lightweight adapters to adapt pretrained policies, which can preserve learned features from the pretraining phase and demonstrate good adaptation performances. However, these approaches treat task learning separately, limiting knowledge transfer between tasks. In this paper, we propose Online Meta-Learned adapters (OMLA). Instead of applying adapters directly, OMLA can facilitate knowledge transfer from previously learned tasks to current learning tasks through a novel meta-learning objective. Extensive experiments in both simulated and real-world environments demonstrate that OMLA can lead to better adaptation performances compared to the baseline methods. The project link: https://ricky-zhu.github.io/OMLA/. △ Less

Submitted 27 March, 2025; v1 submitted 24 March, 2025; originally announced March 2025.

Comments: Project link: https://ricky-zhu.github.io/OMLA/

arXiv:2412.20138 [pdf, ps, other]

TradingAgents: Multi-Agents LLM Financial Trading Framework

Authors: Yijia Xiao, Edward Sun, Di Luo, Wei Wang

Abstract: Significant progress has been made in automated problem-solving using societies of agents powered by large language models (LLMs). In finance, efforts have largely focused on single-agent systems handling specific tasks or multi-agent frameworks independently gathering data. However, the multi-agent systems' potential to replicate real-world trading firms' collaborative dynamics remains underexplo… ▽ More Significant progress has been made in automated problem-solving using societies of agents powered by large language models (LLMs). In finance, efforts have largely focused on single-agent systems handling specific tasks or multi-agent frameworks independently gathering data. However, the multi-agent systems' potential to replicate real-world trading firms' collaborative dynamics remains underexplored. TradingAgents proposes a novel stock trading framework inspired by trading firms, featuring LLM-powered agents in specialized roles such as fundamental analysts, sentiment analysts, technical analysts, and traders with varied risk profiles. The framework includes Bull and Bear researcher agents assessing market conditions, a risk management team monitoring exposure, and traders synthesizing insights from debates and historical data to make informed decisions. By simulating a dynamic, collaborative trading environment, this framework aims to improve trading performance. Detailed architecture and extensive experiments reveal its superiority over baseline models, with notable improvements in cumulative returns, Sharpe ratio, and maximum drawdown, highlighting the potential of multi-agent LLM frameworks in financial trading. TradingAgents is available at https://github.com/TauricResearch/TradingAgents. △ Less

Submitted 3 June, 2025; v1 submitted 28 December, 2024; originally announced December 2024.

Comments: Tauric Research @ https://github.com/TauricResearch; Oral @ Multi-Agent AI in the Real World

arXiv:2412.07386 [pdf, other]

Algorithmic Phase Transitions in Language Models: A Mechanistic Case Study of Arithmetic

Authors: Alan Sun, Ethan Sun, Warren Shepard

Abstract: Zero-shot capabilities of large language models make them powerful tools for solving a range of tasks without explicit training. It remains unclear, however, how these models achieve such performance, or why they can zero-shot some tasks but not others. In this paper, we shed some light on this phenomenon by defining and investigating algorithmic stability in language models -- changes in problem-… ▽ More Zero-shot capabilities of large language models make them powerful tools for solving a range of tasks without explicit training. It remains unclear, however, how these models achieve such performance, or why they can zero-shot some tasks but not others. In this paper, we shed some light on this phenomenon by defining and investigating algorithmic stability in language models -- changes in problem-solving strategy employed by the model as a result of changes in task specification. We focus on a task where algorithmic stability is needed for generalization: two-operand arithmetic. Surprisingly, we find that Gemma-2-2b employs substantially different computational models on closely related subtasks, i.e. four-digit versus eight-digit addition. Our findings suggest that algorithmic instability may be a contributing factor to language models' poor zero-shot performance across certain logical reasoning tasks, as they struggle to abstract different problem-solving strategies and smoothly transition between them. △ Less

Submitted 10 December, 2024; originally announced December 2024.

Comments: 10 pages, 5 figures

arXiv:2411.08900 [pdf, other]

RNA-GPT: Multimodal Generative System for RNA Sequence Understanding

Authors: Yijia Xiao, Edward Sun, Yiqiao Jin, Wei Wang

Abstract: RNAs are essential molecules that carry genetic information vital for life, with profound implications for drug development and biotechnology. Despite this importance, RNA research is often hindered by the vast literature available on the topic. To streamline this process, we introduce RNA-GPT, a multi-modal RNA chat model designed to simplify RNA discovery by leveraging extensive RNA literature.… ▽ More RNAs are essential molecules that carry genetic information vital for life, with profound implications for drug development and biotechnology. Despite this importance, RNA research is often hindered by the vast literature available on the topic. To streamline this process, we introduce RNA-GPT, a multi-modal RNA chat model designed to simplify RNA discovery by leveraging extensive RNA literature. RNA-GPT integrates RNA sequence encoders with linear projection layers and state-of-the-art large language models (LLMs) for precise representation alignment, enabling it to process user-uploaded RNA sequences and deliver concise, accurate responses. Built on a scalable training pipeline, RNA-GPT utilizes RNA-QA, an automated system that gathers RNA annotations from RNACentral using a divide-and-conquer approach with GPT-4o and latent Dirichlet allocation (LDA) to efficiently handle large datasets and generate instruction-tuning samples. Our experiments indicate that RNA-GPT effectively addresses complex RNA queries, thereby facilitating RNA research. Additionally, we present RNA-QA, a dataset of 407,616 RNA samples for modality alignment and instruction tuning, further advancing the potential of RNA research tools. △ Less

Submitted 29 October, 2024; originally announced November 2024.

Comments: Machine Learning for Structural Biology Workshop, NeurIPS 2024

arXiv:2411.08349 [pdf]

Flexible Thermoelectric Active Cooling Garment to Combat Extreme Heat

Authors: Tianshi Feng, Jiedong Wang, Ethan Sun, Antonio Di Buono, Renkun Chen

Abstract: With the increasing frequency, intensity, and duration of extreme heat events due to climate change, heat-related diseases or even mortality have become more prevalent. An efficient personal cooling strategy can mitigate heat stress by regulating the skin temperature within the thermal comfort zone. However, lightweight, wearable, and sustainable cooling garments are unavailable today. Here, we de… ▽ More With the increasing frequency, intensity, and duration of extreme heat events due to climate change, heat-related diseases or even mortality have become more prevalent. An efficient personal cooling strategy can mitigate heat stress by regulating the skin temperature within the thermal comfort zone. However, lightweight, wearable, and sustainable cooling garments are unavailable today. Here, we developed a TED-based cooling garment and demonstrated its effectiveness in active personal cooling. The garment is shown to maintain the skin temperature within its thermal comfort zone in a hot environment of up to 40 oC under mild forced convection conditions (air flow speed of 2.2 m s-1). Furthermore, we demonstrated a portable cooling system with less than 700 grams of total weight, which includes the TED-based garment, a battery pack, and a temperature controller. The system showed long-term cooling on the skin with varying ambient temperatures from 35 to 40 oC. With the advantages of lightweight, flexible, controllable and long-term effective cooling, the TED cooling garments described in this work can contribute to enhanced health and comfort in an increasingly hotter climate. △ Less

Submitted 1 December, 2024; v1 submitted 13 November, 2024; originally announced November 2024.

arXiv:2410.21790 [pdf, other]

Reconstructing East Asian Temperatures from 1368 to 1911 Using Historical Documents, Climate Models, and Data Assimilation

Authors: Eric Sun, Kuan-hui Elaine Lin, Wan-Ling Tseng, Pao K. Wang, Hsin-Cheng Huang

Abstract: We propose a novel approach for reconstructing annual temperatures in East Asia from 1368 to 1911, leveraging the Reconstructed East Asian Climate Historical Encoded Series (REACHES). The lack of instrumental data during this period poses significant challenges to understanding past climate conditions. REACHES digitizes historical documents from the Ming and Qing dynasties of China, converting qua… ▽ More We propose a novel approach for reconstructing annual temperatures in East Asia from 1368 to 1911, leveraging the Reconstructed East Asian Climate Historical Encoded Series (REACHES). The lack of instrumental data during this period poses significant challenges to understanding past climate conditions. REACHES digitizes historical documents from the Ming and Qing dynasties of China, converting qualitative descriptions into a four-level ordinal temperature scale. However, these index-based data are biased toward abnormal or extreme weather phenomena, leading to data gaps that likely correspond to normal conditions. To address this bias and reconstruct historical temperatures at any point within East Asia, including locations without direct historical data, we employ a three-tiered statistical framework. First, we perform kriging to interpolate temperature data across East Asia, adopting a zero-mean assumption to handle missing information. Next, we utilize the Last Millennium Ensemble (LME) reanalysis data and apply quantile mapping to calibrate the kriged REACHES data to Celsius temperature scales. Finally, we introduce a novel Bayesian data assimilation method that integrates the kriged Celsius data with LME simulations to enhance reconstruction accuracy. We model the LME data at each geographic location using a flexible nonstationary autoregressive time series model and employ regularized maximum likelihood estimation with a fused lasso penalty. The resulting dynamic distribution serves as a prior, which is refined via Kalman filtering by incorporating the kriged Celsius REACHES data to yield posterior temperature estimates. This comprehensive integration of historical documentation, contemporary climate models, and advanced statistical methods improves the accuracy of historical temperature reconstructions and provides a crucial resource for future environmental and climate studies. △ Less

Submitted 18 January, 2025; v1 submitted 29 October, 2024; originally announced October 2024.

Comments: 28 pages, 16 figures, 1 table

MSC Class: 62P12

arXiv:2410.10238 [pdf, other]

ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization

Authors: Jiawei Liu, Fanrui Zhang, Jiaying Zhu, Esther Sun, Qiang Zhang, Zheng-Jun Zha

Abstract: Multimodal Large Language Models (MLLMs), such as GPT4o, have shown strong capabilities in visual reasoning and explanation generation. However, despite these strengths, they face significant challenges in the increasingly critical task of Image Forgery Detection and Localization (IFDL). Moreover, existing IFDL methods are typically limited to the learning of low-level semantic-agnostic clues and… ▽ More Multimodal Large Language Models (MLLMs), such as GPT4o, have shown strong capabilities in visual reasoning and explanation generation. However, despite these strengths, they face significant challenges in the increasingly critical task of Image Forgery Detection and Localization (IFDL). Moreover, existing IFDL methods are typically limited to the learning of low-level semantic-agnostic clues and merely provide a single outcome judgment. To tackle these issues, we propose ForgeryGPT, a novel framework that advances the IFDL task by capturing high-order forensics knowledge correlations of forged images from diverse linguistic feature spaces, while enabling explainable generation and interactive dialogue through a newly customized Large Language Model (LLM) architecture. Specifically, ForgeryGPT enhances traditional LLMs by integrating the Mask-Aware Forgery Extractor, which enables the excavating of precise forgery mask information from input images and facilitating pixel-level understanding of tampering artifacts. The Mask-Aware Forgery Extractor consists of a Forgery Localization Expert (FL-Expert) and a Mask Encoder, where the FL-Expert is augmented with an Object-agnostic Forgery Prompt and a Vocabulary-enhanced Vision Encoder, allowing for effectively capturing of multi-scale fine-grained forgery details. To enhance its performance, we implement a three-stage training strategy, supported by our designed Mask-Text Alignment and IFDL Task-Specific Instruction Tuning datasets, which align vision-language modalities and improve forgery detection and instruction-following capabilities. Extensive experiments demonstrate the effectiveness of the proposed method. △ Less

Submitted 6 January, 2025; v1 submitted 14 October, 2024; originally announced October 2024.

Comments: 16 pages, 14 figures

arXiv:2409.16449 [pdf, ps, other]

doi 10.1117/12.3018522

Beyond CCDs: Characterization of sCMOS detectors for optical astronomy

Authors: Aditya Khandelwal, Sarik Jeram, Ryan Dungee, Albert W. K. Lau, Allison Lau, Ethen Sun, Phil Van-Lane, Shaojie Chen, Aaron Tohuvavohu, Ting S. Li

Abstract: Modern scientific complementary metal-oxide semiconductor (sCMOS) detectors provide a highly competitive alternative to charge-coupled devices (CCDs), the latter of which have historically been dominant in optical imaging. sCMOS boast comparable performances to CCDs with faster frame rates, lower read noise, and a higher dynamic range. Furthermore, their lower production costs are shifting the ind… ▽ More Modern scientific complementary metal-oxide semiconductor (sCMOS) detectors provide a highly competitive alternative to charge-coupled devices (CCDs), the latter of which have historically been dominant in optical imaging. sCMOS boast comparable performances to CCDs with faster frame rates, lower read noise, and a higher dynamic range. Furthermore, their lower production costs are shifting the industry to abandon CCD support and production in favour of CMOS, making their characterization urgent. In this work, we characterized a variety of high-end commercially available sCMOS detectors to gauge the state of this technology in the context of applications in optical astronomy. We evaluated a range of sCMOS detectors, including larger pixel models such as the Teledyne Prime 95B and the Andor Sona-11, which are similar to CCDs in pixel size and suitable for wide-field astronomy. Additionally, we assessed smaller pixel detectors like the Ximea xiJ and Andor Sona-6, which are better suited for deep-sky imaging. Furthermore, high-sensitivity quantitative sCMOS detectors such as the Hamamatsu Orca-Quest C15550-20UP, capable of resolving individual photoelectrons, were also tested. In-lab testing showed low levels of dark current, read noise, faulty pixels, and fixed pattern noise, as well as linearity levels above $98\%$ across all detectors. The Orca-Quest had particularly low noise levels with a dark current of $0.0067 \pm 0.0003$ e$^-$/s (at $-20^\circ$C with air cooling) and a read noise of $0.37 \pm 0.09$ e$^-$ using its standard readout mode. Our tests revealed that the latest generation of sCMOS detectors excels in optical imaging performance, offering a more accessible alternative to CCDs for future optical astronomy instruments. △ Less

Submitted 6 December, 2024; v1 submitted 24 September, 2024; originally announced September 2024.

Comments: SPIE Astronomical Telescopes + Instrumentation, Proceedings Volume 13103, X-Ray, Optical, and Infrared Detectors for Astronomy XI; 131030R (2024)

arXiv:2409.15563 [pdf, other]

Using Machine Teaching to Boost Novices' Robot Teaching Skill

Authors: Yuqing Zhu, Endong Sun, Matthew Howard

Abstract: Recent evidence has shown that, contrary to expectations, it is difficult for users, especially novices, to teach robots tasks through LfD. This paper introduces a framework that leverages MT algorithms to train novices to become better teachers of robots, and verifies whether such teaching ability is retained beyond the period of training and generalises such that novices teach robots more effect… ▽ More Recent evidence has shown that, contrary to expectations, it is difficult for users, especially novices, to teach robots tasks through LfD. This paper introduces a framework that leverages MT algorithms to train novices to become better teachers of robots, and verifies whether such teaching ability is retained beyond the period of training and generalises such that novices teach robots more effectively, even for skills for which training has not been received. A between-subjects study is reported, in which novice teachers are asked to teach simple motor skills to a robot. The results demonstrate that subjects that receive training show average 78.83% improvement in teaching ability (as measured by accuracy of the skill learnt by the robot), and average 63.69% improvement in the teaching of new skills not included as part of the training. △ Less

Submitted 23 September, 2024; originally announced September 2024.

arXiv:2409.13913 [pdf, other]

Target word activity detector: An approach to obtain ASR word boundaries without lexicon

Authors: Sunit Sivasankaran, Eric Sun, Jinyu Li, Yan Huang, Jing Pan

Abstract: Obtaining word timestamp information from end-to-end (E2E) ASR models remains challenging due to the lack of explicit time alignment during training. This issue is further complicated in multilingual models. Existing methods, either rely on lexicons or introduce additional tokens, leading to scalability issues and increased computational costs. In this work, we propose a new approach to estimate w… ▽ More Obtaining word timestamp information from end-to-end (E2E) ASR models remains challenging due to the lack of explicit time alignment during training. This issue is further complicated in multilingual models. Existing methods, either rely on lexicons or introduce additional tokens, leading to scalability issues and increased computational costs. In this work, we propose a new approach to estimate word boundaries without relying on lexicons. Our method leverages word embeddings from sub-word token units and a pretrained ASR model, requiring only word alignment information during training. Our proposed method can scale-up to any number of languages without incurring any additional cost. We validate our approach using a multilingual ASR model trained on five languages and demonstrate its effectiveness against a strong baseline. △ Less

Submitted 20 September, 2024; originally announced September 2024.

Comments: Submitted to ICASSP 2025

arXiv:2408.12524 [pdf, ps, other]

Stochastic Online Correlated Selection

Authors: Ziyun Chen, Zhiyi Huang, Enze Sun

Abstract: We study Stochastic Online Correlated Selection (SOCS), a family of online rounding algorithms for Non-IID Stochastic Online Submodular Welfare Maximization and special cases such as Online Stochastic Matching, Stochastic AdWords, and Stochastic Display Ads. At each step, the algorithm sees an online item's type and fractional allocation, then immediately allocates it to an agent. We propose a met… ▽ More We study Stochastic Online Correlated Selection (SOCS), a family of online rounding algorithms for Non-IID Stochastic Online Submodular Welfare Maximization and special cases such as Online Stochastic Matching, Stochastic AdWords, and Stochastic Display Ads. At each step, the algorithm sees an online item's type and fractional allocation, then immediately allocates it to an agent. We propose a metric called the convergence rate for the quality of SOCS. This is cleaner than most metrics in the OCS literature. We propose a Type Decomposition that reduces SOCS to the two-way special case. First, we sample a surrogate type with half-integer allocation. The rounding is trivial for a one-way type fully allocated to an agent. For a two-way type split equally between two agents, we round it using two-way SOCS. We design the distribution of surrogate types to get two-way types as often as possible while respecting the original fractional allocation in expectation. Following this framework, we make progress on numerous problems: 1) Online Stochastic Matching: We improve the state-of-the-art $0.666$ competitive ratio for unweighted/vertex-weighted matching to $0.69$. 2) Query-Commit Matching: We enhance the ratio to $0.705$ in the Query-Commit model, improving the best previous $0.696$ and $0.662$ for unweighted and vertex-weighted matching. 3) Stochastic AdWords: We give a $0.6338$ competitive algorithm, breaking the $1-\frac{1}{e}$ barrier and answering a decade-old open question. 4) AdWords: The framework applies to the adversarial model if the rounding is oblivious to future items' distributions. We get the first multi-way OCS for AdWords, addressing an open question about OCS. This gives a $0.504$ competitive ratio for AdWords, improving the previous $0.501$. 5) Stochastic Display Ads: We design a $0.644$ competitive algorithm, breaking the $1-\frac{1}{e}$ barrier. △ Less

Submitted 22 August, 2024; originally announced August 2024.

arXiv:2408.11363 [pdf, other]

ProteinGPT: Multimodal LLM for Protein Property Prediction and Structure Understanding

Authors: Yijia Xiao, Edward Sun, Yiqiao Jin, Qifan Wang, Wei Wang

Abstract: Understanding biological processes, drug development, and biotechnological advancements requires a detailed analysis of protein structures and functions, a task that is inherently complex and time-consuming in traditional protein research. To streamline this process, we introduce ProteinGPT, a state-of-the-art multimodal large language model for proteins that enables users to upload protein sequen… ▽ More Understanding biological processes, drug development, and biotechnological advancements requires a detailed analysis of protein structures and functions, a task that is inherently complex and time-consuming in traditional protein research. To streamline this process, we introduce ProteinGPT, a state-of-the-art multimodal large language model for proteins that enables users to upload protein sequences and/or structures for comprehensive analysis and responsive inquiries. ProteinGPT integrates protein sequence and structure encoders with linear projection layers to ensure precise representation adaptation and leverages a large language model (LLM) to generate accurate, contextually relevant responses. To train ProteinGPT, we constructed a large-scale dataset of 132,092 proteins, each annotated with 20-30 property tags and 5-10 QA pairs per protein, and optimized the instruction-tuning process using GPT-4o. Experiments demonstrate that ProteinGPT effectively generates informative responses to protein-related questions, achieving high performance on both semantic and lexical metrics and significantly outperforming baseline models and general-purpose LLMs in understanding and responding to protein-related queries. Our code and data are available at https://github.com/ProteinGPT/ProteinGPT. △ Less

Submitted 17 April, 2025; v1 submitted 21 August, 2024; originally announced August 2024.

Comments: Spotlight, Machine Learning for Genomics Explorations @ ICLR 2025

arXiv:2407.14212 [pdf, other]

Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning of CLIP and Fastspeech2

Authors: Chun Xu, En-Wei Sun

Abstract: An increasing number of Chinese people are troubled by different degrees of visual impairment, which has made the modal conversion between a single image or video frame in the visual field and the audio expressing the same information a research hotspot. Deep learning technologies such as OCR+Vocoder and Im2Wav enable English audio synthesis or image-to-sound matching in a self-supervised manner.… ▽ More An increasing number of Chinese people are troubled by different degrees of visual impairment, which has made the modal conversion between a single image or video frame in the visual field and the audio expressing the same information a research hotspot. Deep learning technologies such as OCR+Vocoder and Im2Wav enable English audio synthesis or image-to-sound matching in a self-supervised manner. However, the audio data used for training is limited and English is not universal for visually impaired people with different educational levels. Therefore, for the sake of solving the problems of data volume and language applicability to improve the reading efficiency of visually impaired people, a set of image-to-speech framework CLIP-KNN-Fastspeech2 based on the Chinese context was constructed. The framework integrates multiple basic models and adopts the strategy of independent pre-training and joint fine-tuning. First, the Chinese CLIP and Fastspeech2 text-to-speech models were pre-trained on two public datasets, MUGE and Baker, respectively, and their convergence was verified. Subsequently, joint fine-tuning was performed using a self-built Braille image dataset. Experimental results on multiple public datasets such as VGGSound, Flickr8k, ImageHear, and the self-built Braille dataset BIT-DP show that the model has improved objective indicators such as BLEU4,FAD(Fréchet Audio Distance), WER(Word Error Ratio), and even inference speed. This verifies that the constructed model still has the ability to synthesize high-quality speech under limited data, and also proves the effectiveness of the joint training strategy that integrates multiple basic models. △ Less

Submitted 19 July, 2024; originally announced July 2024.

arXiv:2407.04973 [pdf, other]

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

Authors: Yijia Xiao, Edward Sun, Tianyu Liu, Wei Wang

Abstract: We propose LogicVista, an evaluation benchmark that assesses the integrated logical reasoning capabilities of multimodal large language models (MLLMs) in Visual contexts. Recent advancements in MLLMs have demonstrated various fascinating abilities, from crafting poetry based on an image to performing mathematical reasoning. However, there is still a lack of systematic evaluation of MLLMs' proficie… ▽ More We propose LogicVista, an evaluation benchmark that assesses the integrated logical reasoning capabilities of multimodal large language models (MLLMs) in Visual contexts. Recent advancements in MLLMs have demonstrated various fascinating abilities, from crafting poetry based on an image to performing mathematical reasoning. However, there is still a lack of systematic evaluation of MLLMs' proficiency in logical reasoning tasks, which are essential for activities like navigation and puzzle-solving. Thus we evaluate general logical cognition abilities across 5 logical reasoning tasks encompassing 9 different capabilities, using a sample of 448 multiple-choice questions. Each question is annotated with the correct answer and the human-written reasoning behind the selection, enabling both open-ended and multiple-choice evaluation. A total of 8 MLLMs are comprehensively evaluated using LogicVista. Code and Data Available at https://github.com/Yijia-Xiao/LogicVista. △ Less

Submitted 6 July, 2024; originally announced July 2024.

Comments: LogicVista benchmarks the logical reasoning of multimodal large language models in visual tasks

arXiv:2405.02243 [pdf, other]

Towards Improving Learning from Demonstration Algorithms via MCMC Methods

Authors: Carl Qi, Edward Sun, Harry Zhang

Abstract: Behavioral cloning, or more broadly, learning from demonstrations (LfD) is a priomising direction for robot policy learning in complex scenarios. Albeit being straightforward to implement and data-efficient, behavioral cloning has its own drawbacks, limiting its efficacy in real robot setups. In this work, we take one step towards improving learning from demonstration algorithms by leveraging impl… ▽ More Behavioral cloning, or more broadly, learning from demonstrations (LfD) is a priomising direction for robot policy learning in complex scenarios. Albeit being straightforward to implement and data-efficient, behavioral cloning has its own drawbacks, limiting its efficacy in real robot setups. In this work, we take one step towards improving learning from demonstration algorithms by leveraging implicit energy-based policy models. Results suggest that in selected complex robot policy learning scenarios, treating supervised policy learning with an implicit model generally performs better, on average, than commonly used neural network-based explicit models, especially in the cases of approximating potentially discontinuous and multimodal functions. △ Less

Submitted 23 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

Comments: arXiv admin note: text overlap with arXiv:2207.04638, arXiv:2204.03597 by other authors

arXiv:2312.17673 [pdf, other]

Jatmo: Prompt Injection Defense by Task-Specific Finetuning

Authors: Julien Piet, Maha Alrashed, Chawin Sitawarin, Sizhe Chen, Zeming Wei, Elizabeth Sun, Basel Alomair, David Wagner

Abstract: Large Language Models (LLMs) are attracting significant research attention due to their instruction-following abilities, allowing users and developers to leverage LLMs for a variety of tasks. However, LLMs are vulnerable to prompt-injection attacks: a class of attacks that hijack the model's instruction-following abilities, changing responses to prompts to undesired, possibly malicious ones. In th… ▽ More Large Language Models (LLMs) are attracting significant research attention due to their instruction-following abilities, allowing users and developers to leverage LLMs for a variety of tasks. However, LLMs are vulnerable to prompt-injection attacks: a class of attacks that hijack the model's instruction-following abilities, changing responses to prompts to undesired, possibly malicious ones. In this work, we introduce Jatmo, a method for generating task-specific models resilient to prompt-injection attacks. Jatmo leverages the fact that LLMs can only follow instructions once they have undergone instruction tuning. It harnesses a teacher instruction-tuned model to generate a task-specific dataset, which is then used to fine-tune a base model (i.e., a non-instruction-tuned model). Jatmo only needs a task prompt and a dataset of inputs for the task: it uses the teacher model to generate outputs. For situations with no pre-existing datasets, Jatmo can use a single example, or in some cases none at all, to produce a fully synthetic dataset. Our experiments on seven tasks show that Jatmo models provide similar quality of outputs on their specific task as standard LLMs, while being resilient to prompt injections. The best attacks succeeded in less than 0.5% of cases against our models, versus 87% success rate against GPT-3.5-Turbo. We release Jatmo at https://github.com/wagner-group/prompt-injection-defense. △ Less

Submitted 8 January, 2024; v1 submitted 29 December, 2023; originally announced December 2023.

Comments: 24 pages, 6 figures

arXiv:2311.05623 [pdf, other]

The 4m International Liquid Mirror Telescope: a brief history and some preliminary scientific results

Authors: Jean Surdej, Bhavya Ailawadhi, Talat Akhunov, Ermanno Borra, Monalisa Dubey, Naveen Dukiya, Jiuyang Fu, Baldeep Grewal, Paul Hickson, Brajesh Kumar, Kuntal Misra, Vibhore Negi, Anna Pospieszalska-Surdej, Kumar Pranshu, Ethen Sun

Abstract: The present article is based upon an invited talk delivered at the occasion of the inauguration of the 4m International Liquid Mirror Telescope (ILMT) which took place in Devasthal (ARIES, Uttarakhand, India) on 21st of March 2023. We present hereafter a short history of the liquid mirror telescopes and in particular of the 4m ILMT which is the first liquid mirror telescope entirely dedicated to a… ▽ More The present article is based upon an invited talk delivered at the occasion of the inauguration of the 4m International Liquid Mirror Telescope (ILMT) which took place in Devasthal (ARIES, Uttarakhand, India) on 21st of March 2023. We present hereafter a short history of the liquid mirror telescopes and in particular of the 4m ILMT which is the first liquid mirror telescope entirely dedicated to astrophysical observations. We discuss a few preliminary scientific results and illustrate some direct CCD images taken during the first commissioning phase of the telescope. We invite the reader to refer to the series of ILMT poster papers published in these same proceedings of the BINA3 workshop for more details about the instrument, operation, first observations, performance and scientific results. △ Less