-
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Authors:
Gheorghe Comanici,
Eric Bieber,
Mike Schaekermann,
Ice Pasupat,
Noveen Sachdeva,
Inderjit Dhillon,
Marcel Blistein,
Ori Ram,
Dan Zhang,
Evan Rosen,
Luke Marris,
Sam Petulla,
Colin Gaffney,
Asaf Aharoni,
Nathan Lintz,
Tiago Cardal Pais,
Henrik Jacobsson,
Idan Szpektor,
Nan-Jiang Jiang,
Krishna Haridasan,
Ahmed Omran,
Nikunj Saunshi,
Dara Bahri,
Gaurav Mishra,
Eric Chu
, et al. (3278 additional authors not shown)
Abstract:
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal unde…
▽ More
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
△ Less
Submitted 7 July, 2025;
originally announced July 2025.
-
Depth Anything at Any Condition
Authors:
Boyuan Sun,
Modi Jin,
Bowen Yin,
Qibin Hou
Abstract:
We present Depth Anything at Any Condition (DepthAnything-AC), a foundation monocular depth estimation (MDE) model capable of handling diverse environmental conditions. Previous foundation MDE models achieve impressive performance across general scenes but not perform well in complex open-world environments that involve challenging conditions, such as illumination variations, adverse weather, and…
▽ More
We present Depth Anything at Any Condition (DepthAnything-AC), a foundation monocular depth estimation (MDE) model capable of handling diverse environmental conditions. Previous foundation MDE models achieve impressive performance across general scenes but not perform well in complex open-world environments that involve challenging conditions, such as illumination variations, adverse weather, and sensor-induced distortions. To overcome the challenges of data scarcity and the inability of generating high-quality pseudo-labels from corrupted images, we propose an unsupervised consistency regularization finetuning paradigm that requires only a relatively small amount of unlabeled data. Furthermore, we propose the Spatial Distance Constraint to explicitly enforce the model to learn patch-level relative relationships, resulting in clearer semantic boundaries and more accurate details. Experimental results demonstrate the zero-shot capabilities of DepthAnything-AC across diverse benchmarks, including real-world adverse weather benchmarks, synthetic corruption benchmarks, and general benchmarks.
Project Page: https://ghost233lism.github.io/depthanything-AC-page
Code: https://github.com/HVision-NKU/DepthAnythingAC
△ Less
Submitted 2 July, 2025;
originally announced July 2025.
-
LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs
Authors:
Boyuan Sun,
Jiaxing Zhao,
Xihan Wei,
Qibin Hou
Abstract:
In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively capture all semantic regions and often lead to token redundancy. Differently, we propose to leverage the Semantic Connected Components (SCC) approach that assign…
▽ More
In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively capture all semantic regions and often lead to token redundancy. Differently, we propose to leverage the Semantic Connected Components (SCC) approach that assigns tokens to distinct semantic regions within the token set, ensuring comprehensive semantic coverage. The outcome is a two-step spatio-temporal token compression strategy that utilizes SCC in both spatial and temporal domains. This strategy can effectively compress tokens by representing the entire video with a set of non-overlapping semantic tokens. We conduct extensive evaluations of the token compression capabilities of LLaVA-Scissor across diverse video understanding benchmarks, including video question answering, long video understanding, and comprehensive multi-choices benchmarks. Experimental results show that the proposed LLaVA-Scissor outperforms other token compression methods, achieving superior performance in various video understanding benchmarks, particularly at low token retention ratios. Project page: https://github.com/HumanMLLM/LLaVA-Scissor.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context
Authors:
Qize Yang,
Shimin Yao,
Weixuan Chen,
Shenghao Fu,
Detao Bai,
Jiaxing Zhao,
Boyuan Sun,
Bowen Yin,
Xihan Wei,
Jingren Zhou
Abstract:
With the rapid evolution of multimodal large language models, the capacity to deeply understand and interpret human intentions has emerged as a critical capability, which demands detailed and thoughtful reasoning. In recent studies, Reinforcement Learning (RL) has demonstrated potential in enhancing the reasoning capabilities of Large Language Models (LLMs). Nonetheless, the challenges associated…
▽ More
With the rapid evolution of multimodal large language models, the capacity to deeply understand and interpret human intentions has emerged as a critical capability, which demands detailed and thoughtful reasoning. In recent studies, Reinforcement Learning (RL) has demonstrated potential in enhancing the reasoning capabilities of Large Language Models (LLMs). Nonetheless, the challenges associated with adapting RL to multimodal data and formats remain largely unaddressed. In this paper, we identify two issues in existing multimodal reasoning models: insufficient global context understanding and shortcut problems. Insufficient context understanding can happen when a model misinterprets multimodal context, resulting in incorrect answers. The shortcut problem occurs when the model overlooks crucial clues in multimodal inputs, directly addressing the query without considering the multimodal information. To tackle these issues, we emphasize the necessity for the model to reason with a clear understanding of the global context within multimodal inputs. This global context understanding can effectively prevent the model from overlooking key multimodal cues and ensure a thorough reasoning process. To ensure the accurate interpretation of multimodal context information, we implement a context reward judged by a large language model, alongside format and accuracy rewards. Additionally, to improve complex reasoning capability, we employ the LLM to assess the logical reward, determining whether the reasoning process successfully integrates multimodal information with logical methods. We also introduce a reasoning omni-modal benchmark, IntentBench, aimed at evaluating models in understanding complex human intentions and emotions. Our proposed method demonstrates advanced performance across multiple omni-modal benchmarks compared to other open-source omni-modal models.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
Multimodal Prompt Alignment for Facial Expression Recognition
Authors:
Fuyan Ma,
Yiran He,
Bin Sun,
Shutao Li
Abstract:
Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs) like CLIP for various downstream tasks. Despite their success, current VLM-based facial expression recognition (FER) methods struggle to capture fine-grained textual-visual relationships, which are essential for distinguishing subtle differences between facial expressions. To address this challenge, we propo…
▽ More
Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs) like CLIP for various downstream tasks. Despite their success, current VLM-based facial expression recognition (FER) methods struggle to capture fine-grained textual-visual relationships, which are essential for distinguishing subtle differences between facial expressions. To address this challenge, we propose a multimodal prompt alignment framework for FER, called MPA-FER, that provides fine-grained semantic guidance to the learning process of prompted visual features, resulting in more precise and interpretable representations. Specifically, we introduce a multi-granularity hard prompt generation strategy that utilizes a large language model (LLM) like ChatGPT to generate detailed descriptions for each facial expression. The LLM-based external knowledge is injected into the soft prompts by minimizing the feature discrepancy between the soft prompts and the hard prompts. To preserve the generalization abilities of the pretrained CLIP model, our approach incorporates prototype-guided visual feature alignment, ensuring that the prompted visual features from the frozen image encoder align closely with class-specific prototypes. Additionally, we propose a cross-modal global-local alignment module that focuses on expression-relevant facial features, further improving the alignment between textual and visual features. Extensive experiments demonstrate our framework outperforms state-of-the-art methods on three FER benchmark datasets, while retaining the benefits of the pretrained model and minimizing computational costs.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
SceneCrafter: Controllable Multi-View Driving Scene Editing
Authors:
Zehao Zhu,
Yuliang Zou,
Chiyu Max Jiang,
Bo Sun,
Vincent Casser,
Xiukun Huang,
Jiahao Wang,
Zhenpei Yang,
Ruiqi Gao,
Leonidas Guibas,
Mingxing Tan,
Dragomir Anguelov
Abstract:
Simulation is crucial for developing and evaluating autonomous vehicle (AV) systems. Recent literature builds on a new generation of generative models to synthesize highly realistic images for full-stack simulation. However, purely synthetically generated scenes are not grounded in reality and have difficulty in inspiring confidence in the relevance of its outcomes. Editing models, on the other ha…
▽ More
Simulation is crucial for developing and evaluating autonomous vehicle (AV) systems. Recent literature builds on a new generation of generative models to synthesize highly realistic images for full-stack simulation. However, purely synthetically generated scenes are not grounded in reality and have difficulty in inspiring confidence in the relevance of its outcomes. Editing models, on the other hand, leverage source scenes from real driving logs, and enable the simulation of different traffic layouts, behaviors, and operating conditions such as weather and time of day. While image editing is an established topic in computer vision, it presents fresh sets of challenges in driving simulation: (1) the need for cross-camera 3D consistency, (2) learning ``empty street" priors from driving data with foreground occlusions, and (3) obtaining paired image tuples of varied editing conditions while preserving consistent layout and geometry. To address these challenges, we propose SceneCrafter, a versatile editor for realistic 3D-consistent manipulation of driving scenes captured from multiple cameras. We build on recent advancements in multi-view diffusion models, using a fully controllable framework that scales seamlessly to multi-modality conditions like weather, time of day, agent boxes and high-definition maps. To generate paired data for supervising the editing model, we propose a novel framework on top of Prompt-to-Prompt to generate geometrically consistent synthetic paired data with global edits. We also introduce an alpha-blending framework to synthesize data with local edits, leveraging a model trained on empty street priors through novel masked training and multi-view repaint paradigm. SceneCrafter demonstrates powerful editing capabilities and achieves state-of-the-art realism, controllability, 3D consistency, and scene editing quality compared to existing baselines.
△ Less
Submitted 24 June, 2025;
originally announced June 2025.
-
Optimization of Flying Ad Hoc Network Topology and Collaborative Path Planning for Multiple UAVs
Authors:
Ming He,
Peizhao Wang,
Haihua Chen,
Bin Sun,
Hongpeng Wang
Abstract:
Multiple unmanned aerial vehicles (UAVs) play a vital role in monitoring and data collection in wide area environments with harsh conditions. In most scenarios, issues such as real-time data retrieval and real-time UAV positioning are often disregarded, essentially neglecting the communication constraints. In this paper, we comprehensively address both the coverage of the target area and the data…
▽ More
Multiple unmanned aerial vehicles (UAVs) play a vital role in monitoring and data collection in wide area environments with harsh conditions. In most scenarios, issues such as real-time data retrieval and real-time UAV positioning are often disregarded, essentially neglecting the communication constraints. In this paper, we comprehensively address both the coverage of the target area and the data transmission capabilities of the flying ad hoc network (FANET). The data throughput of the network is therefore maximized by optimizing the network topology and the UAV trajectories. The resultant optimization problem is effectively solved by the proposed reinforcement learning-based trajectory planning (RL-TP) algorithm and the convex-based topology optimization (C-TOP) algorithm sequentially. The RL-TP optimizes the UAV paths while considering the constraints of FANET. The C-TOP maximizes the data throughput of the network while simultaneously constraining the neighbors and transmit powers of the UAVs, which is shown to be a convex problem that can be efficiently solved in polynomial time. Simulations and field experimental results show that the proposed optimization strategy can effectively plan the UAV trajectories and significantly improve the data throughput of the FANET over the adaptive local minimum spanning tree (A-LMST) and cyclic pruning-assisted power optimization (CPAPO) methods.
△ Less
Submitted 22 June, 2025;
originally announced June 2025.
-
PCaM: A Progressive Focus Attention-Based Information Fusion Method for Improving Vision Transformer Domain Adaptation
Authors:
Zelin Zang,
Fei Wang,
Liangyu Li,
Jinlin Wu,
Chunshui Zhao,
Zhen Lei,
Baigui Sun
Abstract:
Unsupervised Domain Adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain. Recent UDA methods based on Vision Transformers (ViTs) have achieved strong performance through attention-based feature alignment. However, we identify a key limitation: foreground object mismatch, where the discrepancy in foreground object size and spatial distribution acros…
▽ More
Unsupervised Domain Adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain. Recent UDA methods based on Vision Transformers (ViTs) have achieved strong performance through attention-based feature alignment. However, we identify a key limitation: foreground object mismatch, where the discrepancy in foreground object size and spatial distribution across domains weakens attention consistency and hampers effective domain alignment. To address this issue, we propose the Progressive Focus Cross-Attention Mechanism (PCaM), which progressively filters out background information during cross-attention, allowing the model to focus on and fuse discriminative foreground semantics across domains. We further introduce an attentional guidance loss that explicitly directs attention toward task-relevant regions, enhancing cross-domain attention consistency. PCaM is lightweight, architecture-agnostic, and easy to integrate into existing ViT-based UDA pipelines. Extensive experiments on Office-Home, DomainNet, VisDA-2017, and remote sensing datasets demonstrate that PCaM significantly improves adaptation performance and achieves new state-of-the-art results, validating the effectiveness of attention-guided foreground fusion for domain adaptation.
△ Less
Submitted 27 May, 2025;
originally announced June 2025.
-
Robust Reinforcement Learning for Discrete Compositional Generation via General Soft Operators
Authors:
Marco Jiralerspong,
Esther Derman,
Danilo Vucetic,
Nikolay Malkin,
Bilun Sun,
Tianyu Zhang,
Pierre-Luc Bacon,
Gauthier Gidel
Abstract:
A major bottleneck in scientific discovery involves narrowing a large combinatorial set of objects, such as proteins or molecules, to a small set of promising candidates. While this process largely relies on expert knowledge, recent methods leverage reinforcement learning (RL) to enhance this filtering. They achieve this by estimating proxy reward functions from available datasets and using regula…
▽ More
A major bottleneck in scientific discovery involves narrowing a large combinatorial set of objects, such as proteins or molecules, to a small set of promising candidates. While this process largely relies on expert knowledge, recent methods leverage reinforcement learning (RL) to enhance this filtering. They achieve this by estimating proxy reward functions from available datasets and using regularization to generate more diverse candidates. These reward functions are inherently uncertain, raising a particularly salient challenge for scientific discovery. In this work, we show that existing methods, often framed as sampling proportional to a reward function, are inadequate and yield suboptimal candidates, especially in large search spaces. To remedy this issue, we take a robust RL approach and introduce a unified operator that seeks robustness to the uncertainty of the proxy reward function. This general operator targets peakier sampling distributions while encompassing known soft RL operators. It also leads us to a novel algorithm that identifies higher-quality, diverse candidates in both synthetic and real-world tasks. Ultimately, our work offers a new, flexible perspective on discrete compositional generation tasks. Code: https://github.com/marcojira/tgm.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
The Amazon Nova Family of Models: Technical Report and Model Card
Authors:
Amazon AGI,
Aaron Langford,
Aayush Shah,
Abhanshu Gupta,
Abhimanyu Bhatter,
Abhinav Goyal,
Abhinav Mathur,
Abhinav Mohanty,
Abhishek Kumar,
Abhishek Sethi,
Abi Komma,
Abner Pena,
Achin Jain,
Adam Kunysz,
Adam Opyrchal,
Adarsh Singh,
Aditya Rawal,
Adok Achar Budihal Prasad,
AdriĆ de Gispert,
Agnika Kumar,
Aishwarya Aryamane,
Ajay Nair,
Akilan M,
Akshaya Iyengar,
Akshaya Vishnu Kudlu Shanbhogue
, et al. (761 additional authors not shown)
Abstract:
We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents…
▽ More
We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents and text. Amazon Nova Micro is a text-only model that delivers our lowest-latency responses at very low cost. Amazon Nova Canvas is an image generation model that creates professional grade images with rich customization controls. Amazon Nova Reel is a video generation model offering high-quality outputs, customization, and motion control. Our models were built responsibly and with a commitment to customer trust, security, and reliability. We report benchmarking results for core capabilities, agentic performance, long context, functional adaptation, runtime performance, and human evaluation.
△ Less
Submitted 17 March, 2025;
originally announced June 2025.
-
FocalAD: Local Motion Planning for End-to-End Autonomous Driving
Authors:
Bin Sun,
Boao Zhang,
Jiayi Lu,
Xinjie Feng,
Jiachen Shang,
Rui Cao,
Mengchao Zheng,
Chuanye Wang,
Shichun Yang,
Yaoguang Cao,
Ziying Song
Abstract:
In end-to-end autonomous driving,the motion prediction plays a pivotal role in ego-vehicle planning. However, existing methods often rely on globally aggregated motion features, ignoring the fact that planning decisions are primarily influenced by a small number of locally interacting agents. Failing to attend to these critical local interactions can obscure potential risks and undermine planning…
▽ More
In end-to-end autonomous driving,the motion prediction plays a pivotal role in ego-vehicle planning. However, existing methods often rely on globally aggregated motion features, ignoring the fact that planning decisions are primarily influenced by a small number of locally interacting agents. Failing to attend to these critical local interactions can obscure potential risks and undermine planning reliability. In this work, we propose FocalAD, a novel end-to-end autonomous driving framework that focuses on critical local neighbors and refines planning by enhancing local motion representations. Specifically, FocalAD comprises two core modules: the Ego-Local-Agents Interactor (ELAI) and the Focal-Local-Agents Loss (FLA Loss). ELAI conducts a graph-based ego-centric interaction representation that captures motion dynamics with local neighbors to enhance both ego planning and agent motion queries. FLA Loss increases the weights of decision-critical neighboring agents, guiding the model to prioritize those more relevant to planning. Extensive experiments show that FocalAD outperforms existing state-of-the-art methods on the open-loop nuScenes datasets and closed-loop Bench2Drive benchmark. Notably, on the robustness-focused Adv-nuScenes dataset, FocalAD achieves even greater improvements, reducing the average colilision rate by 41.9% compared to DiffusionDrive and by 15.6% compared to SparseDrive.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
A Sample Efficient Conditional Independence Test in the Presence of Discretization
Authors:
Boyang Sun,
Yu Yao,
Xinshuai Dong,
Zongfang Liu,
Tongliang Liu,
Yumou Qiu,
Kun Zhang
Abstract:
In many real-world scenarios, interested variables are often represented as discretized values due to measurement limitations. Applying Conditional Independence (CI) tests directly to such discretized data, however, can lead to incorrect conclusions. To address this, recent advancements have sought to infer the correct CI relationship between the latent variables through binarizing observed data.…
▽ More
In many real-world scenarios, interested variables are often represented as discretized values due to measurement limitations. Applying Conditional Independence (CI) tests directly to such discretized data, however, can lead to incorrect conclusions. To address this, recent advancements have sought to infer the correct CI relationship between the latent variables through binarizing observed data. However, this process inevitably results in a loss of information, which degrades the test's performance. Motivated by this, this paper introduces a sample-efficient CI test that does not rely on the binarization process. We find that the independence relationships of latent continuous variables can be established by addressing an over-identifying restriction problem with Generalized Method of Moments (GMM). Based on this insight, we derive an appropriate test statistic and establish its asymptotic distribution correctly reflecting CI by leveraging nodewise regression. Theoretical findings and Empirical results across various datasets demonstrate that the superiority and effectiveness of our proposed test. Our code implementation is provided in https://github.com/boyangaaaaa/DCT
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
Optimal patient allocation for echocardiographic assessments
Authors:
Bozhi Sun,
Seda Tierney,
Jeffrey A. Feinstein,
Frederick Damen,
Alison L. Marsden,
Daniele E. Schiavazzi
Abstract:
Scheduling echocardiographic exams in a hospital presents significant challenges due to non-deterministic factors (e.g., patient no-shows, patient arrival times, diverse exam durations, etc.) and asymmetric resource constraints between fetal and non-fetal patient streams. To address these challenges, we first conducted extensive pre-processing on one week of operational data from the Echo Laborato…
▽ More
Scheduling echocardiographic exams in a hospital presents significant challenges due to non-deterministic factors (e.g., patient no-shows, patient arrival times, diverse exam durations, etc.) and asymmetric resource constraints between fetal and non-fetal patient streams. To address these challenges, we first conducted extensive pre-processing on one week of operational data from the Echo Laboratory at Stanford University's Lucile Packard Children's Hospital, to estimate patient no-show probabilities and derive empirical distributions of arrival times and exam durations. Based on these inputs, we developed a discrete-event stochastic simulation model using SimPy, and integrate it with the open source Gymnasium Python library. As a baseline for policy optimization, we developed a comparative framework to evaluate on-the-fly versus reservation-based allocation strategies, in which different proportions of resources are reserved in advance. Considering a hospital configuration with a 1:6 ratio of fetal to non-fetal rooms and a 4:2 ratio of fetal to non-fetal sonographers, we show that on-the-fly allocation generally yields better performance, more effectively adapting to patient variability and resource constraints. Building on this foundation, we apply reinforcement learning (RL) to derive an approximated optimal dynamic allocation policy. This RL-based policy is benchmarked against the best-performing rule-based strategies, allowing us to quantify their differences and provide actionable insights for improving echo lab efficiency through intelligent, data-driven resource management.
△ Less
Submitted 17 May, 2025;
originally announced June 2025.
-
OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation
Authors:
Ziyi Wang,
Yuxuan Lu,
Wenbo Li,
Amirali Amini,
Bo Sun,
Yakov Bart,
Weimin Lyu,
Jiri Gesi,
Tian Wang,
Jing Huang,
Yu Su,
Upol Ehsan,
Malihe Alikhani,
Toby Jia-Jun Li,
Lydia Chilton,
Dakuo Wang
Abstract:
Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable'' human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasonin…
▽ More
Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable'' human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user's next action and rationale with a given persona and <observation, action, rationale> history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.
△ Less
Submitted 7 July, 2025; v1 submitted 5 June, 2025;
originally announced June 2025.
-
Enhancing Efficiency and Propulsion in Bio-mimetic Robotic Fish through End-to-End Deep Reinforcement Learning
Authors:
Xinyu Cui,
Boai Sun,
Yi Zhu,
Ning Yang,
Haifeng Zhang,
Weicheng Cui,
Dixia Fan,
Jun Wang
Abstract:
Aquatic organisms are known for their ability to generate efficient propulsion with low energy expenditure. While existing research has sought to leverage bio-inspired structures to reduce energy costs in underwater robotics, the crucial role of control policies in enhancing efficiency has often been overlooked. In this study, we optimize the motion of a bio-mimetic robotic fish using deep reinfor…
▽ More
Aquatic organisms are known for their ability to generate efficient propulsion with low energy expenditure. While existing research has sought to leverage bio-inspired structures to reduce energy costs in underwater robotics, the crucial role of control policies in enhancing efficiency has often been overlooked. In this study, we optimize the motion of a bio-mimetic robotic fish using deep reinforcement learning (DRL) to maximize propulsion efficiency and minimize energy consumption. Our novel DRL approach incorporates extended pressure perception, a transformer model processing sequences of observations, and a policy transfer scheme. Notably, significantly improved training stability and speed within our approach allow for end-to-end training of the robotic fish. This enables agiler responses to hydrodynamic environments and possesses greater optimization potential compared to pre-defined motion pattern controls. Our experiments are conducted on a serially connected rigid robotic fish in a free stream with a Reynolds number of 6000 using computational fluid dynamics (CFD) simulations. The DRL-trained policies yield impressive results, demonstrating both high efficiency and propulsion. The policies also showcase the agent's embodiment, skillfully utilizing its body structure and engaging with surrounding fluid dynamics, as revealed through flow analysis. This study provides valuable insights into the bio-mimetic underwater robots optimization through DRL training, capitalizing on their structural advantages, and ultimately contributing to more efficient underwater propulsion systems.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
Large Processor Chip Model
Authors:
Kaiyan Chang,
Mingzhi Chen,
Yunji Chen,
Zhirong Chen,
Dongrui Fan,
Junfeng Gong,
Nan Guo,
Yinhe Han,
Qinfen Hao,
Shuo Hou,
Xuan Huang,
Pengwei Jin,
Changxin Ke,
Cangyuan Li,
Guangli Li,
Huawei Li,
Kuan Li,
Naipeng Li,
Shengwen Liang,
Cheng Liu,
Hongwei Liu,
Jiahua Liu,
Junliang Lv,
Jianan Mu,
Jin Qin
, et al. (18 additional authors not shown)
Abstract:
Computer System Architecture serves as a crucial bridge between software applications and the underlying hardware, encompassing components like compilers, CPUs, coprocessors, and RTL designs. Its development, from early mainframes to modern domain-specific architectures, has been driven by rising computational demands and advancements in semiconductor technology. However, traditional paradigms in…
▽ More
Computer System Architecture serves as a crucial bridge between software applications and the underlying hardware, encompassing components like compilers, CPUs, coprocessors, and RTL designs. Its development, from early mainframes to modern domain-specific architectures, has been driven by rising computational demands and advancements in semiconductor technology. However, traditional paradigms in computer system architecture design are confronting significant challenges, including a reliance on manual expertise, fragmented optimization across software and hardware layers, and high costs associated with exploring expansive design spaces. While automated methods leveraging optimization algorithms and machine learning have improved efficiency, they remain constrained by a single-stage focus, limited data availability, and a lack of comprehensive human domain knowledge. The emergence of large language models offers transformative opportunities for the design of computer system architecture. By leveraging the capabilities of LLMs in areas such as code generation, data analysis, and performance modeling, the traditional manual design process can be transitioned to a machine-based automated design approach. To harness this potential, we present the Large Processor Chip Model (LPCM), an LLM-driven framework aimed at achieving end-to-end automated computer architecture design. The LPCM is structured into three levels: Human-Centric; Agent-Orchestrated; and Model-Governed. This paper utilizes 3D Gaussian Splatting as a representative workload and employs the concept of software-hardware collaborative design to examine the implementation of the LPCM at Level 1, demonstrating the effectiveness of the proposed approach. Furthermore, this paper provides an in-depth discussion on the pathway to implementing Level 2 and Level 3 of the LPCM, along with an analysis of the existing challenges.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Black-box Adversarial Attacks on CNN-based SLAM Algorithms
Authors:
Maria Rafaela Gkeka,
Bowen Sun,
Evgenia Smirni,
Christos D. Antonopoulos,
Spyros Lalis,
Nikolaos Bellas
Abstract:
Continuous advancements in deep learning have led to significant progress in feature detection, resulting in enhanced accuracy in tasks like Simultaneous Localization and Mapping (SLAM). Nevertheless, the vulnerability of deep neural networks to adversarial attacks remains a challenge for their reliable deployment in applications, such as navigation of autonomous agents. Even though CNN-based SLAM…
▽ More
Continuous advancements in deep learning have led to significant progress in feature detection, resulting in enhanced accuracy in tasks like Simultaneous Localization and Mapping (SLAM). Nevertheless, the vulnerability of deep neural networks to adversarial attacks remains a challenge for their reliable deployment in applications, such as navigation of autonomous agents. Even though CNN-based SLAM algorithms are a growing area of research there is a notable absence of a comprehensive presentation and examination of adversarial attacks targeting CNN-based feature detectors, as part of a SLAM system. Our work introduces black-box adversarial perturbations applied to the RGB images fed into the GCN-SLAM algorithm. Our findings on the TUM dataset [30] reveal that even attacks of moderate scale can lead to tracking failure in as many as 76% of the frames. Moreover, our experiments highlight the catastrophic impact of attacking depth instead of RGB input images on the SLAM system.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
NTIRE 2025 challenge on Text to Image Generation Model Quality Assessment
Authors:
Shuhao Han,
Haotian Fan,
Fangyuan Kong,
Wenjie Liao,
Chunle Guo,
Chongyi Li,
Radu Timofte,
Liang Li,
Tao Li,
Junhui Cui,
Yunqiu Wang,
Yang Tai,
Jingwei Sun,
Jianhui Sun,
Xinli Yue,
Tianyi Wang,
Huan Hou,
Junda Lu,
Xinyang Huang,
Zitang Zhou,
Zijian Zhang,
Xuhui Zheng,
Xuecheng Wu,
Chong Peng,
Xuezhi Cao
, et al. (90 additional authors not shown)
Abstract:
This paper reports on the NTIRE 2025 challenge on Text to Image (T2I) generation model quality assessment, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. The aim of this challenge is to address the fine-grained quality assessment of text-to-image generation models. This challenge evaluates text-to-image models from two aspe…
▽ More
This paper reports on the NTIRE 2025 challenge on Text to Image (T2I) generation model quality assessment, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. The aim of this challenge is to address the fine-grained quality assessment of text-to-image generation models. This challenge evaluates text-to-image models from two aspects: image-text alignment and image structural distortion detection, and is divided into the alignment track and the structural track. The alignment track uses the EvalMuse-40K, which contains around 40K AI-Generated Images (AIGIs) generated by 20 popular generative models. The alignment track has a total of 371 registered participants. A total of 1,883 submissions are received in the development phase, and 507 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. The structure track uses the EvalMuse-Structure, which contains 10,000 AI-Generated Images (AIGIs) with corresponding structural distortion mask. A total of 211 participants have registered in the structure track. A total of 1155 submissions are received in the development phase, and 487 submissions are received in the test phase. Finally, 8 participating teams submitted their models and fact sheets. Almost all methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on T2I model quality assessment.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
Enhancing User-Oriented Proactivity in Open-Domain Dialogues with Critic Guidance
Authors:
Yufeng Wang,
Jinwu Hu,
Ziteng Huang,
Kunyang Lin,
Zitian Zhang,
Peihao Chen,
Yu Hu,
Qianyue Wang,
Zhuliang Yu,
Bin Sun,
Xiaofen Xing,
Qingfang Zheng,
Mingkui Tan
Abstract:
Open-domain dialogue systems aim to generate natural and engaging conversations, providing significant practical value in real applications such as social robotics and personal assistants. The advent of large language models (LLMs) has greatly advanced this field by improving context understanding and conversational fluency. However, existing LLM-based dialogue systems often fall short in proactiv…
▽ More
Open-domain dialogue systems aim to generate natural and engaging conversations, providing significant practical value in real applications such as social robotics and personal assistants. The advent of large language models (LLMs) has greatly advanced this field by improving context understanding and conversational fluency. However, existing LLM-based dialogue systems often fall short in proactively understanding the user's chatting preferences and guiding conversations toward user-centered topics. This lack of user-oriented proactivity can lead users to feel unappreciated, reducing their satisfaction and willingness to continue the conversation in human-computer interactions. To address this issue, we propose a User-oriented Proactive Chatbot (UPC) to enhance the user-oriented proactivity. Specifically, we first construct a critic to evaluate this proactivity inspired by the LLM-as-a-judge strategy. Given the scarcity of high-quality training data, we then employ the critic to guide dialogues between the chatbot and user agents, generating a corpus with enhanced user-oriented proactivity. To ensure the diversity of the user backgrounds, we introduce the ISCO-800, a diverse user background dataset for constructing user agents. Moreover, considering the communication difficulty varies among users, we propose an iterative curriculum learning method that trains the chatbot from easy-to-communicate users to more challenging ones, thereby gradually enhancing its performance. Experiments demonstrate that our proposed training method is applicable to different LLMs, improving user-oriented proactivity and attractiveness in open-domain dialogues.
△ Less
Submitted 18 May, 2025;
originally announced May 2025.
-
Multi-Objective Memory Bandwidth Regulation and Cache Partitioning for Multicore Real-Time Systems
Authors:
Binqi Sun,
Zhihang Wei,
Andrea Bastoni,
Debayan Roy,
Mirco Theile,
Tomasz Kloda,
Rodolfo Pellizzoni,
Marco Caccamo
Abstract:
Memory bandwidth regulation and cache partitioning are widely used techniques for achieving predictable timing in real-time computing systems. Combined with partitioned scheduling, these methods require careful co-allocation of tasks and resources to cores, as task execution times strongly depend on available allocated resources. To address this challenge, this paper presents a 0-1 linear program…
▽ More
Memory bandwidth regulation and cache partitioning are widely used techniques for achieving predictable timing in real-time computing systems. Combined with partitioned scheduling, these methods require careful co-allocation of tasks and resources to cores, as task execution times strongly depend on available allocated resources. To address this challenge, this paper presents a 0-1 linear program for task-resource co-allocation, along with a multi-objective heuristic designed to minimize resource usage while guaranteeing schedulability under a preemptive EDF scheduling policy. Our heuristic employs a multi-layer framework, where an outer layer explores resource allocations using Pareto-pruned search, and an inner layer optimizes task allocation by solving a knapsack problem using dynamic programming. To evaluate the performance of the proposed optimization algorithm, we profile real-world benchmarks on an embedded AMD UltraScale+ ZCU102 platform, with fine-grained resource partitioning enabled by the Jailhouse hypervisor, leveraging cache set partitioning and MemGuard for memory bandwidth regulation. Experiments based on the benchmarking results show that the proposed 0-1 linear program outperforms existing mixed-integer programs by finding more optimal solutions within the same time limit. Moreover, the proposed multi-objective multi-layer heuristic performs consistently better than the state-of-the-art multi-resource-task co-allocation algorithm in terms of schedulability, resource usage, number of non-dominated solutions, and computational efficiency.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
ForesightNav: Learning Scene Imagination for Efficient Exploration
Authors:
Hardik Shah,
Jiaxu Xing,
Nico Messikommer,
Boyang Sun,
Marc Pollefeys,
Davide Scaramuzza
Abstract:
Understanding how humans leverage prior knowledge to navigate unseen environments while making exploratory decisions is essential for developing autonomous robots with similar abilities. In this work, we propose ForesightNav, a novel exploration strategy inspired by human imagination and reasoning. Our approach equips robotic agents with the capability to predict contextual information, such as oc…
▽ More
Understanding how humans leverage prior knowledge to navigate unseen environments while making exploratory decisions is essential for developing autonomous robots with similar abilities. In this work, we propose ForesightNav, a novel exploration strategy inspired by human imagination and reasoning. Our approach equips robotic agents with the capability to predict contextual information, such as occupancy and semantic details, for unexplored regions. These predictions enable the robot to efficiently select meaningful long-term navigation goals, significantly enhancing exploration in unseen environments. We validate our imagination-based approach using the Structured3D dataset, demonstrating accurate occupancy prediction and superior performance in anticipating unseen scene geometry. Our experiments show that the imagination module improves exploration efficiency in unseen environments, achieving a 100% completion rate for PointNav and an SPL of 67% for ObjectNav on the Structured3D Validation split. These contributions demonstrate the power of imagination-driven reasoning for autonomous systems to enhance generalizable and efficient exploration.
△ Less
Submitted 5 May, 2025; v1 submitted 22 April, 2025;
originally announced April 2025.
-
AIR: A Systematic Analysis of Annotations, Instructions, and Response Pairs in Preference Dataset
Authors:
Bingxiang He,
Wenbin Zhang,
Jiaxi Song,
Cheng Qian,
Zixuan Fu,
Bowen Sun,
Ning Ding,
Haiwen Hong,
Longtao Huang,
Hui Xue,
Ganqu Cui,
Wanxiang Che,
Zhiyuan Liu,
Maosong Sun
Abstract:
Preference learning is critical for aligning large language models (LLMs) with human values, yet its success hinges on high-quality datasets comprising three core components: Preference \textbf{A}nnotations, \textbf{I}nstructions, and \textbf{R}esponse Pairs. Current approaches conflate these components, obscuring their individual impacts and hindering systematic optimization. In this work, we pro…
▽ More
Preference learning is critical for aligning large language models (LLMs) with human values, yet its success hinges on high-quality datasets comprising three core components: Preference \textbf{A}nnotations, \textbf{I}nstructions, and \textbf{R}esponse Pairs. Current approaches conflate these components, obscuring their individual impacts and hindering systematic optimization. In this work, we propose \textbf{AIR}, a component-wise analysis framework that systematically isolates and optimizes each component while evaluating their synergistic effects. Through rigorous experimentation, AIR reveals actionable principles: annotation simplicity (point-wise generative scoring), instruction inference stability (variance-based filtering across LLMs), and response pair quality (moderate margins + high absolute scores). When combined, these principles yield +5.3 average gains over baseline method, even with only 14k high-quality pairs. Our work shifts preference dataset design from ad hoc scaling to component-aware optimization, offering a blueprint for efficient, reproducible alignment.
△ Less
Submitted 4 April, 2025;
originally announced April 2025.
-
SuperDec: 3D Scene Decomposition with Superquadric Primitives
Authors:
Elisabetta Fedele,
Boyang Sun,
Leonidas Guibas,
Marc Pollefeys,
Francis Engelmann
Abstract:
We present SuperDec, an approach for creating compact 3D scene representations via decomposition into superquadric primitives. While most recent works leverage geometric primitives to obtain photorealistic 3D scene representations, we propose to leverage them to obtain a compact yet expressive representation. We propose to solve the problem locally on individual objects and leverage the capabiliti…
▽ More
We present SuperDec, an approach for creating compact 3D scene representations via decomposition into superquadric primitives. While most recent works leverage geometric primitives to obtain photorealistic 3D scene representations, we propose to leverage them to obtain a compact yet expressive representation. We propose to solve the problem locally on individual objects and leverage the capabilities of instance segmentation methods to scale our solution to full 3D scenes. In doing that, we design a new architecture which efficiently decompose point clouds of arbitrary objects in a compact set of superquadrics. We train our architecture on ShapeNet and we prove its generalization capabilities on object instances extracted from the ScanNet++ dataset as well as on full Replica scenes. Finally, we show how a compact representation based on superquadrics can be useful for a diverse range of downstream applications, including robotic tasks and controllable visual content generation and editing.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
MoCha: Towards Movie-Grade Talking Character Synthesis
Authors:
Cong Wei,
Bo Sun,
Haoyu Ma,
Ji Hou,
Felix Juefei-Xu,
Zecheng He,
Xiaoliang Dai,
Luxin Zhang,
Kunpeng Li,
Tingbo Hou,
Animesh Sinha,
Peter Vajda,
Wenhu Chen
Abstract:
Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text. Unlike talking head, Talking Characters aims at generating the full portrait of…
▽ More
Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text. Unlike talking head, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region. In this paper, we propose MoCha, the first of its kind to generate talking characters. To ensure precise synchronization between video and speech, we propose a speech-video window attention mechanism that effectively aligns speech and video tokens. To address the scarcity of large-scale speech-labeled video datasets, we introduce a joint training strategy that leverages both speech-labeled and text-labeled video data, significantly improving generalization across diverse character actions. We also design structured prompt templates with character tags, enabling, for the first time, multi-character conversation with turn-based dialogue-allowing AI-generated characters to engage in context-aware conversations with cinematic coherence. Extensive qualitative and quantitative evaluations, including human preference studies and benchmark comparisons, demonstrate that MoCha sets a new standard for AI-generated cinematic storytelling, achieving superior realism, expressiveness, controllability and generalization.
△ Less
Submitted 30 March, 2025;
originally announced March 2025.
-
SG-Tailor: Inter-Object Commonsense Relationship Reasoning for Scene Graph Manipulation
Authors:
Haoliang Shang,
Hanyu Wu,
Guangyao Zhai,
Boyang Sun,
Fangjinhua Wang,
Federico Tombari,
Marc Pollefeys
Abstract:
Scene graphs capture complex relationships among objects, serving as strong priors for content generation and manipulation. Yet, reasonably manipulating scene graphs -- whether by adding nodes or modifying edges -- remains a challenging and untouched task. Tasks such as adding a node to the graph or reasoning about a node's relationships with all others are computationally intractable, as even a s…
▽ More
Scene graphs capture complex relationships among objects, serving as strong priors for content generation and manipulation. Yet, reasonably manipulating scene graphs -- whether by adding nodes or modifying edges -- remains a challenging and untouched task. Tasks such as adding a node to the graph or reasoning about a node's relationships with all others are computationally intractable, as even a single edge modification can trigger conflicts due to the intricate interdependencies within the graph. To address these challenges, we introduce SG-Tailor, an autoregressive model that predicts the conflict-free relationship between any two nodes. SG-Tailor not only infers inter-object relationships, including generating commonsense edges for newly added nodes but also resolves conflicts arising from edge modifications to produce coherent, manipulated graphs for downstream tasks. For node addition, the model queries the target node and other nodes from the graph to predict the appropriate relationships. For edge modification, SG-Tailor employs a Cut-And-Stitch strategy to solve the conflicts and globally adjust the graph. Extensive experiments demonstrate that SG-Tailor outperforms competing methods by a large margin and can be seamlessly integrated as a plug-in module for scene generation and robotic manipulation tasks.
△ Less
Submitted 23 March, 2025;
originally announced March 2025.
-
Loop Closure from Two Views: Revisiting PGO for Scalable Trajectory Estimation through Monocular Priors
Authors:
Tian Yi Lim,
Boyang Sun,
Marc Pollefeys,
Hermann Blum
Abstract:
(Visual) Simultaneous Localization and Mapping (SLAM) remains a fundamental challenge in enabling autonomous systems to navigate and understand large-scale environments. Traditional SLAM approaches struggle to balance efficiency and accuracy, particularly in large-scale settings where extensive computational resources are required for scene reconstruction and Bundle Adjustment (BA). However, this…
▽ More
(Visual) Simultaneous Localization and Mapping (SLAM) remains a fundamental challenge in enabling autonomous systems to navigate and understand large-scale environments. Traditional SLAM approaches struggle to balance efficiency and accuracy, particularly in large-scale settings where extensive computational resources are required for scene reconstruction and Bundle Adjustment (BA). However, this scene reconstruction, in the form of sparse pointclouds of visual landmarks, is often only used within the SLAM system because navigation and planning methods require different map representations. In this work, we therefore investigate a more scalable Visual SLAM (VSLAM) approach without reconstruction, mainly based on approaches for two-view loop closures. By restricting the map to a sparse keyframed pose graph without dense geometry representations, our '2GO' system achieves efficient optimization with competitive absolute trajectory accuracy. In particular, we find that recent advancements in image matching and monocular depth priors enable very accurate trajectory optimization from two-view edges. We conduct extensive experiments on diverse datasets, including large-scale scenarios, and provide a detailed analysis of the trade-offs between runtime, accuracy, and map size. Our results demonstrate that this streamlined approach supports real-time performance, scales well in map size and trajectory duration, and effectively broadens the capabilities of VSLAM for long-duration deployments to large environments.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
SAM2-ELNet: Label Enhancement and Automatic Annotation for Remote Sensing Segmentation
Authors:
Jianhao Yang,
Wenshuo Yu,
Yuanchao Lv,
Jiance Sun,
Bokang Sun,
Mingyang Liu
Abstract:
Remote sensing image segmentation is crucial for environmental monitoring, disaster assessment, and resource management, directly affecting the accuracy and efficiency of surface information extraction. The performance of existing supervised models in remote sensing image segmentation tasks highly depends on the quality of label data. However, current label data mainly relies on manual annotation,…
▽ More
Remote sensing image segmentation is crucial for environmental monitoring, disaster assessment, and resource management, directly affecting the accuracy and efficiency of surface information extraction. The performance of existing supervised models in remote sensing image segmentation tasks highly depends on the quality of label data. However, current label data mainly relies on manual annotation, which comes with high time costs and is subject to subjective interference, resulting in distortion of label boundaries and often a loss of detail. To solve the above problems, our work proposes an Edge-enhanced Labeling Network, called SAM2-ELNet, which incorporates a labeling module and an edge attention mechanism. This model effectively addresses issues such as label detail loss, fragmentation, and inaccurate boundaries. Due to the scarcity of manually annotated remote sensing data, the feature extraction capabilities of traditional neural networks are limited. Our method uses the Hiera backbone of the pre-trained self-supervised large model segment anything model 2 (SAM2) as the encoder, achieves high-quality and efficient feature extraction even with small samples by fine-tuning on downstream tasks. This study compared the training effects of original and enhanced labels on the manually annotated Deep-SAR Oil Spill (SOS) dataset. Results showed that the model trained with enhanced labels performed better and had a lower final loss, indicating closer alignment with the real data distribution. Our work also explores the potential of extending the model into an efficient automatic annotation framework through generalization experiments, facilitating large-scale remote sensing image interpretation and intelligent recognition.
△ Less
Submitted 16 March, 2025;
originally announced March 2025.
-
Real-time simulation enabled navigation control of magnetic soft continuum robots in confined lumens
Authors:
Dezhong Tong,
Zhuonan Hao,
Jiyu Li,
Boxi Sun,
Mingchao Liu,
Liu Wang,
Weicheng Huang
Abstract:
Magnetic soft continuum robots (MSCRs) have emerged as a promising technology for minimally invasive interventions, offering enhanced dexterity and remote-controlled navigation in confined lumens. Unlike conventional guidewires with pre-shaped tips, MSCRs feature a magnetic tip that actively bends under applied magnetic fields. Despite extensive studies in modeling and simulation, achieving real-t…
▽ More
Magnetic soft continuum robots (MSCRs) have emerged as a promising technology for minimally invasive interventions, offering enhanced dexterity and remote-controlled navigation in confined lumens. Unlike conventional guidewires with pre-shaped tips, MSCRs feature a magnetic tip that actively bends under applied magnetic fields. Despite extensive studies in modeling and simulation, achieving real-time navigation control of MSCRs in confined lumens remains a significant challenge. The primary reasons are due to robot-lumen contact interactions and computational limitations in modeling MSCR nonlinear behavior under magnetic actuation. Existing approaches, such as Finite Element Method (FEM) simulations and energy-minimization techniques, suffer from high computational costs and oversimplified contact interactions, making them impractical for real-world applications. In this work, we develop a real-time simulation and navigation control framework that integrates hard-magnetic elastic rod theory, formulated within the Discrete Differential Geometry (DDG) framework, with an order-reduced contact handling strategy. Our approach captures large deformations and complex interactions while maintaining computational efficiency. Next, the navigation control problem is formulated as an inverse design task, where optimal magnetic fields are computed in real time by minimizing the constrained forces and enhancing navigation accuracy. We validate the proposed framework through comprehensive numerical simulations and experimental studies, demonstrating its robustness, efficiency, and accuracy. The results show that our method significantly reduces computational costs while maintaining high-fidelity modeling, making it feasible for real-time deployment in clinical settings.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation
Authors:
Hanzhi Chen,
Boyang Sun,
Anran Zhang,
Marc Pollefeys,
Stefan Leutenegger
Abstract:
Future robots are envisioned as versatile systems capable of performing a variety of household tasks. The big question remains, how can we bridge the embodiment gap while minimizing physical robot learning, which fundamentally does not scale well. We argue that learning from in-the-wild human videos offers a promising solution for robotic manipulation tasks, as vast amounts of relevant data alread…
▽ More
Future robots are envisioned as versatile systems capable of performing a variety of household tasks. The big question remains, how can we bridge the embodiment gap while minimizing physical robot learning, which fundamentally does not scale well. We argue that learning from in-the-wild human videos offers a promising solution for robotic manipulation tasks, as vast amounts of relevant data already exist on the internet. In this work, we present VidBot, a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos. VidBot leverages a pipeline to extract explicit representations from them, namely 3D hand trajectories from videos, combining a depth foundation model with structure-from-motion techniques to reconstruct temporally consistent, metric-scale 3D affordance representations agnostic to embodiments. We introduce a coarse-to-fine affordance learning model that first identifies coarse actions from the pixel space and then generates fine-grained interaction trajectories with a diffusion model, conditioned on coarse actions and guided by test-time constraints for context-aware interaction planning, enabling substantial generalization to novel scenes and embodiments. Extensive experiments demonstrate the efficacy of VidBot, which significantly outperforms counterparts across 13 manipulation tasks in zero-shot settings and can be seamlessly deployed across robot systems in real-world environments. VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
△ Less
Submitted 27 March, 2025; v1 submitted 10 March, 2025;
originally announced March 2025.
-
GOD model: Privacy Preserved AI School for Personal Assistant
Authors:
PIN AI Team,
Bill Sun,
Gavin Guo,
Regan Peng,
Boliang Zhang,
Shouqiao Wang,
Laura Florescu,
Xi Wang,
Davide Crapis,
Ben Wu
Abstract:
Personal AI assistants (e.g., Apple Intelligence, Meta AI) offer proactive recommendations that simplify everyday tasks, but their reliance on sensitive user data raises concerns about privacy and trust. To address these challenges, we introduce the Guardian of Data (GOD), a secure, privacy-preserving framework for training and evaluating AI assistants directly on-device. Unlike traditional benchm…
▽ More
Personal AI assistants (e.g., Apple Intelligence, Meta AI) offer proactive recommendations that simplify everyday tasks, but their reliance on sensitive user data raises concerns about privacy and trust. To address these challenges, we introduce the Guardian of Data (GOD), a secure, privacy-preserving framework for training and evaluating AI assistants directly on-device. Unlike traditional benchmarks, the GOD model measures how well assistants can anticipate user needs-such as suggesting gifts-while protecting user data and autonomy. Functioning like an AI school, it addresses the cold start problem by simulating user queries and employing a curriculum-based approach to refine the performance of each assistant. Running within a Trusted Execution Environment (TEE), it safeguards user data while applying reinforcement and imitation learning to refine AI recommendations. A token-based incentive system encourages users to share data securely, creating a data flywheel that drives continuous improvement. Specifically, users mine with their data, and the mining rate is determined by GOD's evaluation of how well their AI assistant understands them across categories such as shopping, social interactions, productivity, trading, and Web3. By integrating privacy, personalization, and trust, the GOD model provides a scalable, responsible path for advancing personal AI assistants. For community collaboration, part of the framework is open-sourced at https://github.com/PIN-AI/God-Model.
△ Less
Submitted 27 February, 2025; v1 submitted 24 February, 2025;
originally announced February 2025.
-
EU-Nets: Enhanced, Explainable and Parsimonious U-Nets
Authors:
B. Sun,
P. Liò
Abstract:
In this study, we propose MHEX+, a framework adaptable to any U-Net architecture. Built upon MHEX+, we introduce novel U-Net variants, EU-Nets, which enhance explainability and uncertainty estimation, addressing the limitations of traditional U-Net models while improving performance and stability. A key innovation is the Equivalent Convolutional Kernel, which unifies consecutive convolutional laye…
▽ More
In this study, we propose MHEX+, a framework adaptable to any U-Net architecture. Built upon MHEX+, we introduce novel U-Net variants, EU-Nets, which enhance explainability and uncertainty estimation, addressing the limitations of traditional U-Net models while improving performance and stability. A key innovation is the Equivalent Convolutional Kernel, which unifies consecutive convolutional layers, boosting interpretability. For uncertainty estimation, we propose the collaboration gradient approach, measuring gradient consistency across decoder layers. Notably, EU-Nets achieve an average accuracy improvement of 1.389\% and a variance reduction of 0.83\% across all networks and datasets in our experiments, requiring fewer than 0.1M parameters.
△ Less
Submitted 25 February, 2025;
originally announced February 2025.
-
An Analytical Emotion Framework of Rumour Threads on Social Media
Authors:
Rui Xing,
Boyang Sun,
Kun Zhang,
Preslav Nakov,
Timothy Baldwin,
Jey Han Lau
Abstract:
Rumours in online social media pose significant risks to modern society, motivating the need for better understanding of how they develop. We focus specifically on the interface between emotion and rumours in threaded discourses, building on the surprisingly sparse literature on the topic which has largely focused on single aspect of emotions within the original rumour posts themselves, and largel…
▽ More
Rumours in online social media pose significant risks to modern society, motivating the need for better understanding of how they develop. We focus specifically on the interface between emotion and rumours in threaded discourses, building on the surprisingly sparse literature on the topic which has largely focused on single aspect of emotions within the original rumour posts themselves, and largely overlooked the comparative differences between rumours and non-rumours. In this work, we take one step further to provide a comprehensive analytical emotion framework with multi-aspect emotion detection, contrasting rumour and non-rumour threads and provide both correlation and causal analysis of emotions. We applied our framework on existing widely-used rumour datasets to further understand the emotion dynamics in online social media threads. Our framework reveals that rumours trigger more negative emotions (e.g., anger, fear, pessimism), while non-rumours evoke more positive ones. Emotions are contagious, rumours spread negativity, non-rumours spread positivity. Causal analysis shows surprise bridges rumours and other emotions; pessimism comes from sadness and fear, while optimism arises from joy and love.
△ Less
Submitted 13 May, 2025; v1 submitted 23 February, 2025;
originally announced February 2025.
-
Strategic priorities for transformative progress in advancing biology with proteomics and artificial intelligence
Authors:
Yingying Sun,
Jun A,
Zhiwei Liu,
Rui Sun,
Liujia Qian,
Samuel H. Payne,
Wout Bittremieux,
Markus Ralser,
Chen Li,
Yi Chen,
Zhen Dong,
Yasset Perez-Riverol,
Asif Khan,
Chris Sander,
Ruedi Aebersold,
Juan Antonio VizcaĆno,
Jonathan R Krieger,
Jianhua Yao,
Han Wen,
Linfeng Zhang,
Yunping Zhu,
Yue Xuan,
Benjamin Boyang Sun,
Liang Qiao,
Henning Hermjakob
, et al. (37 additional authors not shown)
Abstract:
Artificial intelligence (AI) is transforming scientific research, including proteomics. Advances in mass spectrometry (MS)-based proteomics data quality, diversity, and scale, combined with groundbreaking AI techniques, are unlocking new challenges and opportunities in biological discovery. Here, we highlight key areas where AI is driving innovation, from data analysis to new biological insights.…
▽ More
Artificial intelligence (AI) is transforming scientific research, including proteomics. Advances in mass spectrometry (MS)-based proteomics data quality, diversity, and scale, combined with groundbreaking AI techniques, are unlocking new challenges and opportunities in biological discovery. Here, we highlight key areas where AI is driving innovation, from data analysis to new biological insights. These include developing an AI-friendly ecosystem for proteomics data generation, sharing, and analysis; improving peptide and protein identification and quantification; characterizing protein-protein interactions and protein complexes; advancing spatial and perturbation proteomics; integrating multi-omics data; and ultimately enabling AI-empowered virtual cells.
△ Less
Submitted 21 February, 2025;
originally announced February 2025.
-
A Survey of Fuzzing Open-Source Operating Systems
Authors:
Kun Hu,
Qicai Chen,
Zilong Lu,
Wenzhuo Zhang,
Bihuan Chen,
You Lu,
Haowen Jiang,
Bingkun Sun,
Xin Peng,
Wenyun Zhao
Abstract:
Vulnerabilities in open-source operating systems (OSs) pose substantial security risks to software systems, making their detection crucial. While fuzzing has been an effective vulnerability detection technique in various domains, OS fuzzing (OSF) faces unique challenges due to OS complexity and multi-layered interaction, and has not been comprehensively reviewed. Therefore, this work systematicall…
▽ More
Vulnerabilities in open-source operating systems (OSs) pose substantial security risks to software systems, making their detection crucial. While fuzzing has been an effective vulnerability detection technique in various domains, OS fuzzing (OSF) faces unique challenges due to OS complexity and multi-layered interaction, and has not been comprehensively reviewed. Therefore, this work systematically surveys the state-of-the-art OSF techniques, categorizes them based on the general fuzzing process, and investigates challenges specific to kernel, file system, driver, and hypervisor fuzzing. Finally, future research directions for OSF are discussed. GitHub: https://github.com/pghk13/Survey-OSF.
△ Less
Submitted 20 February, 2025; v1 submitted 16 February, 2025;
originally announced February 2025.
-
Computing Efficient Envy-Free Partial Allocations of Indivisible Goods
Authors:
Robert Bredereck,
Andrzej Kaczmarczyk,
Junjie Luo,
Bin Sun
Abstract:
Envy-freeness is one of the most prominent fairness concepts in the allocation of indivisible goods. Even though trivial envy-free allocations always exist, rich literature shows this is not true when one additionally requires some efficiency concept (e.g., completeness, Pareto-efficiency, or social welfare maximization). In fact, in such case even deciding the existence of an efficient envy-free…
▽ More
Envy-freeness is one of the most prominent fairness concepts in the allocation of indivisible goods. Even though trivial envy-free allocations always exist, rich literature shows this is not true when one additionally requires some efficiency concept (e.g., completeness, Pareto-efficiency, or social welfare maximization). In fact, in such case even deciding the existence of an efficient envy-free allocation is notoriously computationally hard. In this paper, we explore the limits of efficient computability by relaxing standard efficiency concepts and analyzing how this impacts the computational complexity of the respective problems. Specifically, we allow partial allocations (where not all goods are allocated) and impose only very mild efficiency constraints, such as ensuring each agent receives a bundle with positive utility. Surprisingly, even such seemingly weak efficiency requirements lead to a diverse computational complexity landscape. We identify several polynomial-time solvable or fixed-parameter tractable cases for binary utilities, yet we also find NP-hardness in very restricted scenarios involving ternary utilities.
△ Less
Submitted 18 February, 2025;
originally announced February 2025.
-
Democratic Training Against Universal Adversarial Perturbations
Authors:
Bing Sun,
Jun Sun,
Wei Zhao
Abstract:
Despite their advances and success, real-world deep neural networks are known to be vulnerable to adversarial attacks. Universal adversarial perturbation, an input-agnostic attack, poses a serious threat for them to be deployed in security-sensitive systems. In this case, a single universal adversarial perturbation deceives the model on a range of clean inputs without requiring input-specific opti…
▽ More
Despite their advances and success, real-world deep neural networks are known to be vulnerable to adversarial attacks. Universal adversarial perturbation, an input-agnostic attack, poses a serious threat for them to be deployed in security-sensitive systems. In this case, a single universal adversarial perturbation deceives the model on a range of clean inputs without requiring input-specific optimization, which makes it particularly threatening. In this work, we observe that universal adversarial perturbations usually lead to abnormal entropy spectrum in hidden layers, which suggests that the prediction is dominated by a small number of ``feature'' in such cases (rather than democratically by many features). Inspired by this, we propose an efficient yet effective defense method for mitigating UAPs called \emph{Democratic Training} by performing entropy-based model enhancement to suppress the effect of the universal adversarial perturbations in a given model. \emph{Democratic Training} is evaluated with 7 neural networks trained on 5 benchmark datasets and 5 types of state-of-the-art universal adversarial attack methods. The results show that it effectively reduces the attack success rate, improves model robustness and preserves the model accuracy on clean samples.
△ Less
Submitted 8 February, 2025;
originally announced February 2025.
-
Knowing When to Stop Matters: A Unified Algorithm for Online Conversion under Horizon Uncertainty
Authors:
Yanzhao Wang,
Hasti Nourmohammadi Sigaroudi,
Bo Sun,
Omid Ardakanian,
Xiaoqi Tan
Abstract:
This paper investigates the online conversion problem, which involves sequentially trading a divisible resource (e.g., energy) under dynamically changing prices to maximize profit. A key challenge in online conversion is managing decisions under horizon uncertainty, where the duration of trading is either known, revealed partway, or entirely unknown. We propose a unified algorithm that achieves op…
▽ More
This paper investigates the online conversion problem, which involves sequentially trading a divisible resource (e.g., energy) under dynamically changing prices to maximize profit. A key challenge in online conversion is managing decisions under horizon uncertainty, where the duration of trading is either known, revealed partway, or entirely unknown. We propose a unified algorithm that achieves optimal competitive guarantees across these horizon models, accounting for practical constraints such as box constraints, which limit the maximum allowable trade per step. Additionally, we extend the algorithm to a learning-augmented version, leveraging horizon predictions to adaptively balance performance: achieving near-optimal results when predictions are accurate while maintaining strong guarantees when predictions are unreliable. These results advance the understanding of online conversion under various degrees of horizon uncertainty and provide more practical strategies to address real world constraints.
△ Less
Submitted 6 February, 2025;
originally announced February 2025.
-
Posted Price Mechanisms for Online Allocation with Diseconomies of Scale
Authors:
Hossein Nekouyan Jazi,
Bo Sun,
Raouf Boutaba,
Xiaoqi Tan
Abstract:
This paper addresses the online $k$-selection problem with diseconomies of scale (OSDoS), where a seller seeks to maximize social welfare by optimally pricing items for sequentially arriving buyers, accounting for increasing marginal production costs. Previous studies have investigated deterministic dynamic pricing mechanisms for such settings. However, significant challenges remain, particularly…
▽ More
This paper addresses the online $k$-selection problem with diseconomies of scale (OSDoS), where a seller seeks to maximize social welfare by optimally pricing items for sequentially arriving buyers, accounting for increasing marginal production costs. Previous studies have investigated deterministic dynamic pricing mechanisms for such settings. However, significant challenges remain, particularly in achieving optimality with small or finite inventories and developing effective randomized posted price mechanisms. To bridge this gap, we propose a novel randomized dynamic pricing mechanism for OSDoS, providing a tighter lower bound on the competitive ratio compared to prior work. Our approach ensures optimal performance in small inventory settings (i.e., when $k$ is small) and surpasses existing online mechanisms in large inventory settings (i.e., when $k$ is large), leading to the best-known posted price mechanism for optimizing online selection and allocation with diseconomies of scale across varying inventory sizes.
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
Knowing When to Stop: Dynamic Context Cutoff for Large Language Models
Authors:
Roy Xie,
Junlin Wang,
Paul Rosu,
Chunyuan Deng,
Bolun Sun,
Zihao Lin,
Bhuwan Dhingra
Abstract:
Large language models (LLMs) process entire input contexts indiscriminately, which is inefficient in cases where the information required to answer a query is localized within the context. We present dynamic context cutoff, a human-inspired method enabling LLMs to self-terminate processing upon acquiring sufficient task-relevant information. Through analysis of model internals, we discover that sp…
▽ More
Large language models (LLMs) process entire input contexts indiscriminately, which is inefficient in cases where the information required to answer a query is localized within the context. We present dynamic context cutoff, a human-inspired method enabling LLMs to self-terminate processing upon acquiring sufficient task-relevant information. Through analysis of model internals, we discover that specific attention heads inherently encode "sufficiency signals" - detectable through lightweight classifiers - that predict when critical information has been processed. This reveals a new efficiency paradigm: models' internal understanding naturally dictates processing needs rather than external compression heuristics. Comprehensive experiments across six QA datasets (up to 40K tokens) with three model families (LLaMA/Qwen/Mistral, 1B0-70B) demonstrate 1.33x average token reduction while improving accuracy by 1.3%. Furthermore, our method demonstrates better performance with the same rate of token reduction compared to other context efficiency methods. Additionally, we observe an emergent scaling phenomenon: while smaller models require require probing for sufficiency detection, larger models exhibit intrinsic self-assessment capabilities through prompting.
△ Less
Submitted 2 February, 2025;
originally announced February 2025.
-
Permutation-Based Rank Test in the Presence of Discretization and Application in Causal Discovery with Mixed Data
Authors:
Xinshuai Dong,
Ignavier Ng,
Boyang Sun,
Haoyue Dai,
Guang-Yuan Hao,
Shunxing Fan,
Peter Spirtes,
Yumou Qiu,
Kun Zhang
Abstract:
Recent advances have shown that statistical tests for the rank of cross-covariance matrices play an important role in causal discovery. These rank tests include partial correlation tests as special cases and provide further graphical information about latent variables. Existing rank tests typically assume that all the continuous variables can be perfectly measured, and yet, in practice many variab…
▽ More
Recent advances have shown that statistical tests for the rank of cross-covariance matrices play an important role in causal discovery. These rank tests include partial correlation tests as special cases and provide further graphical information about latent variables. Existing rank tests typically assume that all the continuous variables can be perfectly measured, and yet, in practice many variables can only be measured after discretization. For example, in psychometric studies, the continuous level of certain personality dimensions of a person can only be measured after being discretized into order-preserving options such as disagree, neutral, and agree. Motivated by this, we propose Mixed data Permutation-based Rank Test (MPRT), which properly controls the statistical errors even when some or all variables are discretized. Theoretically, we establish the exchangeability and estimate the asymptotic null distribution by permutations; as a consequence, MPRT can effectively control the Type I error in the presence of discretization while previous methods cannot. Empirically, our method is validated by extensive experiments on synthetic data and real-world data to demonstrate its effectiveness as well as applicability in causal discovery.
△ Less
Submitted 12 June, 2025; v1 submitted 31 January, 2025;
originally announced January 2025.
-
Online Allocation with Multi-Class Arrivals: Group Fairness vs Individual Welfare
Authors:
Faraz Zargari,
Hossein Nekouyan Jazi,
Bo Sun,
Xiaoqi Tan
Abstract:
We introduce and study a multi-class online resource allocation problem with group fairness guarantees. The problem involves allocating a fixed amount of resources to a sequence of agents, each belonging to a specific group. The primary objective is to ensure fairness across different groups in an online setting. We focus on three fairness notions: one based on quantity and two based on utility. T…
▽ More
We introduce and study a multi-class online resource allocation problem with group fairness guarantees. The problem involves allocating a fixed amount of resources to a sequence of agents, each belonging to a specific group. The primary objective is to ensure fairness across different groups in an online setting. We focus on three fairness notions: one based on quantity and two based on utility. To achieve fair allocations, we develop two threshold-based online algorithms, proving their optimality under two fairness notions and near-optimality for the more challenging one. Additionally, we demonstrate a fundamental trade-off between group fairness and individual welfare using a novel representative function-based approach. To address this trade-off, we propose a set-aside multi-threshold algorithm that reserves a portion of the resource to ensure fairness across groups while utilizing the remaining resource to optimize efficiency under utility-based fairness notions. This algorithm is proven to achieve the Pareto-optimal trade-off. We also demonstrate that our problem can model a wide range of real-world applications, including network caching and cloud computing, and empirically evaluate our proposed algorithms in the network caching problem using real datasets.
△ Less
Submitted 27 January, 2025;
originally announced January 2025.
-
HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding
Authors:
Jiaxing Zhao,
Qize Yang,
Yixing Peng,
Detao Bai,
Shimin Yao,
Boyuan Sun,
Xiang Chen,
Shenghao Fu,
Weixuan chen,
Xihan Wei,
Liefeng Bo
Abstract:
In human-centric scenes, the ability to simultaneously understand visual and auditory information is crucial. While recent omni models can process multiple modalities, they generally lack effectiveness in human-centric scenes due to the absence of large-scale, specialized datasets and non-targeted architectures. In this work, we developed HumanOmni, the industry's first human-centric Omni-multimod…
▽ More
In human-centric scenes, the ability to simultaneously understand visual and auditory information is crucial. While recent omni models can process multiple modalities, they generally lack effectiveness in human-centric scenes due to the absence of large-scale, specialized datasets and non-targeted architectures. In this work, we developed HumanOmni, the industry's first human-centric Omni-multimodal large language model. We constructed a dataset containing over 2.4 million human-centric video clips with detailed captions and more than 14 million instructions, facilitating the understanding of diverse human-centric scenes. HumanOmni includes three specialized branches for understanding different types of scenes. It adaptively fuses features from these branches based on user instructions, significantly enhancing visual understanding in scenes centered around individuals. Moreover, HumanOmni integrates audio features to ensure a comprehensive understanding of environments and individuals. Our experiments validate HumanOmni's advanced capabilities in handling human-centric scenes across a variety of tasks, including emotion recognition, facial expression description, and action understanding. Our model will be open-sourced to facilitate further development and collaboration within both academia and industry.
△ Less
Submitted 25 January, 2025;
originally announced January 2025.
-
Glissando-Net: Deep sinGLe vIew category level poSe eStimation ANd 3D recOnstruction
Authors:
Bo Sun,
Hao Kang,
Li Guan,
Haoxiang Li,
Philippos Mordohai,
Gang Hua
Abstract:
We present a deep learning model, dubbed Glissando-Net, to simultaneously estimate the pose and reconstruct the 3D shape of objects at the category level from a single RGB image. Previous works predominantly focused on either estimating poses(often at the instance level), or reconstructing shapes, but not both. Glissando-Net is composed of two auto-encoders that are jointly trained, one for RGB im…
▽ More
We present a deep learning model, dubbed Glissando-Net, to simultaneously estimate the pose and reconstruct the 3D shape of objects at the category level from a single RGB image. Previous works predominantly focused on either estimating poses(often at the instance level), or reconstructing shapes, but not both. Glissando-Net is composed of two auto-encoders that are jointly trained, one for RGB images and the other for point clouds. We embrace two key design choices in Glissando-Net to achieve a more accurate prediction of the 3D shape and pose of the object given a single RGB image as input. First, we augment the feature maps of the point cloud encoder and decoder with transformed feature maps from the image decoder, enabling effective 2D-3D interaction in both training and prediction. Second, we predict both the 3D shape and pose of the object in the decoder stage. This way, we better utilize the information in the 3D point clouds presented only in the training stage to train the network for more accurate prediction. We jointly train the two encoder-decoders for RGB and point cloud data to learn how to pass latent features to the point cloud decoder during inference. In testing, the encoder of the 3D point cloud is discarded. The design of Glissando-Net is inspired by codeSLAM. Unlike codeSLAM, which targets 3D reconstruction of scenes, we focus on pose estimation and shape reconstruction of objects, and directly predict the object pose and a pose invariant 3D reconstruction without the need of the code optimization step. Extensive experiments, involving both ablation studies and comparison with competing methods, demonstrate the efficacy of our proposed method, and compare favorably with the state-of-the-art.
△ Less
Submitted 24 January, 2025;
originally announced January 2025.
-
Automatic Labelling & Semantic Segmentation with 4D Radar Tensors
Authors:
Botao Sun,
Ignacio Roldan,
Francesco Fioranelli
Abstract:
In this paper, an automatic labelling process is presented for automotive datasets, leveraging on complementary information from LiDAR and camera. The generated labels are then used as ground truth with the corresponding 4D radar data as inputs to a proposed semantic segmentation network, to associate a class label to each spatial voxel. Promising results are shown by applying both approaches to t…
▽ More
In this paper, an automatic labelling process is presented for automotive datasets, leveraging on complementary information from LiDAR and camera. The generated labels are then used as ground truth with the corresponding 4D radar data as inputs to a proposed semantic segmentation network, to associate a class label to each spatial voxel. Promising results are shown by applying both approaches to the publicly shared RaDelft dataset, with the proposed network achieving over 65% of the LiDAR detection performance, improving 13.2% in vehicle detection probability, and reducing 0.54 m in terms of Chamfer distance, compared to variants inspired from the literature.
△ Less
Submitted 20 January, 2025;
originally announced January 2025.
-
Data-driven Online Slice Admission Control and Resource Allocation for 5G and Beyond Networks
Authors:
Muhammad Sulaiman,
Bo Sun,
Mohammad Ali Salahuddin,
Raouf Boutaba,
Aladdin Saleh
Abstract:
Virtualization in 5G and beyond networks allows the creation of virtual networks, or network slices, tailored to meet the requirements of various applications. However, this flexibility introduces several challenges for infrastructure providers (InPs) in slice admission control (AC) and resource allocation. To maximize revenue, InPs must decide in real-time whether to admit new slice requests (SRs…
▽ More
Virtualization in 5G and beyond networks allows the creation of virtual networks, or network slices, tailored to meet the requirements of various applications. However, this flexibility introduces several challenges for infrastructure providers (InPs) in slice admission control (AC) and resource allocation. To maximize revenue, InPs must decide in real-time whether to admit new slice requests (SRs) given slices' revenues, limited infrastructure resources, unknown relationship between resource allocation and Quality of Service (QoS), and the unpredictability of future SRs. To address these challenges, this paper introduces a novel data-driven framework for 5G slice admission control that offers a guaranteed upper bound on the competitive ratio, i.e., the ratio between the revenue obtained by an oracle solution and that of the online solution. The proposed framework leverages a pricing function to dynamically estimate resources' pseudo-prices that reflect resource scarcity. Such prices are further coupled with a resource allocation algorithm, which leverages a machine-learned slice model and employs a primal-dual algorithm to determine the minimum-cost resource allocation. The resource cost is then compared with the offered revenue to admit or reject a SR. To demonstrate the efficacy of our framework, we train the data-driven slice model using real traces collected from our 5G testbed. Our results show that our novel approach achieves up to 42% improvement in the empirical competitive ratio, i.e., ratio between the optimal and the online solution, compared to other benchmark algorithms.
△ Less
Submitted 17 January, 2025;
originally announced January 2025.
-
Gene Regulatory Network Inference in the Presence of Selection Bias and Latent Confounders
Authors:
Gongxu Luo,
Haoyue Dai,
Boyang Sun,
Loka Li,
Biwei Huang,
Petar Stojanov,
Kun Zhang
Abstract:
Gene Regulatory Network Inference (GRNI) aims to identify causal relationships among genes using gene expression data, providing insights into regulatory mechanisms. A significant yet often overlooked challenge is selection bias, a process where only cells meeting specific criteria, such as gene expression thresholds, survive or are observed, distorting the true joint distribution of genes and thu…
▽ More
Gene Regulatory Network Inference (GRNI) aims to identify causal relationships among genes using gene expression data, providing insights into regulatory mechanisms. A significant yet often overlooked challenge is selection bias, a process where only cells meeting specific criteria, such as gene expression thresholds, survive or are observed, distorting the true joint distribution of genes and thus biasing GRNI results. Furthermore, gene expression is influenced by latent confounders, such as non-coding RNAs, which add complexity to GRNI. To address these challenges, we propose GISL (Gene Regulatory Network Inference in the presence of Selection bias and Latent confounders), a novel algorithm to infer true regulatory relationships in the presence of selection and confounding issues. Leveraging data obtained via multiple gene perturbation experiments, we show that the true regulatory relationships, as well as selection processes and latent confounders can be partially identified without strong parametric models and under mild graphical assumptions. Experimental results on both synthetic and real-world single-cell gene expression datasets demonstrate the superiority of GISL over existing methods.
△ Less
Submitted 17 January, 2025;
originally announced January 2025.
-
Breaking Barriers or Building Dependency? Exploring Team-LLM Collaboration in AI-infused Classroom Debate
Authors:
Zihan Zhang,
Black Sun,
Pengcheng An
Abstract:
Classroom debates are a unique form of collaborative learning characterized by fast-paced, high-intensity interactions that foster critical thinking and teamwork. Despite the recognized importance of debates, the role of AI tools, particularly LLM-based systems, in supporting this dynamic learning environment has been under-explored in HCI. This study addresses this opportunity by investigating th…
▽ More
Classroom debates are a unique form of collaborative learning characterized by fast-paced, high-intensity interactions that foster critical thinking and teamwork. Despite the recognized importance of debates, the role of AI tools, particularly LLM-based systems, in supporting this dynamic learning environment has been under-explored in HCI. This study addresses this opportunity by investigating the integration of LLM-based AI into real-time classroom debates. Over four weeks, 22 students in a Design History course participated in three rounds of debates with support from ChatGPT. The findings reveal how learners prompted the AI to offer insights, collaboratively processed its outputs, and divided labor in team-AI interactions. The study also surfaces key advantages of AI usage, reducing social anxiety, breaking communication barriers, and providing scaffolding for novices, alongside risks, such as information overload and cognitive dependency, which could limit learners' autonomy. We thereby discuss a set of nuanced implications for future HCI exploration.
△ Less
Submitted 26 January, 2025; v1 submitted 15 January, 2025;
originally announced January 2025.
-
Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness
Authors:
Jiaxing Zhao,
Boyuan Sun,
Xiang Chen,
Xihan Wei
Abstract:
Facial expression captioning has found widespread application across various domains. Recently, the emergence of video Multimodal Large Language Models (MLLMs) has shown promise in general video understanding tasks. However, describing facial expressions within videos poses two major challenges for these models: (1) the lack of adequate datasets and benchmarks, and (2) the limited visual token cap…
▽ More
Facial expression captioning has found widespread application across various domains. Recently, the emergence of video Multimodal Large Language Models (MLLMs) has shown promise in general video understanding tasks. However, describing facial expressions within videos poses two major challenges for these models: (1) the lack of adequate datasets and benchmarks, and (2) the limited visual token capacity of video MLLMs. To address these issues, this paper introduces a new instruction-following dataset tailored for dynamic facial expression caption. The dataset comprises 5,033 high-quality video clips annotated manually, containing over 700,000 tokens. Its purpose is to improve the capability of video MLLMs to discern subtle facial nuances. Furthermore, we propose FaceTrack-MM, which leverages a limited number of tokens to encode the main character's face. This model demonstrates superior performance in tracking faces and focusing on the facial expressions of the main characters, even in intricate multi-person scenarios. Additionally, we introduce a novel evaluation metric combining event extraction, relation classification, and the longest common subsequence (LCS) algorithm to assess the content consistency and temporal sequence consistency of generated text. Moreover, we present FEC-Bench, a benchmark designed to assess the performance of existing video MLLMs in this specific task. All data and source code will be made publicly available.
△ Less
Submitted 14 January, 2025;
originally announced January 2025.
-
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
Authors:
Jiaxing Zhao,
Boyuan Sun,
Xiang Chen,
Xihan Wei,
Qibin Hou
Abstract:
In this paper, we introduce LLaVA-Octopus, a novel video multimodal large language model. LLaVA-Octopus adaptively weights features from different visual projectors based on user instructions, enabling us to leverage the complementary strengths of each projector. We observe that different visual projectors exhibit distinct characteristics when handling specific tasks. For instance, some projectors…
▽ More
In this paper, we introduce LLaVA-Octopus, a novel video multimodal large language model. LLaVA-Octopus adaptively weights features from different visual projectors based on user instructions, enabling us to leverage the complementary strengths of each projector. We observe that different visual projectors exhibit distinct characteristics when handling specific tasks. For instance, some projectors excel at capturing static details, while others are more effective at processing temporal information, and some are better suited for tasks requiring temporal coherence. By dynamically adjusting feature weights according to user instructions, LLaVA-Octopus dynamically selects and combines the most suitable features, significantly enhancing the model's performance in multimodal tasks. Experimental results demonstrate that LLaVA-Octopus achieves excellent performance across multiple benchmarks, especially in tasks such as video question answering, long video understanding, and comprehensive multi-choices benchmarks, highlighting its broad application potential.
△ Less
Submitted 14 March, 2025; v1 submitted 9 January, 2025;
originally announced January 2025.
-
FrontierNet: Learning Visual Cues to Explore
Authors:
Boyang Sun,
Hanzhi Chen,
Stefan Leutenegger,
Cesar Cadena,
Marc Pollefeys,
Hermann Blum
Abstract:
Exploration of unknown environments is crucial for autonomous robots; it allows them to actively reason and decide on what new data to acquire for different tasks, such as mapping, object discovery, and environmental assessment. Existing solutions, such as frontier-based exploration approaches, rely heavily on 3D map operations, which are limited by map quality and, more critically, often overlook…
▽ More
Exploration of unknown environments is crucial for autonomous robots; it allows them to actively reason and decide on what new data to acquire for different tasks, such as mapping, object discovery, and environmental assessment. Existing solutions, such as frontier-based exploration approaches, rely heavily on 3D map operations, which are limited by map quality and, more critically, often overlook valuable context from visual cues. This work aims at leveraging 2D visual cues for efficient autonomous exploration, addressing the limitations of extracting goal poses from a 3D map. We propose a visual-only frontier-based exploration system, with FrontierNet as its core component. FrontierNet is a learning-based model that (i) proposes frontiers, and (ii) predicts their information gain, from posed RGB images enhanced by monocular depth priors. Our approach provides an alternative to existing 3D-dependent goal-extraction approaches, achieving a 15\% improvement in early-stage exploration efficiency, as validated through extensive simulations and real-world experiments. The project is available at https://github.com/cvg/FrontierNet.
△ Less
Submitted 7 May, 2025; v1 submitted 8 January, 2025;
originally announced January 2025.