-
Breast Cancer Classification in Deep Ultraviolet Fluorescence Images Using a Patch-Level Vision Transformer Framework
Authors:
Pouya Afshin,
David Helminiak,
Tongtong Lu,
Tina Yen,
Julie M. Jorns,
Mollie Patton,
Bing Yu,
Dong Hye Ye
Abstract:
Breast-conserving surgery (BCS) aims to completely remove malignant lesions while maximizing healthy tissue preservation. Intraoperative margin assessment is essential to achieve a balance between thorough cancer resection and tissue conservation. A deep ultraviolet fluorescence scanning microscope (DUV-FSM) enables rapid acquisition of whole surface images (WSIs) for excised tissue, providing con…
▽ More
Breast-conserving surgery (BCS) aims to completely remove malignant lesions while maximizing healthy tissue preservation. Intraoperative margin assessment is essential to achieve a balance between thorough cancer resection and tissue conservation. A deep ultraviolet fluorescence scanning microscope (DUV-FSM) enables rapid acquisition of whole surface images (WSIs) for excised tissue, providing contrast between malignant and normal tissues. However, breast cancer classification with DUV WSIs is challenged by high resolutions and complex histopathological features. This study introduces a DUV WSI classification framework using a patch-level vision transformer (ViT) model, capturing local and global features. Grad-CAM++ saliency weighting highlights relevant spatial regions, enhances result interpretability, and improves diagnostic accuracy for benign and malignant tissue classification. A comprehensive 5-fold cross-validation demonstrates the proposed approach significantly outperforms conventional deep learning methods, achieving a classification accuracy of 98.33%.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Doxing via the Lens: Revealing Privacy Leakage in Image Geolocation for Agentic Multi-Modal Large Reasoning Model
Authors:
Weidi Luo,
Qiming Zhang,
Tianyu Lu,
Xiaogeng Liu,
Yue Zhao,
Zhen Xiang,
Chaowei Xiao
Abstract:
The increasing capabilities of agentic multi-modal large reasoning models, such as ChatGPT o3, have raised critical concerns regarding privacy leakage through inadvertent image geolocation. In this paper, we conduct the first systematic and controlled study on the potential privacy risks associated with visual reasoning abilities of ChatGPT o3. We manually collect and construct a dataset comprisin…
▽ More
The increasing capabilities of agentic multi-modal large reasoning models, such as ChatGPT o3, have raised critical concerns regarding privacy leakage through inadvertent image geolocation. In this paper, we conduct the first systematic and controlled study on the potential privacy risks associated with visual reasoning abilities of ChatGPT o3. We manually collect and construct a dataset comprising 50 real-world images that feature individuals alongside privacy-relevant environmental elements, capturing realistic and sensitive scenarios for analysis. Our experimental evaluation reveals that ChatGPT o3 can predict user locations with high precision, achieving street-level accuracy (within one mile) in 60% of cases. Through analysis, we identify key visual cues, including street layout and front yard design, that significantly contribute to the model inference success. Additionally, targeted occlusion experiments demonstrate that masking critical features effectively mitigates geolocation accuracy, providing insights into potential defense mechanisms. Our findings highlight an urgent need for privacy-aware development for agentic multi-modal large reasoning models, particularly in applications involving private imagery.
△ Less
Submitted 29 April, 2025; v1 submitted 27 April, 2025;
originally announced April 2025.
-
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
Authors:
Guo Chen,
Zhiqi Li,
Shihao Wang,
Jindong Jiang,
Yicheng Liu,
Lidong Lu,
De-An Huang,
Wonmin Byeon,
Matthieu Le,
Tuomas Rintamaki,
Tyler Poon,
Max Ehrlich,
Tuomas Rintamaki,
Tyler Poon,
Tong Lu,
Limin Wang,
Bryan Catanzaro,
Jan Kautz,
Andrew Tao,
Zhiding Yu,
Guilin Liu
Abstract:
We introduce Eagle 2.5, a family of frontier vision-language models (VLMs) for long-context multimodal learning. Our work addresses the challenges in long video comprehension and high-resolution image understanding, introducing a generalist framework for both tasks. The proposed training framework incorporates Automatic Degrade Sampling and Image Area Preservation, two techniques that preserve con…
▽ More
We introduce Eagle 2.5, a family of frontier vision-language models (VLMs) for long-context multimodal learning. Our work addresses the challenges in long video comprehension and high-resolution image understanding, introducing a generalist framework for both tasks. The proposed training framework incorporates Automatic Degrade Sampling and Image Area Preservation, two techniques that preserve contextual integrity and visual details. The framework also includes numerous efficiency optimizations in the pipeline for long-context data training. Finally, we propose Eagle-Video-110K, a novel dataset that integrates both story-level and clip-level annotations, facilitating long-video understanding. Eagle 2.5 demonstrates substantial improvements on long-context multimodal benchmarks, providing a robust solution to the limitations of existing VLMs. Notably, our best model Eagle 2.5-8B achieves 72.4% on Video-MME with 512 input frames, matching the results of top-tier commercial model such as GPT-4o and large-scale open-source models like Qwen2.5-VL-72B and InternVL2.5-78B.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Instance-Adaptive Keypoint Learning with Local-to-Global Geometric Aggregation for Category-Level Object Pose Estimation
Authors:
Xiao Zhang,
Lu Zou,
Tao Lu,
Yuan Yao,
Zhangjin Huang,
Guoping Wang
Abstract:
Category-level object pose estimation aims to predict the 6D pose and size of previously unseen instances from predefined categories, requiring strong generalization across diverse object instances. Although many previous methods attempt to mitigate intra-class variations, they often struggle with instances exhibiting complex geometries or significant deviations from canonical shapes. To address t…
▽ More
Category-level object pose estimation aims to predict the 6D pose and size of previously unseen instances from predefined categories, requiring strong generalization across diverse object instances. Although many previous methods attempt to mitigate intra-class variations, they often struggle with instances exhibiting complex geometries or significant deviations from canonical shapes. To address this challenge, we propose INKL-Pose, a novel category-level object pose estimation framework that enables INstance-adaptive Keypoint Learning with local-to-global geometric aggregation. Specifically, our approach first predicts semantically consistent and geometric informative keypoints through an Instance-Adaptive Keypoint Generator, then refines them with: (1) a Local Keypoint Feature Aggregator capturing fine-grained geometries, and (2) a Global Keypoint Feature Aggregator using bidirectional Mamba for structural consistency. To enable bidirectional modeling in Mamba, we introduce a Feature Sequence Flipping strategy that preserves spatial coherence while constructing backward feature sequences. Additionally, we design a surface loss and a separation loss to enforce uniform coverage and spatial diversity in keypoint distribution. The generated keypoints are finally mapped to a canonical space for regressing the object's 6D pose and size. Extensive experiments on CAMERA25, REAL275, and HouseCat6D demonstrate that INKL-Pose achieves state-of-the-art performance and significantly outperforms existing methods.
△ Less
Submitted 26 April, 2025; v1 submitted 21 April, 2025;
originally announced April 2025.
-
FedCIA: Federated Collaborative Information Aggregation for Privacy-Preserving Recommendation
Authors:
Mingzhe Han,
Dongsheng Li,
Jiafeng Xia,
Jiahao Liu,
Hansu Gu,
Peng Zhang,
Ning Gu,
Tun Lu
Abstract:
Recommendation algorithms rely on user historical interactions to deliver personalized suggestions, which raises significant privacy concerns. Federated recommendation algorithms tackle this issue by combining local model training with server-side model aggregation, where most existing algorithms use a uniform weighted summation to aggregate item embeddings from different client models. This appro…
▽ More
Recommendation algorithms rely on user historical interactions to deliver personalized suggestions, which raises significant privacy concerns. Federated recommendation algorithms tackle this issue by combining local model training with server-side model aggregation, where most existing algorithms use a uniform weighted summation to aggregate item embeddings from different client models. This approach has three major limitations: 1) information loss during aggregation, 2) failure to retain personalized local features, and 3) incompatibility with parameter-free recommendation algorithms. To address these limitations, we first review the development of recommendation algorithms and recognize that their core function is to share collaborative information, specifically the global relationship between users and items. With this understanding, we propose a novel aggregation paradigm named collaborative information aggregation, which focuses on sharing collaborative information rather than item parameters. Based on this new paradigm, we introduce the federated collaborative information aggregation (FedCIA) method for privacy-preserving recommendation. This method requires each client to upload item similarity matrices for aggregation, which allows clients to align their local models without constraining embeddings to a unified vector space. As a result, it mitigates information loss caused by direct summation, preserves the personalized embedding distributions of individual clients, and supports the aggregation of parameter-free models. Theoretical analysis and experimental results on real-world datasets demonstrate the superior performance of FedCIA compared with the state-of-the-art federated recommendation algorithms. Code is available at https://github.com/Mingzhe-Han/FedCIA.
△ Less
Submitted 19 April, 2025;
originally announced April 2025.
-
Active Learning of Symbolic NetKAT Automata
Authors:
Mark Moeller,
Tiago Ferreira,
Thomas Lu,
Nate Foster,
Alexandra Silva
Abstract:
NetKAT is a domain-specific programming language and logic that has been successfully used to specify and verify the behavior of packet-switched networks. This paper develops techniques for automatically learning NetKAT models of unknown networks using active learning. Prior work has explored active learning for a wide range of automata (e.g., deterministic, register, Büchi, timed etc.) and also d…
▽ More
NetKAT is a domain-specific programming language and logic that has been successfully used to specify and verify the behavior of packet-switched networks. This paper develops techniques for automatically learning NetKAT models of unknown networks using active learning. Prior work has explored active learning for a wide range of automata (e.g., deterministic, register, Büchi, timed etc.) and also developed applications, such as validating implementations of network protocols. We present algorithms for learning different types of NetKAT automata, including symbolic automata proposed in recent work. We prove the soundness of these algorithms, build a prototype implementation, and evaluate it on a standard benchmark. Our results highlight the applicability of symbolic NetKAT learning for realistic network configurations and topologies.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
Adaptive Human-Agent Teaming: A Review of Empirical Studies from the Process Dynamics Perspective
Authors:
Mengyao Wang,
Jiayun Wu,
Shuai Ma,
Nuo Li,
Peng Zhang,
Ning Gu,
Tun Lu
Abstract:
The rapid advancement of AI, including Large Language Models, has propelled autonomous agents forward, accelerating the human-agent teaming (HAT) paradigm to leverage complementary strengths. However, HAT research remains fragmented, often focusing on isolated team development phases or specific challenges like trust calibration while overlooking the real-world need for adaptability. Addressing th…
▽ More
The rapid advancement of AI, including Large Language Models, has propelled autonomous agents forward, accelerating the human-agent teaming (HAT) paradigm to leverage complementary strengths. However, HAT research remains fragmented, often focusing on isolated team development phases or specific challenges like trust calibration while overlooking the real-world need for adaptability. Addressing these gaps, a process dynamics perspective is adopted to systematically review HAT using the T$^4$ framework: Team Formation, Task and Role Development, Team Development, and Team Improvement. Each phase is examined in terms of its goals, actions, and evaluation metrics, emphasizing the co-evolution of task and team dynamics. Special focus is given to the second and third phases, highlighting key factors such as team roles, shared mental model, and backup behaviors. This holistic perspective identifies future research directions for advancing long-term adaptive HAT.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Authors:
Jinguo Zhu,
Weiyun Wang,
Zhe Chen,
Zhaoyang Liu,
Shenglong Ye,
Lixin Gu,
Hao Tian,
Yuchen Duan,
Weijie Su,
Jie Shao,
Zhangwei Gao,
Erfei Cui,
Xuehui Wang,
Yue Cao,
Yangzhou Liu,
Xingguang Wei,
Hongjie Zhang,
Haomin Wang,
Weiye Xu,
Hao Li,
Jiahao Wang,
Nianchen Deng,
Songze Li,
Yinan He,
Tan Jiang
, et al. (26 additional authors not shown)
Abstract:
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single p…
▽ More
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.
△ Less
Submitted 18 April, 2025; v1 submitted 14 April, 2025;
originally announced April 2025.
-
Art3D: Training-Free 3D Generation from Flat-Colored Illustration
Authors:
Xiaoyan Cong,
Jiayi Shen,
Zekun Li,
Rao Fu,
Tao Lu,
Srinath Sridhar
Abstract:
Large-scale pre-trained image-to-3D generative models have exhibited remarkable capabilities in diverse shape generations. However, most of them struggle to synthesize plausible 3D assets when the reference image is flat-colored like hand drawings due to the lack of 3D illusion, which are often the most user-friendly input modalities in art content creation. To this end, we propose Art3D, a traini…
▽ More
Large-scale pre-trained image-to-3D generative models have exhibited remarkable capabilities in diverse shape generations. However, most of them struggle to synthesize plausible 3D assets when the reference image is flat-colored like hand drawings due to the lack of 3D illusion, which are often the most user-friendly input modalities in art content creation. To this end, we propose Art3D, a training-free method that can lift flat-colored 2D designs into 3D. By leveraging structural and semantic features with pre- trained 2D image generation models and a VLM-based realism evaluation, Art3D successfully enhances the three-dimensional illusion in reference images, thus simplifying the process of generating 3D from 2D, and proves adaptable to a wide range of painting styles. To benchmark the generalization performance of existing image-to-3D models on flat-colored images without 3D feeling, we collect a new dataset, Flat-2D, with over 100 samples. Experimental results demonstrate the performance and robustness of Art3D, exhibiting superior generalizable capacity and promising practical applicability. Our source code and dataset will be publicly available on our project page: https://joy-jy11.github.io/ .
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
From Token to Line: Enhancing Code Generation with a Long-Term Perspective
Authors:
Tingwei Lu,
Yangning Li,
Liyuan Wang,
Binghuai Lin,
Jiwei Tang,
Wanshi Xu,
Hai-Tao Zheng,
Yinghui Li,
Bingxu An,
Zhao Wei,
Yong Xu
Abstract:
The emergence of large language models (LLMs) has significantly promoted the development of code generation task, sparking a surge in pertinent literature. Current research is hindered by redundant generation results and a tendency to overfit local patterns in the short term. Although existing studies attempt to alleviate the issue by adopting a multi-token prediction strategy, there remains limit…
▽ More
The emergence of large language models (LLMs) has significantly promoted the development of code generation task, sparking a surge in pertinent literature. Current research is hindered by redundant generation results and a tendency to overfit local patterns in the short term. Although existing studies attempt to alleviate the issue by adopting a multi-token prediction strategy, there remains limited focus on choosing the appropriate processing length for generations. By analyzing the attention between tokens during the generation process of LLMs, it can be observed that the high spikes of the attention scores typically appear at the end of lines. This insight suggests that it is reasonable to treat each line of code as a fundamental processing unit and generate them sequentially. Inspired by this, we propose the \textbf{LSR-MCTS} algorithm, which leverages MCTS to determine the code line-by-line and select the optimal path. Further, we integrate a self-refine mechanism at each node to enhance diversity and generate higher-quality programs through error correction. Extensive experiments and comprehensive analyses on three public coding benchmarks demonstrate that our method outperforms the state-of-the-art performance approaches.
△ Less
Submitted 18 April, 2025; v1 submitted 10 April, 2025;
originally announced April 2025.
-
Unraveling Human-AI Teaming: A Review and Outlook
Authors:
Bowen Lou,
Tian Lu,
T. S. Raghu,
Yingjie Zhang
Abstract:
Artificial Intelligence (AI) is advancing at an unprecedented pace, with clear potential to enhance decision-making and productivity. Yet, the collaborative decision-making process between humans and AI remains underdeveloped, often falling short of its transformative possibilities. This paper explores the evolution of AI agents from passive tools to active collaborators in human-AI teams, emphasi…
▽ More
Artificial Intelligence (AI) is advancing at an unprecedented pace, with clear potential to enhance decision-making and productivity. Yet, the collaborative decision-making process between humans and AI remains underdeveloped, often falling short of its transformative possibilities. This paper explores the evolution of AI agents from passive tools to active collaborators in human-AI teams, emphasizing their ability to learn, adapt, and operate autonomously in complex environments. This paradigm shifts challenges traditional team dynamics, requiring new interaction protocols, delegation strategies, and responsibility distribution frameworks. Drawing on Team Situation Awareness (SA) theory, we identify two critical gaps in current human-AI teaming research: the difficulty of aligning AI agents with human values and objectives, and the underutilization of AI's capabilities as genuine team members. Addressing these gaps, we propose a structured research outlook centered on four key aspects of human-AI teaming: formulation, coordination, maintenance, and training. Our framework highlights the importance of shared mental models, trust-building, conflict resolution, and skill adaptation for effective teaming. Furthermore, we discuss the unique challenges posed by varying team compositions, goals, and complexities. This paper provides a foundational agenda for future research and practical design of sustainable, high-performing human-AI teams.
△ Less
Submitted 9 April, 2025; v1 submitted 8 April, 2025;
originally announced April 2025.
-
SCMPPI: Supervised Contrastive Multimodal Framework for Predicting Protein-Protein Interactions
Authors:
Shengrui XU,
Tianchi Lu,
Zikun Wang,
Jixiu Zhai
Abstract:
Protein-protein interaction (PPI) prediction plays a pivotal role in deciphering cellular functions and disease mechanisms. To address the limitations of traditional experimental methods and existing computational approaches in cross-modal feature fusion and false-negative suppression, we propose SCMPPI-a novel supervised contrastive multimodal framework. By effectively integrating sequence-based…
▽ More
Protein-protein interaction (PPI) prediction plays a pivotal role in deciphering cellular functions and disease mechanisms. To address the limitations of traditional experimental methods and existing computational approaches in cross-modal feature fusion and false-negative suppression, we propose SCMPPI-a novel supervised contrastive multimodal framework. By effectively integrating sequence-based features (AAC, DPC, ESMC-CKSAAP) with network topology (Node2Vec embeddings) and incorporating an enhanced contrastive learning strategy with negative sample filtering, SCMPPI achieves superior prediction performance. Extensive experiments on eight benchmark datasets demonstrate its state-of-the-art accuracy(98.13%) and AUC(99.69%), along with excellent cross-species generalization (AUC>99%). Successful applications in CD9 networks, Wnt pathway analysis, and cancer-specific networks further highlight its potential for disease target discovery, establishing SCMPPI as a powerful tool for multimodal biological data analysis.
△ Less
Submitted 27 April, 2025; v1 submitted 3 April, 2025;
originally announced April 2025.
-
CoGen: 3D Consistent Video Generation via Adaptive Conditioning for Autonomous Driving
Authors:
Yishen Ji,
Ziyue Zhu,
Zhenxin Zhu,
Kaixin Xiong,
Ming Lu,
Zhiqi Li,
Lijun Zhou,
Haiyang Sun,
Bing Wang,
Tong Lu
Abstract:
Recent progress in driving video generation has shown significant potential for enhancing self-driving systems by providing scalable and controllable training data. Although pretrained state-of-the-art generation models, guided by 2D layout conditions (e.g., HD maps and bounding boxes), can produce photorealistic driving videos, achieving controllable multi-view videos with high 3D consistency rem…
▽ More
Recent progress in driving video generation has shown significant potential for enhancing self-driving systems by providing scalable and controllable training data. Although pretrained state-of-the-art generation models, guided by 2D layout conditions (e.g., HD maps and bounding boxes), can produce photorealistic driving videos, achieving controllable multi-view videos with high 3D consistency remains a major challenge. To tackle this, we introduce a novel spatial adaptive generation framework, CoGen, which leverages advances in 3D generation to improve performance in two key aspects: (i) To ensure 3D consistency, we first generate high-quality, controllable 3D conditions that capture the geometry of driving scenes. By replacing coarse 2D conditions with these fine-grained 3D representations, our approach significantly enhances the spatial consistency of the generated videos. (ii) Additionally, we introduce a consistency adapter module to strengthen the robustness of the model to multi-condition control. The results demonstrate that this method excels in preserving geometric fidelity and visual realism, offering a reliable video generation solution for autonomous driving.
△ Less
Submitted 5 April, 2025; v1 submitted 28 March, 2025;
originally announced March 2025.
-
HyperFree: A Channel-adaptive and Tuning-free Foundation Model for Hyperspectral Remote Sensing Imagery
Authors:
Jingtao Li,
Yingyi Liu,
Xinyu Wang,
Yunning Peng,
Chen Sun,
Shaoyu Wang,
Zhendong Sun,
Tian Ke,
Xiao Jiang,
Tangwei Lu,
Anran Zhao,
Yanfei Zhong
Abstract:
Advanced interpretation of hyperspectral remote sensing images benefits many precise Earth observation tasks. Recently, visual foundation models have promoted the remote sensing interpretation but concentrating on RGB and multispectral images. Due to the varied hyperspectral channels,existing foundation models would face image-by-image tuning situation, imposing great pressure on hardware and time…
▽ More
Advanced interpretation of hyperspectral remote sensing images benefits many precise Earth observation tasks. Recently, visual foundation models have promoted the remote sensing interpretation but concentrating on RGB and multispectral images. Due to the varied hyperspectral channels,existing foundation models would face image-by-image tuning situation, imposing great pressure on hardware and time resources. In this paper, we propose a tuning-free hyperspectral foundation model called HyperFree, by adapting the existing visual prompt engineering. To process varied channel numbers, we design a learned weight dictionary covering full-spectrum from $0.4 \sim 2.5 \, μ\text{m}$, supporting to build the embedding layer dynamically. To make the prompt design more tractable, HyperFree can generate multiple semantic-aware masks for one prompt by treating feature distance as semantic-similarity. After pre-training HyperFree on constructed large-scale high-resolution hyperspectral images, HyperFree (1 prompt) has shown comparable results with specialized models (5 shots) on 5 tasks and 11 datasets.Code and dataset are accessible at https://rsidea.whu.edu.cn/hyperfree.htm.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
Representative Ranking for Deliberation in the Public Sphere
Authors:
Manon Revel,
Smitha Milli,
Tyler Lu,
Jamelle Watson-Daniels,
Max Nickel
Abstract:
Online comment sections, such as those on news sites or social media, have the potential to foster informal public deliberation, However, this potential is often undermined by the frequency of toxic or low-quality exchanges that occur in these settings. To combat this, platforms increasingly leverage algorithmic ranking to facilitate higher-quality discussions, e.g., by using civility classifiers…
▽ More
Online comment sections, such as those on news sites or social media, have the potential to foster informal public deliberation, However, this potential is often undermined by the frequency of toxic or low-quality exchanges that occur in these settings. To combat this, platforms increasingly leverage algorithmic ranking to facilitate higher-quality discussions, e.g., by using civility classifiers or forms of prosocial ranking. Yet, these interventions may also inadvertently reduce the visibility of legitimate viewpoints, undermining another key aspect of deliberation: representation of diverse views. We seek to remedy this problem by introducing guarantees of representation into these methods. In particular, we adopt the notion of justified representation (JR) from the social choice literature and incorporate a JR constraint into the comment ranking setting. We find that enforcing JR leads to greater inclusion of diverse viewpoints while still being compatible with optimizing for user engagement or other measures of conversational quality.
△ Less
Submitted 19 March, 2025;
originally announced March 2025.
-
Selecting and Pruning: A Differentiable Causal Sequentialized State-Space Model for Two-View Correspondence Learning
Authors:
Xiang Fang,
Shihua Zhang,
Hao Zhang,
Tao Lu,
Huabing Zhou,
Jiayi Ma
Abstract:
Two-view correspondence learning aims to discern true and false correspondences between image pairs by recognizing their underlying different information. Previous methods either treat the information equally or require the explicit storage of the entire context, tending to be laborious in real-world scenarios. Inspired by Mamba's inherent selectivity, we propose \textbf{CorrMamba}, a \textbf{Corr…
▽ More
Two-view correspondence learning aims to discern true and false correspondences between image pairs by recognizing their underlying different information. Previous methods either treat the information equally or require the explicit storage of the entire context, tending to be laborious in real-world scenarios. Inspired by Mamba's inherent selectivity, we propose \textbf{CorrMamba}, a \textbf{Corr}espondence filter leveraging \textbf{Mamba}'s ability to selectively mine information from true correspondences while mitigating interference from false ones, thus achieving adaptive focus at a lower cost. To prevent Mamba from being potentially impacted by unordered keypoints that obscured its ability to mine spatial information, we customize a causal sequential learning approach based on the Gumbel-Softmax technique to establish causal dependencies between features in a fully autonomous and differentiable manner. Additionally, a local-context enhancement module is designed to capture critical contextual cues essential for correspondence pruning, complementing the core framework. Extensive experiments on relative pose estimation, visual localization, and analysis demonstrate that CorrMamba achieves state-of-the-art performance. Notably, in outdoor relative pose estimation, our method surpasses the previous SOTA by $2.58$ absolute percentage points in AUC@20\textdegree, highlighting its practical superiority. Our code will be publicly available.
△ Less
Submitted 23 March, 2025;
originally announced March 2025.
-
NeuGrasp: Generalizable Neural Surface Reconstruction with Background Priors for Material-Agnostic Object Grasp Detection
Authors:
Qingyu Fan,
Yinghao Cai,
Chao Li,
Wenzhe He,
Xudong Zheng,
Tao Lu,
Bin Liang,
Shuo Wang
Abstract:
Robotic grasping in scenes with transparent and specular objects presents great challenges for methods relying on accurate depth information. In this paper, we introduce NeuGrasp, a neural surface reconstruction method that leverages background priors for material-agnostic grasp detection. NeuGrasp integrates transformers and global prior volumes to aggregate multi-view features with spatial encod…
▽ More
Robotic grasping in scenes with transparent and specular objects presents great challenges for methods relying on accurate depth information. In this paper, we introduce NeuGrasp, a neural surface reconstruction method that leverages background priors for material-agnostic grasp detection. NeuGrasp integrates transformers and global prior volumes to aggregate multi-view features with spatial encoding, enabling robust surface reconstruction in narrow and sparse viewing conditions. By focusing on foreground objects through residual feature enhancement and refining spatial perception with an occupancy-prior volume, NeuGrasp excels in handling objects with transparent and specular surfaces. Extensive experiments in both simulated and real-world scenarios show that NeuGrasp outperforms state-of-the-art methods in grasping while maintaining comparable reconstruction quality. More details are available at https://neugrasp.github.io/.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
YouthCare: Building a Personalized Collaborative Video Censorship Tool to Support Parent-Child Joint Media Engagement
Authors:
Wenxin Zhao,
Fangyu Yu,
Peng Zhang,
Hansu Gu,
Lin Wang,
Siyuan Qiao,
Tun Lu,
Ning Gu
Abstract:
To mitigate the negative impacts of online videos on teenagers, existing research and platforms have implemented various parental mediation mechanisms, such as Parent-Child Joint Media Engagement (JME). However, JME generally relies heavily on parents' time, knowledge, and experience. To fill this gap, we aim to design an automatic tool to help parents/children censor videos more effectively and e…
▽ More
To mitigate the negative impacts of online videos on teenagers, existing research and platforms have implemented various parental mediation mechanisms, such as Parent-Child Joint Media Engagement (JME). However, JME generally relies heavily on parents' time, knowledge, and experience. To fill this gap, we aim to design an automatic tool to help parents/children censor videos more effectively and efficiently in JME. For this goal, we first conducted a formative study to identify the needs and expectations of teenagers and parents for such a system. Based on the findings, we designed YouthCare, a personalized collaborative video censorship tool that supports parents and children to collaboratively filter out inappropriate content and select appropriate content in JME. An evaluation with 10 parent-child pairs demonstrated YouthCare's several strengths in supporting video censorship, while also highlighting some potential problems. These findings inspire us to propose several insights for the future design of parent-child collaborative JME systems.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
RemiHaven: Integrating "In-Town" and "Out-of-Town" Peers to Provide Personalized Reminiscence Support for Older Drifters
Authors:
Xuechen Zhang,
Changyang He,
Peng Zhang,
Hansu Gu,
Ning Gu,
Qi Shen,
Zhan Hu,
Tun Lu
Abstract:
With increasing social mobility and an aging society, more older adults in China are migrating to new cities, known as "older drifters." Due to fewer social connections and cultural adaptation challenges, they face negative emotions such as loneliness and depression. While reminiscence-based interventions have been used to improve older adults' psychological well-being, challenges such as the lack…
▽ More
With increasing social mobility and an aging society, more older adults in China are migrating to new cities, known as "older drifters." Due to fewer social connections and cultural adaptation challenges, they face negative emotions such as loneliness and depression. While reminiscence-based interventions have been used to improve older adults' psychological well-being, challenges such as the lack of tangible materials and limited social resources constrain the feasibility of traditional reminiscence approaches for older drifters. To address this challenge, we designed RemiHaven, a personalized reminiscence support tool based on a two-phase formative study. It integrates "In-Town" and "Out-of-Town" peer agents to enhance personalization, engagement, and emotional resonance in the reminiscence process, powered by Multimodal Large Language Models (MLLMs). Our evaluations show RemiHaven's strengths in supporting reminiscence while identifying potential challenges. We conclude by offering insights for the future design of reminiscence support tools for older migrants.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
A Multiple Transferable Neural Network Method with Domain Decomposition for Elliptic Interface Problems
Authors:
Tianzheng Lu,
Lili Ju,
Liyong Zhu
Abstract:
The transferable neural network (TransNet) is a two-layer shallow neural network with pre-determined and uniformly distributed neurons in the hidden layer, and the least-squares solvers can be particularly used to compute the parameters of its output layer when applied to the solution of partial differential equations. In this paper, we integrate the TransNet technique with the nonoverlapping doma…
▽ More
The transferable neural network (TransNet) is a two-layer shallow neural network with pre-determined and uniformly distributed neurons in the hidden layer, and the least-squares solvers can be particularly used to compute the parameters of its output layer when applied to the solution of partial differential equations. In this paper, we integrate the TransNet technique with the nonoverlapping domain decomposition and the interface conditions to develop a novel multiple transferable neural network (Multi-TransNet) method for solving elliptic interface problems, which typically contain discontinuities in both solutions and their derivatives across interfaces. We first propose an empirical formula for the TransNet to characterize the relationship between the radius of the domain-covering ball, the number of hidden-layer neurons, and the optimal neuron shape. In the Multi-TransNet method, we assign each subdomain one distinct TransNet with an adaptively determined number of hidden-layer neurons to maintain the globally uniform neuron distribution across the entire computational domain, and then unite all the subdomain TransNets together by incorporating the interface condition terms into the loss function. The empirical formula is also extended to the Multi-TransNet and further employed to estimate appropriate neuron shapes for the subdomain TransNets, greatly reducing the parameter tuning cost. Additionally, we propose a normalization approach to adaptively select the weighting parameters for the terms in the loss function. Ablation studies and extensive experiments with comparison tests on different types of elliptic interface problems with low to high contrast diffusion coefficients in two and three dimensions are carried out to numerically demonstrate the superior accuracy, efficiency, and robustness of the proposed Multi-TransNet method.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
A general language model for peptide identification
Authors:
Jixiu Zhai,
Tianchi Lu,
Haitian Zhong,
Ziyang Xu,
Yuhuan Liu,
Shengrui Xu,
Jingwan Wang,
Dan Huang
Abstract:
Advances in peptide identification are revolutionizing our ability to decipher protein functions and accelerate therapeutic discovery. We present PDeepPP, a deep learning framework that integrates pretrained protein language models with parallel transformer-CNN architectures, achieving state-of-the-art performance in peptide characterization tasks. The model's hybrid architecture demonstrates uniq…
▽ More
Advances in peptide identification are revolutionizing our ability to decipher protein functions and accelerate therapeutic discovery. We present PDeepPP, a deep learning framework that integrates pretrained protein language models with parallel transformer-CNN architectures, achieving state-of-the-art performance in peptide characterization tasks. The model's hybrid architecture demonstrates unique capabilities in capturing both local sequence motifs and global structural features, as evidenced by 29% improved cluster separation in UMAP visualizations compared to conventional approaches. Evaluated across 33 biological recognition tasks - including post-translational modification site prediction and bioactive peptide identification - PDeepPP outperformed existing methods in 25 tasks with average AUC improvements of 4.2%. Notably, it achieved 0.9726 accuracy with PR AUC 0.9977 in antimicrobial peptide detection while reducing false negatives by 37.5% in antimalarial recognition scenarios. This framework enables accurate large-scale peptide analysis, achieving 218* acceleration over sequence-alignment-based methods while maintaining 99.5% specificity in critical glycosylation site detection.PDeepPP establishes a new paradigm for computational peptide analysis through its synergistic architecture design, enabling rapid yet precise functional annotation that bridges molecular pattern recognition with translational biomedical applications.We have made our implementation, including code, data, and pretrained models, publicly available via GitHub (https://github.com/fondress/PDeepPP) and Hugging Face (https://huggingface.co/fondress/PDeppPP).
△ Less
Submitted 17 April, 2025; v1 submitted 21 February, 2025;
originally announced February 2025.
-
Improving LLM-powered Recommendations with Personalized Information
Authors:
Jiahao Liu,
Xueshuo Yan,
Dongsheng Li,
Guangping Zhang,
Hansu Gu,
Peng Zhang,
Tun Lu,
Li Shang,
Ning Gu
Abstract:
Due to the lack of explicit reasoning modeling, existing LLM-powered recommendations fail to leverage LLMs' reasoning capabilities effectively. In this paper, we propose a pipeline called CoT-Rec, which integrates two key Chain-of-Thought (CoT) processes -- user preference analysis and item perception analysis -- into LLM-powered recommendations, thereby enhancing the utilization of LLMs' reasonin…
▽ More
Due to the lack of explicit reasoning modeling, existing LLM-powered recommendations fail to leverage LLMs' reasoning capabilities effectively. In this paper, we propose a pipeline called CoT-Rec, which integrates two key Chain-of-Thought (CoT) processes -- user preference analysis and item perception analysis -- into LLM-powered recommendations, thereby enhancing the utilization of LLMs' reasoning abilities. CoT-Rec consists of two stages: (1) personalized information extraction, where user preferences and item perception are extracted, and (2) personalized information utilization, where this information is incorporated into the LLM-powered recommendation process. Experimental results demonstrate that CoT-Rec shows potential for improving LLM-powered recommendations. The implementation is publicly available at https://github.com/jhliu0807/CoT-Rec.
△ Less
Submitted 18 April, 2025; v1 submitted 19 February, 2025;
originally announced February 2025.
-
AgentCF++: Memory-enhanced LLM-based Agents for Popularity-aware Cross-domain Recommendations
Authors:
Jiahao Liu,
Shengkang Gu,
Dongsheng Li,
Guangping Zhang,
Mingzhe Han,
Hansu Gu,
Peng Zhang,
Tun Lu,
Li Shang,
Ning Gu
Abstract:
LLM-based user agents, which simulate user interaction behavior, are emerging as a promising approach to enhancing recommender systems. In real-world scenarios, users' interactions often exhibit cross-domain characteristics and are influenced by others. However, the memory design in current methods causes user agents to introduce significant irrelevant information during decision-making in cross-d…
▽ More
LLM-based user agents, which simulate user interaction behavior, are emerging as a promising approach to enhancing recommender systems. In real-world scenarios, users' interactions often exhibit cross-domain characteristics and are influenced by others. However, the memory design in current methods causes user agents to introduce significant irrelevant information during decision-making in cross-domain scenarios and makes them unable to recognize the influence of other users' interactions, such as popularity factors. To tackle this issue, we propose a dual-layer memory architecture combined with a two-step fusion mechanism. This design avoids irrelevant information during decision-making while ensuring effective integration of cross-domain preferences. We also introduce the concepts of interest groups and group-shared memory to better capture the influence of popularity factors on users with similar interests. Comprehensive experiments validate the effectiveness of AgentCF++. Our code is available at https://github.com/jhliu0807/AgentCF-plus.
△ Less
Submitted 18 April, 2025; v1 submitted 19 February, 2025;
originally announced February 2025.
-
Unbiased Collaborative Filtering with Fair Sampling
Authors:
Jiahao Liu,
Dongsheng Li,
Hansu Gu,
Peng Zhang,
Tun Lu,
Li Shang,
Ning Gu
Abstract:
Recommender systems leverage extensive user interaction data to model preferences; however, directly modeling these data may introduce biases that disproportionately favor popular items. In this paper, we demonstrate that popularity bias arises from the influence of propensity factors during training. Building on this insight, we propose a fair sampling (FS) method that ensures each user and each…
▽ More
Recommender systems leverage extensive user interaction data to model preferences; however, directly modeling these data may introduce biases that disproportionately favor popular items. In this paper, we demonstrate that popularity bias arises from the influence of propensity factors during training. Building on this insight, we propose a fair sampling (FS) method that ensures each user and each item has an equal likelihood of being selected as both positive and negative instances, thereby mitigating the influence of propensity factors. The proposed FS method does not require estimating propensity scores, thus avoiding the risk of failing to fully eliminate popularity bias caused by estimation inaccuracies. Comprehensive experiments demonstrate that the proposed FS method achieves state-of-the-art performance in both point-wise and pair-wise recommendation tasks. The code implementation is available at https://github.com/jhliu0807/Fair-Sampling.
△ Less
Submitted 18 April, 2025; v1 submitted 19 February, 2025;
originally announced February 2025.
-
JamendoMaxCaps: A Large Scale Music-caption Dataset with Imputed Metadata
Authors:
Abhinaba Roy,
Renhang Liu,
Tongyu Lu,
Dorien Herremans
Abstract:
We introduce JamendoMaxCaps, a large-scale music-caption dataset featuring over 200,000 freely licensed instrumental tracks from the renowned Jamendo platform. The dataset includes captions generated by a state-of-the-art captioning model, enhanced with imputed metadata. We also introduce a retrieval system that leverages both musical features and metadata to identify similar songs, which are then…
▽ More
We introduce JamendoMaxCaps, a large-scale music-caption dataset featuring over 200,000 freely licensed instrumental tracks from the renowned Jamendo platform. The dataset includes captions generated by a state-of-the-art captioning model, enhanced with imputed metadata. We also introduce a retrieval system that leverages both musical features and metadata to identify similar songs, which are then used to fill in missing metadata using a local large language model (LLLM). This approach allows us to provide a more comprehensive and informative dataset for researchers working on music-language understanding tasks. We validate this approach quantitatively with five different measurements. By making the JamendoMaxCaps dataset publicly available, we provide a high-quality resource to advance research in music-language understanding tasks such as music retrieval, multimodal representation learning, and generative music models.
△ Less
Submitted 11 February, 2025;
originally announced February 2025.
-
Ego vs. Exo and Active vs. Passive: Investigating the Effects of Viewpoint and Navigation on Spatial Immersion and Understanding in Immersive Storytelling
Authors:
Tao Lu,
Qian Zhu,
Tiffany Ma,
Wong Kam-Kwai,
Anlan Xie,
Alex Endert,
Yalong Yang
Abstract:
Visual storytelling combines visuals and narratives to communicate important insights. While web-based visual storytelling is well-established, leveraging the next generation of digital technologies for visual storytelling, specifically immersive technologies, remains underexplored. We investigated the impact of the story viewpoint (from the audience's perspective) and navigation (when progressing…
▽ More
Visual storytelling combines visuals and narratives to communicate important insights. While web-based visual storytelling is well-established, leveraging the next generation of digital technologies for visual storytelling, specifically immersive technologies, remains underexplored. We investigated the impact of the story viewpoint (from the audience's perspective) and navigation (when progressing through the story) on spatial immersion and understanding. First, we collected web-based 3D stories and elicited design considerations from three VR developers. We then adapted four selected web-based stories to an immersive format. Finally, we conducted a user study (N=24) to examine egocentric and exocentric viewpoints, active and passive navigation, and the combinations they form. Our results indicated significantly higher preferences for egocentric+active (higher agency and engagement) and exocentric+passive (higher focus on content). We also found a marginal significance of viewpoints on story understanding and a strong significance of navigation on spatial immersion.
△ Less
Submitted 6 February, 2025;
originally announced February 2025.
-
ImprovNet -- Generating Controllable Musical Improvisations with Iterative Corruption Refinement
Authors:
Keshav Bhandari,
Sungkyun Chang,
Tongyu Lu,
Fareza R. Enus,
Louis B. Bradshaw,
Dorien Herremans,
Simon Colton
Abstract:
Despite deep learning's remarkable advances in style transfer across various domains, generating controllable performance-level musical style transfer for complete symbolically represented musical works remains a challenging area of research. Much of this is owed to limited datasets, especially for genres such as jazz, and the lack of unified models that can handle multiple music generation tasks.…
▽ More
Despite deep learning's remarkable advances in style transfer across various domains, generating controllable performance-level musical style transfer for complete symbolically represented musical works remains a challenging area of research. Much of this is owed to limited datasets, especially for genres such as jazz, and the lack of unified models that can handle multiple music generation tasks. This paper presents ImprovNet, a transformer-based architecture that generates expressive and controllable musical improvisations through a self-supervised corruption-refinement training strategy. The improvisational style transfer is aimed at making meaningful modifications to one or more musical elements - melody, harmony or rhythm of the original composition with respect to the target genre. ImprovNet unifies multiple capabilities within a single model: it can perform cross-genre and intra-genre improvisations, harmonize melodies with genre-specific styles, and execute short prompt continuation and infilling tasks. The model's iterative generation framework allows users to control the degree of style transfer and structural similarity to the original composition. Objective and subjective evaluations demonstrate ImprovNet's effectiveness in generating musically coherent improvisations while maintaining structural relationships with the original pieces. The model outperforms Anticipatory Music Transformer in short continuation and infilling tasks and successfully achieves recognizable genre conversion, with 79\% of participants correctly identifying jazz-style improvisations of classical pieces. Our code and demo page can be found at https://github.com/keshavbhandari/improvnet.
△ Less
Submitted 15 May, 2025; v1 submitted 6 February, 2025;
originally announced February 2025.
-
Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models
Authors:
Zhiqi Li,
Guo Chen,
Shilong Liu,
Shihao Wang,
Vibashan VS,
Yishen Ji,
Shiyi Lan,
Hao Zhang,
Yilin Zhao,
Subhashree Radhakrishnan,
Nadine Chang,
Karan Sapra,
Amala Sanjay Deshmukh,
Tuomas Rintamaki,
Matthieu Le,
Ilia Karmanov,
Lukas Voegtle,
Philipp Fischer,
De-An Huang,
Timo Roman,
Tong Lu,
Jose M. Alvarez,
Bryan Catanzaro,
Jan Kautz,
Andrew Tao
, et al. (2 additional authors not shown)
Abstract:
Recently, promising progress has been made by open-source vision-language models (VLMs) in bringing their capabilities closer to those of proprietary frontier models. However, most open-source models only publish their final model weights, leaving the critical details of data strategies and implementation largely opaque. In this work, we address VLM post-training from a data-centric perspective, s…
▽ More
Recently, promising progress has been made by open-source vision-language models (VLMs) in bringing their capabilities closer to those of proprietary frontier models. However, most open-source models only publish their final model weights, leaving the critical details of data strategies and implementation largely opaque. In this work, we address VLM post-training from a data-centric perspective, showing the key role of data strategy in developing frontier VLMs. By studying and building our post-training data strategy from scratch, we share detailed insights into the development processes, aiming to benefit the development of competitive models for the open-source community. Our introduced data strategy, together with training recipes and model design, leads to a family of performant VLMs named Eagle2. Specifically, Eagle2-9B achieves state-of-the-art results across various multimodal benchmarks, matching certain competitive models with up to 70B parameters.
△ Less
Submitted 20 January, 2025;
originally announced January 2025.
-
Linked Data on Geo-annotated Events and Use Cases for the Resilience of Ukraine
Authors:
Manar Attar,
Shuai Wang,
Ronald Siebes,
Eirik Kultorp,
Zhisheng Huang,
Tianyang Lu
Abstract:
The mission of resilience of Ukrainian cities calls for international collaboration with the scientific community to increase the quality of information by identifying and integrating information from various news and social media sources. Linked Data technology can be used to unify, enrich, and integrate data from multiple sources. In our work, we focus on datasets about damaging events in Ukrain…
▽ More
The mission of resilience of Ukrainian cities calls for international collaboration with the scientific community to increase the quality of information by identifying and integrating information from various news and social media sources. Linked Data technology can be used to unify, enrich, and integrate data from multiple sources. In our work, we focus on datasets about damaging events in Ukraine due to Russia's invasion between February 2022 and the end of April 2023. We convert two selected datasets to Linked Data and enrich them with additional geospatial information. Following that, we present an algorithm for the detection of identical events from different datasets. Our pipeline makes it easy to convert and enrich datasets to integrated Linked Data. The resulting dataset consists of 10K reported events covering damage to hospitals, schools, roads, residential buildings, etc. Finally, we demonstrate in use cases how our dataset can be applied to different scenarios for resilience purposes.
△ Less
Submitted 24 December, 2024;
originally announced January 2025.
-
Value Compass Leaderboard: A Platform for Fundamental and Validated Evaluation of LLMs Values
Authors:
Jing Yao,
Xiaoyuan Yi,
Shitong Duan,
Jindong Wang,
Yuzhuo Bai,
Muhua Huang,
Peng Zhang,
Tun Lu,
Zhicheng Dou,
Maosong Sun,
Xing Xie
Abstract:
As Large Language Models (LLMs) achieve remarkable breakthroughs, aligning their values with humans has become imperative for their responsible development and customized applications. However, there still lack evaluations of LLMs values that fulfill three desirable goals. (1) Value Clarification: We expect to clarify the underlying values of LLMs precisely and comprehensively, while current evalu…
▽ More
As Large Language Models (LLMs) achieve remarkable breakthroughs, aligning their values with humans has become imperative for their responsible development and customized applications. However, there still lack evaluations of LLMs values that fulfill three desirable goals. (1) Value Clarification: We expect to clarify the underlying values of LLMs precisely and comprehensively, while current evaluations focus narrowly on safety risks such as bias and toxicity. (2) Evaluation Validity: Existing static, open-source benchmarks are prone to data contamination and quickly become obsolete as LLMs evolve. Additionally, these discriminative evaluations uncover LLMs' knowledge about values, rather than valid assessments of LLMs' behavioral conformity to values. (3) Value Pluralism: The pluralistic nature of human values across individuals and cultures is largely ignored in measuring LLMs value alignment. To address these challenges, we presents the Value Compass Leaderboard, with three correspondingly designed modules. It (i) grounds the evaluation on motivationally distinct \textit{basic values to clarify LLMs' underlying values from a holistic view; (ii) applies a \textit{generative evolving evaluation framework with adaptive test items for evolving LLMs and direct value recognition from behaviors in realistic scenarios; (iii) propose a metric that quantifies LLMs alignment with a specific value as a weighted sum over multiple dimensions, with weights determined by pluralistic values.
△ Less
Submitted 13 January, 2025;
originally announced January 2025.
-
FlexCache: Flexible Approximate Cache System for Video Diffusion
Authors:
Desen Sun,
Henry Tian,
Tim Lu,
Sihang Liu
Abstract:
Text-to-Video applications receive increasing attention from the public. Among these, diffusion models have emerged as the most prominent approach, offering impressive quality in visual content generation. However, it still suffers from substantial computational complexity, often requiring several minutes to generate a single video. While prior research has addressed the computational overhead in…
▽ More
Text-to-Video applications receive increasing attention from the public. Among these, diffusion models have emerged as the most prominent approach, offering impressive quality in visual content generation. However, it still suffers from substantial computational complexity, often requiring several minutes to generate a single video. While prior research has addressed the computational overhead in text-to-image diffusion models, the techniques developed are not directly suitable for video diffusion models due to the significantly larger cache requirements and enhanced computational demands associated with video generation.
We present FlexCache, a flexible approximate cache system that addresses the challenges in two main designs. First, we compress the caches before saving them to storage. Our compression strategy can reduce 6.7 times consumption on average. Then we find that the approximate cache system can achieve higher hit rate and computation savings by decoupling the object and background. We further design a tailored cache replacement policy to support the two techniques mentioned above better. Through our evaluation, FlexCache reaches 1.26 times higher throughput and 25% lower cost compared to the state-of-the-art diffusion approximate cache system.
△ Less
Submitted 17 December, 2024;
originally announced January 2025.
-
ImagineMap: Enhanced HD Map Construction with SD Maps
Authors:
Yishen Ji,
Zhiqi Li,
Tong Lu
Abstract:
Track Mapless demands models to process multi-view images and Standard-Definition (SD) maps, outputting lane and traffic element perceptions along with their topological relationships. We propose a novel architecture that integrates SD map priors to improve lane line and area detection performance. Inspired by TopoMLP, our model employs a two-stage structure: perception and reasoning. The downstre…
▽ More
Track Mapless demands models to process multi-view images and Standard-Definition (SD) maps, outputting lane and traffic element perceptions along with their topological relationships. We propose a novel architecture that integrates SD map priors to improve lane line and area detection performance. Inspired by TopoMLP, our model employs a two-stage structure: perception and reasoning. The downstream topology head uses the output from the upstream detection head, meaning accuracy improvements in detection significantly boost downstream performance.
△ Less
Submitted 22 December, 2024;
originally announced December 2024.
-
Accelerating Private Large Transformers Inference through Fine-grained Collaborative Computation
Authors:
Yuntian Chen,
Zhanyong Tang,
Tianpei Lu,
Bingsheng Zhang,
Zhiying Shi,
Zheng Wang
Abstract:
Homomorphic encryption (HE) and secret sharing (SS) enable computations on encrypted data, providing significant privacy benefits for large transformer-based models (TBM) in sensitive sectors like medicine and finance. However, private TBM inference incurs significant costs due to the coarse-grained application of HE and SS. We present FASTLMPI, a new approach to accelerate private TBM inference t…
▽ More
Homomorphic encryption (HE) and secret sharing (SS) enable computations on encrypted data, providing significant privacy benefits for large transformer-based models (TBM) in sensitive sectors like medicine and finance. However, private TBM inference incurs significant costs due to the coarse-grained application of HE and SS. We present FASTLMPI, a new approach to accelerate private TBM inference through fine-grained computation optimization. Specifically, through the fine-grained co-design of homomorphic encryption and secret sharing, FASTLMPI achieves efficient protocols for matrix multiplication, SoftMax, LayerNorm, and GeLU. In addition, FASTLMPI introduces a precise segmented approximation technique for differentiable non-linear, improving its fitting accuracy while maintaining a low polynomial degree. Compared to solution BOLT (S&P'24), FASTLMPI shows a remarkable 54% to 64% decrease in runtime and an impressive 72.2% reduction in communication costs.
△ Less
Submitted 25 January, 2025; v1 submitted 21 December, 2024;
originally announced December 2024.
-
Turbo-GS: Accelerating 3D Gaussian Fitting for High-Quality Radiance Fields
Authors:
Tao Lu,
Ankit Dhiman,
R Srinath,
Emre Arslan,
Angela Xing,
Yuanbo Xiangli,
R Venkatesh Babu,
Srinath Sridhar
Abstract:
Novel-view synthesis is an important problem in computer vision with applications in 3D reconstruction, mixed reality, and robotics. Recent methods like 3D Gaussian Splatting (3DGS) have become the preferred method for this task, providing high-quality novel views in real time. However, the training time of a 3DGS model is slow, often taking 30 minutes for a scene with 200 views. In contrast, our…
▽ More
Novel-view synthesis is an important problem in computer vision with applications in 3D reconstruction, mixed reality, and robotics. Recent methods like 3D Gaussian Splatting (3DGS) have become the preferred method for this task, providing high-quality novel views in real time. However, the training time of a 3DGS model is slow, often taking 30 minutes for a scene with 200 views. In contrast, our goal is to reduce the optimization time by training for fewer steps while maintaining high rendering quality. Specifically, we combine the guidance from both the position error and the appearance error to achieve a more effective densification. To balance the rate between adding new Gaussians and fitting old Gaussians, we develop a convergence-aware budget control mechanism. Moreover, to make the densification process more reliable, we selectively add new Gaussians from mostly visited regions. With these designs, we reduce the Gaussian optimization steps to one-third of the previous approach while achieving a comparable or even better novel view rendering quality. To further facilitate the rapid fitting of 4K resolution images, we introduce a dilation-based rendering technique. Our method, Turbo-GS, speeds up optimization for typical scenes and scales well to high-resolution (4K) scenarios on standard datasets. Through extensive experiments, we show that our method is significantly faster in optimization than other methods while retaining quality. Project page: https://ivl.cs.brown.edu/research/turbo-gs.
△ Less
Submitted 18 December, 2024;
originally announced December 2024.
-
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
Authors:
Guo Chen,
Yicheng Liu,
Yifei Huang,
Yuping He,
Baoqi Pei,
Jilan Xu,
Yali Wang,
Tong Lu,
Limin Wang
Abstract:
Most existing video understanding benchmarks for multimodal large language models (MLLMs) focus only on short videos. The limited number of benchmarks for long video understanding often rely solely on multiple-choice questions (MCQs). However, because of the inherent limitation of MCQ-based evaluation and the increasing reasoning ability of MLLMs, models can give the current answer purely by combi…
▽ More
Most existing video understanding benchmarks for multimodal large language models (MLLMs) focus only on short videos. The limited number of benchmarks for long video understanding often rely solely on multiple-choice questions (MCQs). However, because of the inherent limitation of MCQ-based evaluation and the increasing reasoning ability of MLLMs, models can give the current answer purely by combining short video understanding with elimination, without genuinely understanding the video content. To address this gap, we introduce CG-Bench, a novel benchmark designed for clue-grounded question answering in long videos. CG-Bench emphasizes the model's ability to retrieve relevant clues for questions, enhancing evaluation credibility. It features 1,219 manually curated videos categorized by a granular system with 14 primary categories, 171 secondary categories, and 638 tertiary categories, making it the largest benchmark for long video analysis. The benchmark includes 12,129 QA pairs in three major question types: perception, reasoning, and hallucination. Compensating the drawbacks of pure MCQ-based evaluation, we design two novel clue-based evaluation methods: clue-grounded white box and black box evaluations, to assess whether the model generates answers based on the correct understanding of the video. We evaluate multiple closed-source and open-source MLLMs on CG-Bench. Results indicate that current models significantly underperform in understanding long videos compared to short ones, and a significant gap exists between open-source and commercial models. We hope CG-Bench can advance the development of more trustworthy and capable MLLMs for long video understanding. All annotations and video data are released at https://cg-bench.github.io/leaderboard/.
△ Less
Submitted 16 December, 2024;
originally announced December 2024.
-
GenEx: Generating an Explorable World
Authors:
Taiming Lu,
Tianmin Shu,
Junfei Xiao,
Luoxin Ye,
Jiahao Wang,
Cheng Peng,
Chen Wei,
Daniel Khashabi,
Rama Chellappa,
Alan Yuille,
Jieneng Chen
Abstract:
Understanding, navigating, and exploring the 3D physical real world has long been a central challenge in the development of artificial intelligence. In this work, we take a step toward this goal by introducing GenEx, a system capable of planning complex embodied world exploration, guided by its generative imagination that forms priors (expectations) about the surrounding environments. GenEx genera…
▽ More
Understanding, navigating, and exploring the 3D physical real world has long been a central challenge in the development of artificial intelligence. In this work, we take a step toward this goal by introducing GenEx, a system capable of planning complex embodied world exploration, guided by its generative imagination that forms priors (expectations) about the surrounding environments. GenEx generates an entire 3D-consistent imaginative environment from as little as a single RGB image, bringing it to life through panoramic video streams. Leveraging scalable 3D world data curated from Unreal Engine, our generative model is rounded in the physical world. It captures a continuous 360-degree environment with little effort, offering a boundless landscape for AI agents to explore and interact with. GenEx achieves high-quality world generation, robust loop consistency over long trajectories, and demonstrates strong 3D capabilities such as consistency and active 3D mapping. Powered by generative imagination of the world, GPT-assisted agents are equipped to perform complex embodied tasks, including both goal-agnostic exploration and goal-driven navigation. These agents utilize predictive expectation regarding unseen parts of the physical world to refine their beliefs, simulate different outcomes based on potential decisions, and make more informed choices. In summary, we demonstrate that GenEx provides a transformative platform for advancing embodied AI in imaginative spaces and brings potential for extending these capabilities to real-world exploration.
△ Less
Submitted 20 January, 2025; v1 submitted 12 December, 2024;
originally announced December 2024.
-
Proc-GS: Procedural Building Generation for City Assembly with 3D Gaussians
Authors:
Yixuan Li,
Xingjian Ran,
Linning Xu,
Tao Lu,
Mulin Yu,
Zhenzhi Wang,
Yuanbo Xiangli,
Dahua Lin,
Bo Dai
Abstract:
Buildings are primary components of cities, often featuring repeated elements such as windows and doors. Traditional 3D building asset creation is labor-intensive and requires specialized skills to develop design rules. Recent generative models for building creation often overlook these patterns, leading to low visual fidelity and limited scalability. Drawing inspiration from procedural modeling t…
▽ More
Buildings are primary components of cities, often featuring repeated elements such as windows and doors. Traditional 3D building asset creation is labor-intensive and requires specialized skills to develop design rules. Recent generative models for building creation often overlook these patterns, leading to low visual fidelity and limited scalability. Drawing inspiration from procedural modeling techniques used in the gaming and visual effects industry, our method, Proc-GS, integrates procedural code into the 3D Gaussian Splatting (3D-GS) framework, leveraging their advantages in high-fidelity rendering and efficient asset management from both worlds. By manipulating procedural code, we can streamline this process and generate an infinite variety of buildings. This integration significantly reduces model size by utilizing shared foundational assets, enabling scalable generation with precise control over building assembly. We showcase the potential for expansive cityscape generation while maintaining high rendering fidelity and precise control on both real and synthetic cases.
△ Less
Submitted 10 December, 2024;
originally announced December 2024.
-
Driving with InternVL: Oustanding Champion in the Track on Driving with Language of the Autonomous Grand Challenge at CVPR 2024
Authors:
Jiahan Li,
Zhiqi Li,
Tong Lu
Abstract:
This technical report describes the methods we employed for the Driving with Language track of the CVPR 2024 Autonomous Grand Challenge. We utilized a powerful open-source multimodal model, InternVL-1.5, and conducted a full-parameter fine-tuning on the competition dataset, DriveLM-nuScenes. To effectively handle the multi-view images of nuScenes and seamlessly inherit InternVL's outstanding multi…
▽ More
This technical report describes the methods we employed for the Driving with Language track of the CVPR 2024 Autonomous Grand Challenge. We utilized a powerful open-source multimodal model, InternVL-1.5, and conducted a full-parameter fine-tuning on the competition dataset, DriveLM-nuScenes. To effectively handle the multi-view images of nuScenes and seamlessly inherit InternVL's outstanding multimodal understanding capabilities, we formatted and concatenated the multi-view images in a specific manner. This ensured that the final model could meet the specific requirements of the competition task while leveraging InternVL's powerful image understanding capabilities. Meanwhile, we designed a simple automatic annotation strategy that converts the center points of objects in DriveLM-nuScenes into corresponding bounding boxes. As a result, our single model achieved a score of 0.6002 on the final leadboard.
△ Less
Submitted 10 December, 2024;
originally announced December 2024.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Authors:
Zhe Chen,
Weiyun Wang,
Yue Cao,
Yangzhou Liu,
Zhangwei Gao,
Erfei Cui,
Jinguo Zhu,
Shenglong Ye,
Hao Tian,
Zhaoyang Liu,
Lixin Gu,
Xuehui Wang,
Qingyun Li,
Yimin Ren,
Zixuan Chen,
Jiapeng Luo,
Jiahao Wang,
Tan Jiang,
Bo Wang,
Conghui He,
Botian Shi,
Xingcheng Zhang,
Han Lv,
Yi Wang,
Wenqi Shao
, et al. (17 additional authors not shown)
Abstract:
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision…
▽ More
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems. HuggingFace demo see https://huggingface.co/spaces/OpenGVLab/InternVL
△ Less
Submitted 13 January, 2025; v1 submitted 6 December, 2024;
originally announced December 2024.
-
LLM-Align: Utilizing Large Language Models for Entity Alignment in Knowledge Graphs
Authors:
Xuan Chen,
Tong Lu,
Zhichun Wang
Abstract:
Entity Alignment (EA) seeks to identify and match corresponding entities across different Knowledge Graphs (KGs), playing a crucial role in knowledge fusion and integration. Embedding-based entity alignment (EA) has recently gained considerable attention, resulting in the emergence of many innovative approaches. Initially, these approaches concentrated on learning entity embeddings based on the st…
▽ More
Entity Alignment (EA) seeks to identify and match corresponding entities across different Knowledge Graphs (KGs), playing a crucial role in knowledge fusion and integration. Embedding-based entity alignment (EA) has recently gained considerable attention, resulting in the emergence of many innovative approaches. Initially, these approaches concentrated on learning entity embeddings based on the structural features of knowledge graphs (KGs) as defined by relation triples. Subsequent methods have integrated entities' names and attributes as supplementary information to improve the embeddings used for EA. However, existing methods lack a deep semantic understanding of entity attributes and relations. In this paper, we propose a Large Language Model (LLM) based Entity Alignment method, LLM-Align, which explores the instruction-following and zero-shot capabilities of Large Language Models to infer alignments of entities. LLM-Align uses heuristic methods to select important attributes and relations of entities, and then feeds the selected triples of entities to an LLM to infer the alignment results. To guarantee the quality of alignment results, we design a multi-round voting mechanism to mitigate the hallucination and positional bias issues that occur with LLMs. Experiments on three EA datasets, demonstrating that our approach achieves state-of-the-art performance compared to existing EA methods.
△ Less
Submitted 5 December, 2024;
originally announced December 2024.
-
Horizon-GS: Unified 3D Gaussian Splatting for Large-Scale Aerial-to-Ground Scenes
Authors:
Lihan Jiang,
Kerui Ren,
Mulin Yu,
Linning Xu,
Junting Dong,
Tao Lu,
Feng Zhao,
Dahua Lin,
Bo Dai
Abstract:
Seamless integration of both aerial and street view images remains a significant challenge in neural scene reconstruction and rendering. Existing methods predominantly focus on single domain, limiting their applications in immersive environments, which demand extensive free view exploration with large view changes both horizontally and vertically. We introduce Horizon-GS, a novel approach built up…
▽ More
Seamless integration of both aerial and street view images remains a significant challenge in neural scene reconstruction and rendering. Existing methods predominantly focus on single domain, limiting their applications in immersive environments, which demand extensive free view exploration with large view changes both horizontally and vertically. We introduce Horizon-GS, a novel approach built upon Gaussian Splatting techniques, tackles the unified reconstruction and rendering for aerial and street views. Our method addresses the key challenges of combining these perspectives with a new training strategy, overcoming viewpoint discrepancies to generate high-fidelity scenes. We also curate a high-quality aerial-to-ground views dataset encompassing both synthetic and real-world scene to advance further research. Experiments across diverse urban scene datasets confirm the effectiveness of our method.
△ Less
Submitted 2 December, 2024;
originally announced December 2024.
-
Oracle-guided Dynamic User Preference Modeling for Sequential Recommendation
Authors:
Jiafeng Xia,
Dongsheng Li,
Hansu Gu,
Tun Lu,
Peng Zhang,
Li Shang,
Ning Gu
Abstract:
Sequential recommendation methods can capture dynamic user preferences from user historical interactions to achieve better performance. However, most existing methods only use past information extracted from user historical interactions to train the models, leading to the deviations of user preference modeling. Besides past information, future information is also available during training, which c…
▽ More
Sequential recommendation methods can capture dynamic user preferences from user historical interactions to achieve better performance. However, most existing methods only use past information extracted from user historical interactions to train the models, leading to the deviations of user preference modeling. Besides past information, future information is also available during training, which contains the ``oracle'' user preferences in the future and will be beneficial to model dynamic user preferences. Therefore, we propose an oracle-guided dynamic user preference modeling method for sequential recommendation (Oracle4Rec), which leverages future information to guide model training on past information, aiming to learn ``forward-looking'' models. Specifically, Oracle4Rec first extracts past and future information through two separate encoders, then learns a forward-looking model through an oracle-guiding module which minimizes the discrepancy between past and future information. We also tailor a two-phase model training strategy to make the guiding more effective. Extensive experiments demonstrate that Oracle4Rec is superior to state-of-the-art sequential methods. Further experiments show that Oracle4Rec can be leveraged as a generic module in other sequential recommendation methods to improve their performance with a considerable margin.
△ Less
Submitted 1 December, 2024;
originally announced December 2024.
-
Generative World Explorer
Authors:
Taiming Lu,
Tianmin Shu,
Alan Yuille,
Daniel Khashabi,
Jieneng Chen
Abstract:
Planning with partial observation is a central challenge in embodied AI. A majority of prior works have tackled this challenge by developing agents that physically explore their environment to update their beliefs about the world state. In contrast, humans can $\textit{imagine}$ unseen parts of the world through a mental exploration and $\textit{revise}$ their beliefs with imagined observations. S…
▽ More
Planning with partial observation is a central challenge in embodied AI. A majority of prior works have tackled this challenge by developing agents that physically explore their environment to update their beliefs about the world state. In contrast, humans can $\textit{imagine}$ unseen parts of the world through a mental exploration and $\textit{revise}$ their beliefs with imagined observations. Such updated beliefs can allow them to make more informed decisions, without necessitating the physical exploration of the world at all times. To achieve this human-like ability, we introduce the $\textit{Generative World Explorer (Genex)}$, an egocentric world exploration framework that allows an agent to mentally explore a large-scale 3D world (e.g., urban scenes) and acquire imagined observations to update its belief. This updated belief will then help the agent to make a more informed decision at the current step. To train $\textit{Genex}$, we create a synthetic urban scene dataset, Genex-DB. Our experimental results demonstrate that (1) $\textit{Genex}$ can generate high-quality and consistent observations during long-horizon exploration of a large virtual physical world and (2) the beliefs updated with the generated observations can inform an existing decision-making model (e.g., an LLM agent) to make better plans.
△ Less
Submitted 19 November, 2024; v1 submitted 18 November, 2024;
originally announced November 2024.
-
The Communication-Friendly Privacy-Preserving Machine Learning against Malicious Adversaries
Authors:
Tianpei Lu,
Bingsheng Zhang,
Lichun Li,
Kui Ren
Abstract:
With the increasing emphasis on privacy regulations, such as GDPR, protecting individual privacy and ensuring compliance have become critical concerns for both individuals and organizations. Privacy-preserving machine learning (PPML) is an innovative approach that allows for secure data analysis while safeguarding sensitive information. It enables organizations to extract valuable insights from da…
▽ More
With the increasing emphasis on privacy regulations, such as GDPR, protecting individual privacy and ensuring compliance have become critical concerns for both individuals and organizations. Privacy-preserving machine learning (PPML) is an innovative approach that allows for secure data analysis while safeguarding sensitive information. It enables organizations to extract valuable insights from data without compromising privacy. Secure multi-party computation (MPC) is a key tool in PPML, as it allows multiple parties to jointly compute functions without revealing their private inputs, making it essential in multi-server environments. We address the performance overhead of existing maliciously secure protocols, particularly in finite rings like $\mathbb{Z}_{2^\ell}$, by introducing an efficient protocol for secure linear function evaluation. We implement our maliciously secure MPC protocol on GPUs, significantly improving its efficiency and scalability. We extend the protocol to handle linear and non-linear layers, ensuring compatibility with a wide range of machine-learning models. Finally, we comprehensively evaluate machine learning models by integrating our protocol into the workflow, enabling secure and efficient inference across simple and complex models, such as convolutional neural networks (CNNs).
△ Less
Submitted 14 November, 2024;
originally announced November 2024.
-
DeMod: A Holistic Tool with Explainable Detection and Personalized Modification for Toxicity Censorship
Authors:
Yaqiong Li,
Peng Zhang,
Hansu Gu,
Tun Lu,
Siyuan Qiao,
Yubo Shu,
Yiyang Shao,
Ning Gu
Abstract:
Although there have been automated approaches and tools supporting toxicity censorship for social posts, most of them focus on detection. Toxicity censorship is a complex process, wherein detection is just an initial task and a user can have further needs such as rationale understanding and content modification. For this problem, we conduct a needfinding study to investigate people's diverse needs…
▽ More
Although there have been automated approaches and tools supporting toxicity censorship for social posts, most of them focus on detection. Toxicity censorship is a complex process, wherein detection is just an initial task and a user can have further needs such as rationale understanding and content modification. For this problem, we conduct a needfinding study to investigate people's diverse needs in toxicity censorship and then build a ChatGPT-based censorship tool named DeMod accordingly. DeMod is equipped with the features of explainable Detection and personalized Modification, providing fine-grained detection results, detailed explanations, and personalized modification suggestions. We also implemented the tool and recruited 35 Weibo users for evaluation. The results suggest DeMod's multiple strengths like the richness of functionality, the accuracy of censorship, and ease of use. Based on the findings, we further propose several insights into the design of content censorship systems.
△ Less
Submitted 4 November, 2024;
originally announced November 2024.
-
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
Authors:
Zhangwei Gao,
Zhe Chen,
Erfei Cui,
Yiming Ren,
Weiyun Wang,
Jinguo Zhu,
Hao Tian,
Shenglong Ye,
Junjun He,
Xizhou Zhu,
Lewei Lu,
Tong Lu,
Yu Qiao,
Jifeng Dai,
Wenhai Wang
Abstract:
Multimodal large language models (MLLMs) have demonstrated impressive performance in vision-language tasks across a broad spectrum of domains. However, the large model scale and associated high computational costs pose significant challenges for training and deploying MLLMs on consumer-grade GPUs or edge devices, thereby hindering their widespread application. In this work, we introduce Mini-Inter…
▽ More
Multimodal large language models (MLLMs) have demonstrated impressive performance in vision-language tasks across a broad spectrum of domains. However, the large model scale and associated high computational costs pose significant challenges for training and deploying MLLMs on consumer-grade GPUs or edge devices, thereby hindering their widespread application. In this work, we introduce Mini-InternVL, a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters. This significant improvement in efficiency and effectiveness makes our models more accessible and applicable in various real-world scenarios. To further promote the adoption of our models, we develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks, including autonomous driving, medical images, and remote sensing. We believe that our study can provide valuable insights and resources to advance the development of efficient and effective MLLMs. Code is available at https://github.com/OpenGVLab/InternVL.
△ Less
Submitted 7 November, 2024; v1 submitted 21 October, 2024;
originally announced October 2024.
-
MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding
Authors:
Yue Cao,
Yangzhou Liu,
Zhe Chen,
Guangchen Shi,
Wenhai Wang,
Danhuai Zhao,
Tong Lu
Abstract:
Despite significant advancements in Multimodal Large Language Models (MLLMs) for understanding complex human intentions through cross-modal interactions, capturing intricate image details remains challenging. Previous methods integrating multiple vision encoders to enhance visual detail introduce redundancy and computational overhead. We observe that most MLLMs utilize only the last-layer feature…
▽ More
Despite significant advancements in Multimodal Large Language Models (MLLMs) for understanding complex human intentions through cross-modal interactions, capturing intricate image details remains challenging. Previous methods integrating multiple vision encoders to enhance visual detail introduce redundancy and computational overhead. We observe that most MLLMs utilize only the last-layer feature map of the vision encoder for visual representation, neglecting the rich fine-grained information in shallow feature maps. To address this issue, we propose \modelname, a simple yet effective multi-layer feature fuser that efficiently integrates deep and shallow features from Vision Transformers (ViTs). Specifically, it leverages semantically aligned deep features as queries to dynamically extract missing details from shallow features, thus preserving semantic alignment while enriching the representation with fine-grained information. Applied to the LLaVA-1.5 model, \modelname~achieves significant improvements in visual representation and benchmark performance, providing a more flexible and lightweight solution compared to multi-encoder ensemble methods. The code and model have been released at https://github.com/yuecao0119/MMFuser.
△ Less
Submitted 15 October, 2024;
originally announced October 2024.
-
Filtering Discomforting Recommendations with Large Language Models
Authors:
Jiahao Liu,
Yiyang Shao,
Peng Zhang,
Dongsheng Li,
Hansu Gu,
Chao Chen,
Longzhi Du,
Tun Lu,
Ning Gu
Abstract:
Personalized algorithms can inadvertently expose users to discomforting recommendations, potentially triggering negative consequences. The subjectivity of discomfort and the black-box nature of these algorithms make it challenging to effectively identify and filter such content. To address this, we first conducted a formative study to understand users' practices and expectations regarding discomfo…
▽ More
Personalized algorithms can inadvertently expose users to discomforting recommendations, potentially triggering negative consequences. The subjectivity of discomfort and the black-box nature of these algorithms make it challenging to effectively identify and filter such content. To address this, we first conducted a formative study to understand users' practices and expectations regarding discomforting recommendation filtering. Then, we designed a Large Language Model (LLM)-based tool named DiscomfortFilter, which constructs an editable preference profile for a user and helps the user express filtering needs through conversation to mask discomforting preferences within the profile. Based on the edited profile, DiscomfortFilter facilitates the discomforting recommendations filtering in a plug-and-play manner, maintaining flexibility and transparency. The constructed preference profile improves LLM reasoning and simplifies user alignment, enabling a 3.8B open-source LLM to rival top commercial models in an offline proxy task. A one-week user study with 24 participants demonstrated the effectiveness of DiscomfortFilter, while also highlighting its potential impact on platform recommendation outcomes. We conclude by discussing the ongoing challenges, highlighting its relevance to broader research, assessing stakeholder impact, and outlining future research directions.
△ Less
Submitted 23 January, 2025; v1 submitted 7 October, 2024;
originally announced October 2024.
-
Perception Compressor: A Training-Free Prompt Compression Framework in Long Context Scenarios
Authors:
Jiwei Tang,
Jin Xu,
Tingwei Lu,
Zhicheng Zhang,
Yiming Zhao,
Lin Hai,
Hai-Tao Zheng
Abstract:
Large language models (LLMs) demonstrate exceptional capabilities in various scenarios. However, they suffer from much redundant information and are sensitive to the position of key information in long context scenarios. To address these challenges, we present Perception Compressor, a training-free prompt compression framework. It includes a perception retriever that leverages guiding questions an…
▽ More
Large language models (LLMs) demonstrate exceptional capabilities in various scenarios. However, they suffer from much redundant information and are sensitive to the position of key information in long context scenarios. To address these challenges, we present Perception Compressor, a training-free prompt compression framework. It includes a perception retriever that leverages guiding questions and instruction to retrieve the most relevant demonstrations, a dual-slope ratio allocator to dynamically allocate compression ratios and open-book ratios, and a semi-guided iterative compression that retains key information at the token level while removing tokens that distract the LLM. We conduct extensive experiments on long context benchmarks, i.e., NaturalQuestions, LongBench, and MuSiQue. Experiment results show that Perception Compressor outperforms existing methods by a large margin, achieving state-of-the-art performance.
△ Less
Submitted 7 February, 2025; v1 submitted 28 September, 2024;
originally announced September 2024.
-
Joint Optimization of Data- and Model-Driven Probing Beams and Beam Predictor
Authors:
Tianheng Lu,
Fan Meng,
Zhilei Zhang,
Yongming Huang,
Cheng Zhang,
Xiaoyu Bai
Abstract:
Hierarchical search in millimeter-wave (mmWave) communications incurs significant beam training overhead and delay, especially in a dynamic environment. Deep learning-enabled beam prediction is promising to significantly mitigate the overhead and delay, efficiently utilizing the site-specific channel prior. In this work, we propose to jointly optimize a data- and model-driven probe beam module and…
▽ More
Hierarchical search in millimeter-wave (mmWave) communications incurs significant beam training overhead and delay, especially in a dynamic environment. Deep learning-enabled beam prediction is promising to significantly mitigate the overhead and delay, efficiently utilizing the site-specific channel prior. In this work, we propose to jointly optimize a data- and model-driven probe beam module and a cascaded data-driven beam predictor, with limitations in that the probe and communicate beams are restricted within the manifold space of uniform planer array and quantization of the phase modulator. First, The probe beam module senses the mmWave channel with a complex-valued neural network and outputs the counterpart RSRPs of probe beams. Second, the beam predictor estimates the RSRPs in the entire beamspace to minimize the prediction cross entropy and selects the optimal beam with the maximum RSRP value for data transmission. Additionally, we propose to add noise to the phase variables in the probe beam module, against quantization error. Simulation results show the effectiveness of our proposed scheme.
△ Less
Submitted 26 September, 2024;
originally announced September 2024.