-
Searching Clinical Data Using Generative AI
Authors:
Karan Hanswadkar,
Anika Kanchi,
Shivani Tripathi,
Shi Qiao,
Rony Chatterjee,
Alekh Jindal
Abstract:
Artificial Intelligence (AI) is making a major impact on healthcare, particularly through its application in natural language processing (NLP) and predictive analytics. The healthcare sector has increasingly adopted AI for tasks such as clinical data analysis and medical code assignment. However, searching for clinical information in large and often unorganized datasets remains a manual and error-…
▽ More
Artificial Intelligence (AI) is making a major impact on healthcare, particularly through its application in natural language processing (NLP) and predictive analytics. The healthcare sector has increasingly adopted AI for tasks such as clinical data analysis and medical code assignment. However, searching for clinical information in large and often unorganized datasets remains a manual and error-prone process. Assisting this process with automations can help physicians improve their operational productivity significantly.
In this paper, we present a generative AI approach, coined SearchAI, to enhance the accuracy and efficiency of searching clinical data. Unlike traditional code assignment, which is a one-to-one problem, clinical data search is a one-to-many problem, i.e., a given search query can map to a family of codes. Healthcare professionals typically search for groups of related diseases, drugs, or conditions that map to many codes, and therefore, they need search tools that can handle keyword synonyms, semantic variants, and broad open-ended queries. SearchAI employs a hierarchical model that respects the coding hierarchy and improves the traversal of relationships from parent to child nodes. SearchAI navigates these hierarchies predictively and ensures that all paths are reachable without losing any relevant nodes.
To evaluate the effectiveness of SearchAI, we conducted a series of experiments using both public and production datasets. Our results show that SearchAI outperforms default hierarchical traversals across several metrics, including accuracy, robustness, performance, and scalability. SearchAI can help make clinical data more accessible, leading to streamlined workflows, reduced administrative burden, and enhanced coding and diagnostic accuracy.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation
Authors:
Zhen Li,
Duan Li,
Yukai Guo,
Xinyuan Guo,
Bowen Li,
Lanxi Xiao,
Shenyu Qiao,
Jiashu Chen,
Zijian Wu,
Hui Zhang,
Xinhuan Shu,
Shixia Liu
Abstract:
Infographic charts are a powerful medium for communicating abstract data by combining visual elements (e.g., charts, images) with textual information. However, their visual and structural richness poses challenges for large vision-language models (LVLMs), which are typically trained on plain charts. To bridge this gap, we introduce ChartGalaxy, a million-scale dataset designed to advance the under…
▽ More
Infographic charts are a powerful medium for communicating abstract data by combining visual elements (e.g., charts, images) with textual information. However, their visual and structural richness poses challenges for large vision-language models (LVLMs), which are typically trained on plain charts. To bridge this gap, we introduce ChartGalaxy, a million-scale dataset designed to advance the understanding and generation of infographic charts. The dataset is constructed through an inductive process that identifies 75 chart types, 330 chart variations, and 68 layout templates from real infographic charts and uses them to create synthetic ones programmatically. We showcase the utility of this dataset through: 1) improving infographic chart understanding via fine-tuning, 2) benchmarking code generation for infographic charts, and 3) enabling example-based infographic chart generation. By capturing the visual and structural complexity of real design, ChartGalaxy provides a useful resource for enhancing multimodal reasoning and generation in LVLMs.
△ Less
Submitted 31 May, 2025; v1 submitted 24 May, 2025;
originally announced May 2025.
-
Model Merging in Pre-training of Large Language Models
Authors:
Yunshui Li,
Yiyuan Ma,
Shen Yan,
Chaoyi Zhang,
Jing Liu,
Jianqiao Lu,
Ziwen Xu,
Mengzhao Chen,
Minrui Wang,
Shiyi Zhan,
Jin Ma,
Xunhao Lai,
Deyi Liu,
Yao Luo,
Xingyan Bin,
Hongbin Ren,
Mingji Han,
Wenhao Hao,
Bairen Yi,
LingJun Liu,
Bole Ma,
Xiaoying Jia,
Xun Zhou,
Siyuan Qiao,
Liang Xiang
, et al. (1 additional authors not shown)
Abstract:
Model merging has emerged as a promising technique for enhancing large language models, though its application in large-scale pre-training remains relatively unexplored. In this paper, we present a comprehensive investigation of model merging techniques during the pre-training process. Through extensive experiments with both dense and Mixture-of-Experts (MoE) architectures ranging from millions to…
▽ More
Model merging has emerged as a promising technique for enhancing large language models, though its application in large-scale pre-training remains relatively unexplored. In this paper, we present a comprehensive investigation of model merging techniques during the pre-training process. Through extensive experiments with both dense and Mixture-of-Experts (MoE) architectures ranging from millions to over 100 billion parameters, we demonstrate that merging checkpoints trained with constant learning rates not only achieves significant performance improvements but also enables accurate prediction of annealing behavior. These improvements lead to both more efficient model development and significantly lower training costs. Our detailed ablation studies on merging strategies and hyperparameters provide new insights into the underlying mechanisms while uncovering novel applications. Through comprehensive experimental analysis, we offer the open-source community practical pre-training guidelines for effective model merging.
△ Less
Submitted 22 May, 2025; v1 submitted 17 May, 2025;
originally announced May 2025.
-
Disentangled Graph Representation Based on Substructure-Aware Graph Optimal Matching Kernel Convolutional Networks
Authors:
Mao Wang,
Tao Wu,
Xingping Xian,
Shaojie Qiao,
Weina Niu,
Canyixing Cui
Abstract:
Graphs effectively characterize relational data, driving graph representation learning methods that uncover underlying predictive information. As state-of-the-art approaches, Graph Neural Networks (GNNs) enable end-to-end learning for diverse tasks. Recent disentangled graph representation learning enhances interpretability by decoupling independent factors in graph data. However, existing methods…
▽ More
Graphs effectively characterize relational data, driving graph representation learning methods that uncover underlying predictive information. As state-of-the-art approaches, Graph Neural Networks (GNNs) enable end-to-end learning for diverse tasks. Recent disentangled graph representation learning enhances interpretability by decoupling independent factors in graph data. However, existing methods often implicitly and coarsely characterize graph structures, limiting structural pattern analysis within the graph. This paper proposes the Graph Optimal Matching Kernel Convolutional Network (GOMKCN) to address this limitation. We view graphs as node-centric subgraphs, where each subgraph acts as a structural factor encoding position-specific information. This transforms graph prediction into structural pattern recognition. Inspired by CNNs, GOMKCN introduces the Graph Optimal Matching Kernel (GOMK) as a convolutional operator, computing similarities between subgraphs and learnable graph filters. Mathematically, GOMK maps subgraphs and filters into a Hilbert space, representing graphs as point sets. Disentangled representations emerge from projecting subgraphs onto task-optimized filters, which adaptively capture relevant structural patterns via gradient descent. Crucially, GOMK incorporates local correspondences in similarity measurement, resolving the trade-off between differentiability and accuracy in graph kernels. Experiments validate that GOMKCN achieves superior accuracy and interpretability in graph pattern mining and prediction. The framework advances the theoretical foundation for disentangled graph representation learning.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement
Authors:
Runnan Fang,
Xiaobin Wang,
Yuan Liang,
Shuofei Qiao,
Jialong Wu,
Zekun Xi,
Ningyu Zhang,
Yong Jiang,
Pengjun Xie,
Fei Huang,
Huajun Chen
Abstract:
In the interaction between agents and their environments, agents expand their capabilities by planning and executing actions. However, LLM-based agents face substantial challenges when deployed in novel environments or required to navigate unconventional action spaces. To empower agents to autonomously explore environments, optimize workflows, and enhance their understanding of actions, we propose…
▽ More
In the interaction between agents and their environments, agents expand their capabilities by planning and executing actions. However, LLM-based agents face substantial challenges when deployed in novel environments or required to navigate unconventional action spaces. To empower agents to autonomously explore environments, optimize workflows, and enhance their understanding of actions, we propose SynWorld, a framework that allows agents to synthesize possible scenarios with multi-step action invocation within the action space and perform Monte Carlo Tree Search (MCTS) exploration to effectively refine their action knowledge in the current environment. Our experiments demonstrate that SynWorld is an effective and general approach to learning action knowledge in new environments. Code is available at https://github.com/zjunlp/SynWorld.
△ Less
Submitted 1 June, 2025; v1 submitted 4 April, 2025;
originally announced April 2025.
-
Agentic Knowledgeable Self-awareness
Authors:
Shuofei Qiao,
Zhisong Qiu,
Baochang Ren,
Xiaobin Wang,
Xiangyuan Ru,
Ningyu Zhang,
Xiang Chen,
Yong Jiang,
Pengjun Xie,
Fei Huang,
Huajun Chen
Abstract:
Large Language Models (LLMs) have achieved considerable performance across various agentic planning tasks. However, traditional agent planning approaches adopt a "flood irrigation" methodology that indiscriminately injects gold trajectories, external feedback, and domain knowledge into agent models. This practice overlooks the fundamental human cognitive principle of situational self-awareness dur…
▽ More
Large Language Models (LLMs) have achieved considerable performance across various agentic planning tasks. However, traditional agent planning approaches adopt a "flood irrigation" methodology that indiscriminately injects gold trajectories, external feedback, and domain knowledge into agent models. This practice overlooks the fundamental human cognitive principle of situational self-awareness during decision-making-the ability to dynamically assess situational demands and strategically employ resources during decision-making. We propose agentic knowledgeable self-awareness to address this gap, a novel paradigm enabling LLM-based agents to autonomously regulate knowledge utilization. Specifically, we propose KnowSelf, a data-centric approach that applies agents with knowledgeable self-awareness like humans. Concretely, we devise a heuristic situation judgement criterion to mark special tokens on the agent's self-explored trajectories for collecting training data. Through a two-stage training process, the agent model can switch between different situations by generating specific special tokens, achieving optimal planning effects with minimal costs. Our experiments demonstrate that KnowSelf can outperform various strong baselines on different tasks and models with minimal use of external knowledge. Code is available at https://github.com/zjunlp/KnowSelf.
△ Less
Submitted 29 May, 2025; v1 submitted 4 April, 2025;
originally announced April 2025.
-
ZFusion: An Effective Fuser of Camera and 4D Radar for 3D Object Perception in Autonomous Driving
Authors:
Sheng Yang,
Tong Zhan,
Shichen Qiao,
Jicheng Gong,
Qing Yang,
Jian Wang,
Yanfeng Lu
Abstract:
Reliable 3D object perception is essential in autonomous driving. Owing to its sensing capabilities in all weather conditions, 4D radar has recently received much attention. However, compared to LiDAR, 4D radar provides much sparser point cloud. In this paper, we propose a 3D object detection method, termed ZFusion, which fuses 4D radar and vision modality. As the core of ZFusion, our proposed FP-…
▽ More
Reliable 3D object perception is essential in autonomous driving. Owing to its sensing capabilities in all weather conditions, 4D radar has recently received much attention. However, compared to LiDAR, 4D radar provides much sparser point cloud. In this paper, we propose a 3D object detection method, termed ZFusion, which fuses 4D radar and vision modality. As the core of ZFusion, our proposed FP-DDCA (Feature Pyramid-Double Deformable Cross Attention) fuser complements the (sparse) radar information and (dense) vision information, effectively. Specifically, with a feature-pyramid structure, the FP-DDCA fuser packs Transformer blocks to interactively fuse multi-modal features at different scales, thus enhancing perception accuracy. In addition, we utilize the Depth-Context-Split view transformation module due to the physical properties of 4D radar. Considering that 4D radar has a much lower cost than LiDAR, ZFusion is an attractive alternative to LiDAR-based methods. In typical traffic scenarios like the VoD (View-of-Delft) dataset, experiments show that with reasonable inference speed, ZFusion achieved the state-of-the-art mAP (mean average precision) in the region of interest, while having competitive mAP in the entire area compared to the baseline methods, which demonstrates performance close to LiDAR and greatly outperforms those camera-only methods.
△ Less
Submitted 7 April, 2025; v1 submitted 4 April, 2025;
originally announced April 2025.
-
Estimating Hourly Neighborhood Population Using Mobile Phone Data in the United States
Authors:
Huan Ning,
Zhenlong Li,
Manzhu Yu,
Shiyan Zhang,
Shan Qiao
Abstract:
Traditional population estimation techniques often fail to capture the dynamic fluctuations inherent in urban and rural population movements. Recognizing the need for a high spatiotemporal dynamic population dataset, we propose a method using smartphone-based human mobility data to reconstruct the hourly population for each neighborhood across the US. We quantify population fluctuations on an hour…
▽ More
Traditional population estimation techniques often fail to capture the dynamic fluctuations inherent in urban and rural population movements. Recognizing the need for a high spatiotemporal dynamic population dataset, we propose a method using smartphone-based human mobility data to reconstruct the hourly population for each neighborhood across the US. We quantify population fluctuations on an hourly, diurnal, daily, and seasonal basis, and compare these with static population data to highlight the limitations of traditional models in capturing temporal dynamics. This study is one of the first hourly population products at a large geographic extent (US), contributing to various studies that involve dynamic populations with high spatiotemporal resolution, such as air pollution exposure analysis and emergency response.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
Authors:
Lijie Fan,
Luming Tang,
Siyang Qin,
Tianhong Li,
Xuan Yang,
Siyuan Qiao,
Andreas Steiner,
Chen Sun,
Yuanzhen Li,
Tao Zhu,
Michael Rubinstein,
Michalis Raptis,
Deqing Sun,
Radu Soricut
Abstract:
We present UniFluid, a unified autoregressive framework for joint visual generation and understanding leveraging continuous visual tokens. Our unified autoregressive architecture processes multimodal image and text inputs, generating discrete tokens for text and continuous tokens for image. We find though there is an inherent trade-off between the image generation and understanding task, a careful…
▽ More
We present UniFluid, a unified autoregressive framework for joint visual generation and understanding leveraging continuous visual tokens. Our unified autoregressive architecture processes multimodal image and text inputs, generating discrete tokens for text and continuous tokens for image. We find though there is an inherent trade-off between the image generation and understanding task, a carefully tuned training recipe enables them to improve each other. By selecting an appropriate loss balance weight, the unified model achieves results comparable to or exceeding those of single-task baselines on both tasks. Furthermore, we demonstrate that employing stronger pre-trained LLMs and random-order generation during training is important to achieve high-fidelity image generation within this unified framework. Built upon the Gemma model series, UniFluid exhibits competitive performance across both image generation and understanding, demonstrating strong transferability to various downstream tasks, including image editing for generation, as well as visual captioning and question answering for understanding.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
A Survey of Large Language Model Empowered Agents for Recommendation and Search: Towards Next-Generation Information Retrieval
Authors:
Yu Zhang,
Shutong Qiao,
Jiaqi Zhang,
Tzu-Heng Lin,
Chen Gao,
Yong Li
Abstract:
Information technology has profoundly altered the way humans interact with information. The vast amount of content created, shared, and disseminated online has made it increasingly difficult to access relevant information. Over the past two decades, recommender systems and search (collectively referred to as information retrieval systems) have evolved significantly to address these challenges. Rec…
▽ More
Information technology has profoundly altered the way humans interact with information. The vast amount of content created, shared, and disseminated online has made it increasingly difficult to access relevant information. Over the past two decades, recommender systems and search (collectively referred to as information retrieval systems) have evolved significantly to address these challenges. Recent advances in large language models (LLMs) have demonstrated capabilities that surpass human performance in various language-related tasks and exhibit general understanding, reasoning, and decision-making abilities. This paper explores the transformative potential of LLM agents in enhancing recommender and search systems. We discuss the motivations and roles of LLM agents, and establish a classification framework to elaborate on the existing research. We highlight the immense potential of LLM agents in addressing current challenges in recommendation and search, providing insights into future research directions. This paper is the first to systematically review and classify the research on LLM agents in these domains, offering a novel perspective on leveraging this advanced AI technology for information retrieval. To help understand the existing works, we list the existing papers on LLM agent based recommendation and search at this link: https://github.com/tsinghua-fib-lab/LLM-Agent-for-Recommendation-and-Search.
△ Less
Submitted 11 April, 2025; v1 submitted 7 March, 2025;
originally announced March 2025.
-
YouthCare: Building a Personalized Collaborative Video Censorship Tool to Support Parent-Child Joint Media Engagement
Authors:
Wenxin Zhao,
Fangyu Yu,
Peng Zhang,
Hansu Gu,
Lin Wang,
Siyuan Qiao,
Tun Lu,
Ning Gu
Abstract:
To mitigate the negative impacts of online videos on teenagers, existing research and platforms have implemented various parental mediation mechanisms, such as Parent-Child Joint Media Engagement (JME). However, JME generally relies heavily on parents' time, knowledge, and experience. To fill this gap, we aim to design an automatic tool to help parents/children censor videos more effectively and e…
▽ More
To mitigate the negative impacts of online videos on teenagers, existing research and platforms have implemented various parental mediation mechanisms, such as Parent-Child Joint Media Engagement (JME). However, JME generally relies heavily on parents' time, knowledge, and experience. To fill this gap, we aim to design an automatic tool to help parents/children censor videos more effectively and efficiently in JME. For this goal, we first conducted a formative study to identify the needs and expectations of teenagers and parents for such a system. Based on the findings, we designed YouthCare, a personalized collaborative video censorship tool that supports parents and children to collaboratively filter out inappropriate content and select appropriate content in JME. An evaluation with 10 parent-child pairs demonstrated YouthCare's several strengths in supporting video censorship, while also highlighting some potential problems. These findings inspire us to propose several insights for the future design of parent-child collaborative JME systems.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
LightThinker: Thinking Step-by-Step Compression
Authors:
Jintian Zhang,
Yuqi Zhu,
Mengshu Sun,
Yujie Luo,
Shuofei Qiao,
Lun Du,
Da Zheng,
Huajun Chen,
Ningyu Zhang
Abstract:
Large language models (LLMs) have shown remarkable performance in complex reasoning tasks, but their efficiency is hindered by the substantial memory and computational costs associated with generating lengthy tokens. In this paper, we propose LightThinker, a novel method that enables LLMs to dynamically compress intermediate thoughts during reasoning. Inspired by human cognitive processes, LightTh…
▽ More
Large language models (LLMs) have shown remarkable performance in complex reasoning tasks, but their efficiency is hindered by the substantial memory and computational costs associated with generating lengthy tokens. In this paper, we propose LightThinker, a novel method that enables LLMs to dynamically compress intermediate thoughts during reasoning. Inspired by human cognitive processes, LightThinker compresses verbose thought steps into compact representations and discards the original reasoning chains, thereby significantly reducing the number of tokens stored in the context window. This is achieved by training the model on when and how to perform compression through data construction, mapping hidden states to condensed gist tokens, and creating specialized attention masks. Additionally, we introduce the Dependency (Dep) metric to quantify the degree of compression by measuring the reliance on historical tokens during generation. Extensive experiments on four datasets and two models show that LightThinker reduces peak memory usage and inference time, while maintaining competitive accuracy. Our work provides a new direction for improving the efficiency of LLMs in complex reasoning tasks without sacrificing performance. Code will be released at https://github.com/zjunlp/LightThinker.
△ Less
Submitted 21 February, 2025;
originally announced February 2025.
-
Organometallic-Inorganic Hybrid MXenes with Tunable Superconductivity
Authors:
Qi Fan,
Tao Bo,
Wei Guo,
Minghua Chen,
Qing Tang,
Yicong Yang,
Mian Li,
Ke Chen,
Fangfang Ge,
Jialu Li,
Sicong Qiao,
Changda Wang,
Li Song,
Lijing Yu,
Jinghua Guo,
Michael Naguib,
Zhifang Chai,
Qing Huang,
Chaochao Dun,
Ning Kang,
Yury Gogotsi,
Kun Liang
Abstract:
Ti-based two-dimensional transition-metal carbides (MXenes) have attracted attention due to their superior properties and are being explored across various applications1,2. Despite their versatile properties, superconductivity has never been demonstrated, not even predicted, for this important group of 2D materials. In this work, we have introduced an electrochemical intercalation protocol to cons…
▽ More
Ti-based two-dimensional transition-metal carbides (MXenes) have attracted attention due to their superior properties and are being explored across various applications1,2. Despite their versatile properties, superconductivity has never been demonstrated, not even predicted, for this important group of 2D materials. In this work, we have introduced an electrochemical intercalation protocol to construct versatile organometallic-inorganic hybrid MXenes and achieved tunable superconductivity in the metallocene-modified layered crystals. Through structural editing of MXene matrix at atomic scale and meticulously modulated intercalation route, Ti3C2Tx intercalated with metallocene species exhibits a superconductive transition temperature (Tc) of 10.2 K. Guest intercalation induced electron filling and strain engineering are responsible for the emerging superconductivity in this intrinsically non-superconducting material. Theoretically, simulated electron-phonon interaction effects further elucidate the nature of the changes in Tc. Furthermore, the Tc of crafted artificial superlattices beyond Ti-based MXenes have been predicted, offering a general strategy for engineering superconductivity and magnetism in layered hybrid materials.
△ Less
Submitted 16 February, 2025;
originally announced February 2025.
-
Fast Biclique Counting on Bipartite Graphs: A Node Pivot-based Approach
Authors:
Xiaowei Ye,
Rong-Hua Li,
Longlong Lin,
Shaojie Qiao,
Guoren Wang
Abstract:
Counting the number of $(p, q)$-bicliques (complete bipartite subgraphs) in a bipartite graph is a fundamental problem which plays a crucial role in numerous bipartite graph analysis applications. However, existing algorithms for counting $(p, q)$-bicliques often face significant computational challenges, particularly on large real-world networks. In this paper, we propose a general biclique count…
▽ More
Counting the number of $(p, q)$-bicliques (complete bipartite subgraphs) in a bipartite graph is a fundamental problem which plays a crucial role in numerous bipartite graph analysis applications. However, existing algorithms for counting $(p, q)$-bicliques often face significant computational challenges, particularly on large real-world networks. In this paper, we propose a general biclique counting framework, called \npivot, based on a novel concept of node-pivot. We show that previous methods can be viewed as specific implementations of this general framework. More importantly, we propose a novel implementation of \npivot based on a carefully-designed minimum non-neighbor candidate partition strategy. We prove that our new implementation of \npivot has lower worst-case time complexity than the state-of-the-art methods. Beyond basic biclique counting, a nice feature of \npivot is that it also supports local counting (computing bicliques per node) and range counting (simultaneously counting bicliques within a size range). Extensive experiments on 12 real-world large datasets demonstrate that our proposed \npivot substantially outperforms state-of-the-art algorithms by up to two orders of magnitude.
△ Less
Submitted 20 December, 2024;
originally announced December 2024.
-
Oversight in Action: Experiences with Instructor-Moderated LLM Responses in an Online Discussion Forum
Authors:
Shuying Qiao,
Paul Denny,
Nasser Giacaman
Abstract:
The integration of large language models (LLMs) into computing education offers many potential benefits to student learning, and several novel pedagogical approaches have been reported in the literature. However LLMs also present challenges, one of the most commonly cited being that of student over-reliance. This challenge is compounded by the fact that LLMs are always available to provide instant…
▽ More
The integration of large language models (LLMs) into computing education offers many potential benefits to student learning, and several novel pedagogical approaches have been reported in the literature. However LLMs also present challenges, one of the most commonly cited being that of student over-reliance. This challenge is compounded by the fact that LLMs are always available to provide instant help and solutions to students, which can undermine their ability to independently solve problems and diagnose and resolve errors. Providing instructor oversight of LLM-generated content can mitigate this problem, however it is often not practical in real-time learning contexts. Online class discussion forums, which are widely used in computing education, present an opportunity for exploring instructor oversight because they operate asynchronously. Unlike real-time interactions, the discussion forum format aligns with the expectation that responses may take time, making oversight not only feasible but also pedagogically appropriate. In this practitioner paper, we present the design, deployment, and evaluation of a `bot' module that is controlled by the instructor, and integrated into an online discussion forum. The bot assists the instructor by generating draft responses to student questions, which are reviewed, modified, and approved before release. Key features include the ability to leverage course materials, access archived discussions, and publish responses anonymously to encourage open participation. We report our experiences using this tool in a 12-week second-year software engineering course on object-oriented programming. Instructor feedback confirmed the tool successfully alleviated workload but highlighted a need for improvement in handling complex, context-dependent queries. We report the features that were viewed as most beneficial, and suggest avenues for future exploration.
△ Less
Submitted 12 December, 2024;
originally announced December 2024.
-
FGP: Feature-Gradient-Prune for Efficient Convolutional Layer Pruning
Authors:
Qingsong Lv,
Jiasheng Sun,
Sheng Zhou,
Xu Zhang,
Liangcheng Li,
Yun Gao,
Sun Qiao,
Jie Song,
Jiajun Bu
Abstract:
To reduce computational overhead while maintaining model performance, model pruning techniques have been proposed. Among these, structured pruning, which removes entire convolutional channels or layers, significantly enhances computational efficiency and is compatible with hardware acceleration. However, existing pruning methods that rely solely on image features or gradients often result in the r…
▽ More
To reduce computational overhead while maintaining model performance, model pruning techniques have been proposed. Among these, structured pruning, which removes entire convolutional channels or layers, significantly enhances computational efficiency and is compatible with hardware acceleration. However, existing pruning methods that rely solely on image features or gradients often result in the retention of redundant channels, negatively impacting inference efficiency. To address this issue, this paper introduces a novel pruning method called Feature-Gradient Pruning (FGP). This approach integrates both feature-based and gradient-based information to more effectively evaluate the importance of channels across various target classes, enabling a more accurate identification of channels that are critical to model performance. Experimental results demonstrate that the proposed method improves both model compactness and practicality while maintaining stable performance. Experiments conducted across multiple tasks and datasets show that FGP significantly reduces computational costs and minimizes accuracy loss compared to existing methods, highlighting its effectiveness in optimizing pruning outcomes. The source code is available at: https://github.com/FGP-code/FGP.
△ Less
Submitted 19 November, 2024;
originally announced November 2024.
-
LLM-based Bi-level Multi-interest Learning Framework for Sequential Recommendation
Authors:
Shutong Qiao,
Chen Gao,
Wei Yuan,
Yong Li,
Hongzhi Yin
Abstract:
Sequential recommendation (SR) leverages users' dynamic preferences, with recent advances incorporating multi-interest learning to model diverse user interests. However, most multi-interest SR models rely on noisy, sparse implicit feedback, limiting recommendation accuracy. Large language models (LLMs) offer robust reasoning on low-quality data but face high computational costs and latency challen…
▽ More
Sequential recommendation (SR) leverages users' dynamic preferences, with recent advances incorporating multi-interest learning to model diverse user interests. However, most multi-interest SR models rely on noisy, sparse implicit feedback, limiting recommendation accuracy. Large language models (LLMs) offer robust reasoning on low-quality data but face high computational costs and latency challenges for SR integration. We propose a novel LLM-based multi-interest SR framework combining implicit behavioral and explicit semantic perspectives. It includes two modules: the Implicit Behavioral Interest Module (IBIM), which learns from user behavior using a traditional SR model, and the Explicit Semantic Interest Module (ESIM), which uses clustering and prompt-engineered LLMs to extract semantic multi-interest representations from informative samples. Semantic insights from ESIM enhance IBIM's behavioral representations via modality alignment and semantic prediction tasks. During inference, only IBIM is used, ensuring efficient, LLM-free recommendations. Experiments on four real-world datasets validate the framework's effectiveness and practicality.
△ Less
Submitted 7 May, 2025; v1 submitted 14 November, 2024;
originally announced November 2024.
-
DeMod: A Holistic Tool with Explainable Detection and Personalized Modification for Toxicity Censorship
Authors:
Yaqiong Li,
Peng Zhang,
Hansu Gu,
Tun Lu,
Siyuan Qiao,
Yubo Shu,
Yiyang Shao,
Ning Gu
Abstract:
Although there have been automated approaches and tools supporting toxicity censorship for social posts, most of them focus on detection. Toxicity censorship is a complex process, wherein detection is just an initial task and a user can have further needs such as rationale understanding and content modification. For this problem, we conduct a needfinding study to investigate people's diverse needs…
▽ More
Although there have been automated approaches and tools supporting toxicity censorship for social posts, most of them focus on detection. Toxicity censorship is a complex process, wherein detection is just an initial task and a user can have further needs such as rationale understanding and content modification. For this problem, we conduct a needfinding study to investigate people's diverse needs in toxicity censorship and then build a ChatGPT-based censorship tool named DeMod accordingly. DeMod is equipped with the features of explainable Detection and personalized Modification, providing fine-grained detection results, detailed explanations, and personalized modification suggestions. We also implemented the tool and recruited 35 Weibo users for evaluation. The results suggest DeMod's multiple strengths like the richness of functionality, the accuracy of censorship, and ease of use. Based on the findings, we further propose several insights into the design of content censorship systems.
△ Less
Submitted 4 November, 2024;
originally announced November 2024.
-
Differentiable architecture search with multi-dimensional attention for spiking neural networks
Authors:
Yilei Man,
Linhai Xie,
Shushan Qiao,
Yumei Zhou,
Delong Shang
Abstract:
Spiking Neural Networks (SNNs) have gained enormous popularity in the field of artificial intelligence due to their low power consumption. However, the majority of SNN methods directly inherit the structure of Artificial Neural Networks (ANN), usually leading to sub-optimal model performance in SNNs. To alleviate this problem, we integrate Neural Architecture Search (NAS) method and propose Multi-…
▽ More
Spiking Neural Networks (SNNs) have gained enormous popularity in the field of artificial intelligence due to their low power consumption. However, the majority of SNN methods directly inherit the structure of Artificial Neural Networks (ANN), usually leading to sub-optimal model performance in SNNs. To alleviate this problem, we integrate Neural Architecture Search (NAS) method and propose Multi-Attention Differentiable Architecture Search (MA-DARTS) to directly automate the search for the optimal network structure of SNNs. Initially, we defined a differentiable two-level search space and conducted experiments within micro architecture under a fixed layer. Then, we incorporated a multi-dimensional attention mechanism and implemented the MA-DARTS algorithm in this search space. Comprehensive experiments demonstrate our model achieves state-of-the-art performance on classification compared to other methods under the same parameters with 94.40% accuracy on CIFAR10 dataset and 76.52% accuracy on CIFAR100 dataset. Additionally, we monitored and assessed the number of spikes (NoS) in each cell during the whole experiment. Notably, the number of spikes of the whole model stabilized at approximately 110K in validation and 100k in training on datasets.
△ Less
Submitted 1 November, 2024;
originally announced November 2024.
-
Twist-3 contribution in the Drell-Yan process with tensor-polarized deuteron
Authors:
Si-Yi Qiao,
Qin-Tao Song
Abstract:
The tensor-polarized structures of the deuteron can be probed through the proton-deuteron Drell-Yan process, where the proton is unpolarized and the deuteron is tensor polarized. This measurement will be conducted at Fermilab in the near future. In this reaction, the twist-3 contribution is not negligible compared to the twist-2 contribution due to the limited invariant mass of the dilepton pair.…
▽ More
The tensor-polarized structures of the deuteron can be probed through the proton-deuteron Drell-Yan process, where the proton is unpolarized and the deuteron is tensor polarized. This measurement will be conducted at Fermilab in the near future. In this reaction, the twist-3 contribution is not negligible compared to the twist-2 contribution due to the limited invariant mass of the dilepton pair. We calculate the twist-3 contribution for the Drell-Yan cross section with a tensor-polarized deuteron target, preserving the U(1)-gauge invariance of the hadronic tensor. The cross sections and weighted cross sections are expressed in terms of the tensor-polarized parton distribution functions (PDFs), thus one can extract the PDFs $f_{1\scriptscriptstyle{LL}}$, $f_{\scriptscriptstyle{LT}}$, and $f^{(1)}_{\scriptscriptstyle{1LT}}$ from the experimental measurements of Drell-Yan process. Our study should be helpful to solve the puzzle in the tensor-polarized structures of the deuteron.
△ Less
Submitted 24 March, 2025; v1 submitted 17 October, 2024;
originally announced October 2024.
-
Negative-Prompt-driven Alignment for Generative Language Model
Authors:
Shiqi Qiao,
Ning Xv,
Biao Liu,
Xin Geng
Abstract:
Large language models have achieved remarkable capabilities, but aligning their outputs with human values and preferences remains a significant challenge. Existing alignment methods primarily focus on positive examples while overlooking the importance of negative responses in guiding models away from undesirable behaviors. For instance, the widely-used alignment datasets reveals a scarcity of expl…
▽ More
Large language models have achieved remarkable capabilities, but aligning their outputs with human values and preferences remains a significant challenge. Existing alignment methods primarily focus on positive examples while overlooking the importance of negative responses in guiding models away from undesirable behaviors. For instance, the widely-used alignment datasets reveals a scarcity of explicit negative examples that contradict human values, hindering its ability to discourage harmful or biased outputs during training. To address this limitation, we propose NEAT, i.e., NEgative-prompt-driven AlignmenT, to introduce negative prompts to generate undesirable responses alongside positive examples during the optimization process. NEAT explicitly penalizes the model for producing harmful outputs, guiding it not only toward desirable behaviors but also steering it away from generating undesirable, biased responses. This dual feedback mechanism enables better alignment with human preferences, crucial in contexts where avoiding harm is paramount. Starting from a pre-trained language model, NEAT performs online alignment by incorporating a ranking loss derived from an expanded preference dataset containing both positive and negative examples. Extensive experiments validate NEAT's effectiveness in significantly enhancing language models' alignment with human values and preferences.
△ Less
Submitted 15 October, 2024;
originally announced October 2024.
-
GPR Full-Waveform Inversion through Adaptive Filtering of Model Parameters and Gradients Using CNN
Authors:
Peng Jiang,
Kun Wang,
Jiaxing Wang,
Zeliang Feng,
Shengjie Qiao,
Runhuai Deng,
Fengkai Zhang
Abstract:
GPR full-waveform inversion optimizes the subsurface property model iteratively to match the entire waveform information. However, the model gradients derived from wavefield continuation often contain errors, such as ghost values and excessively large values at transmitter and receiver points. Furthermore, models updated based on these gradients frequently exhibit unclear characterization of anoma…
▽ More
GPR full-waveform inversion optimizes the subsurface property model iteratively to match the entire waveform information. However, the model gradients derived from wavefield continuation often contain errors, such as ghost values and excessively large values at transmitter and receiver points. Furthermore, models updated based on these gradients frequently exhibit unclear characterization of anomalous bodies or false anomalies, making it challenging to obtain accurate inversion results. To address these issues, we introduced a novel full-waveform inversion (FWI) framework that incorporates an embedded convolutional neural network (CNN) to adaptively filter model parameters and gradients. Specifically, we embedded the CNN module before the forward modeling process and ensured the entire FWI process remains differentiable. This design leverages the auto-grad tool of the deep learning library, allowing model values to pass through the CNN module during forward computation and model gradients to pass through the CNN module during backpropagation. Experiments have shown that filtering the model parameters during forward computation and the model gradients during backpropagation can ultimately yield high-quality inversion results.
△ Less
Submitted 11 October, 2024;
originally announced October 2024.
-
Benchmarking Agentic Workflow Generation
Authors:
Shuofei Qiao,
Runnan Fang,
Zhisong Qiu,
Xiaobin Wang,
Ningyu Zhang,
Yong Jiang,
Pengjun Xie,
Fei Huang,
Huajun Chen
Abstract:
Large Language Models (LLMs), with their exceptional ability to handle a wide range of tasks, have driven significant advancements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a crucial step in this process. Existing workflow evaluation frameworks either focus solely on holistic performance or suffer from limitations such as restricted…
▽ More
Large Language Models (LLMs), with their exceptional ability to handle a wide range of tasks, have driven significant advancements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a crucial step in this process. Existing workflow evaluation frameworks either focus solely on holistic performance or suffer from limitations such as restricted scenario coverage, simplistic workflow structures, and lax evaluation standards. To this end, we introduce WorfBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. Additionally, we present WorfEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms to accurately quantify the LLM agent's workflow generation capabilities. Through comprehensive evaluations across different types of LLMs, we discover distinct gaps between the sequence planning capabilities and graph planning capabilities of LLM agents, with even GPT-4 exhibiting a gap of around 15%. We also train two open-source models and evaluate their generalization abilities on held-out tasks. Furthermore, we observe that the generated workflows can enhance downstream tasks, enabling them to achieve superior performance with less time during inference. Code and dataset are available at https://github.com/zjunlp/WorfBench.
△ Less
Submitted 23 February, 2025; v1 submitted 10 October, 2024;
originally announced October 2024.
-
Advancements in Road Lane Mapping: Comparative Fine-Tuning Analysis of Deep Learning-based Semantic Segmentation Methods Using Aerial Imagery
Authors:
Willow Liu,
Shuxin Qiao,
Kyle Gao,
Hongjie He,
Michael A. Chapman,
Linlin Xu,
Jonathan Li
Abstract:
This research addresses the need for high-definition (HD) maps for autonomous vehicles (AVs), focusing on road lane information derived from aerial imagery. While Earth observation data offers valuable resources for map creation, specialized models for road lane extraction are still underdeveloped in remote sensing. In this study, we perform an extensive comparison of twelve foundational deep lear…
▽ More
This research addresses the need for high-definition (HD) maps for autonomous vehicles (AVs), focusing on road lane information derived from aerial imagery. While Earth observation data offers valuable resources for map creation, specialized models for road lane extraction are still underdeveloped in remote sensing. In this study, we perform an extensive comparison of twelve foundational deep learning-based semantic segmentation models for road lane marking extraction from high-definition remote sensing images, assessing their performance under transfer learning with partially labeled datasets. These models were fine-tuned on the partially labeled Waterloo Urban Scene dataset, and pre-trained on the SkyScapes dataset, simulating a likely scenario of real-life model deployment under partial labeling. We observed and assessed the fine-tuning performance and overall performance. Models showed significant performance improvements after fine-tuning, with mean IoU scores ranging from 33.56% to 76.11%, and recall ranging from 66.0% to 98.96%. Transformer-based models outperformed convolutional neural networks, emphasizing the importance of model pre-training and fine-tuning in enhancing HD map development for AV navigation.
△ Less
Submitted 15 October, 2024; v1 submitted 8 October, 2024;
originally announced October 2024.
-
GISExplainer: On Explainability of Graph Neural Networks via Game-theoretic Interaction Subgraphs
Authors:
Xingping Xian,
Jianlu Liu,
Chao Wang,
Tao Wu,
Shaojie Qiao,
Xiaochuan Tang,
Qun Liu
Abstract:
Explainability is crucial for the application of black-box Graph Neural Networks (GNNs) in critical fields such as healthcare, finance, cybersecurity, and more. Various feature attribution methods, especially the perturbation-based methods, have been proposed to indicate how much each node/edge contributes to the model predictions. However, these methods fail to generate connected explanatory subg…
▽ More
Explainability is crucial for the application of black-box Graph Neural Networks (GNNs) in critical fields such as healthcare, finance, cybersecurity, and more. Various feature attribution methods, especially the perturbation-based methods, have been proposed to indicate how much each node/edge contributes to the model predictions. However, these methods fail to generate connected explanatory subgraphs that consider the causal interaction between edges within different coalition scales, which will result in unfaithful explanations. In our study, we propose GISExplainer, a novel game-theoretic interaction based explanation method that uncovers what the underlying GNNs have learned for node classification by discovering human-interpretable causal explanatory subgraphs. First, GISExplainer defines a causal attribution mechanism that considers the game-theoretic interaction of multi-granularity coalitions in candidate explanatory subgraph to quantify the causal effect of an edge on the prediction. Second, GISExplainer assumes that the coalitions with negative effects on the predictions are also significant for model interpretation, and the contribution of the computation graph stems from the combined influence of both positive and negative interactions within the coalitions. Then, GISExplainer regards the explanation task as a sequential decision process, in which a salient edges is successively selected and connected to the previously selected subgraph based on its causal effect to form an explanatory subgraph, ultimately striving for better explanations. Additionally, an efficiency optimization scheme is proposed for the causal attribution mechanism through coalition sampling. Extensive experiments demonstrate that GISExplainer achieves better performance than state-of-the-art approaches w.r.t. two quantitative metrics: Fidelity and Sparsity.
△ Less
Submitted 30 December, 2024; v1 submitted 23 September, 2024;
originally announced September 2024.
-
Absence of altermagnetic spin splitting character in rutile oxide RuO$_2$
Authors:
Jiayu Liu,
Jie Zhan,
Tongrui Li,
Jishan Liu,
Shufan Cheng,
Yuming Shi,
Liwei Deng,
Meng Zhang,
Chihao Li,
Jianyang Ding,
Qi Jiang,
Mao Ye,
Zhengtai Liu,
Zhicheng Jiang,
Siyu Wang,
Qian Li,
Yanwu Xie,
Yilin Wang,
Shan Qiao,
Jinsheng Wen,
Yan Sun,
Dawei Shen
Abstract:
Rutile RuO$_2$ has been posited as a potential $d$-wave altermagnetism candidate, with a predicted significant spin splitting up to 1.4 eV. Despite accumulating theoretical predictions and transport measurements, direct spectroscopic observation of spin splitting has remained elusive. Here, we employ spin- and angle-resolved photoemission spectroscopy to investigate the band structures and spin po…
▽ More
Rutile RuO$_2$ has been posited as a potential $d$-wave altermagnetism candidate, with a predicted significant spin splitting up to 1.4 eV. Despite accumulating theoretical predictions and transport measurements, direct spectroscopic observation of spin splitting has remained elusive. Here, we employ spin- and angle-resolved photoemission spectroscopy to investigate the band structures and spin polarization of thin-film and single-crystal RuO$_2$. Contrary to expectations of altermagnetism, our analysis indicates that RuO$_2$'s electronic structure aligns with those predicted under non-magnetic conditions, exhibiting no evidence of the hypothesized spin splitting. Additionally, we observe significant in-plane spin polarization of the low-lying bulk bands, which is antisymmetric about the high-symmetry plane and contrary to the $d$-wave spin texture due to time-reversal symmetry breaking in altermagnetism. These findings definitively challenge the altermagnetic order previously proposed for rutile RuO$_2$, prompting a reevaluation of its magnetic properties.
△ Less
Submitted 8 November, 2024; v1 submitted 20 September, 2024;
originally announced September 2024.
-
Graph-guided Cross-composition Feature Disentanglement for Compositional Zero-shot Learning
Authors:
Yuxia Geng,
Runkai Zhu,
Jiaoyan Chen,
Jintai Chen,
Xiang Chen,
Zhuo Chen,
Shuofei Qiao,
Yuxiang Wang,
Xiaoliang Xu,
Sheng-Jun Huang
Abstract:
Disentanglement of visual features of primitives (i.e., attributes and objects) has shown exceptional results in Compositional Zero-shot Learning (CZSL). However, due to the feature divergence of an attribute (resp. object) when combined with different objects (resp. attributes), it is challenging to learn disentangled primitive features that are general across different compositions. To this end,…
▽ More
Disentanglement of visual features of primitives (i.e., attributes and objects) has shown exceptional results in Compositional Zero-shot Learning (CZSL). However, due to the feature divergence of an attribute (resp. object) when combined with different objects (resp. attributes), it is challenging to learn disentangled primitive features that are general across different compositions. To this end, we propose the solution of cross-composition feature disentanglement, which takes multiple primitive-sharing compositions as inputs and constrains the disentangled primitive features to be general across these compositions. More specifically, we leverage a compositional graph to define the overall primitive-sharing relationships between compositions, and build a task-specific architecture upon the recently successful large pre-trained vision-language model (VLM) CLIP, with dual cross-composition disentangling adapters (called L-Adapter and V-Adapter) inserted into CLIP's frozen text and image encoders, respectively. Evaluation on three popular CZSL benchmarks shows that our proposed solution significantly improves the performance of CZSL, and its components have been verified by solid ablation studies. Our code and data are available at:https://github.com/zhurunkai/DCDA.
△ Less
Submitted 29 May, 2025; v1 submitted 19 August, 2024;
originally announced August 2024.
-
Crystal-symmetry-paired spin-valley locking in a layered room-temperature antiferromagnet
Authors:
Fayuan Zhang,
Xingkai Cheng,
Zhouyi Yin,
Changchao Liu,
Liwei Deng,
Yuxi Qiao,
Zheng Shi,
Shuxuan Zhang,
Junhao Lin,
Zhengtai Liu,
Mao Ye,
Yaobo Huang,
Xiangyu Meng,
Cheng Zhang,
Taichi Okuda,
Kenya Shimada,
Shengtao Cui,
Yue Zhao,
Guang-Han Cao,
Shan Qiao,
Junwei Liu,
Chaoyu Chen
Abstract:
Recent theoretical efforts predicted a type of unconventional antiferromagnet characterized by the crystal symmetry C (rotation or mirror), which connects antiferromagnetic sublattices in real space and simultaneously couples spin and momentum in reciprocal space. This results in a unique C-paired spin-valley locking (SVL) and corresponding novel properties such as piezomagnetism and noncollinear…
▽ More
Recent theoretical efforts predicted a type of unconventional antiferromagnet characterized by the crystal symmetry C (rotation or mirror), which connects antiferromagnetic sublattices in real space and simultaneously couples spin and momentum in reciprocal space. This results in a unique C-paired spin-valley locking (SVL) and corresponding novel properties such as piezomagnetism and noncollinear spin current even without spin-orbit coupling. However, the unconventional antiferromagnets reported thus far are not layered materials, limiting their potential in spintronic applications. Additionally, they do not meet the necessary symmetry requirements for nonrelativistic spin current. Here, we report the realization of C-paired SVL in a layered room-temperature antiferromagnetic compound, Rb1-δV2Te2O. Spin resolved photoemission measurements directly demonstrate the opposite spin splitting between C-paired valleys. Quasi-particle interference patterns reveal the suppression of inter-valley scattering due to the spin selection rules, as a direct consequence of C-paired SVL. All these experiments are well consistent with the results obtained from first-principles calculations. Our observations represent the first realization of layered antiferromagnets with C-paired SVL, enabling both the advantages of layered materials and possible control through crystal symmetry manipulation. These results hold significant promise and broad implications for advancements in magnetism, electronics, and information technology.
△ Less
Submitted 2 August, 2024; v1 submitted 28 July, 2024;
originally announced July 2024.
-
Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Authors:
Mengru Wang,
Yunzhi Yao,
Ziwen Xu,
Shuofei Qiao,
Shumin Deng,
Peng Wang,
Xiang Chen,
Jia-Chen Gu,
Yong Jiang,
Pengjun Xie,
Fei Huang,
Huajun Chen,
Ningyu Zhang
Abstract:
Understanding knowledge mechanisms in Large Language Models (LLMs) is crucial for advancing towards trustworthy AGI. This paper reviews knowledge mechanism analysis from a novel taxonomy including knowledge utilization and evolution. Knowledge utilization delves into the mechanism of memorization, comprehension and application, and creation. Knowledge evolution focuses on the dynamic progression o…
▽ More
Understanding knowledge mechanisms in Large Language Models (LLMs) is crucial for advancing towards trustworthy AGI. This paper reviews knowledge mechanism analysis from a novel taxonomy including knowledge utilization and evolution. Knowledge utilization delves into the mechanism of memorization, comprehension and application, and creation. Knowledge evolution focuses on the dynamic progression of knowledge within individual and group LLMs. Moreover, we discuss what knowledge LLMs have learned, the reasons for the fragility of parametric knowledge, and the potential dark knowledge (hypothesis) that will be challenging to address. We hope this work can help understand knowledge in LLMs and provide insights for future research.
△ Less
Submitted 4 December, 2024; v1 submitted 22 July, 2024;
originally announced July 2024.
-
Understanding the Robustness of Graph Neural Networks against Adversarial Attacks
Authors:
Tao Wu,
Canyixing Cui,
Xingping Xian,
Shaojie Qiao,
Chao Wang,
Lin Yuan,
Shui Yu
Abstract:
Recent studies have shown that graph neural networks (GNNs) are vulnerable to adversarial attacks, posing significant challenges to their deployment in safety-critical scenarios. This vulnerability has spurred a growing focus on designing robust GNNs. Despite this interest, current advancements have predominantly relied on empirical trial and error, resulting in a limited understanding of the robu…
▽ More
Recent studies have shown that graph neural networks (GNNs) are vulnerable to adversarial attacks, posing significant challenges to their deployment in safety-critical scenarios. This vulnerability has spurred a growing focus on designing robust GNNs. Despite this interest, current advancements have predominantly relied on empirical trial and error, resulting in a limited understanding of the robustness of GNNs against adversarial attacks. To address this issue, we conduct the first large-scale systematic study on the adversarial robustness of GNNs by considering the patterns of input graphs, the architecture of GNNs, and their model capacity, along with discussions on sensitive neurons and adversarial transferability. This work proposes a comprehensive empirical framework for analyzing the adversarial robustness of GNNs. To support the analysis of adversarial robustness in GNNs, we introduce two evaluation metrics: the confidence-based decision surface and the accuracy-based adversarial transferability rate. Through experimental analysis, we derive 11 actionable guidelines for designing robust GNNs, enabling model developers to gain deeper insights. The code of this study is available at https://github.com/star4455/GraphRE.
△ Less
Submitted 25 May, 2025; v1 submitted 19 June, 2024;
originally announced June 2024.
-
GraphMU: Repairing Robustness of Graph Neural Networks via Machine Unlearning
Authors:
Tao Wu,
Xinwen Cao,
Chao Wang,
Shaojie Qiao,
Xingping Xian,
Lin Yuan,
Canyixing Cui,
Yanbing Liu
Abstract:
Graph Neural Networks (GNNs) have demonstrated significant application potential in various fields. However, GNNs are still vulnerable to adversarial attacks. Numerous adversarial defense methods on GNNs are proposed to address the problem of adversarial attacks. However, these methods can only serve as a defense before poisoning, but cannot repair poisoned GNN. Therefore, there is an urgent need…
▽ More
Graph Neural Networks (GNNs) have demonstrated significant application potential in various fields. However, GNNs are still vulnerable to adversarial attacks. Numerous adversarial defense methods on GNNs are proposed to address the problem of adversarial attacks. However, these methods can only serve as a defense before poisoning, but cannot repair poisoned GNN. Therefore, there is an urgent need for a method to repair poisoned GNN. In this paper, we address this gap by introducing the novel concept of model repair for GNNs. We propose a repair framework, Repairing Robustness of Graph Neural Networks via Machine Unlearning (GraphMU), which aims to fine-tune poisoned GNN to forget adversarial samples without the need for complete retraining. We also introduce a unlearning validation method to ensure that our approach effectively forget specified poisoned data. To evaluate the effectiveness of GraphMU, we explore three fine-tuned subgraph construction scenarios based on the available perturbation information: (i) Known Perturbation Ratios, (ii) Known Complete Knowledge of Perturbations, and (iii) Unknown any Knowledge of Perturbations. Our extensive experiments, conducted across four citation datasets and four adversarial attack scenarios, demonstrate that GraphMU can effectively restore the performance of poisoned GNN.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Low-probability of Intercept/Detect (LPI/LPD) Secure Communications Using Antenna Arrays Employing Rapid Sidelobe Time Modulation
Authors:
Jiahao Zhao,
Shichen Qiao,
John H. Booske,
Nader Behdad
Abstract:
We present an electronically-reconfigurable antenna array offering low probability of intercept/detect (LPI/LPD) and secure communications capabilities simultaneously at the physical layer. This antenna array is designed to provide rapidly time-varying sidelobes and a stationary main lobe. By performing rapid sidelobe time modulation (SLTM), the signal transmitted in the undesired directions (i.e.…
▽ More
We present an electronically-reconfigurable antenna array offering low probability of intercept/detect (LPI/LPD) and secure communications capabilities simultaneously at the physical layer. This antenna array is designed to provide rapidly time-varying sidelobes and a stationary main lobe. By performing rapid sidelobe time modulation (SLTM), the signal transmitted in the undesired directions (i.e., through sidelobes) undergoes spread-spectrum distortion making it more difficult to be detected, intercepted, and deciphered while the signal transmitted in the desired direction (i.e., through the main lobe) is unaffected. Therefore, the intended receiver would not need additional modifications (i.e. encryption keys) to detect and recover the signal. We describe the operating principles of this SLTM array and validate its spread-spectrum SLTM sequence generation in undesired directions through theory, simulations, and experiments. Using a fabricated SLTM prototype operating at X band, we conducted system-level measurements to demonstrate its LPI/LPD, secure communications, and jamming resilience capabilities. The presented method is a physical layer technique, which can bring LPI/LPD capabilities to existing communications systems by simply replacing their antennas with SLTM arrays. This technique can be used independently or in combination with additional coding and signal-processing techniques to achieve further enhancements in LPI/LPD and secure communications.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Agent Planning with World Knowledge Model
Authors:
Shuofei Qiao,
Runnan Fang,
Ningyu Zhang,
Yuqi Zhu,
Xiang Chen,
Shumin Deng,
Yong Jiang,
Pengjun Xie,
Fei Huang,
Huajun Chen
Abstract:
Recent endeavors towards directly using large language models (LLMs) as agent models to execute interactive planning tasks have shown commendable results. Despite their achievements, however, they still struggle with brainless trial-and-error in global planning and generating hallucinatory actions in local planning due to their poor understanding of the ``real'' physical world. Imitating humans' m…
▽ More
Recent endeavors towards directly using large language models (LLMs) as agent models to execute interactive planning tasks have shown commendable results. Despite their achievements, however, they still struggle with brainless trial-and-error in global planning and generating hallucinatory actions in local planning due to their poor understanding of the ``real'' physical world. Imitating humans' mental world knowledge model which provides global prior knowledge before the task and maintains local dynamic knowledge during the task, in this paper, we introduce parametric World Knowledge Model (WKM) to facilitate agent planning. Concretely, we steer the agent model to self-synthesize knowledge from both expert and sampled trajectories. Then we develop WKM, providing prior task knowledge to guide the global planning and dynamic state knowledge to assist the local planning. Experimental results on three complex real-world simulated datasets with three state-of-the-art open-source LLMs, Mistral-7B, Gemma-7B, and Llama-3-8B, demonstrate that our method can achieve superior performance compared to various strong baselines. Besides, we analyze to illustrate that our WKM can effectively alleviate the blind trial-and-error and hallucinatory action issues, providing strong support for the agent's understanding of the world. Other interesting findings include: 1) our instance-level task knowledge can generalize better to unseen tasks, 2) weak WKM can guide strong agent model planning, and 3) unified WKM training has promising potential for further development. The code is available at https://github.com/zjunlp/WKM.
△ Less
Submitted 3 January, 2025; v1 submitted 23 May, 2024;
originally announced May 2024.
-
Large band-splitting in $g$-wave type altermagnet CrSb
Authors:
Jianyang Ding,
Zhicheng Jiang,
Xiuhua Chen,
Zicheng Tao,
Zhengtai Liu,
Tongrui Li,
Jishan Liu,
Jianping Sun,
Jinguang Cheng,
Jiayu Liu,
Yichen Yang,
Runfeng Zhang,
Liwei Deng,
Wenchuan Jing,
Yu Huang,
Yuming Shi,
Mao Ye,
Shan Qiao,
Yilin Wang,
Yanfeng Guo,
Donglai Feng,
Dawei Shen
Abstract:
Altermagnetism (AM), a newly discovered magnetic state, ingeniously integrates the properties of ferromagnetism and antiferromagnetism, representing a significant breakthrough in the field of magnetic materials. Despite experimental verification of some typical AM materials, such as MnTe and MnTe$_2$, the pursuit of AM materials that feature larger spin splitting and higher transition temperature…
▽ More
Altermagnetism (AM), a newly discovered magnetic state, ingeniously integrates the properties of ferromagnetism and antiferromagnetism, representing a significant breakthrough in the field of magnetic materials. Despite experimental verification of some typical AM materials, such as MnTe and MnTe$_2$, the pursuit of AM materials that feature larger spin splitting and higher transition temperature is still essential. Here, our research focuses on CrSb, which possesses N{é}el temperature of up to 700K and giant spin splitting near the Fermi level ($E_F$). Utilizing high-resolution angle-resolved photoemission spectroscopy and density functional theory calculations, we meticulously map the three-dimensional electronic structure of CrSb. Our photoemission spectroscopic results on both (0001) and (10$\overline{1}$0) cleavages of CrSb collaboratively reveal unprecedented details on AM-induced band splitting, and subsequently pin down its unique bulk $g$-wave symmetry through quantitative analysis of the angular and photon-energy dependence of spin splitting. Moreover, the observed spin splitting reaches the magnitude of 0.93~eV near $E_F$, the most substantial among all confirmed AM materials. This study not only validates the nature of CrSb as a prototype $g$-wave like AM material but also underscores its pivotal role in pioneering applications in spintronics.
△ Less
Submitted 15 November, 2024; v1 submitted 21 May, 2024;
originally announced May 2024.
-
MSene: A new large family of two-dimensional transition metal sulfide with MXene structure
Authors:
Shu-Xiang Qiao,
Yu-Lin Han,
Na Jiao,
Meng-Meng Zheng,
Hong-Yan Lu,
Ping Zhang
Abstract:
In this work, we theoretically report a new large family of two-dimensional (2D) transition metal sulfides $M$$_{2}$S with MXene structure in 2H and 1T phases, which we name as MSene. Twenty-four out of fifty-eight MSenes are proved to be stable. Notably, this family includes twelve superconducting (SC) materials, seven SC topological metals (SCTMs), four charge density wave (CDW) materials, and f…
▽ More
In this work, we theoretically report a new large family of two-dimensional (2D) transition metal sulfides $M$$_{2}$S with MXene structure in 2H and 1T phases, which we name as MSene. Twenty-four out of fifty-eight MSenes are proved to be stable. Notably, this family includes twelve superconducting (SC) materials, seven SC topological metals (SCTMs), four charge density wave (CDW) materials, and five magnetic materials including one ferromagnetic (FM) and four antiferromagnetic (AFM) materials. For example, 2H-Mo$_{2}$S is a SCTM which exhibits SC critical temperature ($T_{c}$) of 10.2 K and nontrivial topological properties; 1T-Hf$_{2}$S is a CDW material with the CDW originating from electron-phonon coupling. The CDW can be suppressed by compressive strain, leading to the emergence of superconductivity; 2H-Cr$_{2}$S and 1T-Mn$_{2}$S show FM and AFM properties, respectively. Thus, the new large family we predicted shows rich physical properties and significantly expands the repertoire of 2D materials. It serves as a novel platform for investigating the competition or coexistence of multiple orders such as SC, CDW, FM, AFM and topological orders in 2D materials.
△ Less
Submitted 9 May, 2024; v1 submitted 6 May, 2024;
originally announced May 2024.
-
Advancing Multimodal Medical Capabilities of Gemini
Authors:
Lin Yang,
Shawn Xu,
Andrew Sellergren,
Timo Kohlberger,
Yuchen Zhou,
Ira Ktena,
Atilla Kiraly,
Faruk Ahmed,
Farhad Hormozdiari,
Tiam Jaroensri,
Eric Wang,
Ellery Wulczyn,
Fayaz Jamil,
Theo Guidroz,
Chuck Lau,
Siyuan Qiao,
Yun Liu,
Akshay Goel,
Kendall Park,
Arnav Agharwal,
Nick George,
Yang Wang,
Ryutaro Tanno,
David G. T. Barrett,
Wei-Hung Weng
, et al. (22 additional authors not shown)
Abstract:
Many clinical tasks require an understanding of specialized data, such as medical images and genomics, which is not typically found in general-purpose large multimodal models. Building upon Gemini's multimodal models, we develop several models within the new Med-Gemini family that inherit core capabilities of Gemini and are optimized for medical use via fine-tuning with 2D and 3D radiology, histop…
▽ More
Many clinical tasks require an understanding of specialized data, such as medical images and genomics, which is not typically found in general-purpose large multimodal models. Building upon Gemini's multimodal models, we develop several models within the new Med-Gemini family that inherit core capabilities of Gemini and are optimized for medical use via fine-tuning with 2D and 3D radiology, histopathology, ophthalmology, dermatology and genomic data. Med-Gemini-2D sets a new standard for AI-based chest X-ray (CXR) report generation based on expert evaluation, exceeding previous best results across two separate datasets by an absolute margin of 1% and 12%, where 57% and 96% of AI reports on normal cases, and 43% and 65% on abnormal cases, are evaluated as "equivalent or better" than the original radiologists' reports. We demonstrate the first ever large multimodal model-based report generation for 3D computed tomography (CT) volumes using Med-Gemini-3D, with 53% of AI reports considered clinically acceptable, although additional research is needed to meet expert radiologist reporting quality. Beyond report generation, Med-Gemini-2D surpasses the previous best performance in CXR visual question answering (VQA) and performs well in CXR classification and radiology VQA, exceeding SoTA or baselines on 17 of 20 tasks. In histopathology, ophthalmology, and dermatology image classification, Med-Gemini-2D surpasses baselines across 18 out of 20 tasks and approaches task-specific model performance. Beyond imaging, Med-Gemini-Polygenic outperforms the standard linear polygenic risk score-based approach for disease risk prediction and generalizes to genetically correlated diseases for which it has never been trained. Although further development and evaluation are necessary in the safety-critical medical domain, our results highlight the potential of Med-Gemini across a wide range of medical tasks.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
Digital Twin Generators for Disease Modeling
Authors:
Nameyeh Alam,
Jake Basilico,
Daniele Bertolini,
Satish Casie Chetty,
Heather D'Angelo,
Ryan Douglas,
Charles K. Fisher,
Franklin Fuller,
Melissa Gomes,
Rishabh Gupta,
Alex Lang,
Anton Loukianov,
Rachel Mak-McCully,
Cary Murray,
Hanalei Pham,
Susanna Qiao,
Elena Ryapolova-Webb,
Aaron Smith,
Dimitri Theoharatos,
Anil Tolwani,
Eric W. Tramel,
Anna Vidovszky,
Judy Viduya,
Jonathan R. Walsh
Abstract:
A patient's digital twin is a computational model that describes the evolution of their health over time. Digital twins have the potential to revolutionize medicine by enabling individual-level computer simulations of human health, which can be used to conduct more efficient clinical trials or to recommend personalized treatment options. Due to the overwhelming complexity of human biology, machine…
▽ More
A patient's digital twin is a computational model that describes the evolution of their health over time. Digital twins have the potential to revolutionize medicine by enabling individual-level computer simulations of human health, which can be used to conduct more efficient clinical trials or to recommend personalized treatment options. Due to the overwhelming complexity of human biology, machine learning approaches that leverage large datasets of historical patients' longitudinal health records to generate patients' digital twins are more tractable than potential mechanistic models. In this manuscript, we describe a neural network architecture that can learn conditional generative models of clinical trajectories, which we call Digital Twin Generators (DTGs), that can create digital twins of individual patients. We show that the same neural network architecture can be trained to generate accurate digital twins for patients across 13 different indications simply by changing the training set and tuning hyperparameters. By introducing a general purpose architecture, we aim to unlock the ability to scale machine learning approaches to larger datasets and across more indications so that a digital twin could be created for any patient in the world.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
Authors:
Kai Zhang,
Yi Luan,
Hexiang Hu,
Kenton Lee,
Siyuan Qiao,
Wenhu Chen,
Yu Su,
Ming-Wei Chang
Abstract:
Image retrieval, i.e., finding desired images given a reference image, inherently encompasses rich, multi-faceted search intents that are difficult to capture solely using image-based measures. Recent works leverage text instructions to allow users to more freely express their search intents. However, they primarily focus on image pairs that are visually similar and/or can be characterized by a sm…
▽ More
Image retrieval, i.e., finding desired images given a reference image, inherently encompasses rich, multi-faceted search intents that are difficult to capture solely using image-based measures. Recent works leverage text instructions to allow users to more freely express their search intents. However, they primarily focus on image pairs that are visually similar and/or can be characterized by a small set of pre-defined relations. The core thesis of this paper is that text instructions can enable retrieving images with richer relations beyond visual similarity. To show this, we introduce MagicLens, a series of self-supervised image retrieval models that support open-ended instructions. MagicLens is built on a key novel insight: image pairs that naturally occur on the same web pages contain a wide range of implicit relations (e.g., inside view of), and we can bring those implicit relations explicit by synthesizing instructions via foundation models. Trained on 36.7M (query image, instruction, target image) triplets with rich semantic relations mined from the web, MagicLens achieves results comparable with or better than prior best on eight benchmarks of various image retrieval tasks, while maintaining high parameter efficiency with a significantly smaller model size. Additional human analyses on a 1.4M-image unseen corpus further demonstrate the diversity of search intents supported by MagicLens. Code and models are publicly available at https://open-vision-language.github.io/MagicLens/.
△ Less
Submitted 24 June, 2024; v1 submitted 28 March, 2024;
originally announced March 2024.
-
Geometric and electronic properties of two kinds of CrO2 magnetic monolayers: D3d and D2h phases
Authors:
Yang Zhang,
Xianggong Bo,
Jimeng Jing,
Lixia Wang,
Shiqian Qiao,
Hong Wu,
Yong Pu,
Feng Li
Abstract:
Due to the high magnetic coupling strength between the Cr elements, the bulk phase CrO2 is one of several ferromagnetic oxides known to have the highest Curie temperature. When the dimensionality of the material is reduced from 3D to 2D, the 2D CrO2 system material is expected to maintain a high Curie temperature. In this work, we predict two new phases of CrO2 monolayer (D3d and D2h) by using fir…
▽ More
Due to the high magnetic coupling strength between the Cr elements, the bulk phase CrO2 is one of several ferromagnetic oxides known to have the highest Curie temperature. When the dimensionality of the material is reduced from 3D to 2D, the 2D CrO2 system material is expected to maintain a high Curie temperature. In this work, we predict two new phases of CrO2 monolayer (D3d and D2h) by using first-principles calculations. We have found that the Curie temperature of 2D CrO2 is much lower than that of its bulk phase, but still remains as high as 191K, which is comparable to that of Fe2Cr2Ge6. In addition, 1L D3d-CrO2 is in the ferromagnetic state, while 1L D2h-CrO2 is in the antiferromagnetic state. Also, the different geometric structure affects its electrical properties: the 1L D3d-CrO2 is a half-metal while 1L D2h-CrO2 is a semiconductor. Our studies have shown that there is a wealth of electrical and magnetic properties in CrO2.
△ Less
Submitted 13 March, 2024;
originally announced March 2024.
-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Authors:
Gemini Team,
Petko Georgiev,
Ving Ian Lei,
Ryan Burnell,
Libin Bai,
Anmol Gulati,
Garrett Tanzer,
Damien Vincent,
Zhufeng Pan,
Shibo Wang,
Soroosh Mariooryad,
Yifan Ding,
Xinyang Geng,
Fred Alcober,
Roy Frostig,
Mark Omernick,
Lexi Walker,
Cosmin Paduraru,
Christina Sorokin,
Andrea Tacchetti,
Colin Gaffney,
Samira Daruki,
Olcan Sercinoglu,
Zach Gleicher,
Juliette Love
, et al. (1112 additional authors not shown)
Abstract:
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February…
▽ More
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
△ Less
Submitted 16 December, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents
Authors:
Yuqi Zhu,
Shuofei Qiao,
Yixin Ou,
Shumin Deng,
Shiwei Lyu,
Yue Shen,
Lei Liang,
Jinjie Gu,
Huajun Chen,
Ningyu Zhang
Abstract:
Large Language Models (LLMs) have demonstrated great potential in complex reasoning tasks, yet they fall short when tackling more sophisticated challenges, especially when interacting with environments through generating executable actions. This inadequacy primarily stems from the lack of built-in action knowledge in language agents, which fails to effectively guide the planning trajectories durin…
▽ More
Large Language Models (LLMs) have demonstrated great potential in complex reasoning tasks, yet they fall short when tackling more sophisticated challenges, especially when interacting with environments through generating executable actions. This inadequacy primarily stems from the lack of built-in action knowledge in language agents, which fails to effectively guide the planning trajectories during task solving and results in planning hallucination. To address this issue, we introduce KnowAgent, a novel approach designed to enhance the planning capabilities of LLMs by incorporating explicit action knowledge. Specifically, KnowAgent employs an action knowledge base and a knowledgeable self-learning strategy to constrain the action path during planning, enabling more reasonable trajectory synthesis, and thereby enhancing the planning performance of language agents. Experimental results on HotpotQA and ALFWorld based on various backbone models demonstrate that KnowAgent can achieve comparable or superior performance to existing baselines. Further analysis indicates the effectiveness of KnowAgent in terms of planning hallucinations mitigation. Code is available in https://github.com/zjunlp/KnowAgent.
△ Less
Submitted 21 February, 2025; v1 submitted 5 March, 2024;
originally announced March 2024.
-
Two-dimensional photonic crystal cavities in ZnSe quantum well structures
Authors:
Siqi Qiao,
Nils von den Driesch,
Xi Chen,
Stefan Trellenkamp,
Florian Lentz,
Christoph Krause,
Benjamin Bennemann,
Thorsten Brazda,
James M. LeBeau,
Alexander Pawlis
Abstract:
ZnSe and related materials like ZnMgSe and ZnCdSe are promising II-VI host materials for optically mediated quantum information technology such as single photon sources or spin qubits. Integrating these heterostructures into photonic crystal (PC) cavities enables further improvements, for example realizing Purcell-enhanced single photon sources with increased quantum efficiency. Here we report on…
▽ More
ZnSe and related materials like ZnMgSe and ZnCdSe are promising II-VI host materials for optically mediated quantum information technology such as single photon sources or spin qubits. Integrating these heterostructures into photonic crystal (PC) cavities enables further improvements, for example realizing Purcell-enhanced single photon sources with increased quantum efficiency. Here we report on the successful implementation of two-dimensional (2D) PC cavities in strained ZnSe quantum wells (QW) on top of a novel AlAs supporting layer. This approach overcomes typical obstacles associated with PC membrane fabrication in strained materials, such as cracks and strain relaxation in the corresponding devices. We demonstrate the attainment of the required mechanical stability in our PC devices, complete strain retainment and effective vertical optical confinement. Structural analysis of our PC cavities reveals excellent etching anisotropy. Additionally, elemental mapping in a scanning transmission electron microscope confirms the transformation of AlAs into AlOx by post-growth wet oxidation and reveals partial oxidation of ZnMgSe at the etched sidewalls in the PC. This knowledge is utilized to tailor FDTD simulations and to extract the ZnMgSe dispersion relation with small oxygen content. Optical characterization of the PC cavities with cross-polarized resonance scattering spectroscopy verifies the presence of cavity modes. The excellent agreement between simulation and measured cavity mode energies demonstrates wide tunability of the PC cavity and proves the pertinence of our model. This implementation of 2D PC cavities in the ZnSe material system establishes a solid foundation for future developments of ZnSe quantum devices.
△ Less
Submitted 23 February, 2024;
originally announced February 2024.
-
Multi-view Intent Learning and Alignment with Large Language Models for Session-based Recommendation
Authors:
Shutong Qiao,
Wei Zhou,
Junhao Wen,
Chen Gao,
Qun Luo,
Peixuan Chen,
Yong Li
Abstract:
Session-based recommendation (SBR) methods often rely on user behavior data, which can struggle with the sparsity of session data, limiting performance. Researchers have identified that beyond behavioral signals, rich semantic information in item descriptions is crucial for capturing hidden user intent. While large language models (LLMs) offer new ways to leverage this semantic data, the challenge…
▽ More
Session-based recommendation (SBR) methods often rely on user behavior data, which can struggle with the sparsity of session data, limiting performance. Researchers have identified that beyond behavioral signals, rich semantic information in item descriptions is crucial for capturing hidden user intent. While large language models (LLMs) offer new ways to leverage this semantic data, the challenges of session anonymity, short-sequence nature, and high LLM training costs have hindered the development of a lightweight, efficient LLM framework for SBR.
To address the above challenges, we propose an LLM-enhanced SBR framework that integrates semantic and behavioral signals from multiple views. This two-stage framework leverages the strengths of both LLMs and traditional SBR models while minimizing training costs. In the first stage, we use multi-view prompts to infer latent user intentions at the session semantic level, supported by an intent localization module to alleviate LLM hallucinations. In the second stage, we align and unify these semantic inferences with behavioral representations, effectively merging insights from both large and small models. Extensive experiments on two real datasets demonstrate that the LLM4SBR framework can effectively improve model performance. We release our codes along with the baselines at https://github.com/tsinghua-fib-lab/LLM4SBR.
△ Less
Submitted 13 April, 2025; v1 submitted 21 February, 2024;
originally announced February 2024.
-
EasyInstruct: An Easy-to-use Instruction Processing Framework for Large Language Models
Authors:
Yixin Ou,
Ningyu Zhang,
Honghao Gui,
Ziwen Xu,
Shuofei Qiao,
Yida Xue,
Runnan Fang,
Kangwei Liu,
Lei Li,
Zhen Bi,
Guozhou Zheng,
Huajun Chen
Abstract:
In recent years, instruction tuning has gained increasing attention and emerged as a crucial technique to enhance the capabilities of Large Language Models (LLMs). To construct high-quality instruction datasets, many instruction processing approaches have been proposed, aiming to achieve a delicate balance between data quantity and data quality. Nevertheless, due to inconsistencies that persist am…
▽ More
In recent years, instruction tuning has gained increasing attention and emerged as a crucial technique to enhance the capabilities of Large Language Models (LLMs). To construct high-quality instruction datasets, many instruction processing approaches have been proposed, aiming to achieve a delicate balance between data quantity and data quality. Nevertheless, due to inconsistencies that persist among various instruction processing methods, there is no standard open-source instruction processing implementation framework available for the community, which hinders practitioners from further developing and advancing. To facilitate instruction processing research and development, we present EasyInstruct, an easy-to-use instruction processing framework for LLMs, which modularizes instruction generation, selection, and prompting, while also considering their combination and interaction. EasyInstruct is publicly released and actively maintained at https://github.com/zjunlp/EasyInstruct, along with an online demo app and a demo video for quick-start, calling for broader research centered on instruction data and synthetic data.
△ Less
Submitted 23 June, 2024; v1 submitted 5 February, 2024;
originally announced February 2024.
-
AutoAct: Automatic Agent Learning from Scratch for QA via Self-Planning
Authors:
Shuofei Qiao,
Ningyu Zhang,
Runnan Fang,
Yujie Luo,
Wangchunshu Zhou,
Yuchen Eleanor Jiang,
Chengfei Lv,
Huajun Chen
Abstract:
Language agents have achieved considerable performance on various complex question-answering tasks by planning with external tools. Despite the incessant exploration in this field, existing language agent systems still struggle with costly, non-reproducible data reliance and face the challenge of compelling a single model for multiple functions. To this end, we introduce AutoAct, an automatic agen…
▽ More
Language agents have achieved considerable performance on various complex question-answering tasks by planning with external tools. Despite the incessant exploration in this field, existing language agent systems still struggle with costly, non-reproducible data reliance and face the challenge of compelling a single model for multiple functions. To this end, we introduce AutoAct, an automatic agent learning framework for QA that does not rely on large-scale annotated data and synthetic planning trajectories from closed-source models (e.g., GPT-4). Given limited data with a tool library, AutoAct first automatically synthesizes planning trajectories without any assistance from humans or strong closed-source models. Then, AutoAct leverages a division-of-labor strategy to automatically differentiate based on the target task information and synthesized trajectories, producing a sub-agent group to complete the task. We conduct comprehensive experiments with different LLMs, which demonstrates that AutoAct yields better or parallel performance compared to various strong baselines. Further analysis demonstrates the effectiveness of the division-of-labor strategy, with the trajectory quality generated by AutoAct generally outperforming that of others. Code will be available at https://github.com/zjunlp/AutoAct.
△ Less
Submitted 26 May, 2024; v1 submitted 10 January, 2024;
originally announced January 2024.
-
Gemini: A Family of Highly Capable Multimodal Models
Authors:
Gemini Team,
Rohan Anil,
Sebastian Borgeaud,
Jean-Baptiste Alayrac,
Jiahui Yu,
Radu Soricut,
Johan Schalkwyk,
Andrew M. Dai,
Anja Hauth,
Katie Millican,
David Silver,
Melvin Johnson,
Ioannis Antonoglou,
Julian Schrittwieser,
Amelia Glaese,
Jilin Chen,
Emily Pitler,
Timothy Lillicrap,
Angeliki Lazaridou,
Orhan Firat,
James Molloy,
Michael Isard,
Paul R. Barham,
Tom Hennigan,
Benjamin Lee
, et al. (1326 additional authors not shown)
Abstract:
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr…
▽ More
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
△ Less
Submitted 9 May, 2025; v1 submitted 18 December, 2023;
originally announced December 2023.
-
R3D-SWIN:Use Shifted Window Attention for Single-View 3D Reconstruction
Authors:
Chenhuan Li,
Meihua Xiao,
zehuan li,
Fangping Chen,
Shanshan Qiao,
Dingli Wang,
Mengxi Gao,
Siyi Zhang
Abstract:
Recently, vision transformers have performed well in various computer vision tasks, including voxel 3D reconstruction. However, the windows of the vision transformer are not multi-scale, and there is no connection between the windows, which limits the accuracy of voxel 3D reconstruction. Therefore, we propose a voxel 3D reconstruction network based on shifted window attention. To the best of our k…
▽ More
Recently, vision transformers have performed well in various computer vision tasks, including voxel 3D reconstruction. However, the windows of the vision transformer are not multi-scale, and there is no connection between the windows, which limits the accuracy of voxel 3D reconstruction. Therefore, we propose a voxel 3D reconstruction network based on shifted window attention. To the best of our knowledge, this is the first work to apply shifted window attention to voxel 3D reconstruction. Experimental results on ShapeNet verify our method achieves SOTA accuracy in single-view reconstruction.
△ Less
Submitted 6 March, 2024; v1 submitted 5 December, 2023;
originally announced December 2023.
-
IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers
Authors:
Chenglin Yang,
Siyuan Qiao,
Yuan Cao,
Yu Zhang,
Tao Zhu,
Alan Yuille,
Jiahui Yu
Abstract:
Generative training has been demonstrated to be powerful for building visual-language models. However, on zero-shot discriminative benchmarks, there is still a performance gap between models trained with generative and discriminative objectives. In this paper, we aim to narrow this gap by improving the efficacy of generative training on classification tasks, without any finetuning processes or add…
▽ More
Generative training has been demonstrated to be powerful for building visual-language models. However, on zero-shot discriminative benchmarks, there is still a performance gap between models trained with generative and discriminative objectives. In this paper, we aim to narrow this gap by improving the efficacy of generative training on classification tasks, without any finetuning processes or additional modules.
Specifically, we focus on narrowing the gap between the generative captioner and the CLIP classifier. We begin by analysing the predictions made by the captioner and classifier and observe that the caption generation inherits the distribution bias from the language model trained with pure text modality, making it less grounded on the visual signal. To tackle this problem, we redesign the scoring objective for the captioner to alleviate the distributional bias and focus on measuring the gain of information brought by the visual inputs. We further design a generative training objective to match the evaluation objective. We name our model trained and evaluated from the novel procedures as Information Gain (IG) captioner. We pretrain the models on the public Laion-5B dataset and perform a series of discriminative evaluations. For the zero-shot classification on ImageNet, IG captioner achieves $> 18\%$ improvements over the standard captioner, achieving comparable performances with the CLIP classifier. IG captioner also demonstrated strong performance on zero-shot image-text retrieval tasks on MSCOCO and Flickr30K. We hope this paper inspires further research towards unifying generative and discriminative training procedures for visual-language models.
△ Less
Submitted 16 July, 2024; v1 submitted 27 November, 2023;
originally announced November 2023.
-
Limited bisimulations for nondeterministic fuzzy transition systems
Authors:
Sha Qiao,
Jun e Feng,
Ping Zhu
Abstract:
The limited version of bisimulation, called limited approximate bisimulation, has recently been introduced to fuzzy transition systems (NFTSs). This article extends limited approximate bisimulation to NFTSs, which are more general structures than FTSs, to introduce a notion of $k$-limited $α$-bisimulation by using an approach of relational lifting, where $k$ is a natural number and $α\in[0,1]$. To…
▽ More
The limited version of bisimulation, called limited approximate bisimulation, has recently been introduced to fuzzy transition systems (NFTSs). This article extends limited approximate bisimulation to NFTSs, which are more general structures than FTSs, to introduce a notion of $k$-limited $α$-bisimulation by using an approach of relational lifting, where $k$ is a natural number and $α\in[0,1]$. To give the algorithmic characterization, a fixed point characterization of $k$-limited $α$-bisimilarity is first provided. Then $k$-limited $α$-bisimulation vector with $i$-th element being a $(k-i+1)$-limited $α$-bisimulation is introduced to investigate conditions for two states to be $k$-limited $α$-bisimilar, where $1\leq i\leq k+1$. Using these results, an $O(2k^2|V|^6\cdot\left|\lra\right|^2)$ algorithm is designed for computing the degree of similarity between two states, where $|V|$ is the number of states of the NFTS and $\left|\lra\right|$ is the greatest number of transitions from states. Finally, the relationship between $k$-limited $α$-bisimilar and $α$-bisimulation under $\widetilde{S}$ is showed, and by which, a logical characterization of $k$-limited $α$-bisimilarity is provided.
△ Less
Submitted 26 November, 2023;
originally announced November 2023.
-
PolyMaX: General Dense Prediction with Mask Transformer
Authors:
Xuan Yang,
Liangzhe Yuan,
Kimberly Wilber,
Astuti Sharma,
Xiuye Gu,
Siyuan Qiao,
Stephanie Debats,
Huisheng Wang,
Hartwig Adam,
Mikhail Sirotenko,
Liang-Chieh Chen
Abstract:
Dense prediction tasks, such as semantic segmentation, depth estimation, and surface normal prediction, can be easily formulated as per-pixel classification (discrete outputs) or regression (continuous outputs). This per-pixel prediction paradigm has remained popular due to the prevalence of fully convolutional networks. However, on the recent frontier of segmentation task, the community has been…
▽ More
Dense prediction tasks, such as semantic segmentation, depth estimation, and surface normal prediction, can be easily formulated as per-pixel classification (discrete outputs) or regression (continuous outputs). This per-pixel prediction paradigm has remained popular due to the prevalence of fully convolutional networks. However, on the recent frontier of segmentation task, the community has been witnessing a shift of paradigm from per-pixel prediction to cluster-prediction with the emergence of transformer architectures, particularly the mask transformers, which directly predicts a label for a mask instead of a pixel. Despite this shift, methods based on the per-pixel prediction paradigm still dominate the benchmarks on the other dense prediction tasks that require continuous outputs, such as depth estimation and surface normal prediction. Motivated by the success of DORN and AdaBins in depth estimation, achieved by discretizing the continuous output space, we propose to generalize the cluster-prediction based method to general dense prediction tasks. This allows us to unify dense prediction tasks with the mask transformer framework. Remarkably, the resulting model PolyMaX demonstrates state-of-the-art performance on three benchmarks of NYUD-v2 dataset. We hope our simple yet effective design can inspire more research on exploiting mask transformers for more dense prediction tasks. Code and model will be made available.
△ Less
Submitted 9 November, 2023;
originally announced November 2023.