-
Evolutionary Perspectives on the Evaluation of LLM-Based AI Agents: A Comprehensive Survey
Authors:
Jiachen Zhu,
Menghui Zhu,
Renting Rui,
Rong Shan,
Congmin Zheng,
Bo Chen,
Yunjia Xi,
Jianghao Lin,
Weiwen Liu,
Ruiming Tang,
Yong Yu,
Weinan Zhang
Abstract:
The advent of large language models (LLMs), such as GPT, Gemini, and DeepSeek, has significantly advanced natural language processing, giving rise to sophisticated chatbots capable of diverse language-related tasks. The transition from these traditional LLM chatbots to more advanced AI agents represents a pivotal evolutionary step. However, existing evaluation frameworks often blur the distinction…
▽ More
The advent of large language models (LLMs), such as GPT, Gemini, and DeepSeek, has significantly advanced natural language processing, giving rise to sophisticated chatbots capable of diverse language-related tasks. The transition from these traditional LLM chatbots to more advanced AI agents represents a pivotal evolutionary step. However, existing evaluation frameworks often blur the distinctions between LLM chatbots and AI agents, leading to confusion among researchers selecting appropriate benchmarks. To bridge this gap, this paper introduces a systematic analysis of current evaluation approaches, grounded in an evolutionary perspective. We provide a detailed analytical framework that clearly differentiates AI agents from LLM chatbots along five key aspects: complex environment, multi-source instructor, dynamic feedback, multi-modal perception, and advanced capability. Further, we categorize existing evaluation benchmarks based on external environments driving forces, and resulting advanced internal capabilities. For each category, we delineate relevant evaluation attributes, presented comprehensively in practical reference tables. Finally, we synthesize current trends and outline future evaluation methodologies through four critical lenses: environment, agent, evaluator, and metrics. Our findings offer actionable guidance for researchers, facilitating the informed selection and application of benchmarks in AI agent evaluation, thus fostering continued advancement in this rapidly evolving research domain.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
AgentNet: Decentralized Evolutionary Coordination for LLM-based Multi-Agent Systems
Authors:
Yingxuan Yang,
Huacan Chai,
Shuai Shao,
Yuanyi Song,
Siyuan Qi,
Renting Rui,
Weinan Zhang
Abstract:
The rapid advancement of large language models (LLMs) has enabled the development of multi-agent systems where multiple LLM-based agents collaborate on complex tasks. However, existing systems often rely on centralized coordination, leading to scalability bottlenecks, reduced adaptability, and single points of failure. Privacy and proprietary knowledge concerns further hinder cross-organizational…
▽ More
The rapid advancement of large language models (LLMs) has enabled the development of multi-agent systems where multiple LLM-based agents collaborate on complex tasks. However, existing systems often rely on centralized coordination, leading to scalability bottlenecks, reduced adaptability, and single points of failure. Privacy and proprietary knowledge concerns further hinder cross-organizational collaboration, resulting in siloed expertise. We propose AgentNet, a decentralized, Retrieval-Augmented Generation (RAG)-based framework that enables LLM-based agents to specialize, evolve, and collaborate autonomously in a dynamically structured Directed Acyclic Graph (DAG). Unlike prior approaches with static roles or centralized control, AgentNet allows agents to adjust connectivity and route tasks based on local expertise and context. AgentNet introduces three key innovations: (1) a fully decentralized coordination mechanism that eliminates the need for a central orchestrator, enhancing robustness and emergent intelligence; (2) dynamic agent graph topology that adapts in real time to task demands, ensuring scalability and resilience; and (3) a retrieval-based memory system for agents that supports continual skill refinement and specialization. By minimizing centralized control and data exchange, AgentNet enables fault-tolerant, privacy-preserving collaboration across organizations. Experiments show that AgentNet achieves higher task accuracy than both single-agent and centralized multi-agent baselines.
△ Less
Submitted 29 May, 2025; v1 submitted 1 April, 2025;
originally announced April 2025.
-
CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation
Authors:
Kounianhua Du,
Jizheng Chen,
Renting Rui,
Huacan Chai,
Lingyue Fu,
Wei Xia,
Yasheng Wang,
Ruiming Tang,
Yong Yu,
Weinan Zhang
Abstract:
Utilizing large language models to generate codes has shown promising meaning in software development revolution. Despite the intelligence shown by the large language models, their specificity in code generation can still be improved due to the syntactic gap and mismatched vocabulary existing between natural language and programming languages. In this paper, we propose CodeGRAG, a Graphical Retrie…
▽ More
Utilizing large language models to generate codes has shown promising meaning in software development revolution. Despite the intelligence shown by the large language models, their specificity in code generation can still be improved due to the syntactic gap and mismatched vocabulary existing between natural language and programming languages. In this paper, we propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework that bridges the gap between NL and PL to enhance the performance of LLMs. CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to better interpret the programming domain knowledge, which can facilitate natural language based LLMs for better understanding of code syntax and serve as a bridge among different programming languages. To take the extracted structural knowledge into the foundation models, we propose 1) a hard meta-graph prompt template to transform the challenging syntax graph into informative graphical view for tuning-free models and 2) a soft prompting technique that injects the domain knowledge of programming languages into model parameters via finetuning the models with the soft signals encoded by GNN expert model. Specifically, two constraints are designed to improve the alignment and structure expressiveness, contributing to the informativeness of the single-token-sized external <GraphEmb> for enhanced code generation. CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation. Implementation is available at https://anonymous.4open.science/r/Code-5970/ .
△ Less
Submitted 19 May, 2025; v1 submitted 2 May, 2024;
originally announced May 2024.
-
CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models
Authors:
Lingyue Fu,
Huacan Chai,
Shuang Luo,
Kounianhua Du,
Weiming Zhang,
Longteng Fan,
Jiayi Lei,
Renting Rui,
Jianghao Lin,
Yuchen Fang,
Yifan Liu,
Jingkuan Wang,
Siyuan Qi,
Kangning Zhang,
Weinan Zhang,
Yong Yu
Abstract:
With the emergence of Large Language Models (LLMs), there has been a significant improvement in the programming capabilities of models, attracting growing attention from researchers. Evaluating the programming capabilities of LLMs is crucial as it reflects the multifaceted abilities of LLMs, and it has numerous downstream applications. In this paper, we propose CodeApex, a bilingual benchmark data…
▽ More
With the emergence of Large Language Models (LLMs), there has been a significant improvement in the programming capabilities of models, attracting growing attention from researchers. Evaluating the programming capabilities of LLMs is crucial as it reflects the multifaceted abilities of LLMs, and it has numerous downstream applications. In this paper, we propose CodeApex, a bilingual benchmark dataset focusing on the programming comprehension, code generation, and code correction abilities of LLMs. Programming comprehension task tests LLMs on multiple-choice exam questions covering conceptual understanding, commonsense reasoning, and multi-hop reasoning. The code generation task evaluates LLMs through completing C++ functions based on provided descriptions and prototypes. The code correction task asks LLMs to fix real-world erroneous code segments with different error messages. We evaluate 12 widely used LLMs, including both general-purpose and specialized models. GPT-4 exhibits the best programming capabilities, achieving approximate accuracy of 69%, 54%, and 66% on the three tasks, respectively. Compared to human performance, there is still significant room for improvement in LLM programming. We hope that CodeApex can serve as a reference for evaluating the coding capabilities of LLMs, further promoting their development and growth.
△ Less
Submitted 11 March, 2024; v1 submitted 5 September, 2023;
originally announced September 2023.
-
Who to Watch Next: Two-side Interactive Networks for Live Broadcast Recommendation
Authors:
Jiarui Jin,
Xianyu Chen,
Yuanbo Chen,
Weinan Zhang,
Renting Rui,
Zaifan Jiang,
Zhewen Su,
Yong Yu
Abstract:
With the prevalence of live broadcast business nowadays, a new type of recommendation service, called live broadcast recommendation, is widely used in many mobile e-commerce Apps. Different from classical item recommendation, live broadcast recommendation is to automatically recommend user anchors instead of items considering the interactions among triple-objects (i.e., users, anchors, items) rath…
▽ More
With the prevalence of live broadcast business nowadays, a new type of recommendation service, called live broadcast recommendation, is widely used in many mobile e-commerce Apps. Different from classical item recommendation, live broadcast recommendation is to automatically recommend user anchors instead of items considering the interactions among triple-objects (i.e., users, anchors, items) rather than binary interactions between users and items. Existing methods based on binary objects, ranging from early matrix factorization to recently emerged deep learning, obtain objects' embeddings by mapping from pre-existing features. Directly applying these techniques would lead to limited performance, as they are failing to encode collaborative signals among triple-objects. In this paper, we propose a novel TWo-side Interactive NetworkS (TWINS) for live broadcast recommendation. In order to fully use both static and dynamic information on user and anchor sides, we combine a product-based neural network with a recurrent neural network to learn the embedding of each object. In addition, instead of directly measuring the similarity, TWINS effectively injects the collaborative effects into the embedding process in an explicit manner by modeling interactive patterns between the user's browsing history and the anchor's broadcast history in both item and anchor aspects. Furthermore, we design a novel co-retrieval technique to select key items among massive historic records efficiently. Offline experiments on real large-scale data show the superior performance of the proposed TWINS, compared to representative methods; and further results of online experiments on Diantao App show that TWINS gains average performance improvement of around 8% on ACTR metric, 3% on UCTR metric, 3.5% on UCVR metric.
△ Less
Submitted 9 February, 2022;
originally announced February 2022.
-
Differentiated context-aware hook placement for different owners' smartphones
Authors:
Tian Chen,
Wang Ya Zhe,
Liu Peng,
Dai Rui Rui,
Zhou An Yuan,
Zhuo Xin Wang
Abstract:
A hook is a piece of code. It checks user privacy policy before some sensitive operations happen. We propose an automated solution named Prihook for hook placement in the Android Framework. Addressing specific context-aware user privacy concerns, the hook placement in Prihook is personalized. Specifically, we design User Privacy Preference Table (UPPT) to help a user express his privacy concerns.…
▽ More
A hook is a piece of code. It checks user privacy policy before some sensitive operations happen. We propose an automated solution named Prihook for hook placement in the Android Framework. Addressing specific context-aware user privacy concerns, the hook placement in Prihook is personalized. Specifically, we design User Privacy Preference Table (UPPT) to help a user express his privacy concerns. And we leverage machine learning to discover a Potential Method Set (consisting of Sensor Data Access Methods and Sensor Control Methods) from which we can select a particular subset to put hooks. We propose a mapping from words in the UPPT lexicon to methods in the Potential Method Set. With this mapping, Prihook is able to (a) select a specific set of methods; and (b) generate and place hooks automatically. We test Prihook separately on 6 typical UPPTs representing 6 kinds of resource-sensitive UPPTs, and no user privacy violation is found. The experimental results show that the hooks placed by PriHook have small runtime overhead.
△ Less
Submitted 20 August, 2019;
originally announced August 2019.