Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework

Shi, Honghao; Cheng, Longkai; Wu, Wenli; Wang, Yuhang; Liu, Xuan; Nie, Shaokai; Wang, Weixv; Min, Xuebin; Men, Chunlei; Lin, Yonghua

Computer Science > Artificial Intelligence

arXiv:2411.05349 (cs)

[Submitted on 8 Nov 2024]

Title:Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework

Authors:Honghao Shi, Longkai Cheng, Wenli Wu, Yuhang Wang, Xuan Liu, Shaokai Nie, Weixv Wang, Xuebin Min, Chunlei Men, Yonghua Lin

View PDF HTML (experimental)

Abstract:Recent advancements in Large Language Models (LLMs) and related technologies such as Retrieval-Augmented Generation (RAG) and Diagram of Thought (DoT) have enabled the creation of autonomous intelligent systems capable of performing cluster diagnostics and troubleshooting. By integrating these technologies with self-play methodologies, we have developed an LLM-agent system designed to autonomously diagnose and resolve issues within AI clusters. Our innovations include a knowledge base tailored for cluster diagnostics, enhanced LLM algorithms, practical deployment strategies for agents, and a benchmark specifically designed for evaluating LLM capabilities in this domain. Through extensive experimentation across multiple dimensions, we have demonstrated the superiority of our system in addressing the challenges faced in cluster diagnostics, particularly in detecting and rectifying performance issues more efficiently and accurately than traditional methods.

Comments:	10 pages
Subjects:	Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
MSC classes:	68T42
Cite as:	arXiv:2411.05349 [cs.AI]
	(or arXiv:2411.05349v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2411.05349

Submission history

From: Honghao Shi [view email]
[v1] Fri, 8 Nov 2024 06:12:56 UTC (515 KB)

Computer Science > Artificial Intelligence

Title:Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators