Learning Dependencies in Distributed Cloud Applications to Identify and Localize Anomalies

Scheinert, Dominik; Acker, Alexander; Thamsen, Lauritz; Geldenhuys, Morgan K.; Kao, Odej

doi:10.1109/CloudIntelligence52565.2021.00011

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2103.05245 (cs)

[Submitted on 9 Mar 2021 (v1), last revised 9 Sep 2021 (this version, v2)]

Title:Learning Dependencies in Distributed Cloud Applications to Identify and Localize Anomalies

Authors:Dominik Scheinert, Alexander Acker, Lauritz Thamsen, Morgan K. Geldenhuys, Odej Kao

View PDF

Abstract:Operation and maintenance of large distributed cloud applications can quickly become unmanageably complex, putting human operators under immense stress when problems occur. Utilizing machine learning for identification and localization of anomalies in such systems supports human experts and enables fast mitigation. However, due to the various inter-dependencies of system components, anomalies do not only affect their origin but propagate through the distributed system. Taking this into account, we present Arvalus and its variant D-Arvalus, a neural graph transformation method that models system components as nodes and their dependencies and placement as edges to improve the identification and localization of anomalies. Given a series of metric KPIs, our method predicts the most likely system state - either normal or an anomaly class - and performs localization when an anomaly is detected. During our experiments, we simulate a distributed cloud application deployment and synthetically inject anomalies. The evaluation shows the generally good prediction performance of Arvalus and reveals the advantage of D-Arvalus which incorporates information about system component dependencies.

Comments:	6 pages, 5 figures, 3 tables
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Cite as:	arXiv:2103.05245 [cs.DC]
	(or arXiv:2103.05245v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2103.05245
Journal reference:	IEEE/ACM CloudIntelligence (2021) 7-12
Related DOI:	https://doi.org/10.1109/CloudIntelligence52565.2021.00011

Submission history

From: Dominik Scheinert [view email]
[v1] Tue, 9 Mar 2021 06:34:05 UTC (869 KB)
[v2] Thu, 9 Sep 2021 15:34:56 UTC (869 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Learning Dependencies in Distributed Cloud Applications to Identify and Localize Anomalies

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Learning Dependencies in Distributed Cloud Applications to Identify and Localize Anomalies

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators