-
Sustainable Carbon-Aware and Water-Efficient LLM Scheduling in Geo-Distributed Cloud Datacenters
Authors:
Hayden Moore,
Sirui Qi,
Ninad Hogade,
Dejan Milojicic,
Cullen Bash,
Sudeep Pasricha
Abstract:
In recent years, Large Language Models (LLM) such as ChatGPT, CoPilot, and Gemini have been widely adopted in different areas. As the use of LLMs continues to grow, many efforts have focused on reducing the massive training overheads of these models. But it is the environmental impact of handling user requests to LLMs that is increasingly becoming a concern. Recent studies estimate that the costs…
▽ More
In recent years, Large Language Models (LLM) such as ChatGPT, CoPilot, and Gemini have been widely adopted in different areas. As the use of LLMs continues to grow, many efforts have focused on reducing the massive training overheads of these models. But it is the environmental impact of handling user requests to LLMs that is increasingly becoming a concern. Recent studies estimate that the costs of operating LLMs in their inference phase can exceed training costs by 25x per year. As LLMs are queried incessantly, the cumulative carbon footprint for the operational phase has been shown to far exceed the footprint during the training phase. Further, estimates indicate that 500 ml of fresh water is expended for every 20-50 requests to LLMs during inference. To address these important sustainability issues with LLMs, we propose a novel framework called SLIT to co-optimize LLM quality of service (time-to-first token), carbon emissions, water usage, and energy costs. The framework utilizes a machine learning (ML) based metaheuristic to enhance the sustainability of LLM hosting across geo-distributed cloud datacenters. Such a framework will become increasingly vital as LLMs proliferate.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Hierarchical Multi-Agent Framework for Carbon-Efficient Liquid-Cooled Data Center Clusters
Authors:
Soumyendu Sarkar,
Avisek Naug,
Antonio Guillen,
Vineet Gundecha,
Ricardo Luna Gutierrez,
Sahand Ghorbanpour,
Sajad Mousavi,
Ashwin Ramesh Babu,
Desik Rengarajan,
Cullen Bash
Abstract:
Reducing the environmental impact of cloud computing requires efficient workload distribution across geographically dispersed Data Center Clusters (DCCs) and simultaneously optimizing liquid and air (HVAC) cooling with time shift of workloads within individual data centers (DC). This paper introduces Green-DCC, which proposes a Reinforcement Learning (RL) based hierarchical controller to optimize…
▽ More
Reducing the environmental impact of cloud computing requires efficient workload distribution across geographically dispersed Data Center Clusters (DCCs) and simultaneously optimizing liquid and air (HVAC) cooling with time shift of workloads within individual data centers (DC). This paper introduces Green-DCC, which proposes a Reinforcement Learning (RL) based hierarchical controller to optimize both workload and liquid cooling dynamically in a DCC. By incorporating factors such as weather, carbon intensity, and resource availability, Green-DCC addresses realistic constraints and interdependencies. We demonstrate how the system optimizes multiple data centers synchronously, enabling the scope of digital twins, and compare the performance of various RL approaches based on carbon emissions and sustainability metrics while also offering a framework and benchmark simulation for broader ML research in sustainability.
△ Less
Submitted 12 February, 2025;
originally announced February 2025.
-
A Framework for SLO, Carbon, and Wastewater-Aware Sustainable FaaS Cloud Platform Management
Authors:
Sirui Qi,
Hayden Moore,
Ninad Hogade,
Dejan Milojicic,
Cullen Bash,
Sudeep Pasricha
Abstract:
Function-as-a-Service (FaaS) is a growing cloud computing paradigm that is expected to reduce the user cost of service over traditional serverful approaches. However, the environmental impact of FaaS has not received much attention. We investigate FaaS scheduling and scaling from a sustainability perspective in this work. We find that the service-level objectives (SLOs) of FaaS and carbon emission…
▽ More
Function-as-a-Service (FaaS) is a growing cloud computing paradigm that is expected to reduce the user cost of service over traditional serverful approaches. However, the environmental impact of FaaS has not received much attention. We investigate FaaS scheduling and scaling from a sustainability perspective in this work. We find that the service-level objectives (SLOs) of FaaS and carbon emissions conflict with each other. We also find that SLO-focused FaaS scheduling can exacerbate water use in a datacenter. We propose a novel sustainability-focused FaaS scheduling and scaling framework to co-optimize SLO performance, carbon emissions, and wastewater generation.
△ Less
Submitted 9 October, 2024;
originally announced October 2024.
-
CASA: A Framework for SLO and Carbon-Aware Autoscaling and Scheduling in Serverless Cloud Computing
Authors:
S. Qi,
H. Moore,
N. Hogade,
D. Milojicic,
C. Bash,
S. Pasricha
Abstract:
Serverless computing is an emerging cloud computing paradigm that can reduce costs for cloud providers and their customers. However, serverless cloud platforms have stringent performance requirements (due to the need to execute short duration functions in a timely manner) and a growing carbon footprint. Traditional carbon-reducing techniques such as shutting down idle containers can reduce perform…
▽ More
Serverless computing is an emerging cloud computing paradigm that can reduce costs for cloud providers and their customers. However, serverless cloud platforms have stringent performance requirements (due to the need to execute short duration functions in a timely manner) and a growing carbon footprint. Traditional carbon-reducing techniques such as shutting down idle containers can reduce performance by increasing cold-start latencies of containers required in the future. This can cause higher violation rates of service level objectives (SLOs). Conversely, traditional latency-reduction approaches of prewarming containers or keeping them alive when not in use can improve performance but increase the associated carbon footprint of the serverless cluster platform. To strike a balance between sustainability and performance, in this paper, we propose a novel carbon- and SLO-aware framework called CASA to schedule and autoscale containers in a serverless cloud computing cluster. Experimental results indicate that CASA reduces the operational carbon footprint of a FaaS cluster by up to 2.6x while also reducing the SLO violation rate by up to 1.4x compared to the state-of-the-art.
△ Less
Submitted 31 August, 2024;
originally announced September 2024.
-
MOSAIC: A Multi-Objective Optimization Framework for Sustainable Datacenter Management
Authors:
Sirui Qi,
Dejan Milojicic,
Cullen Bash,
Sudeep Pasricha
Abstract:
In recent years, cloud service providers have been building and hosting datacenters across multiple geographical locations to provide robust services. However, the geographical distribution of datacenters introduces growing pressure to both local and global environments, particularly when it comes to water usage and carbon emissions. Unfortunately, efforts to reduce the environmental impact of suc…
▽ More
In recent years, cloud service providers have been building and hosting datacenters across multiple geographical locations to provide robust services. However, the geographical distribution of datacenters introduces growing pressure to both local and global environments, particularly when it comes to water usage and carbon emissions. Unfortunately, efforts to reduce the environmental impact of such datacenters often lead to an increase in the cost of datacenter operations. To co-optimize the energy cost, carbon emissions, and water footprint of datacenter operation from a global perspective, we propose a novel framework for multi-objective sustainable datacenter management (MOSAIC) that integrates adaptive local search with a collaborative decomposition-based evolutionary algorithm to intelligently manage geographical workload distribution and datacenter operations. Our framework sustainably allocates workloads to datacenters while taking into account multiple geography- and time-based factors including renewable energy sources, variable energy costs, power usage efficiency, carbon factors, and water intensity in energy. Our experimental results show that, compared to the best-known prior work frameworks, MOSAIC can achieve 27.45x speedup and 1.53x improvement in Pareto Hypervolume while reducing the carbon footprint by up to 1.33x, water footprint by up to 3.09x, and energy costs by up to 1.40x. In the simultaneous three-objective co-optimization scenario, MOSAIC achieves a cumulative improvement across all objectives (carbon, water, cost) of up to 4.61x compared to the state-of-the-arts.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
SHIELD: Sustainable Hybrid Evolutionary Learning Framework for Carbon, Wastewater, and Energy-Aware Data Center Management
Authors:
Sirui Qi,
Dejan Milojicic,
Cullen Bash,
Sudeep Pasricha
Abstract:
Today's cloud data centers are often distributed geographically to provide robust data services. But these geo-distributed data centers (GDDCs) have a significant associated environmental impact due to their increasing carbon emissions and water usage, which needs to be curtailed. Moreover, the energy costs of operating these data centers continue to rise. This paper proposes a novel framework to…
▽ More
Today's cloud data centers are often distributed geographically to provide robust data services. But these geo-distributed data centers (GDDCs) have a significant associated environmental impact due to their increasing carbon emissions and water usage, which needs to be curtailed. Moreover, the energy costs of operating these data centers continue to rise. This paper proposes a novel framework to co-optimize carbon emissions, water footprint, and energy costs of GDDCs, using a hybrid workload management framework called SHIELD that integrates machine learning guided local search with a decomposition-based evolutionary algorithm. Our framework considers geographical factors and time-based differences in power generation/use, costs, and environmental impacts to intelligently manage workload distribution across GDDCs and data center operation. Experimental results show that SHIELD can realize 34.4x speedup and 2.1x improvement in Pareto Hypervolume while reducing the carbon footprint by up to 3.7x, water footprint by up to 1.8x, energy costs by up to 1.3x, and a cumulative improvement across all objectives (carbon, water, cost) of up to 4.8x compared to the state-of-the-art.
△ Less
Submitted 24 August, 2023;
originally announced August 2023.