Search | arXiv e-print repository

The infrastructure powering IBM's Gen AI model development

Authors: Talia Gershon, Seetharami Seelam, Brian Belgodere, Milton Bonilla, Lan Hoang, Danny Barnett, I-Hsin Chung, Apoorve Mohan, Ming-Hung Chen, Lixiang Luo, Robert Walkup, Constantinos Evangelinos, Shweta Salaria, Marc Dombrowa, Yoonho Park, Apo Kayi, Liran Schour, Alim Alim, Ali Sydney, Pavlos Maniotis, Laurent Schares, Bernard Metzler, Bengi Karacali-Akyamac, Sophia Wen, Tatsuhiro Chiba , et al. (122 additional authors not shown)

Abstract: AI Infrastructure plays a key role in the speed and cost-competitiveness of developing and deploying advanced AI models. The current demand for powerful AI infrastructure for model training is driven by the emergence of generative AI and foundational models, where on occasion thousands of GPUs must cooperate on a single training job for the model to be trained in a reasonable time. Delivering effi… ▽ More AI Infrastructure plays a key role in the speed and cost-competitiveness of developing and deploying advanced AI models. The current demand for powerful AI infrastructure for model training is driven by the emergence of generative AI and foundational models, where on occasion thousands of GPUs must cooperate on a single training job for the model to be trained in a reasonable time. Delivering efficient and high-performing AI training requires an end-to-end solution that combines hardware, software and holistic telemetry to cater for multiple types of AI workloads. In this report, we describe IBM's hybrid cloud infrastructure that powers our generative AI model development. This infrastructure includes (1) Vela: an AI-optimized supercomputing capability directly integrated into the IBM Cloud, delivering scalable, dynamic, multi-tenant and geographically distributed infrastructure for large-scale model training and other AI workflow steps and (2) Blue Vela: a large-scale, purpose-built, on-premises hosting environment that is optimized to support our largest and most ambitious AI model training tasks. Vela provides IBM with the dual benefit of high performance for internal use along with the flexibility to adapt to an evolving commercial landscape. Blue Vela provides us with the benefits of rapid development of our largest and most ambitious models, as well as future-proofing against the evolving model landscape in the industry. Taken together, they provide IBM with the ability to rapidly innovate in the development of both AI models and commercial offerings. △ Less

Submitted 13 January, 2025; v1 submitted 7 July, 2024; originally announced July 2024.

Comments: Corresponding Authors: Talia Gershon, Seetharami Seelam,Brian Belgodere, Milton Bonilla

arXiv:2211.12139 [pdf, other]

City-Wide Perceptions of Neighbourhood Quality using Street View Images

Authors: Emily Muller, Emily Gemmell, Ishmam Choudhury, Ricky Nathvani, Antje Barbara Metzler, James Bennett, Emily Denton, Seth Flaxman, Majid Ezzati

Abstract: The interactions of individuals with city neighbourhoods is determined, in part, by the perceived quality of urban environments. Perceived neighbourhood quality is a core component of urban vitality, influencing social cohesion, sense of community, safety, activity and mental health of residents. Large-scale assessment of perceptions of neighbourhood quality was pioneered by the Place Pulse projec… ▽ More The interactions of individuals with city neighbourhoods is determined, in part, by the perceived quality of urban environments. Perceived neighbourhood quality is a core component of urban vitality, influencing social cohesion, sense of community, safety, activity and mental health of residents. Large-scale assessment of perceptions of neighbourhood quality was pioneered by the Place Pulse projects. Researchers demonstrated the efficacy of crowd-sourcing perception ratings of image pairs across 56 cities and training a model to predict perceptions from street-view images. Variation across cities may limit Place Pulse's usefulness for assessing within-city perceptions. In this paper, we set forth a protocol for city-specific dataset collection for the perception: 'On which street would you prefer to walk?'. This paper describes our methodology, based in London, including collection of images and ratings, web development, model training and mapping. Assessment of within-city perceptions of neighbourhoods can identify inequities, inform planning priorities, and identify temporal dynamics. Code available: https://emilymuller1991.github.io/urban-perceptions/. △ Less

Submitted 24 November, 2022; v1 submitted 22 November, 2022; originally announced November 2022.

arXiv:2205.11261 [pdf, other]

An Elastic Ephemeral Datastore using Cheap, Transient Cloud Resources

Authors: Malte Brodmann, Nikolas Ioannou, Bernard Metzler, Jonas Pfefferle, Ana Klimovic

Abstract: Spot instances are virtual machines offered at 60-90% lower cost that can be reclaimed at any time, with only a short warning period. Spot instances have already been used to significantly reduce the cost of processing workloads in the cloud. However, leveraging spot instances to reduce the cost of stateful cloud applications is much more challenging, as the sudden preemptions lead to data loss. I… ▽ More Spot instances are virtual machines offered at 60-90% lower cost that can be reclaimed at any time, with only a short warning period. Spot instances have already been used to significantly reduce the cost of processing workloads in the cloud. However, leveraging spot instances to reduce the cost of stateful cloud applications is much more challenging, as the sudden preemptions lead to data loss. In this work, we propose leveraging spot instances to decrease the cost of ephemeral data management in distributed data analytics applications. We specifically target ephemeral data as this large class of data in modern analytics workloads has low durability requirements; if lost, the data can be regenerated by re-executing compute tasks. We design an elastic, distributed ephemeral datastore that handles node preemptions transparently to user applications and minimizes data loss by redistributing data during node preemption warning periods. We implement our elastic datastore on top of the Apache Crail datastore and evaluate the system with various workloads and VM types. By leveraging spot instances, we show that we can run TPC-DS queries with 60\% lower cost compared to using on-demand VMs for the datastore, while only increasing end-to-end execution time by 2.1%. △ Less

Submitted 23 May, 2022; originally announced May 2022.

arXiv:2104.03075 [pdf, other]

Serverless Predictions: 2021-2030

Authors: Pedro Garcia Lopez, Aleksander Slominski, Michael Behrendt, Bernard Metzler

Abstract: Within the next 10 years, advances on resource disaggregation will enable full transparency for most Cloud applications: to run unmodified single-machine applications over effectively unlimited remote computing resources. In this article, we present five serverless predictions for the next decade that will realize this vision of transparency -- equivalent to Tim Wagner's Serverless SuperComputer o… ▽ More Within the next 10 years, advances on resource disaggregation will enable full transparency for most Cloud applications: to run unmodified single-machine applications over effectively unlimited remote computing resources. In this article, we present five serverless predictions for the next decade that will realize this vision of transparency -- equivalent to Tim Wagner's Serverless SuperComputer or AnyScale's Infinite Laptop proposals. △ Less

Submitted 7 April, 2021; originally announced April 2021.

Comments: arXiv admin note: text overlap with arXiv:2006.01251

arXiv:2006.01251 [pdf, other]

Serverless End Game: Disaggregation enabling Transparency

Authors: Pedro García-López, Aleksander Slominski, Simon Shillaker, Michael Behrendt, Barnard Metzler

Abstract: For many years, the distributed systems community has struggled to smooth the transition from local to remote computing. Transparency means concealing the complexities of distributed programming like remote locations, failures or scaling. For us, full transparency implies that we can compile, debug and run unmodified single-machine code over effectively unlimited compute, storage, and memory resou… ▽ More For many years, the distributed systems community has struggled to smooth the transition from local to remote computing. Transparency means concealing the complexities of distributed programming like remote locations, failures or scaling. For us, full transparency implies that we can compile, debug and run unmodified single-machine code over effectively unlimited compute, storage, and memory resources. We elaborate in this article why resource disaggregation in serverless computing is the definitive catalyst to enable full transparency in the Cloud. We demonstrate with two experiments that we can achieve transparency today over disaggregated serverless resources and obtain comparable performance to local executions. We also show that locality cannot be neglected for many problems and we present five open research challenges: granular middleware and locality, memory disaggregation, virtualization, elastic programming models, and optimized deployment. If full transparency is possible, who needs explicit use of middleware if you can treat remote entities as local ones? Can we close the curtains of distributed systems complexity for the majority of users? △ Less

Submitted 1 June, 2020; originally announced June 2020.

arXiv:1703.07626 [pdf, other]

doi 10.1016/j.future.2017.03.027

Energy-Efficient Data Transfers in Radio Astronomy with Software UDP RDMA

Authors: Przemyslaw Lenkiewicz, P. Chris Broekema, Bernard Metzler

Abstract: Modern radio astronomy relies on very large amounts of data that need to be transferred between various parts of astronomical instruments, over distances that are often in the range of tens or hundreds of kilometres. The Square Kilometre Array (SKA) will be the world's largest radio telescope, data rates between its components will exceed Terabits per second. This will impose a huge challenge on i… ▽ More Modern radio astronomy relies on very large amounts of data that need to be transferred between various parts of astronomical instruments, over distances that are often in the range of tens or hundreds of kilometres. The Square Kilometre Array (SKA) will be the world's largest radio telescope, data rates between its components will exceed Terabits per second. This will impose a huge challenge on its data transport system, especially with regard to power consumption. High-speed data transfers using modern off-the-shelf hardware may impose a significant load on the receiving system with respect to CPU and DRAM usage. The SKA has a strict energy budget which demands a new, custom-designed data transport solution. In this paper we present SoftiWARP UDP, an unreliable datagram-based Remote Direct Memory Access (RDMA) protocol, which can significantly increase the energy-efficiency of high-speed data transfers for radio astronomy. We have implemented a fully functional software prototype of such a protocol, supporting RDMA Read and Write operations and zero-copy capabilities. We present measurements of power consumption and achieved bandwidth and investigate the behaviour of all examined protocols when subjected to packet loss. △ Less

Submitted 22 March, 2017; originally announced March 2017.

Comments: Preprint submitted to Future Generation Computer Systems, 15 pages

Showing 1–6 of 6 results for author: Metzler, B