-
DES-LOC: Desynced Low Communication Adaptive Optimizers for Training Foundation Models
Authors:
Alex Iacob,
Lorenzo Sani,
Mher Safaryan,
Paris Giampouras,
Samuel Horváth,
Andrej Jovanovic,
Meghdad Kurmanji,
Preslav Aleksandrov,
William F. Shen,
Xinchi Qiu,
Nicholas D. Lane
Abstract:
Scaling foundation model training with Distributed Data Parallel (DDP) methods is bandwidth-limited. Existing infrequent communication methods like Local SGD were designed to synchronize only model parameters and cannot be trivially applied to adaptive optimizers due to additional optimizer states. Current approaches extending Local SGD either lack convergence guarantees or require synchronizing a…
▽ More
Scaling foundation model training with Distributed Data Parallel (DDP) methods is bandwidth-limited. Existing infrequent communication methods like Local SGD were designed to synchronize only model parameters and cannot be trivially applied to adaptive optimizers due to additional optimizer states. Current approaches extending Local SGD either lack convergence guarantees or require synchronizing all optimizer states, tripling communication costs. We propose Desynced Low Communication Adaptive Optimizers (DES-LOC), a family of optimizers assigning independent synchronization periods to parameters and momenta, enabling lower communication costs while preserving convergence. Through extensive experiments on language models of up to 1.7B, we show that DES-LOC can communicate 170x less than DDP and 2x less than the previous state-of-the-art Local ADAM. Furthermore, unlike previous heuristic approaches, DES-LOC is suited for practical training scenarios prone to system failures. DES-LOC offers a scalable, bandwidth-efficient, and fault-tolerant solution for foundation model training.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
SparsyFed: Sparse Adaptive Federated Training
Authors:
Adriano Guastella,
Lorenzo Sani,
Alex Iacob,
Alessio Mora,
Paolo Bellavista,
Nicholas D. Lane
Abstract:
Sparse training is often adopted in cross-device federated learning (FL) environments where constrained devices collaboratively train a machine learning model on private data by exchanging pseudo-gradients across heterogeneous networks. Although sparse training methods can reduce communication overhead and computational burden in FL, they are often not used in practice for the following key reason…
▽ More
Sparse training is often adopted in cross-device federated learning (FL) environments where constrained devices collaboratively train a machine learning model on private data by exchanging pseudo-gradients across heterogeneous networks. Although sparse training methods can reduce communication overhead and computational burden in FL, they are often not used in practice for the following key reasons: (1) data heterogeneity makes it harder for clients to reach consensus on sparse models compared to dense ones, requiring longer training; (2) methods for obtaining sparse masks lack adaptivity to accommodate very heterogeneous data distributions, crucial in cross-device FL; and (3) additional hyperparameters are required, which are notably challenging to tune in FL. This paper presents SparsyFed, a practical federated sparse training method that critically addresses the problems above. Previous works have only solved one or two of these challenges at the expense of introducing new trade-offs, such as clients' consensus on masks versus sparsity pattern adaptivity. We show that SparsyFed simultaneously (1) can produce 95% sparse models, with negligible degradation in accuracy, while only needing a single hyperparameter, (2) achieves a per-round weight regrowth 200 times smaller than previous methods, and (3) allows the sparse masks to adapt to highly heterogeneous data distributions and outperform all baselines under such conditions.
△ Less
Submitted 7 April, 2025;
originally announced April 2025.
-
BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology
Authors:
Ludovico Mitchener,
Jon M Laurent,
Benjamin Tenmann,
Siddharth Narayanan,
Geemi P Wellawatte,
Andrew White,
Lorenzo Sani,
Samuel G Rodriques
Abstract:
Large Language Models (LLMs) and LLM-based agents show great promise in accelerating scientific research. Existing benchmarks for measuring this potential and guiding future development continue to evolve from pure recall and rote knowledge tasks, towards more practical work such as literature review and experimental planning. Bioinformatics is a domain where fully autonomous AI-driven discovery m…
▽ More
Large Language Models (LLMs) and LLM-based agents show great promise in accelerating scientific research. Existing benchmarks for measuring this potential and guiding future development continue to evolve from pure recall and rote knowledge tasks, towards more practical work such as literature review and experimental planning. Bioinformatics is a domain where fully autonomous AI-driven discovery may be near, but no extensive benchmarks for measuring progress have been introduced to date. We therefore present the Bioinformatics Benchmark (BixBench), a dataset comprising over 50 real-world scenarios of practical biological data analysis with nearly 300 associated open-answer questions designed to measure the ability of LLM-based agents to explore biological datasets, perform long, multi-step analytical trajectories, and interpret the nuanced results of those analyses. We evaluate the performance of two frontier LLMs (GPT-4o and Claude 3.5 Sonnet) using a custom agent framework we open source. We find that even the latest frontier models only achieve 17% accuracy in the open-answer regime, and no better than random in a multiple-choice setting. By exposing the current limitations of frontier models, we hope BixBench can spur the development of agents capable of conducting rigorous bioinformatic analysis and accelerate scientific discovery.
△ Less
Submitted 7 March, 2025; v1 submitted 28 February, 2025;
originally announced March 2025.
-
Low-Eddington ratio, changing-look active galactic nuclei: the case of NGC 4614
Authors:
Elisabeta Lusso,
Lapo Casetti,
Marco Romoli,
Lara Fossi,
Emanuele Nardini,
Emanuele Arra,
Benedetta Barsi,
Clarissa Calamai,
Francesca Campani,
Riccardo Capogrosso,
Francesco Chiti Tegli,
Riccardo Ciantini,
Eirini Demertzi,
Marina A. Gaitani,
Asia Giudice,
Alessia Gori,
Lorenzo Graziani,
Laura Macchiarini,
Marianna Michelagnoli,
Chiara Niccolai,
Irene Parenti,
Simone Pistolesi,
Martina Rago,
Ofelia Romani,
Leonardo Sani
, et al. (5 additional authors not shown)
Abstract:
Active galactic nuclei (AGN) are known to be variable sources across the entire electromagnetic spectrum, in particular at optical/ultraviolet and X-ray energies. Over the past decades, a growing number of AGN have displayed type transitions: from type 1 to type 2 or viceversa within a few years or even several months. These galaxies have been commonly referred to as changing-look AGN (CLAGN). Her…
▽ More
Active galactic nuclei (AGN) are known to be variable sources across the entire electromagnetic spectrum, in particular at optical/ultraviolet and X-ray energies. Over the past decades, a growing number of AGN have displayed type transitions: from type 1 to type 2 or viceversa within a few years or even several months. These galaxies have been commonly referred to as changing-look AGN (CLAGN). Here we report on a new CLAGN, NGC 4614, which transitioned from a type 1.9 to a type 2 state. NGC 4614 is a nearly face-on barred galaxy at redshift $z = 0.016$, classified as a low-luminosity AGN. Its central black hole has a mass of about $1.6\times 10^7 M_\odot$ and an Eddington ratio around 1 percent. We recently acquired optical spectra of NGC 4614 at the Telescopio Nazionale Galileo and the data clearly suggest that the broad H$α$ component has strongly dimmed, if not disappeared. A very recent Swift observation confirmed our current optical data, with the AGN weakened by almost a factor of 10 with respect to previous X-ray observations. Indeed, NGC 4614 had been also observed by Swift/XRT 6 times in 2011, when the source was clearly detected in all observations. By fitting the stack of the 2011 Swift observations we obtain a photon index of $Γ=1.3\pm0.3$ and an equivalent hydrogen column density of $N_{\rm H}$=$1.2\pm0.3$ $\times$10$^{22}$ cm$^{-2}$, indicating that NGC 4614 can be moderately absorbed in the X-rays. Although a significant change in the foreground gas absorption that may have obscured the broad line region cannot be entirely ruled out, the most likely explanation for our optical and X-ray data is that NGC 4614 is experiencing a change in the accretion state that reduces the radiative efficiency of the X-ray corona.
△ Less
Submitted 24 February, 2025;
originally announced February 2025.
-
LUNAR: LLM Unlearning via Neural Activation Redirection
Authors:
William F. Shen,
Xinchi Qiu,
Meghdad Kurmanji,
Alex Iacob,
Lorenzo Sani,
Yihong Chen,
Nicola Cancedda,
Nicholas D. Lane
Abstract:
Large Language Models (LLMs) benefit from training on ever larger amounts of textual data, but as a result, they increasingly incur the risk of leaking private information. The ability to selectively remove knowledge from LLMs is, therefore, a highly desirable capability. In this paper, we propose LUNAR, a novel unlearning methodology grounded in the Linear Representation Hypothesis. LUNAR operate…
▽ More
Large Language Models (LLMs) benefit from training on ever larger amounts of textual data, but as a result, they increasingly incur the risk of leaking private information. The ability to selectively remove knowledge from LLMs is, therefore, a highly desirable capability. In this paper, we propose LUNAR, a novel unlearning methodology grounded in the Linear Representation Hypothesis. LUNAR operates by redirecting the representations of unlearned data to regions that trigger the model's inherent ability to express its inability to answer. LUNAR achieves state-of-the-art unlearning performance while significantly enhancing the controllability of the unlearned model during inference. Specifically, LUNAR achieves between 2.9x to 11.7x improvements on combined "unlearning efficacy" and "model utility" score ("Deviation Score") on the PISTOL dataset across various base models. We also demonstrate, through quantitative analysis and qualitative examples, LUNAR's superior controllability in generating coherent and contextually aware responses, mitigating undesired side effects of existing methods. Moreover, we demonstrate that LUNAR is robust against white-box adversarial attacks and versatile in handling real-world scenarios, such as processing sequential unlearning requests.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
Photon: Federated LLM Pre-Training
Authors:
Lorenzo Sani,
Alex Iacob,
Zeyu Cao,
Royson Lee,
Bill Marino,
Yan Gao,
Dongqi Cai,
Zexi Li,
Wanru Zhao,
Xinchi Qiu,
Nicholas D. Lane
Abstract:
Scaling large language models (LLMs) demands extensive data and computing resources, which are traditionally constrained to data centers by the high-bandwidth requirements of distributed training. Low-bandwidth methods like federated learning (FL) could enable collaborative training of larger models across weakly-connected GPUs if they can effectively be used for pre-training. To achieve this, we…
▽ More
Scaling large language models (LLMs) demands extensive data and computing resources, which are traditionally constrained to data centers by the high-bandwidth requirements of distributed training. Low-bandwidth methods like federated learning (FL) could enable collaborative training of larger models across weakly-connected GPUs if they can effectively be used for pre-training. To achieve this, we introduce Photon, the first complete system for federated end-to-end LLM training, leveraging cross-silo FL for global-scale training with minimal communication overheads. Using Photon, we train the first federated family of decoder-only LLMs from scratch. We show that: (1) Photon can train model sizes up to 7B in a federated fashion while reaching an even better perplexity than centralized pre-training; (2) Photon model training time decreases with available compute, achieving a similar compute-time trade-off to centralized; and (3) Photon outperforms the wall-time of baseline distributed training methods by 35% via communicating 64x-512xless. Our proposal is robust to data heterogeneity and converges twice as fast as previous methods like DiLoCo. This surprising data efficiency stems from a unique approach combining small client batch sizes with extremely high learning rates, enabled by federated averaging's robustness to hyperparameters. Photon thus represents the first economical system for global internet-wide LLM pre-training.
△ Less
Submitted 5 November, 2024;
originally announced November 2024.
-
DEPT: Decoupled Embeddings for Pre-training Language Models
Authors:
Alex Iacob,
Lorenzo Sani,
Meghdad Kurmanji,
William F. Shen,
Xinchi Qiu,
Dongqi Cai,
Yan Gao,
Nicholas D. Lane
Abstract:
Language Model pre-training uses broad data mixtures to enhance performance across domains and languages. However, training on such heterogeneous text corpora requires extensive and expensive efforts. Since these data sources vary significantly in lexical, syntactic, and semantic aspects, they cause negative interference or the ``curse of multilinguality''. To address these challenges we propose a…
▽ More
Language Model pre-training uses broad data mixtures to enhance performance across domains and languages. However, training on such heterogeneous text corpora requires extensive and expensive efforts. Since these data sources vary significantly in lexical, syntactic, and semantic aspects, they cause negative interference or the ``curse of multilinguality''. To address these challenges we propose a communication-efficient pre-training framework, DEPT. Our method decouples embeddings from the transformer body while simultaneously training the latter on multiple data sources without requiring a shared vocabulary. DEPT can: (1) train robustly and effectively under significant data heterogeneity, (2) minimize token embedding parameters to only what the data source vocabulary requires, while cutting communication costs in direct proportion to both the communication frequency and the reduction in parameters, (3) enhance transformer body plasticity and generalization, improving both average perplexity (up to 20%) and downstream task performance, and (4) enable training with custom optimized vocabularies per data source. We demonstrate DEPT's potential via the first vocabulary-agnostic federated pre-training of billion-scale models, reducing communication costs by orders of magnitude and embedding memory by 4-5x.
△ Less
Submitted 7 April, 2025; v1 submitted 7 October, 2024;
originally announced October 2024.
-
Sheaf HyperNetworks for Personalized Federated Learning
Authors:
Bao Nguyen,
Lorenzo Sani,
Xinchi Qiu,
Pietro Liò,
Nicholas D. Lane
Abstract:
Graph hypernetworks (GHNs), constructed by combining graph neural networks (GNNs) with hypernetworks (HNs), leverage relational data across various domains such as neural architecture search, molecular property prediction and federated learning. Despite GNNs and HNs being individually successful, we show that GHNs present problems compromising their performance, such as over-smoothing and heteroph…
▽ More
Graph hypernetworks (GHNs), constructed by combining graph neural networks (GNNs) with hypernetworks (HNs), leverage relational data across various domains such as neural architecture search, molecular property prediction and federated learning. Despite GNNs and HNs being individually successful, we show that GHNs present problems compromising their performance, such as over-smoothing and heterophily. Moreover, we cannot apply GHNs directly to personalized federated learning (PFL) scenarios, where a priori client relation graph may be absent, private, or inaccessible. To mitigate these limitations in the context of PFL, we propose a novel class of HNs, sheaf hypernetworks (SHNs), which combine cellular sheaf theory with HNs to improve parameter sharing for PFL. We thoroughly evaluate SHNs across diverse PFL tasks, including multi-class classification, traffic and weather forecasting. Additionally, we provide a methodology for constructing client relation graphs in scenarios where such graphs are unavailable. We show that SHNs consistently outperform existing PFL solutions in complex non-IID scenarios. While the baselines' performance fluctuates depending on the task, SHNs show improvements of up to 2.7% in accuracy and 5.3% in lower mean squared error over the best-performing baseline.
△ Less
Submitted 31 May, 2024;
originally announced May 2024.
-
Worldwide Federated Training of Language Models
Authors:
Alex Iacob,
Lorenzo Sani,
Bill Marino,
Preslav Aleksandrov,
William F. Shen,
Nicholas Donald Lane
Abstract:
The reliance of language model training on massive amounts of computation and vast datasets scraped from potentially low-quality, copyrighted, or sensitive data has come into question practically, legally, and ethically. Federated learning provides a plausible alternative by enabling previously untapped data to be voluntarily gathered from collaborating organizations. However, when scaled globally…
▽ More
The reliance of language model training on massive amounts of computation and vast datasets scraped from potentially low-quality, copyrighted, or sensitive data has come into question practically, legally, and ethically. Federated learning provides a plausible alternative by enabling previously untapped data to be voluntarily gathered from collaborating organizations. However, when scaled globally, federated learning requires collaboration across heterogeneous legal, security, and privacy regimes while accounting for the inherent locality of language data; this further exacerbates the established challenge of federated statistical heterogeneity. We propose a Worldwide Federated Language Model Training~(WorldLM) system based on federations of federations, where each federation has the autonomy to account for factors such as its industry, operating jurisdiction, or competitive environment. WorldLM enables such autonomy in the presence of statistical heterogeneity via partial model localization by allowing sub-federations to attentively aggregate key layers from their constituents. Furthermore, it can adaptively share information across federations via residual layer embeddings. Evaluations of language modeling on naturally heterogeneous datasets show that WorldLM outperforms standard federations by up to $1.91\times$, approaches the personalized performance of fully local models, and maintains these advantages under privacy-enhancing techniques.
△ Less
Submitted 27 May, 2024; v1 submitted 23 May, 2024;
originally announced May 2024.
-
The Future of Large Language Model Pre-training is Federated
Authors:
Lorenzo Sani,
Alex Iacob,
Zeyu Cao,
Bill Marino,
Yan Gao,
Tomas Paulik,
Wanru Zhao,
William F. Shen,
Preslav Aleksandrov,
Xinchi Qiu,
Nicholas D. Lane
Abstract:
Generative pre-trained large language models (LLMs) have demonstrated impressive performance over a wide range of tasks, thanks to the unprecedented amount of data they have been trained on. As established scaling laws indicate, LLMs' future performance improvement depends on the amount of computing and data sources they can leverage for pre-training. Federated learning (FL) has the potential to u…
▽ More
Generative pre-trained large language models (LLMs) have demonstrated impressive performance over a wide range of tasks, thanks to the unprecedented amount of data they have been trained on. As established scaling laws indicate, LLMs' future performance improvement depends on the amount of computing and data sources they can leverage for pre-training. Federated learning (FL) has the potential to unleash the majority of the planet's data and computational resources, which are underutilized by the data-center-focused training methodology of current LLM practice. Our work presents a robust, flexible, reproducible FL approach that enables large-scale collaboration across institutions to train LLMs. We propose a scalable deployment system called Photon to enable the investigation and development of this new training paradigm for LLM pre-training. We show that Photon can be used by organizations interested in collaborating with their private data sources and computational resources for pre-training LLMs with billions of parameters. This paradigm would mobilize more computational and data resources while matching or potentially exceeding centralized performance. We further show the effectiveness of the federated training scales with model size and present our approach for training billion-scale federated LLMs using limited resources. Thus far, we have used Photon to train LLM models to the size of 7B parameters and anticipate larger models being completed in the near future. Finally, we show that LLM training is highly resilient to the classical challenges of federated statistical and hardware heterogeneity. Furthermore, we show that convergence is robust to partial participation, opening the avenue for compute-efficient collaborative training. Photon will help data-rich actors to become the protagonists of LLMs pre-training instead of leaving the stage to compute-rich actors alone.
△ Less
Submitted 14 October, 2024; v1 submitted 17 May, 2024;
originally announced May 2024.
-
FedAnchor: Enhancing Federated Semi-Supervised Learning with Label Contrastive Loss for Unlabeled Clients
Authors:
Xinchi Qiu,
Yan Gao,
Lorenzo Sani,
Heng Pan,
Wanru Zhao,
Pedro P. B. Gusmao,
Mina Alibeigi,
Alex Iacob,
Nicholas D. Lane
Abstract:
Federated learning (FL) is a distributed learning paradigm that facilitates collaborative training of a shared global model across devices while keeping data localized. The deployment of FL in numerous real-world applications faces delays, primarily due to the prevalent reliance on supervised tasks. Generating detailed labels at edge devices, if feasible, is demanding, given resource constraints a…
▽ More
Federated learning (FL) is a distributed learning paradigm that facilitates collaborative training of a shared global model across devices while keeping data localized. The deployment of FL in numerous real-world applications faces delays, primarily due to the prevalent reliance on supervised tasks. Generating detailed labels at edge devices, if feasible, is demanding, given resource constraints and the imperative for continuous data updates. In addressing these challenges, solutions such as federated semi-supervised learning (FSSL), which relies on unlabeled clients' data and a limited amount of labeled data on the server, become pivotal. In this paper, we propose FedAnchor, an innovative FSSL method that introduces a unique double-head structure, called anchor head, paired with the classification head trained exclusively on labeled anchor data on the server. The anchor head is empowered with a newly designed label contrastive loss based on the cosine similarity metric. Our approach mitigates the confirmation bias and overfitting issues associated with pseudo-labeling techniques based on high-confidence model prediction samples. Extensive experiments on CIFAR10/100 and SVHN datasets demonstrate that our method outperforms the state-of-the-art method by a significant margin in terms of convergence rate and model accuracy.
△ Less
Submitted 15 February, 2024;
originally announced February 2024.
-
Pollen: High-throughput Federated Learning Simulation via Resource-Aware Client Placement
Authors:
Lorenzo Sani,
Pedro Porto Buarque de Gusmão,
Alex Iacob,
Wanru Zhao,
Xinchi Qiu,
Yan Gao,
Javier Fernandez-Marques,
Nicholas Donald Lane
Abstract:
Federated Learning (FL) is a privacy-focused machine learning paradigm that collaboratively trains models directly on edge devices. Simulation plays an essential role in FL adoption, helping develop novel aggregation and client sampling strategies. However, current simulators cannot emulate large-scale systems in a time-efficient manner, which limits their utility and casts doubts on generalizabil…
▽ More
Federated Learning (FL) is a privacy-focused machine learning paradigm that collaboratively trains models directly on edge devices. Simulation plays an essential role in FL adoption, helping develop novel aggregation and client sampling strategies. However, current simulators cannot emulate large-scale systems in a time-efficient manner, which limits their utility and casts doubts on generalizability.
This work proposes Pollen, a novel resource-aware system for speeding up simulations. Pollen addresses two limiting factors from existing simulators: (a) communication inefficiency derived from pull-based client execution and (b) inadequate load balance when using heterogeneous hardware. Pollen executes high-throughput FL simulations at scale by (a) using a push-based client placement system, (b) learning how an adaptable scheduling of clients based on hardware statistics (c) estimating the optimal number of concurrent workers per GPU. We evaluate Pollen on four representative FL tasks and show that Pollen's placement model increases GPU utilization and reduces idle time. We compare Pollen to Flower, Flute, FedScale, Parrot, and pfl and show experimental speed-ups of days or weeks.
△ Less
Submitted 20 May, 2024; v1 submitted 30 June, 2023;
originally announced June 2023.
-
Flower: A Friendly Federated Learning Research Framework
Authors:
Daniel J. Beutel,
Taner Topal,
Akhil Mathur,
Xinchi Qiu,
Javier Fernandez-Marques,
Yan Gao,
Lorenzo Sani,
Kwing Hei Li,
Titouan Parcollet,
Pedro Porto Buarque de Gusmão,
Nicholas D. Lane
Abstract:
Federated Learning (FL) has emerged as a promising technique for edge devices to collaboratively learn a shared prediction model, while keeping their training data on the device, thereby decoupling the ability to do machine learning from the need to store the data in the cloud. However, FL is difficult to implement realistically, both in terms of scale and systems heterogeneity. Although there are…
▽ More
Federated Learning (FL) has emerged as a promising technique for edge devices to collaboratively learn a shared prediction model, while keeping their training data on the device, thereby decoupling the ability to do machine learning from the need to store the data in the cloud. However, FL is difficult to implement realistically, both in terms of scale and systems heterogeneity. Although there are a number of research frameworks available to simulate FL algorithms, they do not support the study of scalable FL workloads on heterogeneous edge devices.
In this paper, we present Flower -- a comprehensive FL framework that distinguishes itself from existing platforms by offering new facilities to execute large-scale FL experiments and consider richly heterogeneous FL device scenarios. Our experiments show Flower can perform FL experiments up to 15M in client size using only a pair of high-end GPUs. Researchers can then seamlessly migrate experiments to real devices to examine other parts of the design space. We believe Flower provides the community with a critical new tool for FL study and development.
△ Less
Submitted 5 March, 2022; v1 submitted 28 July, 2020;
originally announced July 2020.
-
Genetic Algorithms for the Optimization of Diffusion Parameters in Content-Based Image Retrieval
Authors:
Federico Magliani,
Laura Sani,
Stefano Cagnoni,
Andrea Prati
Abstract:
Several computer vision and artificial intelligence projects are nowadays exploiting the manifold data distribution using, e.g., the diffusion process. This approach has produced dramatic improvements on the final performance thanks to the application of such algorithms to the kNN graph. Unfortunately, this recent technique needs a manual configuration of several parameters, thus it is not straigh…
▽ More
Several computer vision and artificial intelligence projects are nowadays exploiting the manifold data distribution using, e.g., the diffusion process. This approach has produced dramatic improvements on the final performance thanks to the application of such algorithms to the kNN graph. Unfortunately, this recent technique needs a manual configuration of several parameters, thus it is not straightforward to find the best configuration for each dataset. Moreover, the brute-force approach is computationally very demanding when used to optimally set the parameters of the diffusion approach. We propose to use genetic algorithms to find the optimal setting of all the diffusion parameters with respect to retrieval performance for each different dataset. Our approach is faster than others used as references (brute-force, random-search and PSO). A comparison with these methods has been made on three public image datasets: Oxford5k, Paris6k and Oxford105k.
△ Less
Submitted 19 August, 2019;
originally announced August 2019.
-
Isolario: a Do-ut-des Approach to Improve the Appeal of BGP Route Collecting
Authors:
Enrico Gregori,
Alessandro Improta,
Luca Sani
Abstract:
The incompleteness of data collected from BGP route collecting projects is a well-known issue which potentially affects every research activity carried out on the analysis of the Internet inter-domain routing. Recent works explained that one of the possible solutions is to increase the number of ASes feeding these projects from the Internet periphery, in order to reveal the hidden portion of peeri…
▽ More
The incompleteness of data collected from BGP route collecting projects is a well-known issue which potentially affects every research activity carried out on the analysis of the Internet inter-domain routing. Recent works explained that one of the possible solutions is to increase the number of ASes feeding these projects from the Internet periphery, in order to reveal the hidden portion of peering connectivity of their upstream providers. The main problem is that these projects are currently not appealing enough for the network administrators of these ASes, which are typically not aware of their existence or not interested enough to share their data. Our contribution is Isolario, a project based on the do-ut-des principle which aims at persuading network administrators to share their routing information by offering services in return, ranging from real-time analyses of the incoming BGP session(s) to historic analyses of routing reachability. To the best of our knowledge, Isolario is the only route collecting project publicly available which offers a set of services to its users to encourage their participation, aiming at increasing the amount of BGP data publicly available for research purposes.
△ Less
Submitted 21 November, 2016;
originally announced November 2016.