-
HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing
Authors:
Minghui Liu,
Tahseen Rabbani,
Tony O'Halloran,
Ananth Sankaralingam,
Mary-Anne Hartley,
Furong Huang,
Cornelia Fermüller,
Yiannis Aloimonos
Abstract:
Transformer-based large language models (LLMs) use the key-value (KV) cache to significantly accelerate inference by storing the key and value embeddings of past tokens. However, this cache consumes significant GPU memory. In this work, we introduce HashEvict, an algorithm that uses locality-sensitive hashing (LSH) to compress the KV cache. HashEvict quickly locates tokens in the cache that are co…
▽ More
Transformer-based large language models (LLMs) use the key-value (KV) cache to significantly accelerate inference by storing the key and value embeddings of past tokens. However, this cache consumes significant GPU memory. In this work, we introduce HashEvict, an algorithm that uses locality-sensitive hashing (LSH) to compress the KV cache. HashEvict quickly locates tokens in the cache that are cosine dissimilar to the current query token. This is achieved by computing the Hamming distance between binarized Gaussian projections of the current token query and cached token keys, with a projection length much smaller than the embedding dimension. We maintain a lightweight binary structure in GPU memory to facilitate these calculations. Unlike existing compression strategies that compute attention to determine token retention, HashEvict makes these decisions pre-attention, thereby reducing computational costs. Additionally, HashEvict is dynamic - at every decoding step, the key and value of the current token replace the embeddings of a token expected to produce the lowest attention score. We demonstrate that HashEvict can compress the KV cache by 30%-70% while maintaining high performance across reasoning, multiple-choice, long-context retrieval and summarization tasks.
△ Less
Submitted 4 June, 2025; v1 submitted 13 December, 2024;
originally announced December 2024.
-
TabSeq: A Framework for Deep Learning on Tabular Data via Sequential Ordering
Authors:
Al Zadid Sultan Bin Habib,
Kesheng Wang,
Mary-Anne Hartley,
Gianfranco Doretto,
Donald A. Adjeroh
Abstract:
Effective analysis of tabular data still poses a significant problem in deep learning, mainly because features in tabular datasets are often heterogeneous and have different levels of relevance. This work introduces TabSeq, a novel framework for the sequential ordering of features, addressing the vital necessity to optimize the learning process. Features are not always equally informative, and for…
▽ More
Effective analysis of tabular data still poses a significant problem in deep learning, mainly because features in tabular datasets are often heterogeneous and have different levels of relevance. This work introduces TabSeq, a novel framework for the sequential ordering of features, addressing the vital necessity to optimize the learning process. Features are not always equally informative, and for certain deep learning models, their random arrangement can hinder the model's learning capacity. Finding the optimum sequence order for such features could improve the deep learning models' learning process. The novel feature ordering technique we provide in this work is based on clustering and incorporates both local ordering and global ordering. It is designed to be used with a multi-head attention mechanism in a denoising autoencoder network. Our framework uses clustering to align comparable features and improve data organization. Multi-head attention focuses on essential characteristics, whereas the denoising autoencoder highlights important aspects by rebuilding from distorted inputs. This method improves the capability to learn from tabular data while lowering redundancy. Our research, demonstrating improved performance through appropriate feature sequence rearrangement using raw antibody microarray and two other real-world biomedical datasets, validates the impact of feature ordering. These results demonstrate that feature ordering can be a viable approach to improved deep learning of tabular data.
△ Less
Submitted 21 October, 2024; v1 submitted 17 October, 2024;
originally announced October 2024.
-
Efficient Classification of Histopathology Images
Authors:
Mohammad Iqbal Nouyed,
Mary-Anne Hartley,
Gianfranco Doretto,
Donald A. Adjeroh
Abstract:
This work addresses how to efficiently classify challenging histopathology images, such as gigapixel whole-slide images for cancer diagnostics with image-level annotation. We use images with annotated tumor regions to identify a set of tumor patches and a set of benign patches in a cancerous slide. Due to the variable nature of region of interest the tumor positive regions may refer to an extreme…
▽ More
This work addresses how to efficiently classify challenging histopathology images, such as gigapixel whole-slide images for cancer diagnostics with image-level annotation. We use images with annotated tumor regions to identify a set of tumor patches and a set of benign patches in a cancerous slide. Due to the variable nature of region of interest the tumor positive regions may refer to an extreme minority of the pixels. This creates an important problem during patch-level classification, where the majority of patches from an image labeled as 'cancerous' are actually tumor-free. This problem is different from semantic segmentation which associates a label to every pixel in an image, because after patch extraction we are only dealing with patch-level labels.Most existing approaches address the data imbalance issue by mitigating the data shortage in minority classes in order to prevent the model from being dominated by the majority classes. These methods include data re-sampling, loss re-weighting, margin modification, and data augmentation. In this work, we mitigate the patch-level class imbalance problem by taking a divide-and-conquer approach. First, we partition the data into sub-groups, and define three separate classification sub-problems based on these data partitions. Then, using an information-theoretic cluster-based sampling of deep image patch features, we sample discriminative patches from the sub-groups. Using these sampled patches, we build corresponding deep models to solve the new classification sub-problems. Finally, we integrate information learned from the respective models to make a final decision on the patches. Our result shows that the proposed approach can perform competitively using a very low percentage of the available patches in a given whole-slide image.
△ Less
Submitted 8 September, 2024;
originally announced September 2024.
-
Unlearning Information Bottleneck: Machine Unlearning of Systematic Patterns and Biases
Authors:
Ling Han,
Hao Huang,
Dustin Scheinost,
Mary-Anne Hartley,
María Rodríguez Martínez
Abstract:
Effective adaptation to distribution shifts in training data is pivotal for sustaining robustness in neural networks, especially when removing specific biases or outdated information, a process known as machine unlearning. Traditional approaches typically assume that data variations are random, which makes it difficult to adjust the model parameters accurately to remove patterns and characteristic…
▽ More
Effective adaptation to distribution shifts in training data is pivotal for sustaining robustness in neural networks, especially when removing specific biases or outdated information, a process known as machine unlearning. Traditional approaches typically assume that data variations are random, which makes it difficult to adjust the model parameters accurately to remove patterns and characteristics from unlearned data. In this work, we present Unlearning Information Bottleneck (UIB), a novel information-theoretic framework designed to enhance the process of machine unlearning that effectively leverages the influence of systematic patterns and biases for parameter adjustment. By proposing a variational upper bound, we recalibrate the model parameters through a dynamic prior that integrates changes in data distribution with an affordable computational cost, allowing efficient and accurate removal of outdated or unwanted data patterns and biases. Our experiments across various datasets, models, and unlearning methods demonstrate that our approach effectively removes systematic patterns and biases while maintaining the performance of models post-unlearning.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
Towards Independence Criterion in Machine Unlearning of Features and Labels
Authors:
Ling Han,
Nanqing Luo,
Hao Huang,
Jing Chen,
Mary-Anne Hartley
Abstract:
This work delves into the complexities of machine unlearning in the face of distributional shifts, particularly focusing on the challenges posed by non-uniform feature and label removal. With the advent of regulations like the GDPR emphasizing data privacy and the right to be forgotten, machine learning models face the daunting task of unlearning sensitive information without compromising their in…
▽ More
This work delves into the complexities of machine unlearning in the face of distributional shifts, particularly focusing on the challenges posed by non-uniform feature and label removal. With the advent of regulations like the GDPR emphasizing data privacy and the right to be forgotten, machine learning models face the daunting task of unlearning sensitive information without compromising their integrity or performance. Our research introduces a novel approach that leverages influence functions and principles of distributional independence to address these challenges. By proposing a comprehensive framework for machine unlearning, we aim to ensure privacy protection while maintaining model performance and adaptability across varying distributions. Our method not only facilitates efficient data removal but also dynamically adjusts the model to preserve its generalization capabilities. Through extensive experimentation, we demonstrate the efficacy of our approach in scenarios characterized by significant distributional shifts, making substantial contributions to the field of machine unlearning. This research paves the way for developing more resilient and adaptable unlearning techniques, ensuring models remain robust and accurate in the dynamic landscape of data privacy and machine learning.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
TimEHR: Image-based Time Series Generation for Electronic Health Records
Authors:
Hojjat Karami,
Mary-Anne Hartley,
David Atienza,
Anisoara Ionescu
Abstract:
Time series in Electronic Health Records (EHRs) present unique challenges for generative models, such as irregular sampling, missing values, and high dimensionality. In this paper, we propose a novel generative adversarial network (GAN) model, TimEHR, to generate time series data from EHRs. In particular, TimEHR treats time series as images and is based on two conditional GANs. The first GAN gener…
▽ More
Time series in Electronic Health Records (EHRs) present unique challenges for generative models, such as irregular sampling, missing values, and high dimensionality. In this paper, we propose a novel generative adversarial network (GAN) model, TimEHR, to generate time series data from EHRs. In particular, TimEHR treats time series as images and is based on two conditional GANs. The first GAN generates missingness patterns, and the second GAN generates time series values based on the missingness pattern. Experimental results on three real-world EHR datasets show that TimEHR outperforms state-of-the-art methods in terms of fidelity, utility, and privacy metrics.
△ Less
Submitted 9 February, 2024;
originally announced February 2024.
-
MEDITRON-70B: Scaling Medical Pretraining for Large Language Models
Authors:
Zeming Chen,
Alejandro Hernández Cano,
Angelika Romanou,
Antoine Bonnet,
Kyle Matoba,
Francesco Salvi,
Matteo Pagliardini,
Simin Fan,
Andreas Köpf,
Amirkeivan Mohtashami,
Alexandre Sallinen,
Alireza Sakhaeirad,
Vinitra Swamy,
Igor Krawczuk,
Deniz Bayazit,
Axel Marmet,
Syrielle Montariol,
Mary-Anne Hartley,
Martin Jaggi,
Antoine Bosselut
Abstract:
Large language models (LLMs) can potentially democratize access to medical knowledge. While many efforts have been made to harness and improve LLMs' medical knowledge and reasoning capacities, the resulting models are either closed-source (e.g., PaLM, GPT-4) or limited in scale (<= 13B parameters), which restricts their abilities. In this work, we improve access to large-scale medical LLMs by rele…
▽ More
Large language models (LLMs) can potentially democratize access to medical knowledge. While many efforts have been made to harness and improve LLMs' medical knowledge and reasoning capacities, the resulting models are either closed-source (e.g., PaLM, GPT-4) or limited in scale (<= 13B parameters), which restricts their abilities. In this work, we improve access to large-scale medical LLMs by releasing MEDITRON: a suite of open-source LLMs with 7B and 70B parameters adapted to the medical domain. MEDITRON builds on Llama-2 (through our adaptation of Nvidia's Megatron-LM distributed trainer), and extends pretraining on a comprehensively curated medical corpus, including selected PubMed articles, abstracts, and internationally-recognized medical guidelines. Evaluations using four major medical benchmarks show significant performance gains over several state-of-the-art baselines before and after task-specific finetuning. Overall, MEDITRON achieves a 6% absolute performance gain over the best public baseline in its parameter class and 3% over the strongest baseline we finetuned from Llama-2. Compared to closed-source LLMs, MEDITRON-70B outperforms GPT-3.5 and Med-PaLM and is within 5% of GPT-4 and 10% of Med-PaLM-2. We release our code for curating the medical pretraining corpus and the MEDITRON model weights to drive open-source development of more capable medical LLMs.
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
-
MultiModN- Multimodal, Multi-Task, Interpretable Modular Networks
Authors:
Vinitra Swamy,
Malika Satayeva,
Jibril Frej,
Thierry Bossy,
Thijs Vogels,
Martin Jaggi,
Tanja Käser,
Mary-Anne Hartley
Abstract:
Predicting multiple real-world tasks in a single model often requires a particularly diverse feature space. Multimodal (MM) models aim to extract the synergistic predictive potential of multiple data types to create a shared feature space with aligned semantic meaning across inputs of drastically varying sizes (i.e. images, text, sound). Most current MM architectures fuse these representations in…
▽ More
Predicting multiple real-world tasks in a single model often requires a particularly diverse feature space. Multimodal (MM) models aim to extract the synergistic predictive potential of multiple data types to create a shared feature space with aligned semantic meaning across inputs of drastically varying sizes (i.e. images, text, sound). Most current MM architectures fuse these representations in parallel, which not only limits their interpretability but also creates a dependency on modality availability. We present MultiModN, a multimodal, modular network that fuses latent representations in a sequence of any number, combination, or type of modality while providing granular real-time predictive feedback on any number or combination of predictive tasks. MultiModN's composable pipeline is interpretable-by-design, as well as innately multi-task and robust to the fundamental issue of biased missingness. We perform four experiments on several benchmark MM datasets across 10 real-world tasks (predicting medical diagnoses, academic performance, and weather), and show that MultiModN's sequential MM fusion does not compromise performance compared with a baseline of parallel fusion. By simulating the challenging bias of missing not-at-random (MNAR), this work shows that, contrary to MultiModN, parallel fusion baselines erroneously learn MNAR and suffer catastrophic failure when faced with different patterns of MNAR at inference. To the best of our knowledge, this is the first inherently MNAR-resistant approach to MM modeling. In conclusion, MultiModN provides granular insights, robustness, and flexibility without compromising performance.
△ Less
Submitted 6 November, 2023; v1 submitted 25 September, 2023;
originally announced September 2023.
-
Modular Clinical Decision Support Networks (MoDN) -- Updatable, Interpretable, and Portable Predictions for Evolving Clinical Environments
Authors:
Cécile Trottet,
Thijs Vogels,
Martin Jaggi,
Mary-Anne Hartley
Abstract:
Data-driven Clinical Decision Support Systems (CDSS) have the potential to improve and standardise care with personalised probabilistic guidance. However, the size of data required necessitates collaborative learning from analogous CDSS's, which are often unsharable or imperfectly interoperable (IIO), meaning their feature sets are not perfectly overlapping. We propose Modular Clinical Decision Su…
▽ More
Data-driven Clinical Decision Support Systems (CDSS) have the potential to improve and standardise care with personalised probabilistic guidance. However, the size of data required necessitates collaborative learning from analogous CDSS's, which are often unsharable or imperfectly interoperable (IIO), meaning their feature sets are not perfectly overlapping. We propose Modular Clinical Decision Support Networks (MoDN) which allow flexible, privacy-preserving learning across IIO datasets, while providing interpretable, continuous predictive feedback to the clinician.
MoDN is a novel decision tree composed of feature-specific neural network modules. It creates dynamic personalised representations of patients, and can make multiple predictions of diagnoses, updatable at each step of a consultation. The modular design allows it to compartmentalise training updates to specific features and collaboratively learn between IIO datasets without sharing any data.
△ Less
Submitted 12 November, 2022;
originally announced November 2022.
-
Optimal Model Averaging: Towards Personalized Collaborative Learning
Authors:
Felix Grimberg,
Mary-Anne Hartley,
Sai P. Karimireddy,
Martin Jaggi
Abstract:
In federated learning, differences in the data or objectives between the participating nodes motivate approaches to train a personalized machine learning model for each node. One such approach is weighted averaging between a locally trained model and the global model. In this theoretical work, we study weighted model averaging for arbitrary scalar mean estimation problems under minimal assumptions…
▽ More
In federated learning, differences in the data or objectives between the participating nodes motivate approaches to train a personalized machine learning model for each node. One such approach is weighted averaging between a locally trained model and the global model. In this theoretical work, we study weighted model averaging for arbitrary scalar mean estimation problems under minimal assumptions on the distributions. In a variant of the bias-variance trade-off, we find that there is always some positive amount of model averaging that reduces the expected squared error compared to the local model, provided only that the local model has a non-zero variance. Further, we quantify the (possibly negative) benefit of weighted model averaging as a function of the weight used and the optimal weight. Taken together, this work formalizes an approach to quantify the value of personalization in collaborative learning and provides a framework for future research to test the findings in multivariate parameter estimation and under a range of assumptions.
△ Less
Submitted 25 October, 2021;
originally announced October 2021.
-
WAFFLE: Weighted Averaging for Personalized Federated Learning
Authors:
Martin Beaussart,
Felix Grimberg,
Mary-Anne Hartley,
Martin Jaggi
Abstract:
In federated learning, model personalization can be a very effective strategy to deal with heterogeneous training data across clients. We introduce WAFFLE (Weighted Averaging For Federated LEarning), a personalized collaborative machine learning algorithm that leverages stochastic control variates for faster convergence. WAFFLE uses the Euclidean distance between clients' updates to weigh their in…
▽ More
In federated learning, model personalization can be a very effective strategy to deal with heterogeneous training data across clients. We introduce WAFFLE (Weighted Averaging For Federated LEarning), a personalized collaborative machine learning algorithm that leverages stochastic control variates for faster convergence. WAFFLE uses the Euclidean distance between clients' updates to weigh their individual contributions and thus minimize the personalized model loss on the specific agent of interest. Through a series of experiments, we compare our new approach to two recent personalized federated learning methods--Weight Erosion and APFL--as well as two general FL methods--Federated Averaging and SCAFFOLD. Performance is evaluated using two categories of non-identical client data distributions--concept shift and label skew--on two image data sets (MNIST and CIFAR10). Our experiments demonstrate the comparative effectiveness of WAFFLE, as it achieves or improves accuracy with faster convergence.
△ Less
Submitted 13 December, 2021; v1 submitted 13 October, 2021;
originally announced October 2021.
-
IFedAvg: Interpretable Data-Interoperability for Federated Learning
Authors:
David Roschewitz,
Mary-Anne Hartley,
Luca Corinzia,
Martin Jaggi
Abstract:
Recently, the ever-growing demand for privacy-oriented machine learning has motivated researchers to develop federated and decentralized learning techniques, allowing individual clients to train models collaboratively without disclosing their private datasets. However, widespread adoption has been limited in domains relying on high levels of user trust, where assessment of data compatibility is es…
▽ More
Recently, the ever-growing demand for privacy-oriented machine learning has motivated researchers to develop federated and decentralized learning techniques, allowing individual clients to train models collaboratively without disclosing their private datasets. However, widespread adoption has been limited in domains relying on high levels of user trust, where assessment of data compatibility is essential. In this work, we define and address low interoperability induced by underlying client data inconsistencies in federated learning for tabular data. The proposed method, iFedAvg, builds on federated averaging adding local element-wise affine layers to allow for a personalized and granular understanding of the collaborative learning process. Thus, enabling the detection of outlier datasets in the federation and also learning the compensation for local data distribution shifts without sharing any original data. We evaluate iFedAvg using several public benchmarks and a previously unstudied collection of real-world datasets from the 2014 - 2016 West African Ebola epidemic, jointly forming the largest such dataset in the world. In all evaluations, iFedAvg achieves competitive average performance with negligible overhead. It additionally shows substantial improvement on outlier clients, highlighting increased robustness to individual dataset shifts. Most importantly, our method provides valuable client-specific insights at a fine-grained level to guide interoperable federated learning.
△ Less
Submitted 14 July, 2021;
originally announced July 2021.
-
Cheryl's Birthday
Authors:
Hans van Ditmarsch,
Michael Ian Hartley,
Barteld Kooi,
Jonathan Welton,
Joseph B. W. Yeo
Abstract:
We present four logic puzzles and after that their solutions. Joseph Yeo designed 'Cheryl's Birthday'. Mike Hartley came up with a novel solution for 'One Hundred Prisoners and a Light Bulb'. Jonathan Welton designed 'A Blind Guess' and 'Abby's Birthday'. Hans van Ditmarsch and Barteld Kooi authored the puzzlebook 'One Hundred Prisoners and a Light Bulb' that contains other knowledge puzzles, and…
▽ More
We present four logic puzzles and after that their solutions. Joseph Yeo designed 'Cheryl's Birthday'. Mike Hartley came up with a novel solution for 'One Hundred Prisoners and a Light Bulb'. Jonathan Welton designed 'A Blind Guess' and 'Abby's Birthday'. Hans van Ditmarsch and Barteld Kooi authored the puzzlebook 'One Hundred Prisoners and a Light Bulb' that contains other knowledge puzzles, and that can also be found on the webpage http://personal.us.es/hvd/lightbulb.html dedicated to the book.
△ Less
Submitted 27 July, 2017;
originally announced August 2017.
-
Kilombo: a Kilobot simulator to enable effective research in swarm robotics
Authors:
Fredrik Jansson,
Matthew Hartley,
Martin Hinsch,
Ivica Slavkov,
Noemí Carranza,
Tjelvar S. G. Olsson,
Roland M. Dries,
Johanna H. Grönqvist,
Athanasius F. M. Marée,
James Sharpe,
Jaap A. Kaandorp,
Verônica A. Grieneisen
Abstract:
The Kilobot is a widely used platform for investigation of swarm robotics. Physical Kilobots are slow moving and require frequent recalibration and charging, which significantly slows down the development cycle. Simulators can speed up the process of testing, exploring and hypothesis generation, but usually require time consuming and error-prone translation of code between simulator and robot. Mor…
▽ More
The Kilobot is a widely used platform for investigation of swarm robotics. Physical Kilobots are slow moving and require frequent recalibration and charging, which significantly slows down the development cycle. Simulators can speed up the process of testing, exploring and hypothesis generation, but usually require time consuming and error-prone translation of code between simulator and robot. Moreover, code of different nature often obfuscates direct comparison, as well as determination of the cause of deviation, between simulator and actual robot swarm behaviour. To tackle these issues we have developed a C-based simulator that allows those working with Kilobots to use the same programme code in both the simulator and the physical robots. Use of our simulator, coined Kilombo, significantly simplifies and speeds up development, given that a simulation of 1000 robots can be run at a speed 100 times faster than real time on a desktop computer, making high-throughput pre-screening possible of potential algorithms that could lead to desired emergent behaviour. We argue that this strategy, here specifically developed for Kilobots, is of general importance for effective robot swarm research. The source code is freely available under the MIT license.
△ Less
Submitted 9 May, 2016; v1 submitted 13 November, 2015;
originally announced November 2015.