-
Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement
Authors:
Simon Yu,
Liangyu Chen,
Sara Ahmadian,
Marzieh Fadaee
Abstract:
Finetuning large language models on instruction data is crucial for enhancing pre-trained knowledge and improving instruction-following capabilities. As instruction datasets proliferate, selecting optimal data for effective training becomes increasingly important. This work addresses the question: How can we determine the optimal subset of data for effective training? While existing research often…
▽ More
Finetuning large language models on instruction data is crucial for enhancing pre-trained knowledge and improving instruction-following capabilities. As instruction datasets proliferate, selecting optimal data for effective training becomes increasingly important. This work addresses the question: How can we determine the optimal subset of data for effective training? While existing research often emphasizes local criteria like instance quality for subset selection, we argue that a global approach focused on data diversity is more critical. Our method employs k-means clustering to ensure the selected subset effectively represents the full dataset. We propose an iterative refinement method inspired by active learning techniques to resample instances from clusters, reassessing each cluster's importance and sampling weight in every training iteration. This approach reduces the effect of outliers and automatically filters out clusters containing low-quality data. Through extensive evaluation across natural language reasoning, general world knowledge, code and math reasoning tasks, and by fine-tuning models from various families, we observe consistent improvements, achieving a 7% increase over random selection and a 3.8% improvement over state-of-the-art sampling methods. Our work highlights the significance of diversity-first sampling when finetuning LLMs to enhance performance across a broad array of evaluation tasks. Our code is available at https://github.com/for-ai/iterative-data-selection.
△ Less
Submitted 17 September, 2024;
originally announced September 2024.
-
GIST: Greedy Independent Set Thresholding for Diverse Data Summarization
Authors:
Matthew Fahrbach,
Srikumar Ramalingam,
Morteza Zadimoghaddam,
Sara Ahmadian,
Gui Citovsky,
Giulia DeSalvo
Abstract:
We introduce a novel subset selection problem called min-distance diversification with monotone submodular utility ($\textsf{MDMS}$), which has a wide variety of applications in machine learning, e.g., data sampling and feature selection. Given a set of points in a metric space, the goal of $\textsf{MDMS}$ is to maximize an objective function combining a monotone submodular utility term and a min-…
▽ More
We introduce a novel subset selection problem called min-distance diversification with monotone submodular utility ($\textsf{MDMS}$), which has a wide variety of applications in machine learning, e.g., data sampling and feature selection. Given a set of points in a metric space, the goal of $\textsf{MDMS}$ is to maximize an objective function combining a monotone submodular utility term and a min-distance diversity term between any pair of selected points, subject to a cardinality constraint. We propose the $\texttt{GIST}$ algorithm, which achieves a $\frac{1}{2}$-approximation guarantee for $\textsf{MDMS}$ by approximating a series of maximum independent set problems with a bicriteria greedy algorithm. We also prove that it is NP-hard to approximate to within a factor of $0.5584$. Finally, we demonstrate that $\texttt{GIST}$ outperforms existing benchmarks for on a real-world image classification task that studies single-shot subset selection for ImageNet.
△ Less
Submitted 10 February, 2025; v1 submitted 29 May, 2024;
originally announced May 2024.
-
Unmasking Vulnerabilities: Cardinality Sketches under Adaptive Inputs
Authors:
Sara Ahmadian,
Edith Cohen
Abstract:
Cardinality sketches are popular data structures that enhance the efficiency of working with large data sets. The sketches are randomized representations of sets that are only of logarithmic size but can support set merges and approximate cardinality (i.e., distinct count) queries. When queries are not adaptive, that is, they do not depend on preceding query responses, the design provides strong g…
▽ More
Cardinality sketches are popular data structures that enhance the efficiency of working with large data sets. The sketches are randomized representations of sets that are only of logarithmic size but can support set merges and approximate cardinality (i.e., distinct count) queries. When queries are not adaptive, that is, they do not depend on preceding query responses, the design provides strong guarantees of correctly answering a number of queries exponential in the sketch size $k$.
In this work, we investigate the performance of cardinality sketches in adaptive settings and unveil inherent vulnerabilities. We design an attack against the ``standard'' estimators that constructs an adversarial input by post-processing responses to a set of simple non-adaptive queries of size linear in the sketch size $k$. Empirically, our attack used only $4k$ queries with the widely used HyperLogLog (HLL++)~\citep{hyperloglog:2007,hyperloglogpractice:EDBT2013} sketch. The simple attack technique suggests it can be effective with post-processed natural workloads. Finally and importantly, we demonstrate that the vulnerability is inherent as \emph{any} estimator applied to known sketch structures can be attacked using a number of queries that is quadratic in $k$, matching a generic upper bound.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
A novel scheme for modelling dissipation or thermalization in open quantum systems
Authors:
Fardin Kheirandish,
Elmira Bolandhemmat,
Narges Cheraghpour,
Ronak Moradi,
Servieh Ahmadian
Abstract:
In this letter, we introduce a novel method for investigating dissipation (gain) and thermalization in an open quantum system. In this method, the quantum system is coupled linearly with a copy of itself or with another system described by a finite number of bosonic operators. The time-dependent coupling functions play a fundamental role in this scheme. To demonstrate the efficiency and significan…
▽ More
In this letter, we introduce a novel method for investigating dissipation (gain) and thermalization in an open quantum system. In this method, the quantum system is coupled linearly with a copy of itself or with another system described by a finite number of bosonic operators. The time-dependent coupling functions play a fundamental role in this scheme. To demonstrate the efficiency and significance of the method, we apply it to some ubiquitous open quantum systems. Firstly, we investigate a quantum oscillator in the presence of a thermal bath at the inverse temperature $β$, obtaining the reduced density matrix, the Husimi distribution function, and the quantum heat distribution function accurately. The results are consistent with existing literature by appropriate choices for the time-dependent coupling function. To illustrate the generalizability of this method to systems interacting with multiple thermal baths, we study the interaction of a quantum oscillator with two thermal baths at different temperatures and obtain compatible results. Subsequently, we analyze a two-level atom with energy or phase dissipation and derive the spontaneous emission and the pure dephasing processes consistently using the new method. Finally, we investigate the Markovianity in a dissipative two-level system.
△ Less
Submitted 15 November, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
Fair Active Ranking from Pairwise Preferences
Authors:
Sruthi Gorantla,
Sara Ahmadian
Abstract:
We investigate the problem of probably approximately correct and fair (PACF) ranking of items by adaptively evoking pairwise comparisons. Given a set of $n$ items that belong to disjoint groups, our goal is to find an $(ε, δ)$-PACF-Ranking according to a fair objective function that we propose. We assume access to an oracle, wherein, for each query, the learner can choose a pair of items and recei…
▽ More
We investigate the problem of probably approximately correct and fair (PACF) ranking of items by adaptively evoking pairwise comparisons. Given a set of $n$ items that belong to disjoint groups, our goal is to find an $(ε, δ)$-PACF-Ranking according to a fair objective function that we propose. We assume access to an oracle, wherein, for each query, the learner can choose a pair of items and receive stochastic winner feedback from the oracle. Our proposed objective function asks to minimize the $\ell_q$ norm of the error of the groups, where the error of a group is the $\ell_p$ norm of the error of all the items within that group, for $p, q \geq 1$. This generalizes the objective function of $ε$-Best-Ranking, proposed by Saha & Gopalan (2019).
By adopting our objective function, we gain the flexibility to explore fundamental fairness concepts like equal or proportionate errors within a unified framework. Adjusting parameters $p$ and $q$ allows tailoring to specific fairness preferences. We present both group-blind and group-aware algorithms and analyze their sample complexity. We provide matching lower bounds up to certain logarithmic factors for group-blind algorithms. For a restricted class of group-aware algorithms, we show that we can get reasonable lower bounds. We conduct comprehensive experiments on both real-world and synthetic datasets to complement our theoretical findings.
△ Less
Submitted 5 February, 2024;
originally announced February 2024.
-
Reconstruction as a service: a data space for off-site image reconstruction in magnetic particle imaging
Authors:
Anselm von Gladiss,
Amir Shayan Ahmadian,
Jan Jürjens
Abstract:
Magnetic particle imaging (MPI) is an emerging medical imaging modality which offers a unique combination of high temporal and spatial resolution, sensitivity and biocompatibility. For system-matrix (SM) based image reconstruction in MPI, a huge amount of calibration data needs to be acquired prior to reconstruction in a time-consuming procedure. Conventionally, the data is recorded on-site inside…
▽ More
Magnetic particle imaging (MPI) is an emerging medical imaging modality which offers a unique combination of high temporal and spatial resolution, sensitivity and biocompatibility. For system-matrix (SM) based image reconstruction in MPI, a huge amount of calibration data needs to be acquired prior to reconstruction in a time-consuming procedure. Conventionally, the data is recorded on-site inside the scanning device, which significantly limits the time that the scanning device is available for patient care in a clinical setting. Due to its size, handling the calibration data can be challenging. To solve these issues of recording and handling the data, data spaces could be used, as it has been shown that the calibration data can be measured in dedicated devices off-site. We propose a data space aimed at improving the efficiency of SM-based image reconstruction in MPI. The data space consists of imaging facilities, calibration data providers and reconstruction experts. Its specifications follow the reference architecture model of international data spaces (IDS). Use-cases of image reconstruction in MPI are formulated. The stakeholders and tasks are listed and mapped to the terminology of IDS. The signal chain in MPI is analysed to identify a minimum information model which is used by the data space.
△ Less
Submitted 11 January, 2024;
originally announced January 2024.
-
DeMEtRIS: Counting (near)-Cliques by Crawling
Authors:
Suman K. Bera,
Jayesh Choudhari,
Shahrzad Haddadan,
Sara Ahmadian
Abstract:
We study the problem of approximately counting cliques and near cliques in a graph, where the access to the graph is only available through crawling its vertices; thus typically seeing only a small portion of it. This model, known as the random walk model or the neighborhood query model has been introduced recently and captures real-life scenarios in which the entire graph is too massive to be sto…
▽ More
We study the problem of approximately counting cliques and near cliques in a graph, where the access to the graph is only available through crawling its vertices; thus typically seeing only a small portion of it. This model, known as the random walk model or the neighborhood query model has been introduced recently and captures real-life scenarios in which the entire graph is too massive to be stored as a whole or be scanned entirely and sampling vertices independently is non-trivial in it. We introduce DeMEtRIS: Dense Motif Estimation through Random Incident Sampling. This method provides a scalable algorithm for clique and near clique counting in the random walk model. We prove the correctness of our algorithm through rigorous mathematical analysis and extensive experiments. Both our theoretical results and our experiments show that DeMEtRIS obtains a high precision estimation by only crawling a sub-linear portion on vertices, thus we demonstrate a significant improvement over previously known results.
△ Less
Submitted 7 December, 2022;
originally announced December 2022.
-
Improved Approximation for Fair Correlation Clustering
Authors:
Sara Ahmadian,
Maryam Negahbani
Abstract:
Correlation clustering is a ubiquitous paradigm in unsupervised machine learning where addressing unfairness is a major challenge. Motivated by this, we study Fair Correlation Clustering where the data points may belong to different protected groups and the goal is to ensure fair representation of all groups across clusters. Our paper significantly generalizes and improves on the quality guarantee…
▽ More
Correlation clustering is a ubiquitous paradigm in unsupervised machine learning where addressing unfairness is a major challenge. Motivated by this, we study Fair Correlation Clustering where the data points may belong to different protected groups and the goal is to ensure fair representation of all groups across clusters. Our paper significantly generalizes and improves on the quality guarantees of previous work of Ahmadi et al. and Ahmadian et al. as follows.
- We allow the user to specify an arbitrary upper bound on the representation of each group in a cluster.
- Our algorithm allows individuals to have multiple protected features and ensure fairness simultaneously across them all.
- We prove guarantees for clustering quality and fairness in this general setting. Furthermore, this improves on the results for the special cases studied in previous work. Our experiments on real-world data demonstrate that our clustering quality compared to the optimal solution is much better than what our theoretical result suggests.
△ Less
Submitted 8 June, 2022;
originally announced June 2022.
-
A Modeling Framework for Reliability of Erasure Codes in SSD Arrays
Authors:
Mostafa Kishani,
Saba Ahmadian,
Hossein Asadi
Abstract:
To help reliability of SSD arrays, Redundant Array of Independent Disks (RAID) are commonly employed. However, the conventional reliability models of HDD RAID cannot be applied to SSD arrays, as the nature of failures in SSDs are different from HDDs. Previous studies on the reliability of SSD arrays are based on the deprecated SSD failure data, and only focus on limited failure types, device failu…
▽ More
To help reliability of SSD arrays, Redundant Array of Independent Disks (RAID) are commonly employed. However, the conventional reliability models of HDD RAID cannot be applied to SSD arrays, as the nature of failures in SSDs are different from HDDs. Previous studies on the reliability of SSD arrays are based on the deprecated SSD failure data, and only focus on limited failure types, device failures, and page failures caused by the bit errors, while recent field studies have reported other failure types including bad blocks and bad chips, and a high correlation between failures. In this paper, we explore the reliability of SSD arrays using field storage traces and real-system implementation of conventional and emerging erasure codes. The reliability is evaluated by statistical fault injections that post-process the usage logs from the real-system implementation, while the fault/failure attributes are obtained from field data. As a case study, we examine conventional and emerging erasure codes in terms of both reliability and performance using Linux MD RAID and commercial SSDs. Our analysis shows that a) emerging erasure codes fail to replace RAID6 in terms of reliability, b) row-wise erasure codes are the most efficient choices for contemporary SSD devices, and c) previous models overestimate the SSD array reliability by up to six orders of magnitude, as they focus on the coincidence of bad pages and bad chips that roots the minority of Data Loss (DL) in SSD arrays. Our experiments show that the combination of bad chips with bad blocks is the major source of DL in RAID5 and emerging codes (contributing more than 54% and 90% of DL in RAID5 and emerging codes, respectively), while RAID6 remains robust under these failure combinations. Finally, the fault injection results show that SSD array reliability, as well as the failure breakdown is significantly correlated with SSD type.
△ Less
Submitted 23 December, 2021;
originally announced December 2021.
-
ETICA: Efficient Two-Level I/O Caching Architecture for Virtualized Platforms
Authors:
Saba Ahmadian,
Reza Salkhordeh,
Onur Mutlu,
Hossein Asadi
Abstract:
In this paper, we propose an Efficient Two-Level I/O Caching Architecture (ETICA) for virtualized platforms that can significantly improve I/O latency, endurance, and cost (in terms of cache size) while preserving the reliability of write-pending data blocks. As opposed to previous one-level I/O caching schemes in virtualized platforms, our proposed architecture 1) provides two levels of cache by…
▽ More
In this paper, we propose an Efficient Two-Level I/O Caching Architecture (ETICA) for virtualized platforms that can significantly improve I/O latency, endurance, and cost (in terms of cache size) while preserving the reliability of write-pending data blocks. As opposed to previous one-level I/O caching schemes in virtualized platforms, our proposed architecture 1) provides two levels of cache by employing both Dynamic Random-Access Memory (DRAM) and SSD in the I/O caching layer of virtualized platforms and 2) effectively partitions the cache space between running VMs to achieve maximum performance and minimum cache size. To manage the two-level cache, unlike the previous reuse distance calculation schemes such as Useful Reuse Distance (URD), which only consider the request type and neglect the impact of cache write policy, we propose a new metric, Policy Optimized reuse Distance (POD). The key idea of POD is to effectively calculate the reuse distance and estimate the amount of two-level DRAM+SSD cache space to allocate by considering both 1) the request type and 2) the cache write policy. Doing so results in enhanced performance and reduced cache size due to the allocation of cache blocks only for the requests that would be served by the I/O cache. ETICA maintains the reliability of write-pending data blocks and improves performance by 1) assigning an effective and fixed write policy at each level of the I/O cache hierarchy and 2) employing effective promotion and eviction methods between cache levels. Our extensive experiments conducted with a real implementation of the proposed two-level storage caching architecture show that ETICA provides 45% higher performance, compared to the state-of-the-art caching schemes in virtualized platforms, while improving both cache size and SSD endurance by 51.7% and 33.8%, respectively.
△ Less
Submitted 14 June, 2021;
originally announced June 2021.
-
Maximizing Agreements for Ranking, Clustering and Hierarchical Clustering via MAX-CUT
Authors:
Vaggos Chatziafratis,
Mohammad Mahdian,
Sara Ahmadian
Abstract:
In this paper, we study a number of well-known combinatorial optimization problems that fit in the following paradigm: the input is a collection of (potentially inconsistent) local relationships between the elements of a ground set (e.g., pairwise comparisons, similar/dissimilar pairs, or ancestry structure of triples of points), and the goal is to aggregate this information into a global structur…
▽ More
In this paper, we study a number of well-known combinatorial optimization problems that fit in the following paradigm: the input is a collection of (potentially inconsistent) local relationships between the elements of a ground set (e.g., pairwise comparisons, similar/dissimilar pairs, or ancestry structure of triples of points), and the goal is to aggregate this information into a global structure (e.g., a ranking, a clustering, or a hierarchical clustering) in a way that maximizes agreement with the input. Well-studied problems such as rank aggregation, correlation clustering, and hierarchical clustering with triplet constraints fall in this class of problems.
We study these problems on stochastic instances with a hidden embedded ground truth solution. Our main algorithmic contribution is a unified technique that uses the maximum cut problem in graphs to approximately solve these problems. Using this technique, we can often get approximation guarantees in the stochastic setting that are better than the known worst case inapproximability bounds for the corresponding problem. On the negative side, we improve the worst case inapproximability bound on several hierarchical clustering formulations through a reduction to related ranking problems.
△ Less
Submitted 23 February, 2021;
originally announced February 2021.
-
The Wedge Picking Model: A dynamic graph model based on triadic closure
Authors:
Sara Ahmadian,
Shahrzad Haddadan
Abstract:
Social networks have become an inseparable part of human life and processing them in an efficient manner is a top priority in the study of networks. These networks are highly dynamic and they are growing incessantly. Inspired by the concept of triadic closure, we propose a probabilistic mechanism to model the evolution of these dynamic graphs. Although triadic closure is ubiquitous in social netwo…
▽ More
Social networks have become an inseparable part of human life and processing them in an efficient manner is a top priority in the study of networks. These networks are highly dynamic and they are growing incessantly. Inspired by the concept of triadic closure, we propose a probabilistic mechanism to model the evolution of these dynamic graphs. Although triadic closure is ubiquitous in social networks and its presence helps forming communities, probabilistic models encapsulating it have not been studied adequately.
We theoretically analyze our model and show how to bound the growth rate of some characteristics of the graph, such as degree of vertices. Leveraging our theoretical results, we develop a scheduling subroutine to process modifications of the graph in batches. Our scheduling subroutine is then used to speed up the state-of-the-art algorithms with negligible loss in their approximation guarantees. We demonstrate the applicability of our method by applying it to the densest subgraph and tri-densest subgraph discovery problem.
△ Less
Submitted 2 December, 2020;
originally announced December 2020.
-
Fair Hierarchical Clustering
Authors:
Sara Ahmadian,
Alessandro Epasto,
Marina Knittel,
Ravi Kumar,
Mohammad Mahdian,
Benjamin Moseley,
Philip Pham,
Sergei Vassilvitskii,
Yuyan Wang
Abstract:
As machine learning has become more prevalent, researchers have begun to recognize the necessity of ensuring machine learning systems are fair. Recently, there has been an interest in defining a notion of fairness that mitigates over-representation in traditional clustering.
In this paper we extend this notion to hierarchical clustering, where the goal is to recursively partition the data to opt…
▽ More
As machine learning has become more prevalent, researchers have begun to recognize the necessity of ensuring machine learning systems are fair. Recently, there has been an interest in defining a notion of fairness that mitigates over-representation in traditional clustering.
In this paper we extend this notion to hierarchical clustering, where the goal is to recursively partition the data to optimize a specific objective. For various natural objectives, we obtain simple, efficient algorithms to find a provably good fair hierarchical clustering. Empirically, we show that our algorithms can find a fair hierarchical clustering, with only a negligible loss in the objective.
△ Less
Submitted 18 June, 2020; v1 submitted 17 June, 2020;
originally announced June 2020.
-
Fair Correlation Clustering
Authors:
Sara Ahmadian,
Alessandro Epasto,
Ravi Kumar,
Mohammad Mahdian
Abstract:
In this paper, we study correlation clustering under fairness constraints. Fair variants of $k$-median and $k$-center clustering have been studied recently, and approximation algorithms using a notion called fairlet decomposition have been proposed. We obtain approximation algorithms for fair correlation clustering under several important types of fairness constraints.
Our results hinge on obtai…
▽ More
In this paper, we study correlation clustering under fairness constraints. Fair variants of $k$-median and $k$-center clustering have been studied recently, and approximation algorithms using a notion called fairlet decomposition have been proposed. We obtain approximation algorithms for fair correlation clustering under several important types of fairness constraints.
Our results hinge on obtaining a fairlet decomposition for correlation clustering by introducing a novel combinatorial optimization problem. We define a fairlet decomposition with cost similar to the $k$-median cost and this allows us to obtain approximation algorithms for a wide range of fairness constraints.
We complement our theoretical results with an in-depth analysis of our algorithms on real graphs where we show that fair solutions to correlation clustering can be obtained with limited increase in cost compared to the state-of-the-art (unfair) algorithms.
△ Less
Submitted 2 March, 2020; v1 submitted 6 February, 2020;
originally announced February 2020.
-
Bisect and Conquer: Hierarchical Clustering via Max-Uncut Bisection
Authors:
Sara Ahmadian,
Vaggos Chatziafratis,
Alessandro Epasto,
Euiwoong Lee,
Mohammad Mahdian,
Konstantin Makarychev,
Grigory Yaroslavtsev
Abstract:
Hierarchical Clustering is an unsupervised data analysis method which has been widely used for decades. Despite its popularity, it had an underdeveloped analytical foundation and to address this, Dasgupta recently introduced an optimization viewpoint of hierarchical clustering with pairwise similarity information that spurred a line of work shedding light on old algorithms (e.g., Average-Linkage),…
▽ More
Hierarchical Clustering is an unsupervised data analysis method which has been widely used for decades. Despite its popularity, it had an underdeveloped analytical foundation and to address this, Dasgupta recently introduced an optimization viewpoint of hierarchical clustering with pairwise similarity information that spurred a line of work shedding light on old algorithms (e.g., Average-Linkage), but also designing new algorithms. Here, for the maximization dual of Dasgupta's objective (introduced by Moseley-Wang), we present polynomial-time .4246 approximation algorithms that use Max-Uncut Bisection as a subroutine. The previous best worst-case approximation factor in polynomial time was .336, improving only slightly over Average-Linkage which achieves 1/3. Finally, we complement our positive results by providing APX-hardness (even for 0-1 similarities), under the Small Set Expansion hypothesis.
△ Less
Submitted 15 December, 2019;
originally announced December 2019.
-
Evaluating Reliability of SSD-Based I/O Caches in Enterprise Storage Systems
Authors:
Saba Ahmadian,
Farhad Taheri,
Hossein Asadi
Abstract:
In this paper, we present a comprehensive analysis investigating the reliability of SSD-based I/O caching architectures used in enterprise storage systems under power failure and high-operating temperature. We explore variety of SSDs from top vendors and investigate the cache reliability in mirrored configuration. To this end, we first develop a physical fault injection and failure detection platf…
▽ More
In this paper, we present a comprehensive analysis investigating the reliability of SSD-based I/O caching architectures used in enterprise storage systems under power failure and high-operating temperature. We explore variety of SSDs from top vendors and investigate the cache reliability in mirrored configuration. To this end, we first develop a physical fault injection and failure detection platform and then investigate the impact of workload dependent parameters on the reliability of I/O cache in the presence of two common failure types in data centers, power outage and high temperature faults. We implement an I/O cache scheme using an open-source I/O cache module in Linux operating system. The experimental results obtained by conducting more than twenty thousand of physical fault injections on the implemented I/O cache with different write policies reveal that the failure rate of the I/O cache is significantly affected by workload dependent parameters. Our results show that unlike workload requests access pattern, the other workload dependent parameters such as request size, Working Set Size (WSS), and sequence of the accesses have considerable impact on the I/O cache failure rate. We observe a significant growth in the failure rate in the workloads by decreasing the size of the requests (by more than 14X). Furthermore, we observe that in addition to writes, the read accesses to the I/O cache are subjected to failure in presence of sudden power outage (the failure mainly occurs during promoting data to the cache). In addition, we observe that I/O cache experiences no data failure upon high temperature faults.
△ Less
Submitted 1 December, 2019;
originally announced December 2019.
-
LGLMF: Local Geographical based Logistic Matrix Factorization Model for POI Recommendation
Authors:
Hossein A. Rahmani,
Mohammad Aliannejadi,
Sajad Ahmadian,
Mitra Baratchi,
Mohsen Afsharchi,
Fabio Crestani
Abstract:
With the rapid growth of Location-Based Social Networks, personalized Points of Interest (POIs) recommendation has become a critical task to help users explore their surroundings. Due to the scarcity of check-in data, the availability of geographical information offers an opportunity to improve the accuracy of POI recommendation. Moreover, matrix factorization methods provide effective models whic…
▽ More
With the rapid growth of Location-Based Social Networks, personalized Points of Interest (POIs) recommendation has become a critical task to help users explore their surroundings. Due to the scarcity of check-in data, the availability of geographical information offers an opportunity to improve the accuracy of POI recommendation. Moreover, matrix factorization methods provide effective models which can be used in POI recommendation. However, there are two main challenges which should be addressed to improve the performance of POI recommendation methods. First, leveraging geographical information to capture both the user's personal, geographic profile and a location's geographic popularity. Second, incorporating the geographical model into the matrix factorization approaches. To address these problems, a POI recommendation method is proposed in this paper based on a Local Geographical Model, which considers both users' and locations' points of view. To this end, an effective geographical model is proposed by considering the user's main region of activity and the relevance of each location within that region. Then, the proposed local geographical model is fused into the Logistic Matrix Factorization to improve the accuracy of POI recommendation. Experimental results on two well-known datasets demonstrate that the proposed approach outperforms other state-of-the-art POI recommendation methods.
△ Less
Submitted 14 September, 2019;
originally announced September 2019.
-
Clustering without Over-Representation
Authors:
Sara Ahmadian,
Alessandro Epasto,
Ravi Kumar,
Mohammad Mahdian
Abstract:
In this paper we consider clustering problems in which each point is endowed with a color. The goal is to cluster the points to minimize the classical clustering cost but with the additional constraint that no color is over-represented in any cluster. This problem is motivated by practical clustering settings, e.g., in clustering news articles where the color of an article is its source, it is pre…
▽ More
In this paper we consider clustering problems in which each point is endowed with a color. The goal is to cluster the points to minimize the classical clustering cost but with the additional constraint that no color is over-represented in any cluster. This problem is motivated by practical clustering settings, e.g., in clustering news articles where the color of an article is its source, it is preferable that no single news source dominates any cluster.
For the most general version of this problem, we obtain an algorithm that has provable guarantees of performance; our algorithm is based on finding a fractional solution using a linear program and rounding the solution subsequently. For the special case of the problem where no color has an absolute majority in any cluster, we obtain a simpler combinatorial algorithm also with provable guarantees. Experiments on real-world data shows that our algorithms are effective in finding good clustering without over-representation.
△ Less
Submitted 29 May, 2019;
originally announced May 2019.
-
A New Method To Find The Nash Equilibrium Point in Financial Transmission Rights Bidding Problem
Authors:
Saeed Ahmadian,
Ramin Farajifijani
Abstract:
Financial transmission right (FTR) is an important tool and an especially feature for stopping congestion charges in restructured electricity markets. Participants in the transmission market as players are assumed to be a generation company (Gencos) which also take part in an energy market and able to buy their require FTRs. In this regard, there are two types of FTR: obligation or option. There a…
▽ More
Financial transmission right (FTR) is an important tool and an especially feature for stopping congestion charges in restructured electricity markets. Participants in the transmission market as players are assumed to be a generation company (Gencos) which also take part in an energy market and able to buy their require FTRs. In this regard, there are two types of FTR: obligation or option. There are three main questions which immediately arise for each player who is placed in the market. First, which type of FTR is the best choice second, how much power is needed to generate by each player and third, how bid prices should be offered. Deciding on these trade-offs is difficult and requires definition of special matrices to measure risk in each possible condition in the transmission market. These matrices include: possibility of flow direction alteration, probable forward and reverse power flow on each line, maximum and minimum offering FTRs and the worst condition of load variation which influence on each players decision. Based on these matrices, players try to maximize their expected payoffs by taking into account the associated risks. Supposing these matrices are known to respective players, the FTR bidding problem is modeled as a bi-level optimization based on the Nash equilibrium game theory with the upper sub-problem representing player profit maximization and the lower sub-problem representing the optimal solution to the market clearing. An eight-bus system with six players is simulated to verify the proposed method and the obtained results are illustrated the complex interaction between FTR obligation and FTR option bidding strategies. Furthermore, the results are demonstrated to be consistent between the impacts of FTR type, forecast bid offer of the other players and players preferred risk levels on FTR bidding strategies.
△ Less
Submitted 21 February, 2019;
originally announced February 2019.
-
LBICA: A Load Balancer for I/O Cache Architectures
Authors:
Saba Ahmadian,
Reza Salkhordeh,
Hossein Asadi
Abstract:
In recent years, enterprise Solid-State Drives (SSDs) are used in the caching layer of high-performance servers to close the growing performance gap between processing units and storage subsystem. SSD-based I/O caching is typically not effective in workloads with burst accesses in which the caching layer itself becomes the performance bottleneck because of the large number of accesses. Existing I/…
▽ More
In recent years, enterprise Solid-State Drives (SSDs) are used in the caching layer of high-performance servers to close the growing performance gap between processing units and storage subsystem. SSD-based I/O caching is typically not effective in workloads with burst accesses in which the caching layer itself becomes the performance bottleneck because of the large number of accesses. Existing I/O cache architectures mainly focus on maximizing the cache hit ratio while they neglect the average queue time of accesses. Previous studies suggested bypassing the cache when burst accesses are identified. These schemes, however, are not applicable to a general cache configuration and also result in significant performance degradation on burst accesses. In this paper, we propose a novel I/O cache load balancing scheme (LBICA) with adaptive write policy management to prevent the I/O cache from becoming performance bottleneck in burst accesses. Our proposal, unlike previous schemes, which disable the I/O cache or bypass the requests into the disk subsystem in burst accesses, selectively reduces the number of waiting accesses in the SSD queue and balances the load between the I/O cache and the disk subsystem while providing the maximum performance. The proposed scheme characterizes the workload based on the type of in-queue requests and assigns an effective cache write policy. We aim to bypass the accesses which 1) are served faster by the disk subsystem or 2) cannot be merged with other accesses in the I/O cache queue. Doing so, the selected requests are responded by the disk layer, preventing from overloading the I/O cache. Our evaluations on a physical system shows that LBICA reduces the load on the I/O cache by 48% and improves the performance of burst workloads by 30% compared to the latest state-of-the-art load balancing scheme.
△ Less
Submitted 5 December, 2018;
originally announced December 2018.
-
ECI-Cache: A High-Endurance and Cost-Efficient I/O Caching Scheme for Virtualized Platforms
Authors:
Saba Ahmadian,
Onur Mutlu,
Hossein Asadi
Abstract:
In recent years, high interest in using Virtual Machines (VMs) in data centers and Cloud computing has significantly increased the demand for high-performance data storage systems. Recent studies suggest using SSDs as a caching layer for HDD-based storage subsystems in virtualization platforms. Such studies neglect to address the endurance and cost of SSDs, which can significantly affect the effic…
▽ More
In recent years, high interest in using Virtual Machines (VMs) in data centers and Cloud computing has significantly increased the demand for high-performance data storage systems. Recent studies suggest using SSDs as a caching layer for HDD-based storage subsystems in virtualization platforms. Such studies neglect to address the endurance and cost of SSDs, which can significantly affect the efficiency of I/O caching. Moreover, previous studies only configure the cache size to provide the required performance level for each VM, while neglecting other important parameters such as cache write policy and request type, which can adversely affect both performance-per-cost and endurance.
In this paper, we present a new high-Endurance and Cost-efficient I/O Caching (ECI-Cache) scheme for virtualized platforms, which can significantly improve both the performance-per-cost and endurance of storage subsystems as opposed to previously proposed I/O caching schemes. Unlike traditional I/O caching schemes which allocate cache size only based on reuse distance of accesses, we propose a new metric, Useful Reuse Distance (URD), which considers the request type in reuse distance calculation, resulting in improved performance-per-cost and endurance for the SSD cache. Via online characterization of workloads and using URD, ECI-Cache partitions the SSD cache across VMs and is able to dynamically adjust the cache size and write policy for each VM. To evaluate the proposed scheme, we have implemented ECI-Cache in an open source hypervisor, QEMU (version 2.8.0), on a server running the CentOS 7 operating system (kernel version 3.10.0-327). Experimental results show that our proposed scheme improves the performance, performance-per-cost, and endurance of the SSD cache by 17%, 30% and 65%, respectively, compared to the state-of-the-art dynamic cache partitioning scheme.
△ Less
Submitted 2 May, 2018;
originally announced May 2018.
-
Investigating Power Outage Effects on Reliability of Solid-State Drives
Authors:
Saba Ahmadian,
Farhad Taheri,
Mehrshad Lotfi,
Maryam Karimi,
Hossein Asad
Abstract:
Solid-State Drives (SSDs) are recently employed in enterprise servers and high-end storage systems in order to enhance performance of storage subsystem. Although employing high speed SSDs in the storage subsystems can significantly improve system performance, it comes with significant reliability threat for write operations upon power failures. In this paper, we present a comprehensive analysis in…
▽ More
Solid-State Drives (SSDs) are recently employed in enterprise servers and high-end storage systems in order to enhance performance of storage subsystem. Although employing high speed SSDs in the storage subsystems can significantly improve system performance, it comes with significant reliability threat for write operations upon power failures. In this paper, we present a comprehensive analysis investigating the impact of workload dependent parameters on the reliability of SSDs under power failure for variety of SSDs (from top manufacturers). To this end, we first develop a platform to perform two important features required for study: a) a realistic fault injection into the SSD in the computing systems and b) data loss detection mechanism on the SSD upon power failure. In the proposed physical fault injection platform, SSDs experience a real discharge phase of Power Supply Unit (PSU) that occurs during power failure in data centers which was neglected in previous studies. The impact of workload dependent parameters such as workload Working Set Size (WSS), request size, request type, access pattern, and sequence of accesses on the failure of SSDs is carefully studied in the presence of realistic power failures. Experimental results over thousands number of fault injections show that data loss occurs even after completion of the request (up to 700ms) where the failure rate is influenced by the type, size, access pattern, and sequence of IO accesses while other parameters such as workload WSS has no impact on the failure of SSDs.
△ Less
Submitted 29 April, 2018;
originally announced May 2018.
-
Further Approximations for Demand Matching: Matroid Constraints and Minor-Closed Graphs
Authors:
Sara Ahmadian,
Zachary Friggstad
Abstract:
We pursue a study of the Generalized Demand Matching problem, a common generalization of the $b$-Matching and Knapsack problems. Here, we are given a graph with vertex capacities, edge profits, and asymmetric demands on the edges. The goal is to find a maximum-profit subset of edges so the demands of chosen edges do not violate vertex capacities. This problem is APX-hard and constant-factor approx…
▽ More
We pursue a study of the Generalized Demand Matching problem, a common generalization of the $b$-Matching and Knapsack problems. Here, we are given a graph with vertex capacities, edge profits, and asymmetric demands on the edges. The goal is to find a maximum-profit subset of edges so the demands of chosen edges do not violate vertex capacities. This problem is APX-hard and constant-factor approximations are known.
Our results fall into two categories. First, using iterated relaxation and various filtering strategies, we show with an efficient rounding algorithm if an additional matroid structure $\mathcal M$ is given and we further only allow sets $F \subseteq E$ that are independent in $\mathcal M$, the natural LP relaxation has an integrality gap of at most $\frac{25}{3} \approx 8.333$. This can be improved in various special cases, for example we improve over the 15-approximation for the previously-studied Coupled Placement problem [Korupolu et al. 2014] by giving a $7$-approximation.
Using similar techniques, we show the problem of computing a minimum-cost base in $\mathcal M$ satisfying vertex capacities admits a $(1,3)$-bicriteria approximation. This improves over the previous $(1,4)$-approximation in the special case that $\mathcal M$ is the graphic matroid over the given graph [Fukanaga and Nagamochi, 2009].
Second, we show Demand Matching admits a polynomial-time approximation scheme in graphs that exclude a fixed minor. If all demands are polynomially-bounded integers, this is somewhat easy using dynamic programming in bounded-treewidth graphs. Our main technical contribution is a sparsification lemma allowing us to scale the demands to be used in a more intricate dynamic programming algorithm, followed by randomized rounding to filter our scaled-demand solution to a feasible solution.
△ Less
Submitted 29 May, 2017;
originally announced May 2017.
-
Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms
Authors:
Sara Ahmadian,
Ashkan Norouzi-Fard,
Ola Svensson,
Justin Ward
Abstract:
Clustering is a classic topic in optimization with $k$-means being one of the most fundamental such problems. In the absence of any restrictions on the input, the best known algorithm for $k$-means with a provable guarantee is a simple local search heuristic yielding an approximation guarantee of $9+ε$, a ratio that is known to be tight with respect to such methods.
We overcome this barrier by p…
▽ More
Clustering is a classic topic in optimization with $k$-means being one of the most fundamental such problems. In the absence of any restrictions on the input, the best known algorithm for $k$-means with a provable guarantee is a simple local search heuristic yielding an approximation guarantee of $9+ε$, a ratio that is known to be tight with respect to such methods.
We overcome this barrier by presenting a new primal-dual approach that allows us to (1) exploit the geometric structure of $k$-means and (2) to satisfy the hard constraint that at most $k$ clusters are selected without deteriorating the approximation guarantee. Our main result is a $6.357$-approximation algorithm with respect to the standard LP relaxation. Our techniques are quite general and we also show improved guarantees for the general version of $k$-means where the underlying metric is not required to be Euclidean and for $k$-median in Euclidean metrics.
△ Less
Submitted 10 April, 2017; v1 submitted 23 December, 2016;
originally announced December 2016.
-
Approximation Algorithms for Clustering Problems with Lower Bounds and Outliers
Authors:
Sara Ahmadian,
Chaitanya Swamy
Abstract:
We consider clustering problems with {\em non-uniform lower bounds and outliers}, and obtain the {\em first approximation guarantees} for these problems. We have a set $\F$ of facilities with lower bounds $\{L_i\}_{i\in\F}$ and a set $\D$ of clients located in a common metric space $\{c(i,j)\}_{i,j\in\F\cup\D}$, and bounds $k$, $m$. A feasible solution is a pair…
▽ More
We consider clustering problems with {\em non-uniform lower bounds and outliers}, and obtain the {\em first approximation guarantees} for these problems. We have a set $\F$ of facilities with lower bounds $\{L_i\}_{i\in\F}$ and a set $\D$ of clients located in a common metric space $\{c(i,j)\}_{i,j\in\F\cup\D}$, and bounds $k$, $m$. A feasible solution is a pair $\bigl(S\sse\F,σ:\D\mapsto S\cup\{\mathsf{out}\}\bigr)$, where $σ$ specifies the client assignments, such that $|S|\leq k$, $|σ^{-1}(i)|\geq L_i$ for all $i\in S$, and $|σ^{-1}(\mathsf{out})|\leq m$. In the {\em lower-bounded min-sum-of-radii with outliers} (\lbksro) problem, the objective is to minimize $\sum_{i\in S}\max_{j\inσ^{-1}(i)}c(i,j)$, and in the {\em lower-bounded $k$-supplier with outliers} (\lbkso) problem, the objective is to minimize $\max_{i\in S}\max_{j\inσ^{-1}(i)}c(i,j)$.
We obtain an approximation factor of $12.365$ for \lbksro, which improves to $3.83$ for the non-outlier version (i.e., $m=0$). These also constitute the {\em first} approximation bounds for the min-sum-of-radii objective when we consider lower bounds and outliers {\em separately}. We apply the primal-dual method to the relaxation where we Lagrangify the $|S|\leq k$ constraint. The chief technical contribution and novelty of our algorithm is that, departing from the standard paradigm used for such constrained problems, we obtain an $O(1)$-approximation {\em despite the fact that we do not obtain a Lagrangian-multiplier-preserving algorithm for the Lagrangian relaxation}. We believe that our ideas have {broader applicability to other clustering problems with outliers as well.}
We obtain approximation factors of $5$ and $3$ respectively for \lbkso and its non-outlier version. These are the {\em first} approximation results for $k$-supplier with {\em non-uniform} lower bounds.
△ Less
Submitted 3 November, 2016; v1 submitted 4 August, 2016;
originally announced August 2016.
-
Local-Search based Approximation Algorithms for Mobile Facility Location Problems
Authors:
Sara Ahmadian,
Zachary Friggstad,
Chaitanya Swamy
Abstract:
We consider the {\em mobile facility location} (\mfl) problem. We are given a set of facilities and clients located in a common metric space. The goal is to move each facility from its initial location to a destination and assign each client to the destination of some facility so as to minimize the sum of the movement-costs of the facilities and the client-assignment costs. This abstracts facility…
▽ More
We consider the {\em mobile facility location} (\mfl) problem. We are given a set of facilities and clients located in a common metric space. The goal is to move each facility from its initial location to a destination and assign each client to the destination of some facility so as to minimize the sum of the movement-costs of the facilities and the client-assignment costs. This abstracts facility-location settings where one has the flexibility of moving facilities from their current locations to other destinations so as to serve clients more efficiently by reducing their assignment costs.
We give the first {\em local-search based} approximation algorithm for this problem and achieve the best-known approximation guarantee. Our main result is $(3+ε)$-approximation for this problem for any constant $ε>0$ using local search. The previous best guarantee was an 8-approximation algorithm based on LP-rounding. Our guarantee {\em matches} the best-known approximation guarantee for the $k$-median problem. Since there is an approximation-preserving reduction from the $k$-median problem to \mfl, any improvement of our result would imply an analogous improvement for the $k$-median problem. Furthermore, {\em our analysis is tight} (up to $o(1)$ factors) since the tight example for the local-search based 3-approximation algorithm for $k$-median can be easily adapted to show that our local-search algorithm has a tight approximation ratio of 3. One of the chief novelties of the analysis is that in order to generate a suitable collection of local-search moves whose resulting inequalities yield the desired bound on the cost of a local-optimum, we define a tree-like structure that (loosely speaking) functions as a "recursion tree", using which we spawn off local-search moves by exploring this tree to a constant depth.
△ Less
Submitted 18 January, 2013;
originally announced January 2013.
-
Improved Approximation Guarantees for Lower-Bounded Facility Location
Authors:
Sara Ahmadian,
Chaitanya Swamy
Abstract:
We consider the {\em lower-bounded facility location} (\lbfl) problem (also sometimes called {\em load-balanced facility location}), which is a generalization of {\em uncapacitated facility location} (\ufl), where each open facility is required to serve a certain {\em minimum} amount of demand. More formally, an instance $\I$ of \lbfl is specified by a set $\F$ of facilities with facility-opening…
▽ More
We consider the {\em lower-bounded facility location} (\lbfl) problem (also sometimes called {\em load-balanced facility location}), which is a generalization of {\em uncapacitated facility location} (\ufl), where each open facility is required to serve a certain {\em minimum} amount of demand. More formally, an instance $\I$ of \lbfl is specified by a set $\F$ of facilities with facility-opening costs $\{f_i\}$, a set $\D$ of clients, and connection costs $\{c_{ij}\}$ specifying the cost of assigning a client $j$ to a facility $i$, where the $c_{ij}$s form a metric. A feasible solution specifies a subset $F$ of facilities to open, and assigns each client $j$ to an open facility $i(j)\in F$ so that each open facility serves {\em at least $M$ clients}, where $M$ is an input parameter. The cost of such a solution is $\sum_{i\in F}f_i+\sum_j c_{i(j)j}$, and the goal is to find a feasible solution of minimum cost. The current best approximation ratio for \lbfl is 448 \cite{Svitkina08}. We substantially advance the state-of-the-art for \lbfl by devising an approximation algorithm for \lbfl that achieves a significantly-improved approximation guarantee of 82.6. Our improvement comes from a variety of ideas in algorithm design and analysis, which also yield new insights into \lbfl.
△ Less
Submitted 29 August, 2012; v1 submitted 15 April, 2011;
originally announced April 2011.