Search | arXiv e-print repository

A QUBO Framework for Team Formation

Authors: Karan Vombatkere, Evimaria Terzi, Theodoros Lappas

Abstract: The team formation problem assumes a set of experts and a task, where each expert has a set of skills and the task requires some skills. The objective is to find a set of experts that maximizes coverage of the required skills while simultaneously minimizing the costs associated with the experts. Different definitions of cost have traditionally led to distinct problem formulations and algorithmic s… ▽ More The team formation problem assumes a set of experts and a task, where each expert has a set of skills and the task requires some skills. The objective is to find a set of experts that maximizes coverage of the required skills while simultaneously minimizing the costs associated with the experts. Different definitions of cost have traditionally led to distinct problem formulations and algorithmic solutions. We introduce the unified TeamFormation formulation that captures all cost definitions for team formation problems that balance task coverage and expert cost. Specifically, we formulate three TeamFormation variants with different cost functions using quadratic unconstrained binary optimization (QUBO), and we evaluate two distinct general-purpose solution methods. We show that solutions based on the QUBO formulations of TeamFormation problems are at least as good as those produced by established baselines. Furthermore, we show that QUBO-based solutions leveraging graph neural networks can effectively learn representations of experts and skills to enable transfer learning, allowing node embeddings from one problem instance to be efficiently applied to another. △ Less

Submitted 29 March, 2025; originally announced March 2025.

arXiv:2503.05898 [pdf, other]

doi 10.1007/s10618-025-01090-x

Forming Coordinated Teams that Balance Task Coverage and Expert Workload

Authors: Karan Vombatkere, Evimaria Terzi, Aristides Gionis

Abstract: We study a new formulation of the team-formation problem, where the goal is to form teams to work on a given set of tasks requiring different skills. Deviating from the classic problem setting where one is asking to cover all skills of each given task, we aim to cover as many skills as possible while also trying to minimize the maximum workload among the experts. We do this by combining penalizati… ▽ More We study a new formulation of the team-formation problem, where the goal is to form teams to work on a given set of tasks requiring different skills. Deviating from the classic problem setting where one is asking to cover all skills of each given task, we aim to cover as many skills as possible while also trying to minimize the maximum workload among the experts. We do this by combining penalization terms for the coverage and load constraints into one objective. We call the corresponding assignment problem $\texttt{Balanced-Coverage}$, and show that it is NP-hard. We also consider a variant of this problem, where the experts are organized into a graph, which encodes how well they work together. Utilizing such a coordination graph, we aim to find teams to assign to tasks such that each team's radius does not exceed a given threshold. We refer to this problem as $\texttt{Network-Balanced-Coverage}$. We develop a generic template algorithm for approximating both problems in polynomial time, and we show that our template algorithm for $\texttt{Balanced-Coverage}$ has provable guarantees. We describe a set of computational speedups that we can apply to our algorithms and make them scale for reasonably large datasets. From the practical point of view, we demonstrate how to efficiently tune the two parts of the objective and tailor their importance to a particular application. Our experiments with a variety of real-world datasets demonstrate the utility of our problem formulation as well as the efficiency of our algorithms in practice. △ Less

Submitted 7 March, 2025; originally announced March 2025.

Journal ref: Data Mining and Knowledge Discovery (2025)

arXiv:2410.22591 [pdf, other]

FGCE: Feasible Group Counterfactual Explanations for Auditing Fairness

Authors: Christos Fragkathoulas, Vasiliki Papanikou, Evaggelia Pitoura, Evimaria Terzi

Abstract: This paper introduces the first graph-based framework for generating group counterfactual explanations to audit model fairness, a crucial aspect of trustworthy machine learning. Counterfactual explanations are instrumental in understanding and mitigating unfairness by revealing how inputs should change to achieve a desired outcome. Our framework, named Feasible Group Counterfactual Explanations (F… ▽ More This paper introduces the first graph-based framework for generating group counterfactual explanations to audit model fairness, a crucial aspect of trustworthy machine learning. Counterfactual explanations are instrumental in understanding and mitigating unfairness by revealing how inputs should change to achieve a desired outcome. Our framework, named Feasible Group Counterfactual Explanations (FGCEs), captures real-world feasibility constraints and constructs subgroups with similar counterfactuals, setting it apart from existing methods. It also addresses key trade-offs in counterfactual generation, including the balance between the number of counterfactuals, their associated costs, and the breadth of coverage achieved. To evaluate these trade-offs and assess fairness, we propose measures tailored to group counterfactual generation. Our experimental results on benchmark datasets demonstrate the effectiveness of our approach in managing feasibility constraints and trade-offs, as well as the potential of our proposed metrics in identifying and quantifying fairness issues. △ Less

Submitted 15 November, 2024; v1 submitted 29 October, 2024; originally announced October 2024.

arXiv:2407.19262 [pdf, other]

Understanding Memorisation in LLMs: Dynamics, Influencing Factors, and Implications

Authors: Till Speicher, Mohammad Aflah Khan, Qinyuan Wu, Vedant Nanda, Soumi Das, Bishwamittra Ghosh, Krishna P. Gummadi, Evimaria Terzi

Abstract: Understanding whether and to what extent large language models (LLMs) have memorised training data has important implications for the reliability of their output and the privacy of their training data. In order to cleanly measure and disentangle memorisation from other phenomena (e.g. in-context learning), we create an experimental framework that is based on repeatedly exposing LLMs to random stri… ▽ More Understanding whether and to what extent large language models (LLMs) have memorised training data has important implications for the reliability of their output and the privacy of their training data. In order to cleanly measure and disentangle memorisation from other phenomena (e.g. in-context learning), we create an experimental framework that is based on repeatedly exposing LLMs to random strings. Our framework allows us to better understand the dynamics, i.e., the behaviour of the model, when repeatedly exposing it to random strings. Using our framework, we make several striking observations: (a) we find consistent phases of the dynamics across families of models (Pythia, Phi and Llama2), (b) we identify factors that make some strings easier to memorise than others, and (c) we identify the role of local prefixes and global context in memorisation. We also show that sequential exposition to different random strings has a significant effect on memorisation. Our results, often surprising, have significant downstream implications in the study and usage of LLMs. △ Less

Submitted 27 July, 2024; originally announced July 2024.

arXiv:2404.12957 [pdf, other]

doi 10.1145/3701551.3703562

Towards Reliable Latent Knowledge Estimation in LLMs: Zero-Prompt Many-Shot Based Factual Knowledge Extraction

Authors: Qinyuan Wu, Mohammad Aflah Khan, Soumi Das, Vedant Nanda, Bishwamittra Ghosh, Camila Kolling, Till Speicher, Laurent Bindschaedler, Krishna P. Gummadi, Evimaria Terzi

Abstract: In this paper, we focus on the challenging task of reliably estimating factual knowledge that is embedded inside large language models (LLMs). To avoid reliability concerns with prior approaches, we propose to eliminate prompt engineering when probing LLMs for factual knowledge. Our approach, called Zero-Prompt Latent Knowledge Estimator (ZP-LKE), leverages the in-context learning ability of LLMs… ▽ More In this paper, we focus on the challenging task of reliably estimating factual knowledge that is embedded inside large language models (LLMs). To avoid reliability concerns with prior approaches, we propose to eliminate prompt engineering when probing LLMs for factual knowledge. Our approach, called Zero-Prompt Latent Knowledge Estimator (ZP-LKE), leverages the in-context learning ability of LLMs to communicate both the factual knowledge question as well as the expected answer format. Our knowledge estimator is both conceptually simpler (i.e., doesn't depend on meta-linguistic judgments of LLMs) and easier to apply (i.e., is not LLM-specific), and we demonstrate that it can surface more of the latent knowledge embedded in LLMs. We also investigate how different design choices affect the performance of ZP-LKE. Using the proposed estimator, we perform a large-scale evaluation of the factual knowledge of a variety of open-source LLMs, like OPT, Pythia, Llama(2), Mistral, Gemma, etc. over a large set of relations and facts from the Wikidata knowledge base. We observe differences in the factual knowledge between different model families and models of different sizes, that some relations are consistently better known than others but that models differ in the precise facts they know, and differences in the knowledge of base models and their finetuned counterparts. Code available at: https://github.com/QinyuanWu0710/ZeroPrompt_LKE △ Less

Submitted 17 December, 2024; v1 submitted 19 April, 2024; originally announced April 2024.

arXiv:2403.00859 [pdf, other]

doi 10.1145/3589334.3645444

Team Formation amidst Conflicts

Authors: Iasonas Nikolaou, Evimaria Terzi

Abstract: In this work, we formulate the problem of team formation amidst conflicts. The goal is to assign individuals to tasks, with given capacities, taking into account individuals' task preferences and the conflicts between them. Using dependent rounding schemes as our main toolbox, we provide efficient approximation algorithms. Our framework is extremely versatile and can model many different real-worl… ▽ More In this work, we formulate the problem of team formation amidst conflicts. The goal is to assign individuals to tasks, with given capacities, taking into account individuals' task preferences and the conflicts between them. Using dependent rounding schemes as our main toolbox, we provide efficient approximation algorithms. Our framework is extremely versatile and can model many different real-world scenarios as they arise in educational settings and human-resource management. We test and deploy our algorithms on real-world datasets and we show that our algorithms find assignments that are better than those found by natural baselines. In the educational setting we also show how our assignments are far better than those done manually by human experts. In the human resource management application we show how our assignments increase the diversity of teams. Finally, using a synthetic dataset we demonstrate that our algorithms scale very well in practice. △ Less

Submitted 29 February, 2024; originally announced March 2024.

arXiv:2402.10243 [pdf, other]

Understanding team collapse via probabilistic graphical models

Authors: Iasonas Nikolaou, Konstantinos Pelechrinis, Evimaria Terzi

Abstract: In this work, we develop a graphical model to capture team dynamics. We analyze the model and show how to learn its parameters from data. Using our model we study the phenomenon of team collapse from a computational perspective. We use simulations and real-world experiments to find the main causes of team collapse. We also provide the principles of building resilient teams, i.e., teams that avoid… ▽ More In this work, we develop a graphical model to capture team dynamics. We analyze the model and show how to learn its parameters from data. Using our model we study the phenomenon of team collapse from a computational perspective. We use simulations and real-world experiments to find the main causes of team collapse. We also provide the principles of building resilient teams, i.e., teams that avoid collapsing. Finally, we use our model to analyze the structure of NBA teams and dive deeper into games of interest. △ Less

Submitted 14 February, 2024; originally announced February 2024.

arXiv:2311.10005 [pdf, other]

Towards Flexibility and Robustness of LSM Trees

Authors: Andy Huynh, Harshal A. Chaudhari, Evimaria Terzi, Manos Athanassoulis

Abstract: Log-Structured Merge trees (LSM trees) are increasingly used as part of the storage engine behind several data systems, and are frequently deployed in the cloud. As the number of applications relying on LSM-based storage backends increases, the problem of performance tuning of LSM trees receives increasing attention. We consider both nominal tunings - where workload and execution environment are a… ▽ More Log-Structured Merge trees (LSM trees) are increasingly used as part of the storage engine behind several data systems, and are frequently deployed in the cloud. As the number of applications relying on LSM-based storage backends increases, the problem of performance tuning of LSM trees receives increasing attention. We consider both nominal tunings - where workload and execution environment are accurately known a priori - and robust tunings - which consider uncertainty in the workload knowledge. This type of workload uncertainty is common in modern applications, notably in shared infrastructure environments like the public cloud. To address this problem, we introduce ENDURE, a new paradigm for tuning LSM trees in the presence of workload uncertainty. Specifically, we focus on the impact of the choice of compaction policy, size ratio, and memory allocation on the overall performance. ENDURE considers a robust formulation of the throughput maximization problem and recommends a tuning that offers near-optimal throughput when the executed workload is not the same, instead in a neighborhood of the expected workload. Additionally, we explore the robustness of flexible LSM designs by proposing a new unified design called K-LSM that encompasses existing designs. We deploy our robust tuning system, ENDURE, on a state-of-the-art key-value store, RocksDB, and demonstrate throughput improvements of up to 5x in the presence of uncertainty. Our results indicate that the tunings obtained by ENDURE are more robust than tunings obtained under our expanded LSM design space. This indicates that robustness may not be inherent to a design, instead, it is an outcome of a tuning process that explicitly accounts for uncertainty. △ Less

Submitted 16 November, 2023; originally announced November 2023.

Comments: 25 pages, 19 figures, VLDB-J. arXiv admin note: substantial text overlap with arXiv:2110.13801

arXiv:2309.04339 [pdf, other]

Online Submodular Maximization via Online Convex Optimization

Authors: Tareq Si Salem, Gözde Özcan, Iasonas Nikolaou, Evimaria Terzi, Stratis Ioannidis

Abstract: We study monotone submodular maximization under general matroid constraints in the online setting. We prove that online optimization of a large class of submodular functions, namely, weighted threshold potential functions, reduces to online convex optimization (OCO). This is precisely because functions in this class admit a concave relaxation; as a result, OCO policies, coupled with an appropriate… ▽ More We study monotone submodular maximization under general matroid constraints in the online setting. We prove that online optimization of a large class of submodular functions, namely, weighted threshold potential functions, reduces to online convex optimization (OCO). This is precisely because functions in this class admit a concave relaxation; as a result, OCO policies, coupled with an appropriate rounding scheme, can be used to achieve sublinear regret in the combinatorial setting. We show that our reduction extends to many different versions of the online learning problem, including the dynamic regret, bandit, and optimistic-learning settings. △ Less

Submitted 7 January, 2024; v1 submitted 8 September, 2023; originally announced September 2023.

Comments: Accepted to AAAI Conference on Artificial Intelligence, 2024

arXiv:2110.13801 [pdf, other]

Endure: A Robust Tuning Paradigm for LSM Trees Under Workload Uncertainty

Authors: Andy Huynh, Harshal A. Chaudhari, Evimaria Terzi, Manos Athanassoulis

Abstract: Log-Structured Merge trees (LSM trees) are increasingly used as the storage engines behind several data systems, frequently deployed in the cloud. Similar to other database architectures, LSM trees take into account information about the expected workload (e.g., reads vs. writes, point vs. range queries) to optimize their performance via tuning. Operating in shared infrastructure like the cloud, h… ▽ More Log-Structured Merge trees (LSM trees) are increasingly used as the storage engines behind several data systems, frequently deployed in the cloud. Similar to other database architectures, LSM trees take into account information about the expected workload (e.g., reads vs. writes, point vs. range queries) to optimize their performance via tuning. Operating in shared infrastructure like the cloud, however, comes with a degree of workload uncertainty due to multi-tenancy and the fast-evolving nature of modern applications. Systems with static tuning discount the variability of such hybrid workloads and hence provide an inconsistent and overall suboptimal performance. To address this problem, we introduce Endure - a new paradigm for tuning LSM trees in the presence of workload uncertainty. Specifically, we focus on the impact of the choice of compaction policies, size-ratio, and memory allocation on the overall performance. Endure considers a robust formulation of the throughput maximization problem, and recommends a tuning that maximizes the worst-case throughput over a neighborhood of each expected workload. Additionally, an uncertainty tuning parameter controls the size of this neighborhood, thereby allowing the output tunings to be conservative or optimistic. Through both model-based and extensive experimental evaluation of Endure in the state-of-the-art LSM-based storage engine, RocksDB, we show that the robust tuning methodology consistently outperforms classical tun-ing strategies. We benchmark Endure using 15 workload templates that generate more than 10000 unique noisy workloads. The robust tunings output by Endure lead up to a 5$\times$ improvement in through-put in presence of uncertainty. On the flip side, when the observed workload exactly matches the expected one, Endure tunings have negligible performance loss. △ Less

Submitted 2 November, 2021; v1 submitted 26 October, 2021; originally announced October 2021.

Comments: 21 pages, 30 figures

arXiv:2011.04428 [pdf, other]

Finding teams that balance expert load and task coverage

Authors: Sofia Maria Nikolakaki, Mingxiang Cai, Evimaria Terzi

Abstract: The rise of online labor markets (e.g., Freelancer, Guru and Upwork) has ignited a lot of research on team formation, where experts acquiring different skills form teams to complete tasks. The core idea in this line of work has been the strict requirement that the team of experts assigned to complete a given task should contain a superset of the skills required by the task. However, in many applic… ▽ More The rise of online labor markets (e.g., Freelancer, Guru and Upwork) has ignited a lot of research on team formation, where experts acquiring different skills form teams to complete tasks. The core idea in this line of work has been the strict requirement that the team of experts assigned to complete a given task should contain a superset of the skills required by the task. However, in many applications the required skills are often a wishlist of the entity that posts the task and not all of the skills are absolutely necessary. Thus, in our setting we relax the complete coverage requirement and we allow for tasks to be partially covered by the formed teams, assuming that the quality of task completion is proportional to the fraction of covered skills per task. At the same time, we assume that when multiple tasks need to be performed, the less the load of an expert the better the performance. We combine these two high-level objectives into one and define the BalancedTA problem. We also consider a generalization of this problem where each task consists of required and optional skills. In this setting, our objective is the same under the constraint that all required skills should be covered. From the technical point of view, we show that the BalancedTA problem (and its variant) is NP-hard and design efficient heuristics for solving it in practice. Using real datasets from three online market places, Freelancer, Guru and Upwork we demonstrate the efficiency of our methods and the practical utility of our framework. △ Less

Submitted 3 November, 2020; originally announced November 2020.

arXiv:2011.01897

A Multi-aspect Analysis of Gender Bias on Online Student Evaluations

Authors: Sofia Maria Nikolakaki, Joseph Lai, Evimaria Terzi

Abstract: Institutions widely use student evaluations to assess the faculty's teaching performance, but underlying trends and biases can influence their interpretation. Using data from Rate My Professors, we conduct the largest and most recent quantitative data analysis to study questions related to the evaluation criteria that students have when they review the performance of their male and female professo… ▽ More Institutions widely use student evaluations to assess the faculty's teaching performance, but underlying trends and biases can influence their interpretation. Using data from Rate My Professors, we conduct the largest and most recent quantitative data analysis to study questions related to the evaluation criteria that students have when they review the performance of their male and female professors. Our analysis spans data from two decades (1999-2019), thus taking into account recent changes on the website and in the perception of students, and demonstrates interesting insights related to how students perceive the teaching style and personality traits of their male and female professors. We also present the first analysis that investigates how gender bias evolves over time and changes over space. We believe that our results are interesting from a sociological viewpoint, as they investigate the role of gender in higher education by disclosing how students perceive and evaluate professors of different genders. In addition, we believe that our findings can be useful to educational institutions when considering possible biases that exist in the evaluations of their faculty. △ Less

Submitted 7 December, 2020; v1 submitted 3 November, 2020; originally announced November 2020.

Comments: Withdrawal due to a technical issue that we do not know how and if we will be able to resolve

arXiv:2006.10904 [pdf, other]

Learn to Earn: Enabling Coordination within a Ride Hailing Fleet

Authors: Harshal A. Chaudhari, John W. Byers, Evimaria Terzi

Abstract: The problem of optimizing social welfare objectives on multi sided ride hailing platforms such as Uber, Lyft, etc., is challenging, due to misalignment of objectives between drivers, passengers, and the platform itself. An ideal solution aims to minimize the response time for each hyper local passenger ride request, while simultaneously maintaining high demand satisfaction and supply utilization a… ▽ More The problem of optimizing social welfare objectives on multi sided ride hailing platforms such as Uber, Lyft, etc., is challenging, due to misalignment of objectives between drivers, passengers, and the platform itself. An ideal solution aims to minimize the response time for each hyper local passenger ride request, while simultaneously maintaining high demand satisfaction and supply utilization across the entire city. Economists tend to rely on dynamic pricing mechanisms that stifle price sensitive excess demand and resolve the supply demand imbalances emerging in specific neighborhoods. In contrast, computer scientists primarily view it as a demand prediction problem with the goal of preemptively repositioning supply to such neighborhoods using black box coordinated multi agent deep reinforcement learning based approaches. Here, we introduce explainability in the existing supply repositioning approaches by establishing the need for coordination between the drivers at specific locations and times. Explicit need based coordination allows our framework to use a simpler non deep reinforcement learning based approach, thereby enabling it to explain its recommendations ex post. Moreover, it provides envy free recommendations i.e., drivers at the same location and time do not envy one another's future earnings. Our experimental evaluation demonstrates the effectiveness, the robustness, and the generalizability of our framework. Finally, in contrast to previous works, we make available a reinforcement learning environment for end to end reproducibility of our work and to encourage future comparative studies. △ Less

Submitted 16 July, 2020; v1 submitted 18 June, 2020; originally announced June 2020.

Comments: 16 pages, 9 figures

MSC Class: 68T05 ACM Class: I.2; K.4; J.6

arXiv:2002.07782 [pdf, other]

An Efficient Framework for Balancing Submodularity and Cost

Authors: Sofia Maria Nikolakaki, Alina Ene, Evimaria Terzi

Abstract: In the classical selection problem, the input consists of a collection of elements and the goal is to pick a subset of elements from the collection such that some objective function $f$ is maximized. This problem has been studied extensively in the data-mining community and it has multiple applications including influence maximization in social networks, team formation and recommender systems. A p… ▽ More In the classical selection problem, the input consists of a collection of elements and the goal is to pick a subset of elements from the collection such that some objective function $f$ is maximized. This problem has been studied extensively in the data-mining community and it has multiple applications including influence maximization in social networks, team formation and recommender systems. A particularly popular formulation that captures the needs of many such applications is one where the objective function $f$ is a monotone and non-negative submodular function. In these cases, the corresponding computational problem can be solved using a simple greedy $(1-\frac{1}{e})$-approximation algorithm. In this paper, we consider a generalization of the above formulation where the goal is to optimize a function that maximizes the submodular function $f$ minus a linear cost function $c$. This formulation appears as a more natural one, particularly when one needs to strike a balance between the value of the objective function and the cost being paid in order to pick the selected elements. We address variants of this problem both in an offline setting, where the collection is known a priori, as well as in online settings, where the elements of the collection arrive in an online fashion. We demonstrate that by using simple variants of the standard greedy algorithm (used for submodular optimization) we can design algorithms that have provable approximation guarantees, are extremely efficient and work very well in practice. △ Less

Submitted 3 September, 2021; v1 submitted 18 February, 2020; originally announced February 2020.

Comments: Extended version of KDD 2021 paper

arXiv:2002.07618 [pdf, ps, other]

doi 10.1145/3219819.3220056

Algorithms for Hiring and Outsourcing in the Online Labor Market

Authors: Aris Anagnostopoulos, Carlos Castillo, Adriano Fazzone, Stefano Leonardi, Evimaria Terzi

Abstract: Although freelancing work has grown substantially in recent years, in part facilitated by a number of online labor marketplaces, (e.g., Guru, Freelancer, Amazon Mechanical Turk), traditional forms of "in-sourcing" work continue being the dominant form of employment. This means that, at least for the time being, freelancing and salaried employment will continue to co-exist. In this paper, we provid… ▽ More Although freelancing work has grown substantially in recent years, in part facilitated by a number of online labor marketplaces, (e.g., Guru, Freelancer, Amazon Mechanical Turk), traditional forms of "in-sourcing" work continue being the dominant form of employment. This means that, at least for the time being, freelancing and salaried employment will continue to co-exist. In this paper, we provide algorithms for outsourcing and hiring workers in a general setting, where workers form a team and contribute different skills to perform a task. We call this model team formation with outsourcing. In our model, tasks arrive in an online fashion: neither the number nor the composition of the tasks is known a-priori. At any point in time, there is a team of hired workers who receive a fixed salary independently of the work they perform. This team is dynamic: new members can be hired and existing members can be fired, at some cost. Additionally, some parts of the arriving tasks can be outsourced and thus completed by non-team members, at a premium. Our contribution is an efficient online cost-minimizing algorithm for hiring and firing team members and outsourcing tasks. We present theoretical bounds obtained using a primal-dual scheme proving that our algorithms have a logarithmic competitive approximation ratio. We complement these results with experiments using semi-synthetic datasets based on actual task requirements and worker skills from three large online labor marketplaces. △ Less

Submitted 16 February, 2020; originally announced February 2020.

Comments: Published at 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 2018

arXiv:1905.03037 [pdf, other]

The Guided Team-Partitioning Problem: Definition, Complexity, and Algorithm

Authors: Sanaz Bahargam, Theodoros Lappas, Evimaria Terzi

Abstract: A long line of literature has focused on the problem of selecting a team of individuals from a large pool of candidates, such that certain constraints are respected, and a given objective function is maximized. Even though extant research has successfully considered diverse families of objective functions and constraints, one of the most common limitations is the focus on the single-team paradigm.… ▽ More A long line of literature has focused on the problem of selecting a team of individuals from a large pool of candidates, such that certain constraints are respected, and a given objective function is maximized. Even though extant research has successfully considered diverse families of objective functions and constraints, one of the most common limitations is the focus on the single-team paradigm. Despite its well-documented applications in multiple domains, this paradigm is not appropriate when the team-builder needs to partition the entire population into multiple teams. Team-partitioning tasks are very common in an educational setting, in which the teacher has to partition the students in her class into teams for collaborative projects. The task also emerges in the context of organizations, when managers need to partition the workforce into teams with specific properties to tackle relevant projects. In this work, we extend the team formation literature by introducing the Guided Team-Partitioning (GTP) problem, which asks for the partitioning of a population into teams such that the centroid of each team is as close as possible to a given target vector. As we describe in detail in our work, this formulation allows the team-builder to control the composition of the produced teams and has natural applications in practical settings. Algorithms for the GTP need to simultaneously consider the composition of multiple non-overlapping teams that compete for the same population of candidates. This makes the problem considerably more challenging than formulations that focus on the optimization of a single team. In fact, we prove that GTP is NP-hard to solve and even to approximate. The complexity of the problem motivates us to consider efficient algorithmic heuristics, which we evaluate via experiments on both real and synthetic datasets. △ Less

Submitted 30 April, 2019; originally announced May 2019.

arXiv:1811.05015 [pdf, other]

doi 10.1016/j.eswa.2018.10.046

A Team-Formation Algorithm for Faultline Minimization

Authors: Sanaz Bahargam, Behzad Golshan, Theodoros Lappas, Evimaria Terzi

Abstract: In recent years, the proliferation of online resumes and the need to evaluate large populations of candidates for on-site and virtual teams have led to a growing interest in automated team-formation. Given a large pool of candidates, the general problem requires the selection of a team of experts to complete a given task. Surprisingly, while ongoing research has studied numerous variations with di… ▽ More In recent years, the proliferation of online resumes and the need to evaluate large populations of candidates for on-site and virtual teams have led to a growing interest in automated team-formation. Given a large pool of candidates, the general problem requires the selection of a team of experts to complete a given task. Surprisingly, while ongoing research has studied numerous variations with different constraints, it has overlooked a factor with a well-documented impact on team cohesion and performance: team faultlines. Addressing this gap is challenging, as the available measures for faultlines in existing teams cannot be efficiently applied to faultline optimization. In this work, we meet this challenge with a new measure that can be efficiently used for both faultline measurement and minimization. We then use the measure to solve the problem of automatically partitioning a large population into low-faultline teams. By introducing faultlines to the team-formation literature, our work creates exciting opportunities for algorithmic work on faultline optimization, as well as on work that combines and studies the connection of faultlines with other influential team characteristics. △ Less

Submitted 12 November, 2018; originally announced November 2018.

arXiv:1801.07722 [pdf, other]

doi 10.1137/1.9781611975321.50

Markov Chain Monitoring

Authors: Harshal A. Chaudhari, Michael Mathioudakis, Evimaria Terzi

Abstract: In networking applications, one often wishes to obtain estimates about the number of objects at different parts of the network (e.g., the number of cars at an intersection of a road network or the number of packets expected to reach a node in a computer network) by monitoring the traffic in a small number of network nodes or edges. We formalize this task by defining the 'Markov Chain Monitoring' p… ▽ More In networking applications, one often wishes to obtain estimates about the number of objects at different parts of the network (e.g., the number of cars at an intersection of a road network or the number of packets expected to reach a node in a computer network) by monitoring the traffic in a small number of network nodes or edges. We formalize this task by defining the 'Markov Chain Monitoring' problem. Given an initial distribution of items over the nodes of a Markov chain, we wish to estimate the distribution of items at subsequent times. We do this by asking a limited number of queries that retrieve, for example, how many items transitioned to a specific node or over a specific edge at a particular time. We consider different types of queries, each defining a different variant of the Markov chain monitoring. For each variant, we design efficient algorithms for choosing the queries that make our estimates as accurate as possible. In our experiments with synthetic and real datasets we demonstrate the efficiency and the efficacy of our algorithms in a variety of settings. △ Less

Submitted 23 January, 2018; originally announced January 2018.

Comments: 13 pages, 10 figures, 1 table

arXiv:1705.00399 [pdf, other]

doi 10.1145/2783258.2783259

Matrix completion with queries

Authors: Natali Ruchansky, Mark Crovella, Evimaria Terzi

Abstract: In many applications, e.g., recommender systems and traffic monitoring, the data comes in the form of a matrix that is only partially observed and low rank. A fundamental data-analysis task for these datasets is matrix completion, where the goal is to accurately infer the entries missing from the matrix. Even when the data satisfies the low-rank assumption, classical matrix-completion methods may… ▽ More In many applications, e.g., recommender systems and traffic monitoring, the data comes in the form of a matrix that is only partially observed and low rank. A fundamental data-analysis task for these datasets is matrix completion, where the goal is to accurately infer the entries missing from the matrix. Even when the data satisfies the low-rank assumption, classical matrix-completion methods may output completions with significant error -- in that the reconstructed matrix differs significantly from the true underlying matrix. Often, this is due to the fact that the information contained in the observed entries is insufficient. In this work, we address this problem by proposing an active version of matrix completion, where queries can be made to the true underlying matrix. Subsequently, we design Order&Extend, which is the first algorithm to unify a matrix-completion approach and a querying strategy into a single algorithm. Order&Extend is able identify and alleviate insufficient information by judiciously querying a small number of additional entries. In an extensive experimental evaluation on real-world datasets, we demonstrate that our algorithm is efficient and is able to accurately reconstruct the true matrix while asking only a small number of queries. △ Less

Submitted 30 April, 2017; originally announced May 2017.

Comments: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

arXiv:1705.00375 [pdf, other]

Targeted matrix completion

Authors: Natali Ruchansky, Mark Crovella, Evimaria Terzi

Abstract: Matrix completion is a problem that arises in many data-analysis settings where the input consists of a partially-observed matrix (e.g., recommender systems, traffic matrix analysis etc.). Classical approaches to matrix completion assume that the input partially-observed matrix is low rank. The success of these methods depends on the number of observed entries and the rank of the matrix; the large… ▽ More Matrix completion is a problem that arises in many data-analysis settings where the input consists of a partially-observed matrix (e.g., recommender systems, traffic matrix analysis etc.). Classical approaches to matrix completion assume that the input partially-observed matrix is low rank. The success of these methods depends on the number of observed entries and the rank of the matrix; the larger the rank, the more entries need to be observed in order to accurately complete the matrix. In this paper, we deal with matrices that are not necessarily low rank themselves, but rather they contain low-rank submatrices. We propose Targeted, which is a general framework for completing such matrices. In this framework, we first extract the low-rank submatrices and then apply a matrix-completion algorithm to these low-rank submatrices as well as the remainder matrix separately. Although for the completion itself we use state-of-the-art completion methods, our results demonstrate that Targeted achieves significantly smaller reconstruction errors than other classical matrix-completion methods. One of the key technical contributions of the paper lies in the identification of the low-rank submatrices from the input partially-observed matrices. △ Less

Submitted 30 April, 2017; originally announced May 2017.

Comments: Proceedings of the 2017 SIAM International Conference on Data Mining (SDM)

arXiv:1703.08762 [pdf, other]

Team Formation for Scheduling Educational Material in Massive Online Classes

Authors: Sanaz Bahargam, Dóra Erdos, Azer Bestavros, Evimaria Terzi

Abstract: Whether teaching in a classroom or a Massive Online Open Course it is crucial to present the material in a way that benefits the audience as a whole. We identify two important tasks to solve towards this objective, 1 group students so that they can maximally benefit from peer interaction and 2 find an optimal schedule of the educational material for each group. Thus, in this paper, we solve the pr… ▽ More Whether teaching in a classroom or a Massive Online Open Course it is crucial to present the material in a way that benefits the audience as a whole. We identify two important tasks to solve towards this objective, 1 group students so that they can maximally benefit from peer interaction and 2 find an optimal schedule of the educational material for each group. Thus, in this paper, we solve the problem of team formation and content scheduling for education. Given a time frame d, a set of students S with their required need to learn different activities T and given k as the number of desired groups, we study the problem of finding k group of students. The goal is to teach students within time frame d such that their potential for learning is maximized and find the best schedule for each group. We show this problem to be NP-hard and develop a polynomial algorithm for it. We show our algorithm to be effective both on synthetic as well as a real data set. For our experiments, we use real data on students' grades in a Computer Science department. As part of our contribution, we release a semi-synthetic dataset that mimics the properties of the real data. △ Less

Submitted 25 March, 2017; originally announced March 2017.

arXiv:1701.07221 [pdf, other]

Community-aware network sparsification

Authors: Aristides Gionis, Polina Rozenshtein, Nikolaj Tatti, Evimaria Terzi

Abstract: Network sparsification aims to reduce the number of edges of a network while maintaining its structural properties; such properties include shortest paths, cuts, spectral measures, or network modularity. Sparsification has multiple applications, such as, speeding up graph-mining algorithms, graph visualization, as well as identifying the important network edges. In this paper we consider a novel f… ▽ More Network sparsification aims to reduce the number of edges of a network while maintaining its structural properties; such properties include shortest paths, cuts, spectral measures, or network modularity. Sparsification has multiple applications, such as, speeding up graph-mining algorithms, graph visualization, as well as identifying the important network edges. In this paper we consider a novel formulation of the network-sparsification problem. In addition to the network, we also consider as input a set of communities. The goal is to sparsify the network so as to preserve the network structure with respect to the given communities. We introduce two variants of the community-aware sparsification problem, leading to sparsifiers that satisfy different connectedness community properties. From the technical point of view, we prove hardness results and devise effective approximation algorithms. Our experimental results on a large collection of datasets demonstrate the effectiveness of our algorithms. △ Less

Submitted 25 January, 2017; originally announced January 2017.

arXiv:1701.05352 [pdf, other]

Finding low-tension communities

Authors: Esther Galbrun, Behzad Golshan, Aristides Gionis, Evimaria Terzi

Abstract: Motivated by applications that arise in online social media and collaboration networks, there has been a lot of work on community-search and team-formation problems. In the former class of problems, the goal is to find a subgraph that satisfies a certain connectivity requirement and contains a given collection of seed nodes. In the latter class of problems, on the other hand, the goal is to find i… ▽ More Motivated by applications that arise in online social media and collaboration networks, there has been a lot of work on community-search and team-formation problems. In the former class of problems, the goal is to find a subgraph that satisfies a certain connectivity requirement and contains a given collection of seed nodes. In the latter class of problems, on the other hand, the goal is to find individuals who collectively have the skills required for a task and form a connected subgraph with certain properties. In this paper, we extend both the community-search and the team-formation problems by associating each individual with a profile. The profile is a numeric score that quantifies the position of an individual with respect to a topic. We adopt a model where each individual starts with a latent profile and arrives to a conformed profile through a dynamic conformation process, which takes into account the individual's social interaction and the tendency to conform with one's social environment. In this framework, social tension arises from the differences between the conformed profiles of neighboring individuals as well as from differences between individuals' conformed and latent profiles. Given a network of individuals, their latent profiles and this conformation process, we extend the community-search and the team-formation problems by requiring the output subgraphs to have low social tension. From the technical point of view, we study the complexity of these problems and propose algorithms for solving them effectively. Our experimental evaluation in a number of social networks reveals the efficacy and efficiency of our methods. △ Less

Submitted 19 January, 2017; originally announced January 2017.

Comments: A short version of this paper appeared in the 2017 SIAM International Conference on Data Mining, SDM'17. In this extended version, we discuss the team-formation problem variant, beside the original community-search problem, and include additional experimental results

arXiv:1612.05440 [pdf, other]

doi 10.1007/s10618-018-0602-x

Best Friends Forever (BFF): Finding Lasting Dense Subgraphs

Authors: Konstantinos Semertzidis, Evaggelia Pitoura, Evimaria Terzi, Panayiotis Tsaparas

Abstract: Graphs form a natural model for relationships and interactions between entities, for example, between people in social and cooperation networks, servers in computer networks, or tags and words in documents and tweets. But, which of these relationships or interactions are the most lasting ones? In this paper, we study the following problem: given a set of graph snapshots, which may correspond to th… ▽ More Graphs form a natural model for relationships and interactions between entities, for example, between people in social and cooperation networks, servers in computer networks, or tags and words in documents and tweets. But, which of these relationships or interactions are the most lasting ones? In this paper, we study the following problem: given a set of graph snapshots, which may correspond to the state of an evolving graph at different time instances, identify the set of nodes that are the most densely connected in all snapshots. We call this problem the Best Friends For Ever (BFF) problem. We provide definitions for density over multiple graph snapshots, that capture different semantics of connectedness over time, and we study the corresponding variants of the BFF problem. We then look at the On-Off BFF (O^2BFF) problem that relaxes the requirement of nodes being connected in all snapshots, and asks for the densest set of nodes in at least $k$ of a given set of graph snapshots. We show that this problem is NP-complete for all definitions of density, and we propose a set of efficient algorithms. Finally, we present experiments with synthetic and real datasets that show both the efficiency of our algorithms and the usefulness of the BFF and the O^2BFF problems. △ Less

Submitted 2 October, 2017; v1 submitted 16 December, 2016; originally announced December 2016.

Comments: 15 pages, 10 figures, 8 tables

Journal ref: Data Mining and Knowledge Discovery - Journal Track of ECML PKDD 2019

arXiv:1610.05516 [pdf, other]

Active Network Alignment: A Matching-Based Approach

Authors: Eric Malmi, Aristides Gionis, Evimaria Terzi

Abstract: Network alignment is the problem of matching the nodes of two graphs, maximizing the similarity of the matched nodes and the edges between them. This problem is encountered in a wide array of applications-from biological networks to social networks to ontologies-where multiple networked data sources need to be integrated. Due to the difficulty of the task, an accurate alignment can rarely be found… ▽ More Network alignment is the problem of matching the nodes of two graphs, maximizing the similarity of the matched nodes and the edges between them. This problem is encountered in a wide array of applications-from biological networks to social networks to ontologies-where multiple networked data sources need to be integrated. Due to the difficulty of the task, an accurate alignment can rarely be found without human assistance. Thus, it is of great practical importance to develop network alignment algorithms that can optimally leverage experts who are able to provide the correct alignment for a small number of nodes. Yet, only a handful of existing works address this active network alignment setting. The majority of the existing active methods focus on absolute queries ("are nodes $a$ and $b$ the same or not?"), whereas we argue that it is generally easier for a human expert to answer relative queries ("which node in the set $\{b_1, \ldots, b_n\}$ is the most similar to node $a$?"). This paper introduces two novel relative-query strategies, TopMatchings and GibbsMatchings, which can be applied on top of any network alignment method that constructs and solves a bipartite matching problem. Our methods identify the most informative nodes to query by sampling the matchings of the bipartite graph associated to the network-alignment instance. We compare the proposed approaches to several commonly-used query strategies and perform experiments on both synthetic and real-world datasets. Our sampling-based strategies yield the highest overall performance, outperforming all the baseline methods by more than 15 percentage points in some cases. In terms of accuracy, TopMatchings and GibbsMatchings perform comparably. However, GibbsMatchings is significantly more scalable, but it also requires hyperparameter tuning for a temperature parameter. △ Less

Submitted 6 September, 2017; v1 submitted 18 October, 2016; originally announced October 2016.

Comments: This is a pre-print of an article appearing at CIKM 2017

arXiv:1406.4173 [pdf, other]

A Divide-and-Conquer Algorithm for Betweenness Centrality

Authors: Dora Erdos, Vatche Ishakian, Azer Bestavros, Evimaria Terzi

Abstract: The problem of efficiently computing the betweenness centrality of nodes has been researched extensively. To date, the best known exact and centralized algorithm for this task is an algorithm proposed in 2001 by Brandes. The contribution of our paper is Brandes++, an algorithm for exact efficient computation of betweenness centrality. The crux of our algorithm is that we create a sketch of the gra… ▽ More The problem of efficiently computing the betweenness centrality of nodes has been researched extensively. To date, the best known exact and centralized algorithm for this task is an algorithm proposed in 2001 by Brandes. The contribution of our paper is Brandes++, an algorithm for exact efficient computation of betweenness centrality. The crux of our algorithm is that we create a sketch of the graph, that we call the skeleton, by replacing subgraphs with simpler graph structures. Depending on the underlying graph structure, using this skeleton and by keeping appropriate summaries Brandes++ we can achieve significantly low running times in our computations. Extensive experimental evaluation on real life datasets demonstrate the efficacy of our algorithm for different types of graphs. We release our code for benefit of the research community. △ Less

Submitted 4 June, 2015; v1 submitted 16 June, 2014; originally announced June 2014.

Comments: Shorter version of this paper appeared in Siam Data Mining 2015

arXiv:1301.7455 [pdf, other]

Opinion Maximization in Social Networks

Authors: Aristides Gionis, Evimaria Terzi, Panayiotis Tsaparas

Abstract: The process of opinion formation through synthesis and contrast of different viewpoints has been the subject of many studies in economics and social sciences. Today, this process manifests itself also in online social networks and social media. The key characteristic of successful promotion campaigns is that they take into consideration such opinion-formation dynamics in order to create a overall… ▽ More The process of opinion formation through synthesis and contrast of different viewpoints has been the subject of many studies in economics and social sciences. Today, this process manifests itself also in online social networks and social media. The key characteristic of successful promotion campaigns is that they take into consideration such opinion-formation dynamics in order to create a overall favorable opinion about a specific information item, such as a person, a product, or an idea. In this paper, we adopt a well-established model for social-opinion dynamics and formalize the campaign-design problem as the problem of identifying a set of target individuals whose positive opinion about an information item will maximize the overall positive opinion for the item in the social network. We call this problem CAMPAIGN. We study the complexity of the CAMPAIGN problem, and design algorithms for solving it. Our experiments on real data demonstrate the efficiency and practical utility of our algorithms. △ Less

Submitted 30 January, 2013; originally announced January 2013.

Journal ref: Siam International Conference on Data Mining (SDM), 2013

arXiv:1201.6565 [pdf, other]

The Filter-Placement Problem and its Application to Minimizing Information Multiplicity

Authors: Dóra Erdös, Vatche Ishakian, Andrei Lapets, Evimaria Terzi, Azer Bestavros

Abstract: In many information networks, data items -- such as updates in social networks, news flowing through interconnected RSS feeds and blogs, measurements in sensor networks, route updates in ad-hoc networks -- propagate in an uncoordinated manner: nodes often relay information they receive to neighbors, independent of whether or not these neighbors received the same information from other sources. Thi… ▽ More In many information networks, data items -- such as updates in social networks, news flowing through interconnected RSS feeds and blogs, measurements in sensor networks, route updates in ad-hoc networks -- propagate in an uncoordinated manner: nodes often relay information they receive to neighbors, independent of whether or not these neighbors received the same information from other sources. This uncoordinated data dissemination may result in significant, yet unnecessary communication and processing overheads, ultimately reducing the utility of information networks. To alleviate the negative impacts of this information multiplicity phenomenon, we propose that a subset of nodes (selected at key positions in the network) carry out additional information filtering functionality. Thus, nodes are responsible for the removal (or significant reduction) of the redundant data items relayed through them. We refer to such nodes as filters. We formally define the Filter Placement problem as a combinatorial optimization problem, and study its computational complexity for different types of graphs. We also present polynomial-time approximation algorithms and scalable heuristics for the problem. Our experimental results, which we obtained through extensive simulations on synthetic and real-world information flow networks, suggest that in many settings a relatively small number of filters are fairly effective in removing a large fraction of redundant information. △ Less

Submitted 31 January, 2012; originally announced January 2012.

Comments: VLDB2012

Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 5, pp. 418-429 (2012)

arXiv:0810.5578 [pdf, ps, other]

Anonymizing Graphs

Authors: Tomas Feder, Shubha U. Nabar, Evimaria Terzi

Abstract: Motivated by recently discovered privacy attacks on social networks, we study the problem of anonymizing the underlying graph of interactions in a social network. We call a graph (k,l)-anonymous if for every node in the graph there exist at least k other nodes that share at least l of its neighbors. We consider two combinatorial problems arising from this notion of anonymity in graphs. More spec… ▽ More Motivated by recently discovered privacy attacks on social networks, we study the problem of anonymizing the underlying graph of interactions in a social network. We call a graph (k,l)-anonymous if for every node in the graph there exist at least k other nodes that share at least l of its neighbors. We consider two combinatorial problems arising from this notion of anonymity in graphs. More specifically, given an input graph we ask for the minimum number of edges to be added so that the graph becomes (k,l)-anonymous. We define two variants of this minimization problem and study their properties. We show that for certain values of k and l the problems are polynomial-time solvable, while for others they become NP-hard. Approximation algorithms for the latter cases are also given. △ Less

Submitted 30 October, 2008; originally announced October 2008.

Comments: 15 pages, 5 figures

arXiv:0809.3027 [pdf, ps, other]

Finding links and initiators: a graph reconstruction problem

Authors: Heikki Mannila, Evimaria Terzi

Abstract: Consider a 0-1 observation matrix M, where rows correspond to entities and columns correspond to signals; a value of 1 (or 0) in cell (i,j) of M indicates that signal j has been observed (or not observed) in entity i. Given such a matrix we study the problem of inferring the underlying directed links between entities (rows) and finding which entries in the matrix are initiators. We formally de… ▽ More Consider a 0-1 observation matrix M, where rows correspond to entities and columns correspond to signals; a value of 1 (or 0) in cell (i,j) of M indicates that signal j has been observed (or not observed) in entity i. Given such a matrix we study the problem of inferring the underlying directed links between entities (rows) and finding which entries in the matrix are initiators. We formally define this problem and propose an MCMC framework for estimating the links and the initiators given the matrix of observations M. We also show how this framework can be extended to incorporate a temporal aspect; instead of considering a single observation matrix M we consider a sequence of observation matrices M1,..., Mt over time. We show the connection between our problem and several problems studied in the field of social-network analysis. We apply our method to paleontological and ecological data and show that our algorithms work well in practice and give reasonable results. △ Less

Submitted 17 September, 2008; originally announced September 2008.

ACM Class: H.2.8

Showing 1–30 of 30 results for author: Terzi, E