-
Mesterséges Intelligencia Kutatások Magyarországon
Authors:
András A. Benczúr,
Tibor Gyimóthy,
Balázs Szegedy
Abstract:
Artificial intelligence (AI) has undergone remarkable development since the mid-2000s, particularly in the fields of machine learning and deep learning, driven by the explosive growth of large databases and computational capacity. Hungarian researchers recognized the significance of AI early on, actively participating in international research and achieving significant results in both theoretical…
▽ More
Artificial intelligence (AI) has undergone remarkable development since the mid-2000s, particularly in the fields of machine learning and deep learning, driven by the explosive growth of large databases and computational capacity. Hungarian researchers recognized the significance of AI early on, actively participating in international research and achieving significant results in both theoretical and practical domains. This article presents some key achievements in Hungarian AI research. It highlights the results from the period before the rise of deep learning (the early 2010s), then discusses major theoretical advancements in Hungary after 2010. Finally, it provides a brief overview of AI-related applied scientific achievements from 2010 onward.
△ Less
Submitted 24 February, 2025;
originally announced March 2025.
-
Generalized Naive Bayes
Authors:
Edith Alice Kovács,
Anna Ország,
Dániel Pfeifer,
András Benczúr
Abstract:
In this paper we introduce the so-called Generalized Naive Bayes structure as an extension of the Naive Bayes structure. We give a new greedy algorithm that finds a good fitting Generalized Naive Bayes (GNB) probability distribution. We prove that this fits the data at least as well as the probability distribution determined by the classical Naive Bayes (NB). Then, under a not very restrictive con…
▽ More
In this paper we introduce the so-called Generalized Naive Bayes structure as an extension of the Naive Bayes structure. We give a new greedy algorithm that finds a good fitting Generalized Naive Bayes (GNB) probability distribution. We prove that this fits the data at least as well as the probability distribution determined by the classical Naive Bayes (NB). Then, under a not very restrictive condition, we give a second algorithm for which we can prove that it finds the optimal GNB probability distribution, i.e. best fitting structure in the sense of KL divergence. Both algorithms are constructed to maximize the information content and aim to minimize redundancy. Based on these algorithms, new methods for feature selection are introduced. We discuss the similarities and differences to other related algorithms in terms of structure, methodology, and complexity. Experimental results show, that the algorithms introduced outperform the related algorithms in many cases.
△ Less
Submitted 28 August, 2024;
originally announced August 2024.
-
A finite-sample generalization bound for stable LPV systems
Authors:
Daniel Racz,
Martin Gonzalez,
Mihaly Petreczky,
Andras Benczur,
Balint Daroczy
Abstract:
One of the main theoretical challenges in learning dynamical systems from data is providing upper bounds on the generalization error, that is, the difference between the expected prediction error and the empirical prediction error measured on some finite sample. In machine learning, a popular class of such bounds are the so-called Probably Approximately Correct (PAC) bounds. In this paper, we deri…
▽ More
One of the main theoretical challenges in learning dynamical systems from data is providing upper bounds on the generalization error, that is, the difference between the expected prediction error and the empirical prediction error measured on some finite sample. In machine learning, a popular class of such bounds are the so-called Probably Approximately Correct (PAC) bounds. In this paper, we derive a PAC bound for stable continuous-time linear parameter-varying (LPV) systems. Our bound depends on the H2 norm of the chosen class of the LPV systems, but does not depend on the time interval for which the signals are considered.
△ Less
Submitted 21 May, 2024; v1 submitted 16 May, 2024;
originally announced May 2024.
-
Theoretical Evaluation of Asymmetric Shapley Values for Root-Cause Analysis
Authors:
Domokos M. Kelen,
Mihály Petreczky,
Péter Kersch,
András A. Benczúr
Abstract:
In this work, we examine Asymmetric Shapley Values (ASV), a variant of the popular SHAP additive local explanation method. ASV proposes a way to improve model explanations incorporating known causal relations between variables, and is also considered as a way to test for unfair discrimination in model predictions. Unexplored in previous literature, relaxing symmetry in Shapley values can have coun…
▽ More
In this work, we examine Asymmetric Shapley Values (ASV), a variant of the popular SHAP additive local explanation method. ASV proposes a way to improve model explanations incorporating known causal relations between variables, and is also considered as a way to test for unfair discrimination in model predictions. Unexplored in previous literature, relaxing symmetry in Shapley values can have counter-intuitive consequences for model explanation. To better understand the method, we first show how local contributions correspond to global contributions of variance reduction. Using variance, we demonstrate multiple cases where ASV yields counter-intuitive attributions, arguably producing incorrect results for root-cause analysis. Second, we identify generalized additive models (GAM) as a restricted class for which ASV exhibits desirable properties. We support our arguments by proving multiple theoretical results about the method. Finally, we demonstrate the use of asymmetric attributions on multiple real-world datasets, comparing the results with and without restricted model families using gradient boosting and deep learning models.
△ Less
Submitted 15 October, 2023;
originally announced October 2023.
-
Constructing and sampling partite, $3$-uniform hypergraphs with given degree sequence
Authors:
Andras Hubai,
Tamas Robert Mezei,
Ferenc Beres,
Andras Benczur,
Istvan Miklos
Abstract:
Partite, $3$-uniform hypergraphs are $3$-uniform hypergraphs in which each hyperedge contains exactly one point from each of the $3$ disjoint vertex classes. We consider the degree sequence problem of partite, $3$-uniform hypergraphs, that is, to decide if such a hypergraph with prescribed degree sequences exists. We prove that this decision problem is NP-complete in general, and give a polynomial…
▽ More
Partite, $3$-uniform hypergraphs are $3$-uniform hypergraphs in which each hyperedge contains exactly one point from each of the $3$ disjoint vertex classes. We consider the degree sequence problem of partite, $3$-uniform hypergraphs, that is, to decide if such a hypergraph with prescribed degree sequences exists. We prove that this decision problem is NP-complete in general, and give a polynomial running time algorithm for third almost-regular degree sequences, that is, when each degree in one of the vertex classes is $k$ or $k-1$ for some fixed $k$, and there is no restriction for the other two vertex classes. We also consider the sampling problem, that is, to uniformly sample partite, $3$-uniform hypergraphs with prescribed degree sequences. We propose a Parallel Tempering method, where the hypothetical energy of the hypergraphs measures the deviation from the prescribed degree sequence. The method has been implemented and tested on synthetic and real data. It can also be applied for $χ^2$ testing of contingency tables. We have shown that this hypergraph-based $χ^2$ test is more sensitive than the standard $χ^2$ test. The extra sensitivity is especially advantageous on small data sets, where the proposed Parallel Tempering method shows promising performance.
△ Less
Submitted 25 August, 2023;
originally announced August 2023.
-
ethp2psim: Evaluating and deploying privacy-enhanced peer-to-peer routing protocols for the Ethereum network
Authors:
Ferenc Béres,
István András Seres,
Domokos M. Kelen,
András A. Benczúr
Abstract:
Network-level privacy is the Achilles heel of financial privacy in cryptocurrencies. Financial privacy amounts to achieving and maintaining blockchain- and network-level privacy. Blockchain-level privacy recently received substantial attention. Specifically, several privacy-enhancing technologies were proposed and deployed to enhance blockchain-level privacy. On the other hand, network-level priva…
▽ More
Network-level privacy is the Achilles heel of financial privacy in cryptocurrencies. Financial privacy amounts to achieving and maintaining blockchain- and network-level privacy. Blockchain-level privacy recently received substantial attention. Specifically, several privacy-enhancing technologies were proposed and deployed to enhance blockchain-level privacy. On the other hand, network-level privacy, i.e., privacy on the peer-to-peer layer, has seen far less attention and development. In this work, we aim to provide a peer-to-peer network simulator, ethp2psim, that allows researchers to evaluate the privacy guarantees of privacy-enhanced broadcast and message routing algorithms. Our goal is two-fold. First, we want to enable researchers to implement their proposed protocols in our modular simulator framework. Second, our simulator allows researchers to evaluate the privacy guarantees of privacy-enhanced routing algorithms. Finally, ethp2psim can help choose the right protocol parameters for efficient, robust, and private deployment.
△ Less
Submitted 26 June, 2023;
originally announced June 2023.
-
Vaccine skepticism detection by network embedding
Authors:
Ferenc Béres,
Rita Csoma,
Tamás Vilmos Michaletzky,
András A. Benczúr
Abstract:
We demonstrate the applicability of network embedding to vaccine skepticism, a controversial topic of long-past history. With the Covid-19 pandemic outbreak at the end of 2019, the topic is more important than ever. Only a year after the first international cases were registered, multiple vaccines were developed and passed clinical testing. Besides the challenges of development, testing, and logis…
▽ More
We demonstrate the applicability of network embedding to vaccine skepticism, a controversial topic of long-past history. With the Covid-19 pandemic outbreak at the end of 2019, the topic is more important than ever. Only a year after the first international cases were registered, multiple vaccines were developed and passed clinical testing. Besides the challenges of development, testing, and logistics, another factor that might play a significant role in the fight against the pandemic are people who are hesitant to get vaccinated, or even state that they will refuse any vaccine offered to them. Two groups of people commonly referred to as a) pro-vaxxer, those who support vaccinating people b) vax-skeptic, those who question vaccine efficacy or the need for general vaccination against Covid-19. It is very difficult to tell exactly how many people share each of these views. It is even more difficult to understand all the reasoning why vax-skeptic opinions are getting more popular. In this work, our intention was to develop techniques that are able to efficiently differentiate between pro-vaxxer and vax-skeptic content. After multiple data preprocessing steps, we analyzed the tweet text as well as the structure of user interactions on Twitter. We deployed several node embedding and community detection models that scale well for graphs with millions of edges.
△ Less
Submitted 20 October, 2021;
originally announced October 2021.
-
System-aware dynamic partitioning for batch and streaming workloads
Authors:
Zoltán Zvara,
Péter G. N. Szabó,
Balázs Barnabás Lóránt,
András A. Benczúr
Abstract:
When processing data streams with highly skewed and nonstationary key distributions, we often observe overloaded partitions when the hash partitioning fails to balance data correctly. To avoid slow tasks that delay the completion of the whole stage of computation, it is necessary to apply adaptive, on-the-fly partitioning that continuously recomputes an optimal partitioner, given the observed key…
▽ More
When processing data streams with highly skewed and nonstationary key distributions, we often observe overloaded partitions when the hash partitioning fails to balance data correctly. To avoid slow tasks that delay the completion of the whole stage of computation, it is necessary to apply adaptive, on-the-fly partitioning that continuously recomputes an optimal partitioner, given the observed key distribution. While such solutions exist for batch processing of static data sets and stateless stream processing, the task is difficult for long-running stateful streaming jobs where key distribution changes over time. Careful checkpointing and operator state migration is necessary to change the partitioning while the operation is running.
Our key result is a lightweight on-the-fly Dynamic Repartitioning (DR) module for distributed data processing systems (DDPS), including Apache Spark and Flink, which improves the performance with negligible overhead. DR can adaptively repartition data during execution using our Key Isolator Partitioner (KIP). In our experiments with real workloads and power-law distributions, we reach a speedup of 1.5-6 for a variety of Spark and Flink jobs.
△ Less
Submitted 31 May, 2021;
originally announced May 2021.
-
Blockchain is Watching You: Profiling and Deanonymizing Ethereum Users
Authors:
Ferenc Béres,
István András Seres,
András A. Benczúr,
Mikerah Quintyne-Collins
Abstract:
Ethereum is the largest public blockchain by usage. It applies an account-based model, which is inferior to Bitcoin's unspent transaction output model from a privacy perspective. Due to its privacy shortcomings, recently several privacy-enhancing overlays have been deployed on Ethereum, such as non-custodial, trustless coin mixers and confidential transactions. In our privacy analysis of Ethereum'…
▽ More
Ethereum is the largest public blockchain by usage. It applies an account-based model, which is inferior to Bitcoin's unspent transaction output model from a privacy perspective. Due to its privacy shortcomings, recently several privacy-enhancing overlays have been deployed on Ethereum, such as non-custodial, trustless coin mixers and confidential transactions. In our privacy analysis of Ethereum's account-based model, we describe several patterns that characterize only a limited set of users and successfully apply these quasi-identifiers in address deanonymization tasks. Using Ethereum Name Service identifiers as ground truth information, we quantitatively compare algorithms in recent branch of machine learning, the so-called graph representation learning, as well as time-of-day activity and transaction fee based user profiling techniques. As an application, we rigorously assess the privacy guarantees of the Tornado Cash coin mixer by discovering strong heuristics to link the mixing parties. To the best of our knowledge, we are the first to propose and implement Ethereum user profiling techniques based on quasi-identifiers. Finally, we describe a malicious value-fingerprinting attack, a variant of the Danaan-gift attack, applicable for the confidential transaction overlays on Ethereum. By incorporating user activity statistics from our data set, we estimate the success probability of such an attack.
△ Less
Submitted 13 October, 2020; v1 submitted 28 May, 2020;
originally announced May 2020.
-
Tangent Space Separability in Feedforward Neural Networks
Authors:
Bálint Daróczy,
Rita Aleksziev,
András Benczúr
Abstract:
Hierarchical neural networks are exponentially more efficient than their corresponding "shallow" counterpart with the same expressive power, but involve huge number of parameters and require tedious amounts of training. By approximating the tangent subspace, we suggest a sparse representation that enables switching to shallow networks, GradNet after a very early training stage. Our experiments sho…
▽ More
Hierarchical neural networks are exponentially more efficient than their corresponding "shallow" counterpart with the same expressive power, but involve huge number of parameters and require tedious amounts of training. By approximating the tangent subspace, we suggest a sparse representation that enables switching to shallow networks, GradNet after a very early training stage. Our experiments show that the proposed approximation of the metric improves and sometimes even surpasses the achievable performance of the original network significantly even after a few epochs of training the original feedforward network.
△ Less
Submitted 18 December, 2019;
originally announced December 2019.
-
A Cryptoeconomic Traffic Analysis of Bitcoin's Lightning Network
Authors:
Ferenc Beres,
Istvan Andras Seres,
Andras A. Benczur
Abstract:
Lightning Network (LN) is designed to amend the scalability and privacy issues of Bitcoin. It's a payment channel network where Bitcoin transactions are issued off chain, onion routed through a private payment path with the aim to settle transactions in a faster, cheaper, and private manner, as they're not recorded in a costly-to-maintain, slow, and public ledger. In this work, we design a traffic…
▽ More
Lightning Network (LN) is designed to amend the scalability and privacy issues of Bitcoin. It's a payment channel network where Bitcoin transactions are issued off chain, onion routed through a private payment path with the aim to settle transactions in a faster, cheaper, and private manner, as they're not recorded in a costly-to-maintain, slow, and public ledger. In this work, we design a traffic simulator to empirically study LN's transaction fees and privacy provisions. The simulator relies on publicly available data of the network structure and generates transactions under assumptions we attempt to validate based on information spread by certain blog posts of LN node owners. Our findings on the estimated revenue from transaction fees are in line with widespread opinion that participation is economically irrational for the majority of large routing nodes who currently hold the network together. Either traffic or transaction fees must increase by orders of magnitude to make payment routing economically viable. We give worst-case estimates for the potential fee increase by assuming strong price competition among the routers. We estimate how current channel structures and pricing policies respond to a potential increase in traffic, how reduction in locked funds on channels would affect the network, and show examples of nodes who are estimated to operate with economically feasible revenue. Even if transactions are onion routed, strong statistical evidence on payment source and destination can be inferred, as many transaction paths only consist of a single intermediary by the side effect of LN's small-world nature. Based on our simulation experiments, we quantitatively characterize the privacy shortcomings of current LN operation, and propose a method to inject additional hops in routing paths to demonstrate how privacy can be strengthened with very little additional transactional cost.
△ Less
Submitted 13 July, 2020; v1 submitted 21 November, 2019;
originally announced November 2019.
-
Expressive power of outer product manifolds on feed-forward neural networks
Authors:
Bálint Daróczy,
Rita Aleksziev,
András Benczúr
Abstract:
Hierarchical neural networks are exponentially more efficient than their corresponding "shallow" counterpart with the same expressive power, but involve huge number of parameters and require tedious amounts of training. Our main idea is to mathematically understand and describe the hierarchical structure of feedforward neural networks by reparametrization invariant Riemannian metrics. By computing…
▽ More
Hierarchical neural networks are exponentially more efficient than their corresponding "shallow" counterpart with the same expressive power, but involve huge number of parameters and require tedious amounts of training. Our main idea is to mathematically understand and describe the hierarchical structure of feedforward neural networks by reparametrization invariant Riemannian metrics. By computing or approximating the tangent subspace, we better utilize the original network via sparse representations that enables switching to shallow networks after a very early training stage. Our experiments show that the proposed approximation of the metric improves and sometimes even surpasses the achievable performance of the original network significantly even after a few epochs of training the original feedforward network.
△ Less
Submitted 17 July, 2018;
originally announced July 2018.
-
Online Machine Learning in Big Data Streams
Authors:
András A. Benczúr,
Levente Kocsis,
Róbert Pálovics
Abstract:
The area of online machine learning in big data streams covers algorithms that are (1) distributed and (2) work from data streams with only a limited possibility to store past data. The first requirement mostly concerns software architectures and efficient algorithms. The second one also imposes nontrivial theoretical restrictions on the modeling methods: In the data stream model, older data is no…
▽ More
The area of online machine learning in big data streams covers algorithms that are (1) distributed and (2) work from data streams with only a limited possibility to store past data. The first requirement mostly concerns software architectures and efficient algorithms. The second one also imposes nontrivial theoretical restrictions on the modeling methods: In the data stream model, older data is no longer available to revise earlier suboptimal modeling decisions as the fresh data arrives.
In this article, we provide an overview of distributed software architectures and libraries as well as machine learning models for online learning. We highlight the most important ideas for classification, regression, recommendation, and unsupervised modeling from streaming data, and we show how they are implemented in various distributed data stream processing systems.
This article is a reference material and not a survey. We do not attempt to be comprehensive in describing all existing methods and solutions; rather, we give pointers to the most important resources in the field. All related sub-fields, online algorithms, online learning, and distributed data processing are hugely dominant in current research and development with conceptually new research results and software components emerging at the time of writing. In this article, we refer to several survey results, both for distributed data processing and for online machine learning. Compared to past surveys, our article is different because we discuss recommender systems in extended detail.
△ Less
Submitted 16 February, 2018;
originally announced February 2018.
-
Raising Graphs From Randomness to Reveal Information Networks
Authors:
Róbert Pálovics,
András A. Benczúr
Abstract:
We analyze the fine-grained connections between the average degree and the power-law degree distribution exponent in growing information networks. Our starting observation is a power-law degree distribution with a decreasing exponent and increasing average degree as a function of the network size. Our experiments are based on three Twitter at-mention networks and three more from the Koblenz Networ…
▽ More
We analyze the fine-grained connections between the average degree and the power-law degree distribution exponent in growing information networks. Our starting observation is a power-law degree distribution with a decreasing exponent and increasing average degree as a function of the network size. Our experiments are based on three Twitter at-mention networks and three more from the Koblenz Network Collection. We observe that popular network models cannot explain decreasing power-law degree distribution exponent and increasing average degree at the same time.
We propose a model that is the combination of exponential growth, and a power-law developing network, in which new "homophily" edges are continuously added to nodes proportional to their current homophily degree. Parameters of the average degree growth and the power-law degree distribution exponent functions depend on the ratio of the network growth exponent parameters. Specifically, we connect the growth of the average degree to the decreasing exponent of the power-law degree distribution. Prior to our work, only one of the two cases were handled. Existing models and even their combinations can only reproduce some of our key new observations in growing information networks.
△ Less
Submitted 2 January, 2017;
originally announced January 2017.
-
Item-to-item recommendation based on Contextual Fisher Information
Authors:
Bálint Daróczy,
Frederick Ayala-Gómez,
András Benczúr
Abstract:
Web recommendation services bear great importance in e-commerce, as they aid the user in navigating through the items that are most relevant to her needs. In a typical Web site, long history of previous activities or purchases by the user is rarely available. Hence in most cases, recommenders propose items that are similar to the most recent ones viewed in the current user session. The correspondi…
▽ More
Web recommendation services bear great importance in e-commerce, as they aid the user in navigating through the items that are most relevant to her needs. In a typical Web site, long history of previous activities or purchases by the user is rarely available. Hence in most cases, recommenders propose items that are similar to the most recent ones viewed in the current user session. The corresponding task is called session based item-to-item recommendation. For frequent items, it is easy to present item-to-item recommendations by "people who viewed this, also viewed" lists. However, most of the items belong to the long tail, where previous actions are sparsely available. Another difficulty is the so-called cold start problem, when the item has recently appeared and had no time yet to accumulate sufficient number of transactions. In order to recommend a next item in a session in sparse or cold start situations, we also have to incorporate item similarity models. In this paper we describe a probabilistic similarity model based on Random Fields to approximate item-to-item transition probabilities. We give a generative model for the item interactions based on arbitrary distance measures over the items including explicit, implicit ratings and external metadata. The model may change in time to fit better recent events and recommend the next item based on the updated Fisher Information. Our new model outperforms both simple similarity baseline methods and recent item-to-item recommenders, under several different performance metrics and publicly available data sets. We reach significant gains in particular for recommending a new item following a rare item.
△ Less
Submitted 8 November, 2016; v1 submitted 7 November, 2016;
originally announced November 2016.
-
Statistical analysis of NOMAO customer votes for spots of France
Authors:
Robert Palovics,
Balint Daroczy,
Andras Benczur,
Julia Pap,
Leonardo Ermann,
Samuel Phan,
Alexei D. Chepelianskii,
Dima L. Shepelyansky
Abstract:
We investigate the statistical properties of votes of customers for spots of France collected by the startup company NOMAO. The frequencies of votes per spot and per customer are characterized by a power law distributions which remain stable on a time scale of a decade when the number of votes is varied by almost two orders of magnitude. Using the computer science methods we explore the spectrum a…
▽ More
We investigate the statistical properties of votes of customers for spots of France collected by the startup company NOMAO. The frequencies of votes per spot and per customer are characterized by a power law distributions which remain stable on a time scale of a decade when the number of votes is varied by almost two orders of magnitude. Using the computer science methods we explore the spectrum and the eigenvalues of a matrix containing user ratings to geolocalized items. Eigenvalues nicely map to large towns and regions but show certain level of instability as we modify the interpretation of the underlying matrix. We evaluate imputation strategies that provide improved prediction performance by reaching geographically smooth eigenvectors. We point on possible links between distribution of votes and the phenomenon of self-organized criticality.
△ Less
Submitted 12 May, 2015;
originally announced May 2015.
-
Temporal influence over the Last.fm social network
Authors:
Róbert Pálovics,
András A. Benczúr
Abstract:
Several recent results show the influence of social contacts to spread certain properties over the network, but others question the methodology of these experiments by proposing that the measured effects may be due to homophily or a shared environment. In this paper we justify the existence of the social influence by considering the temporal behavior of Last.fm users. In order to clearly distingui…
▽ More
Several recent results show the influence of social contacts to spread certain properties over the network, but others question the methodology of these experiments by proposing that the measured effects may be due to homophily or a shared environment. In this paper we justify the existence of the social influence by considering the temporal behavior of Last.fm users. In order to clearly distinguish between friends sharing the same interest, especially since Last.fm recommends friends based on similarity of taste, we separated the timeless effect of similar taste from the temporal impulses of immediately listening to the same artist after a friend. We measured strong increase of listening to a completely new artist in a few hours period after a friend compared to non-friends representing a simple trend or external influence. In our experiment to eliminate network independent elements of taste, we improved collaborative filtering and trend based methods by blending with simple time aware recommendations based on the influence of friends. Our experiments are carried over the two-year "scrobble" history of 70,000 Last.fm users.
△ Less
Submitted 28 July, 2013;
originally announced July 2013.
-
Time evolution of Wikipedia network ranking
Authors:
Young-Ho Eom,
Klaus M. Frahm,
András Benczúr,
Dima L. Shepelyansky
Abstract:
We study the time evolution of ranking and spectral properties of the Google matrix of English Wikipedia hyperlink network during years 2003 - 2011. The statistical properties of ranking of Wikipedia articles via PageRank and CheiRank probabilities, as well as the matrix spectrum, are shown to be stabilized for 2007 - 2011. A special emphasis is done on ranking of Wikipedia personalities and unive…
▽ More
We study the time evolution of ranking and spectral properties of the Google matrix of English Wikipedia hyperlink network during years 2003 - 2011. The statistical properties of ranking of Wikipedia articles via PageRank and CheiRank probabilities, as well as the matrix spectrum, are shown to be stabilized for 2007 - 2011. A special emphasis is done on ranking of Wikipedia personalities and universities. We show that PageRank selection is dominated by politicians while 2DRank, which combines PageRank and CheiRank, gives more accent on personalities of arts. The Wikipedia PageRank of universities recovers 80 percents of top universities of Shanghai ranking during the considered time period.
△ Less
Submitted 31 October, 2013; v1 submitted 24 April, 2013;
originally announced April 2013.
-
Métodos para la Selección y el Ajuste de Características en el Problema de la Detección de Spam
Authors:
Carlos M. Lorenzetti,
Rocío L. Cecchini,
Ana G. Maguitman,
András A. Benczúr
Abstract:
The email is used daily by millions of people to communicate around the globe and it is a mission-critical application for many businesses. Over the last decade, unsolicited bulk email has become a major problem for email users. An overwhelming amount of spam is flowing into users' mailboxes daily. In 2004, an estimated 62% of all email was attributed to spam. Spam is not only frustrating for most…
▽ More
The email is used daily by millions of people to communicate around the globe and it is a mission-critical application for many businesses. Over the last decade, unsolicited bulk email has become a major problem for email users. An overwhelming amount of spam is flowing into users' mailboxes daily. In 2004, an estimated 62% of all email was attributed to spam. Spam is not only frustrating for most email users, it strains the IT infrastructure of organizations and costs businesses billions of dollars in lost productivity. In recent years, spam has evolved from an annoyance into a serious security threat, and is now a prime medium for phishing of sensitive information, as well the spread of malicious software. This work presents a first approach to attack the spam problem. We propose an algorithm that will improve a classifier's results by adjusting its training set data. It improves the document's vocabulary representation by detecting good topic descriptors and discriminators.
△ Less
Submitted 14 October, 2010; v1 submitted 1 June, 2010;
originally announced June 2010.
-
Randomized Approximation Schemes for Cuts and Flows in Capacitated Graphs
Authors:
Andras Benczur,
David R. Karger
Abstract:
We improve on random sampling techniques for approximately solving problems that involve cuts and flows in graphs. We give a near-linear-time construction that transforms any graph on n vertices into an O(n\log n)-edge graph on the same vertices whose cuts have approximately the same value as the original graph's. In this new graph, for example, we can run the O(m^{3/2})-time maximum flow algori…
▽ More
We improve on random sampling techniques for approximately solving problems that involve cuts and flows in graphs. We give a near-linear-time construction that transforms any graph on n vertices into an O(n\log n)-edge graph on the same vertices whose cuts have approximately the same value as the original graph's. In this new graph, for example, we can run the O(m^{3/2})-time maximum flow algorithm of Goldberg and Rao to find an s--t minimum cut in O(n^{3/2}) time. This corresponds to a (1+epsilon)-times minimum s--t cut in the original graph. In a similar way, we can approximate a sparsest cut to within O(log n) in O(n^2) time using a previous O(mn)-time algorithm. A related approach leads to a randomized divide and conquer algorithm producing an approximately maximum flow in O(m sqrt{n}) time.
△ Less
Submitted 23 July, 2002;
originally announced July 2002.