Succinct Coverage Oracles
Authors:
Ioannis Antonellis,
Anish Das Sarma,
Shaddin Dughmi
Abstract:
In this paper, we identify a fundamental algorithmic problem that we term succinct dynamic covering (SDC), arising in many modern-day web applications, including ad-serving and online recommendation systems in eBay and Netflix. Roughly speaking, SDC applies two restrictions to the well-studied Max-Coverage problem: Given an integer k, X={1,2,...,n} and I={S_1, ..., S_m}, S_i a subset of X, find…
▽ More
In this paper, we identify a fundamental algorithmic problem that we term succinct dynamic covering (SDC), arising in many modern-day web applications, including ad-serving and online recommendation systems in eBay and Netflix. Roughly speaking, SDC applies two restrictions to the well-studied Max-Coverage problem: Given an integer k, X={1,2,...,n} and I={S_1, ..., S_m}, S_i a subset of X, find a subset J of I, such that |J| <= k and the union of S in J is as large as possible. The two restrictions applied by SDC are: (1) Dynamic: At query-time, we are given a query Q, a subset of X, and our goal is to find J such that the intersection of Q with the union of S in J is as large as possible; (2) Space-constrained: We don't have enough space to store (and process) the entire input; specifically, we have o(mn), and maybe as little as O((m+n)polylog(mn)) space. The goal of SDC is to maintain a small data structure so as to answer most dynamic queries with high accuracy. We call such a scheme a Coverage Oracle.
We present algorithms and complexity results for coverage oracles. We present deterministic and probabilistic near-tight upper and lower bounds on the approximation ratio of SDC as a function of the amount of space available to the oracle. Our lower bound results show that to obtain constant-factor approximations we need Omega(mn) space. Fortunately, our upper bounds present an explicit tradeoff between space and approximation ratio, allowing us to determine the amount of space needed to guarantee certain accuracy.
△ Less
Submitted 8 April, 2010; v1 submitted 12 December, 2009;
originally announced December 2009.
Simrank++: Query rewriting through link analysis of the click graph
Authors:
Ioannis Antonellis,
Hector Garcia-Molina,
Chi-Chao Chang
Abstract:
We focus on the problem of query rewriting for sponsored search. We base rewrites on a historical click graph that records the ads that have been clicked on in response to past user queries. Given a query q, we first consider Simrank as a way to identify queries similar to q, i.e., queries whose ads a user may be interested in. We argue that Simrank fails to properly identify query similarities…
▽ More
We focus on the problem of query rewriting for sponsored search. We base rewrites on a historical click graph that records the ads that have been clicked on in response to past user queries. Given a query q, we first consider Simrank as a way to identify queries similar to q, i.e., queries whose ads a user may be interested in. We argue that Simrank fails to properly identify query similarities in our application, and we present two enhanced version of Simrank: one that exploits weights on click graph edges and another that exploits ``evidence.'' We experimentally evaluate our new schemes against Simrank, using actual click graphs and queries form Yahoo!, and using a variety of metrics. Our results show that the enhanced methods can yield more and better query rewrites.
△ Less
Submitted 4 December, 2007;
originally announced December 2007.
Exploring term-document matrices from matrix models in text mining
Authors:
Ioannis Antonellis,
Efstratios Gallopoulos
Abstract:
We explore a matrix-space model, that is a natural extension to the vector space model for Information Retrieval. Each document can be represented by a matrix that is based on document extracts (e.g. sentences, paragraphs, sections). We focus on the performance of this model for the specific case in which documents are originally represented as term-by-sentence matrices. We use the singular valu…
▽ More
We explore a matrix-space model, that is a natural extension to the vector space model for Information Retrieval. Each document can be represented by a matrix that is based on document extracts (e.g. sentences, paragraphs, sections). We focus on the performance of this model for the specific case in which documents are originally represented as term-by-sentence matrices. We use the singular value decomposition to approximate the term-by-sentence matrices and assemble these results to form the pseudo-``term-document'' matrix that forms the basis of a text mining method alternative to traditional VSM and LSI. We investigate the singular values of this matrix and provide experimental evidence suggesting that the method can be particularly effective in terms of accuracy for text collections with multi-topic documents, such as web pages with news.
△ Less
Submitted 21 February, 2006;
originally announced February 2006.