Search | arXiv e-print repository

Succinct Coverage Oracles

Authors: Ioannis Antonellis, Anish Das Sarma, Shaddin Dughmi

Abstract: In this paper, we identify a fundamental algorithmic problem that we term succinct dynamic covering (SDC), arising in many modern-day web applications, including ad-serving and online recommendation systems in eBay and Netflix. Roughly speaking, SDC applies two restrictions to the well-studied Max-Coverage problem: Given an integer k, X={1,2,...,n} and I={S_1, ..., S_m}, S_i a subset of X, find… ▽ More In this paper, we identify a fundamental algorithmic problem that we term succinct dynamic covering (SDC), arising in many modern-day web applications, including ad-serving and online recommendation systems in eBay and Netflix. Roughly speaking, SDC applies two restrictions to the well-studied Max-Coverage problem: Given an integer k, X={1,2,...,n} and I={S_1, ..., S_m}, S_i a subset of X, find a subset J of I, such that |J| <= k and the union of S in J is as large as possible. The two restrictions applied by SDC are: (1) Dynamic: At query-time, we are given a query Q, a subset of X, and our goal is to find J such that the intersection of Q with the union of S in J is as large as possible; (2) Space-constrained: We don't have enough space to store (and process) the entire input; specifically, we have o(mn), and maybe as little as O((m+n)polylog(mn)) space. The goal of SDC is to maintain a small data structure so as to answer most dynamic queries with high accuracy. We call such a scheme a Coverage Oracle. We present algorithms and complexity results for coverage oracles. We present deterministic and probabilistic near-tight upper and lower bounds on the approximation ratio of SDC as a function of the amount of space available to the oracle. Our lower bound results show that to obtain constant-factor approximations we need Omega(mn) space. Fortunately, our upper bounds present an explicit tradeoff between space and approximation ratio, allowing us to determine the amount of space needed to guarantee certain accuracy. △ Less

Submitted 8 April, 2010; v1 submitted 12 December, 2009; originally announced December 2009.

arXiv:0712.0499 [pdf, ps, other]

Simrank++: Query rewriting through link analysis of the click graph

Authors: Ioannis Antonellis, Hector Garcia-Molina, Chi-Chao Chang

Abstract: We focus on the problem of query rewriting for sponsored search. We base rewrites on a historical click graph that records the ads that have been clicked on in response to past user queries. Given a query q, we first consider Simrank as a way to identify queries similar to q, i.e., queries whose ads a user may be interested in. We argue that Simrank fails to properly identify query similarities… ▽ More We focus on the problem of query rewriting for sponsored search. We base rewrites on a historical click graph that records the ads that have been clicked on in response to past user queries. Given a query q, we first consider Simrank as a way to identify queries similar to q, i.e., queries whose ads a user may be interested in. We argue that Simrank fails to properly identify query similarities in our application, and we present two enhanced version of Simrank: one that exploits weights on click graph edges and another that exploits ``evidence.'' We experimentally evaluate our new schemes against Simrank, using actual click graphs and queries form Yahoo!, and using a variety of metrics. Our results show that the enhanced methods can yield more and better query rewrites. △ Less

Submitted 4 December, 2007; originally announced December 2007.

Comments: Available via http://dbpubs.stanford.edu/pub/2007-32

Report number: Stanford University, Infolab TR 2007-32

arXiv:cs/0602076 [pdf, ps, other]

Exploring term-document matrices from matrix models in text mining

Authors: Ioannis Antonellis, Efstratios Gallopoulos

Abstract: We explore a matrix-space model, that is a natural extension to the vector space model for Information Retrieval. Each document can be represented by a matrix that is based on document extracts (e.g. sentences, paragraphs, sections). We focus on the performance of this model for the specific case in which documents are originally represented as term-by-sentence matrices. We use the singular valu… ▽ More We explore a matrix-space model, that is a natural extension to the vector space model for Information Retrieval. Each document can be represented by a matrix that is based on document extracts (e.g. sentences, paragraphs, sections). We focus on the performance of this model for the specific case in which documents are originally represented as term-by-sentence matrices. We use the singular value decomposition to approximate the term-by-sentence matrices and assemble these results to form the pseudo-``term-document'' matrix that forms the basis of a text mining method alternative to traditional VSM and LSI. We investigate the singular values of this matrix and provide experimental evidence suggesting that the method can be particularly effective in terms of accuracy for text collections with multi-topic documents, such as web pages with news. △ Less

Submitted 21 February, 2006; originally announced February 2006.

Comments: SIAM Text Mining Workshop, SIAM Conference Data Mining, 2006

Report number: 03/02-06

Showing 1–3 of 3 results for author: Antonellis, I