Skip to main content

Showing 1–50 of 79 results for author: Phillips, J M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2509.07587  [pdf, ps, other

    cs.DS

    The General Expiration Streaming Model: Diameter, $k$-Center, Counting, Sampling, and Friends

    Authors: Lotte Blank, Sergio Cabello, MohammadTaghi Hajiaghayi, Robert Krauthgamer, Sepideh Mahabadi, André Nusser, Jeff M. Phillips, Jonas Sauer

    Abstract: An important thread in the study of data-stream algorithms focuses on settings where stream items are active only for a limited time. We introduce a new expiration model, where each item arrives with its own expiration time. The special case where items expire in the order that they arrive, which we call consistent expirations, contains the classical sliding-window model of Datar, Gionis, Indyk, a… ▽ More

    Submitted 9 September, 2025; originally announced September 2025.

  2. arXiv:2506.00051  [pdf, ps, other

    cs.CY

    Beyond Monoliths: Expert Orchestration for More Capable, Democratic, and Safe Large Language Models

    Authors: Philip Quirke, Narmeen Oozeer, Chaithanya Bandi, Amir Abdullah, Jason Hoelscher-Obermaier, Jeff M. Phillips, Joshua Greaves, Clement Neo, Fazl Barez, Shriyash Upadhyay

    Abstract: This position paper argues that the prevailing trajectory toward ever larger, more expensive generalist foundation models controlled by a handful of big companies limits innovation and constrains progress. We challenge this approach by advocating for an "Expert Orchestration" framework as a superior alternative that democratizes LLM advancement. Our proposed framework intelligently selects from th… ▽ More

    Submitted 28 May, 2025; originally announced June 2025.

    Comments: 9 pages, 2 figures

  3. arXiv:2505.02472  [pdf, other

    cs.CG

    Trajectory Minimum Touching Ball

    Authors: Jeff M. Phillips, Jens Kristian Refsgaard Schou

    Abstract: We present algorithms to find the minimum radius sphere that intersects every trajectory in a set of $n$ trajectories composed of at most $k$ line segments each. When $k=1$, we can reduce the problem to the LP-type framework to achieve a linear time complexity. For $k \geq 4$ we provide a trajectory configuration with unbounded LP-type complexity, but also present an almost… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

  4. arXiv:2504.11299  [pdf, other

    stat.CO cs.CG cs.LG

    Efficient and Stable Multi-Dimensional Kolmogorov-Smirnov Distance

    Authors: Peter Matthew Jacobs, Foad Namjoo, Jeff M. Phillips

    Abstract: We revisit extending the Kolmogorov-Smirnov distance between probability distributions to the multidimensional setting and make new arguments about the proper way to approach this generalization. Our proposed formulation maximizes the difference over orthogonal dominating rectangular ranges (d-sided rectangles in R^d), and is an integral probability metric. We also prove that the distance between… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

    Comments: 21 pages, Primary: stat.CO. Secondary: cs.CG, cs.LG

  5. arXiv:2504.05627  [pdf, other

    cs.LG

    Maternal and Fetal Health Status Assessment by Using Machine Learning on Optical 3D Body Scans

    Authors: Ruting Cheng, Yijiang Zheng, Boyuan Feng, Chuhui Qiu, Zhuoxin Long, Joaquin A. Calderon, Xiaoke Zhang, Jaclyn M. Phillips, James K. Hahn

    Abstract: Monitoring maternal and fetal health during pregnancy is crucial for preventing adverse outcomes. While tests such as ultrasound scans offer high accuracy, they can be costly and inconvenient. Telehealth and more accessible body shape information provide pregnant women with a convenient way to monitor their health. This study explores the potential of 3D body scan data, captured during the 18-24 g… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  6. arXiv:2502.11324  [pdf, other

    stat.ML cs.LG

    Robust High-Dimensional Mean Estimation With Low Data Size, an Empirical Study

    Authors: Cullen Anderson, Jeff M. Phillips

    Abstract: Robust statistics aims to compute quantities to represent data where a fraction of it may be arbitrarily corrupted. The most essential statistic is the mean, and in recent years, there has been a flurry of theoretical advancement for efficiently estimating the mean in high dimensions on corrupted data. While several algorithms have been proposed that achieve near-optimal error, they all rely on la… ▽ More

    Submitted 16 February, 2025; originally announced February 2025.

    Journal ref: Transactions on Machine Learning Research (TMLR), February 2025, ISSN: 2835-8856

  7. Fast Comparative Analysis of Merge Trees Using Locality Sensitive Hashing

    Authors: Weiran Lyu, Raghavendra Sridharamurthy, Jeff M. Phillips, Bei Wang

    Abstract: Scalar field comparison is a fundamental task in scientific visualization. In topological data analysis, we compare topological descriptors of scalar fields -- such as persistence diagrams and merge trees -- because they provide succinct and robust abstract representations. Several similarity measures for topological descriptors seem to be both asymptotically and practically efficient with polynom… ▽ More

    Submitted 12 September, 2024; originally announced September 2024.

    Comments: IEEE VIS 2024

  8. arXiv:2402.05280  [pdf, ps, other

    cs.LG cs.CG

    No Dimensional Sampling Coresets for Classification

    Authors: Meysam Alishahi, Jeff M. Phillips

    Abstract: We refine and generalize what is known about coresets for classification problems via the sensitivity sampling framework. Such coresets seek the smallest possible subsets of input data, so one can optimize a loss function on the coreset and ensure approximation guarantees with respect to the original data. Our analysis provides the first no dimensional coresets, so the size does not depend on the… ▽ More

    Submitted 22 July, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

    Comments: 42 Pages

  9. arXiv:2402.02746  [pdf, other

    cs.LG stat.ML

    Standard Gaussian Process is All You Need for High-Dimensional Bayesian Optimization

    Authors: Zhitong Xu, Haitao Wang, Jeff M Phillips, Shandian Zhe

    Abstract: A long-standing belief holds that Bayesian Optimization (BO) with standard Gaussian processes (GP) -- referred to as standard BO -- underperforms in high-dimensional optimization problems. While this belief seems plausible, it lacks both robust empirical evidence and theoretical justification. To address this gap, we present a systematic investigation. First, through a comprehensive evaluation acr… ▽ More

    Submitted 11 March, 2025; v1 submitted 5 February, 2024; originally announced February 2024.

    Comments: ICLR 2025 camera-ready version

  10. arXiv:2311.05651  [pdf, other

    cs.CG cs.DS cs.LG

    On Mergable Coresets for Polytope Distance

    Authors: Benwei Shi, Aditya Bhaskara, Wai Ming Tai, Jeff M. Phillips

    Abstract: We show that a constant-size constant-error coreset for polytope distance is simple to maintain under merges of coresets. However, increasing the size cannot improve the error bound significantly beyond that constant.

    Submitted 8 November, 2023; originally announced November 2023.

    Comments: Presented in SoCG'19 Young Researchers Forum (CG:YRF)

    ACM Class: I.3.5

  11. arXiv:2311.03393  [pdf, other

    cs.DB cs.AI

    Sketching Multidimensional Time Series for Fast Discord Mining

    Authors: Chin-Chia Michael Yeh, Yan Zheng, Menghai Pan, Huiyuan Chen, Zhongfang Zhuang, Junpeng Wang, Liang Wang, Wei Zhang, Jeff M. Phillips, Eamonn Keogh

    Abstract: Time series discords are a useful primitive for time series anomaly detection, and the matrix profile is capable of capturing discord effectively. There exist many research efforts to improve the scalability of discord discovery with respect to the length of time series. However, there is surprisingly little work focused on reducing the time complexity of matrix profile computation associated with… ▽ More

    Submitted 7 December, 2023; v1 submitted 5 November, 2023; originally announced November 2023.

  12. arXiv:2310.03919  [pdf, other

    cs.IR cs.AI cs.LG

    An Efficient Content-based Time Series Retrieval System

    Authors: Chin-Chia Michael Yeh, Huiyuan Chen, Xin Dai, Yan Zheng, Junpeng Wang, Vivian Lai, Yujie Fan, Audrey Der, Zhongfang Zhuang, Liang Wang, Wei Zhang, Jeff M. Phillips

    Abstract: A Content-based Time Series Retrieval (CTSR) system is an information retrieval system for users to interact with time series emerged from multiple domains, such as finance, healthcare, and manufacturing. For example, users seeking to learn more about the source of a time series can submit the time series as a query to the CTSR system and retrieve a list of relevant time series with associated met… ▽ More

    Submitted 5 October, 2023; originally announced October 2023.

  13. arXiv:2308.07418  [pdf, other

    cs.LG stat.ML

    Locally Adaptive and Differentiable Regression

    Authors: Mingxuan Han, Varun Shankar, Jeff M Phillips, Chenglong Ye

    Abstract: Over-parameterized models like deep nets and random forests have become very popular in machine learning. However, the natural goals of continuity and differentiability, common in regression models, are now often ignored in modern overparametrized, locally-adaptive models. We propose a general framework to construct a global continuous and differentiable model based on a weighted average of locall… ▽ More

    Submitted 12 October, 2023; v1 submitted 14 August, 2023; originally announced August 2023.

    Journal ref: Journal of Machine Learning for Modeling and Computing 2023

  14. arXiv:2306.16516  [pdf, ps, other

    cs.CG cs.LG

    Dimension-Independent Kernel ε-Covers

    Authors: Jeff M. Phillips, Hasan Pourmahmood-Aghababa

    Abstract: We introduce the notion of an $\varepsilon$-cover for a kernel range space. A kernel range space concerns a set of points $X \subset \mathbb{R}^d$ and the space of all queries by a fixed kernel (e.g., a Gaussian kernel $K(p,\cdot) = \exp(-\|p-\cdot\|^2)$, where $p \in \mathbb{R}^d$). For a point set $X$ of size $n$, a query returns a vector of values $R_p \in \mathbb{R}^n$, where the $i$th coordin… ▽ More

    Submitted 12 June, 2025; v1 submitted 28 June, 2023; originally announced June 2023.

    Comments: 27 pages

    Journal ref: Computing in Geometry and Topology, Volume 4(1), 2025, Article 5, pp. 1-28

  15. arXiv:2306.03173  [pdf, other

    cs.LG

    Linear Distance Metric Learning with Noisy Labels

    Authors: Meysam Alishahi, Anna Little, Jeff M. Phillips

    Abstract: In linear distance metric learning, we are given data in one Euclidean metric space and the goal is to find an appropriate linear map to another Euclidean metric space which respects certain distance conditions as much as possible. In this paper, we formalize a simple and elegant method which reduces to a general continuous convex loss optimization problem, and for different noise models we derive… ▽ More

    Submitted 20 December, 2023; v1 submitted 5 June, 2023; originally announced June 2023.

    Comments: 52 pages

  16. arXiv:2305.16606  [pdf, other

    cs.IR

    Mitigating Exploitation Bias in Learning to Rank with an Uncertainty-aware Empirical Bayes Approach

    Authors: Tao Yang, Cuize Han, Chen Luo, Parth Gupta, Jeff M. Phillips, Qingyao Ai

    Abstract: Ranking is at the core of many artificial intelligence (AI) applications, including search engines, recommender systems, etc. Modern ranking systems are often constructed with learning-to-rank (LTR) models built from user behavior signals. While previous studies have demonstrated the effectiveness of using user behavior signals (e.g., clicks) as both features and labels of LTR algorithms, we argue… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

  17. arXiv:2210.12704  [pdf, other

    cs.LG

    Batch Multi-Fidelity Active Learning with Budget Constraints

    Authors: Shibo Li, Jeff M. Phillips, Xin Yu, Robert M. Kirby, Shandian Zhe

    Abstract: Learning functions with high-dimensional outputs is critical in many applications, such as physical simulation and engineering design. However, collecting training examples for these applications is often costly, e.g. by running numerical solvers. The recent work (Li et al., 2022) proposes the first multi-fidelity active learning approach for high-dimensional outputs, which can acquire examples at… ▽ More

    Submitted 23 October, 2022; originally announced October 2022.

  18. arXiv:2209.01322  [pdf, other

    cs.CG cs.LG

    Classifying Spatial Trajectories

    Authors: Hasan Pourmahmood-Aghababa, Jeff M. Phillips

    Abstract: We provide the first comprehensive study on how to classify trajectories using only their spatial representations, measured on 5 real-world data sets. Our comparison considers 20 distinct classifiers arising either as a KNN classifier of a popular distance, or as a more general type of classifier using a vectorized representation of each trajectory. We additionally develop new methods for how to v… ▽ More

    Submitted 3 September, 2022; originally announced September 2022.

    Comments: 21 pages, 15 figures

  19. arXiv:2112.10931  [pdf, other

    cs.NI

    Hiding Signal Strength Interference from Outside Adversaries

    Authors: Mingxuan Han, Jeff M. Phillips, Sneha Kumar Kasera

    Abstract: The presence of people can be detected by passively observing the signal strength of Wifi and related forms of communication. This paper tackles the question of how and when can this be prevented by adjustments to the transmitted signal strength, and other similar measures. The main contribution of this paper is a formal framework to analyze this problem, and the identification of several scenario… ▽ More

    Submitted 20 December, 2021; originally announced December 2021.

    Comments: 6 pages, 2 figures

  20. arXiv:2108.12084  [pdf, other

    cs.CL cs.AI cs.LG

    Harms of Gender Exclusivity and Challenges in Non-Binary Representation in Language Technologies

    Authors: Sunipa Dev, Masoud Monajatipoor, Anaelia Ovalle, Arjun Subramonian, Jeff M Phillips, Kai-Wei Chang

    Abstract: Gender is widely discussed in the context of language tasks and when examining the stereotypes propagated by language models. However, current discussions primarily treat gender as binary, which can perpetuate harms such as the cyclical erasure of non-binary gender identities. These harms are driven by model and dataset biases, which are consequences of the non-recognition and lack of understandin… ▽ More

    Submitted 10 September, 2021; v1 submitted 26 August, 2021; originally announced August 2021.

    Journal ref: EMNLP 2021

  21. Practical and Configurable Network Traffic Classification Using Probabilistic Machine Learning

    Authors: Jiahui Chen, Joe Breen, Jeff M. Phillips, Jacobus Van der Merwe

    Abstract: Network traffic classification that is widely applicable and highly accurate is valuable for many network security and management tasks. A flexible and easily configurable classification framework is ideal, as it can be customized for use in a wide variety of networks. In this paper, we propose a highly configurable and flexible machine learning traffic classification method that relies only on st… ▽ More

    Submitted 10 July, 2021; originally announced July 2021.

    Comments: Published in the Springer Cluster Computing journal

  22. arXiv:2106.13851  [pdf, other

    cs.CG cs.LG

    Approximate Maximum Halfspace Discrepancy

    Authors: Michael Matheny, Jeff M. Phillips

    Abstract: Consider the geometric range space $(X, \mathcal{H}_d)$ where $X \subset \mathbb{R}^d$ and $\mathcal{H}_d$ is the set of ranges defined by $d$-dimensional halfspaces. In this setting we consider that $X$ is the disjoint union of a red and blue set. For each halfspace $h \in \mathcal{H}_d$ define a function $Φ(h)$ that measures the "difference" between the fraction of red and fraction of blue point… ▽ More

    Submitted 25 June, 2021; originally announced June 2021.

  23. arXiv:2104.02797  [pdf, other

    cs.CL cs.HC

    VERB: Visualizing and Interpreting Bias Mitigation Techniques for Word Representations

    Authors: Archit Rathore, Sunipa Dev, Jeff M. Phillips, Vivek Srikumar, Yan Zheng, Chin-Chia Michael Yeh, Junpeng Wang, Wei Zhang, Bei Wang

    Abstract: Word vector embeddings have been shown to contain and amplify biases in data they are extracted from. Consequently, many techniques have been proposed to identify, mitigate, and attenuate these biases in word representations. In this paper, we utilize interactive visualization to increase the interpretability and accessibility of a collection of state-of-the-art debiasing techniques. To aid this,… ▽ More

    Submitted 6 April, 2021; originally announced April 2021.

    Comments: 11 pages

  24. arXiv:2007.15924  [pdf, other

    cs.CG

    Orientation-Preserving Vectorized Distance Between Curves

    Authors: Jeff M. Phillips, Hasan Pourmahmood-Aghababa

    Abstract: We introduce an orientation-preserving landmark-based distance for continuous curves, which can be viewed as an alternative to the \Frechet or Dynamic Time Warping distances. This measure retains many of the properties of those measures, and we prove some relations, but can be interpreted as a Euclidean distance in a particular vector space. Hence it is significantly easier to use, faster for gene… ▽ More

    Submitted 24 May, 2021; v1 submitted 31 July, 2020; originally announced July 2020.

    Comments: 23 pages, 10 figures, 1 table, accepted for MSML21, this paper is in Primary: CG (Computational Geometry) and Secondary: LG (Machine Learning) categories

    ACM Class: I.3.5

  25. arXiv:2007.00049  [pdf, other

    cs.CL cs.AI cs.LG

    OSCaR: Orthogonal Subspace Correction and Rectification of Biases in Word Embeddings

    Authors: Sunipa Dev, Tao Li, Jeff M Phillips, Vivek Srikumar

    Abstract: Language representations are known to carry stereotypical biases and, as a result, lead to biased predictions in downstream tasks. While existing methods are effective at mitigating biases by linear projection, such methods are too aggressive: they not only remove bias, but also erase valuable information from word embeddings. We develop new measures for evaluating specific information retention t… ▽ More

    Submitted 10 September, 2021; v1 submitted 30 June, 2020; originally announced July 2020.

    Journal ref: EMNLP 2021

  26. arXiv:2002.02013  [pdf, other

    cs.LG cs.DS stat.ML

    A Deterministic Streaming Sketch for Ridge Regression

    Authors: Benwei Shi, Jeff M. Phillips

    Abstract: We provide a deterministic space-efficient algorithm for estimating ridge regression. For $n$ data points with $d$ features and a large enough regularization parameter, we provide a solution within $\varepsilon$ L$_2$ error using only $O(d/\varepsilon)$ space. This is the first $o(d^2)$ space deterministic streaming algorithm with guaranteed solution error and risk bound for this classic problem.… ▽ More

    Submitted 10 March, 2021; v1 submitted 5 February, 2020; originally announced February 2020.

    Comments: Fix a few typos. To be published in AISTATS 2021

    Journal ref: Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR 130:586-594, 2021

  27. arXiv:1912.07673  [pdf, ps, other

    cs.DS cs.CG

    Finding the Mode of a Kernel Density Estimate

    Authors: Jasper C. H. Lee, Jerry Li, Christopher Musco, Jeff M. Phillips, Wai Ming Tai

    Abstract: Given points $p_1, \dots, p_n$ in $\mathbb{R}^d$, how do we find a point $x$ which maximizes $\frac{1}{n} \sum_{i=1}^n e^{-\|p_i - x\|^2}$? In other words, how do we find the maximizing point, or mode of a Gaussian kernel density estimation (KDE) centered at $p_1, \dots, p_n$? Given the power of KDEs in representing probability distributions and other continuous functions, the basic mode finding p… ▽ More

    Submitted 16 December, 2019; originally announced December 2019.

  28. arXiv:1910.05862  [pdf, other

    cs.LG stat.ML

    Constrained Non-Affine Alignment of Embeddings

    Authors: Yuwei Wang, Yan Zheng, Yanqing Peng, Chin-Chia Michael Yeh, Zhongfang Zhuang, Das Mahashweta, Bendre Mangesh, Feifei Li, Wei Zhang, Jeff M. Phillips

    Abstract: Embeddings are one of the fundamental building blocks for data analysis tasks. Embeddings are already essential tools for large language models and image analysis, and their use is being extended to many other research domains. The generation of these distributed representations is often a data- and computation-expensive process; yet the holistic analysis and adjustment of them after they have bee… ▽ More

    Submitted 19 November, 2021; v1 submitted 13 October, 2019; originally announced October 2019.

  29. arXiv:1907.02171  [pdf, other

    cs.CG

    Sketched MinDist

    Authors: Jeff M. Phillips, Pingfan Tang

    Abstract: We consider sketch vectors of geometric objects $J$ through the \mindist function \[ v_i(J) = \inf_{p \in J} \|p-q_i\| \] for $q_i \in Q$ from a point set $Q$. Collecting the vector of these sketch values induces a simple, effective, and powerful distance: the Euclidean distance between these sketched vectors. This paper shows how large this set $Q$ needs to be under a variety of shapes and scenar… ▽ More

    Submitted 7 July, 2019; v1 submitted 3 July, 2019; originally announced July 2019.

  30. arXiv:1906.09381  [pdf, other

    stat.ML cs.LG

    The Kernel Spatial Scan Statistic

    Authors: Mingxuan Han, Michael Matheny, Jeff M. Phillips

    Abstract: Kulldorff's (1997) seminal paper on spatial scan statistics (SSS) has led to many methods considering different regions of interest, different statistical models, and different approximations while also having numerous applications in epidemiology, environmental monitoring, and homeland security. SSS provides a way to rigorously test for the existence of an anomaly and provide statistical guarante… ▽ More

    Submitted 9 August, 2019; v1 submitted 13 June, 2019; originally announced June 2019.

    Comments: 13 pages, 13 figures

  31. arXiv:1906.01693  [pdf, other

    cs.DS

    Scalable Spatial Scan Statistics for Trajectories

    Authors: Michael Matheny, Dong Xie, Jeff M. Phillips

    Abstract: We define several new models for how to define anomalous regions among enormous sets of trajectories. These are based on spatial scan statistics, and identify a geometric region which captures a subset of trajectories which are significantly different in a measured characteristic from the background population. The model definition depends on how much a geometric region is contributed to by some o… ▽ More

    Submitted 4 June, 2019; originally announced June 2019.

  32. arXiv:1903.08014  [pdf, other

    cs.DS cs.CG

    Independent Range Sampling, Revisited Again

    Authors: Peyman Afshani, Jeff M. Phillips

    Abstract: We revisit the range sampling problem: the input is a set of points where each point is associated with a real-valued weight. The goal is to store them in a structure such that given a query range and an integer $k$, we can extract $k$ independent random samples from the points inside the query range, where the probability of sampling a point is proportional to its weight. This line of work was… ▽ More

    Submitted 19 March, 2019; originally announced March 2019.

  33. arXiv:1903.03211  [pdf, other

    cs.CG

    The VC Dimension of Metric Balls under Fréchet and Hausdorff Distances

    Authors: Anne Driemel, André Nusser, Jeff M. Phillips, Ioannis Psarros

    Abstract: The Vapnik-Chervonenkis dimension provides a notion of complexity for systems of sets. If the VC dimension is small, then knowing this can drastically simplify fundamental computational tasks such as classification, range counting, and density estimation through the use of sampling bounds. We analyze set systems where the ground set $X$ is a set of polygonal curves in $\mathbb{R}^d$ and the sets… ▽ More

    Submitted 15 November, 2019; v1 submitted 7 March, 2019; originally announced March 2019.

    Comments: 24 pages, 5 figures

  34. arXiv:1811.04136  [pdf, ps, other

    cs.LG cs.CG stat.ML

    The GaussianSketch for Almost Relative Error Kernel Distance

    Authors: Jeff M. Phillips, Wai Ming Tai

    Abstract: We introduce two versions of a new sketch for approximately embedding the Gaussian kernel into Euclidean inner product space. These work by truncating infinite expansions of the Gaussian kernel, and carefully invoking the RecursiveTensorSketch [Ahle et al. SODA 2020]. After providing concentration and approximation properties of these sketches, we use them to approximate the kernel distance betwee… ▽ More

    Submitted 19 June, 2020; v1 submitted 9 November, 2018; originally announced November 2018.

  35. Improved Bounds on Information Dissemination by Manhattan Random Waypoint Model

    Authors: Aria Rezaei, Jie Gao, Jeff M. Phillips, Csaba D. Tóth

    Abstract: With the popularity of portable wireless devices it is important to model and predict how information or contagions spread by natural human mobility -- for understanding the spreading of deadly infectious diseases and for improving delay tolerant communication schemes. Formally, we model this problem by considering $M$ moving agents, where each agent initially carries a \emph{distinct} bit of info… ▽ More

    Submitted 19 September, 2018; originally announced September 2018.

    Comments: 10 pages, ACM SIGSPATIAL 2018, Seattle, US

    Journal ref: 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL 18), 2018, Seattle, WA, USA

  36. arXiv:1806.01330  [pdf, other

    cs.CL stat.ML

    Closed Form Word Embedding Alignment

    Authors: Sunipa Dev, Safia Hassan, Jeff M. Phillips

    Abstract: We develop a family of techniques to align word embeddings which are derived from different source datasets or created using different mechanisms (e.g., GloVe or word2vec). Our methods are simple and have a closed form to optimally rotate, translate, and scale to minimize root mean squared errors or maximize the average cosine similarity between two embeddings of the same vocabulary into the same… ▽ More

    Submitted 17 November, 2020; v1 submitted 4 June, 2018; originally announced June 2018.

    Comments: Accepted ICDM 2019 and KAIS Special Issue

  37. arXiv:1804.11307  [pdf, other

    cs.CG

    Practical Low-Dimensional Halfspace Range Space Sampling

    Authors: Michael Matheny, Jeff M. Phillips

    Abstract: We develop, analyze, implement, and compare new algorithms for creating $\varepsilon$-samples of range spaces defined by halfspaces which have size sub-quadratic in $1/\varepsilon$, and have runtime linear in the input size and near-quadratic in $1/\varepsilon$. The key to our solution is an efficient construction of partition trees. Despite not requiring any techniques developed after the early 1… ▽ More

    Submitted 17 July, 2018; v1 submitted 30 April, 2018; originally announced April 2018.

  38. arXiv:1804.11287  [pdf, other

    cs.CG

    Computing Approximate Statistical Discrepancy

    Authors: Michael Matheny, Jeff M. Phillips

    Abstract: Consider a geometric range space $(X,cA)$ where each data point $x \in X$ has two or more values (say $r(x)$ and $b(x)$). Also consider a function $Φ(A)$ defined on any subset $A \in (X,cA)$ on the sum of values in that range e.g., $r_A = \sum_{x \in A} r(x)$ and $b_A = \sum_{x \in A} b(x)$. The $Φ$-maximum range is $A^* = \arg \max_{A \in (X,cA)} Φ(A)$. Our goal is to find some $\hat{A}$ such tha… ▽ More

    Submitted 27 September, 2018; v1 submitted 30 April, 2018; originally announced April 2018.

  39. arXiv:1804.11284  [pdf, other

    cs.CG cs.LG

    Simple Distances for Trajectories via Landmarks

    Authors: Jeff M. Phillips, Pingfan Tang

    Abstract: We develop a new class of distances for objects including lines, hyperplanes, and trajectories, based on the distance to a set of landmarks. These distances easily and interpretably map objects to a Euclidean space, are simple to compute, and perform well in data analysis tasks. For trajectories, they match and in some cases significantly out-perform all state-of-the-art other metrics, can effortl… ▽ More

    Submitted 11 June, 2019; v1 submitted 30 April, 2018; originally announced April 2018.

    ACM Class: I.3.5; G.3

  40. arXiv:1802.01751  [pdf, other

    cs.LG cs.CG stat.ML

    Near-Optimal Coresets of Kernel Density Estimates

    Authors: Jeff M. Phillips, Wai Ming Tai

    Abstract: We construct near-optimal coresets for kernel density estimates for points in $\mathbb{R}^d$ when the kernel is positive definite. Specifically we show a polynomial time construction for a coreset of size $O(\sqrt{d}/\varepsilon\cdot \sqrt{\log 1/\varepsilon} )$, and we show a near-matching lower bound of size $Ω(\min\{\sqrt{d}/\varepsilon, 1/\varepsilon^2\})$. When $d\geq 1/\varepsilon^2$, it is… ▽ More

    Submitted 11 April, 2019; v1 submitted 5 February, 2018; originally announced February 2018.

    Comments: This paper is combined with arXiv:1710.04325

  41. arXiv:1710.06925  [pdf, other

    cs.HC eess.SP

    Visualizing Sensor Network Coverage with Location Uncertainty

    Authors: Tim Sodergren, Jessica Hair, Jeff M. Phillips, Bei Wang

    Abstract: We present an interactive visualization system for exploring the coverage in sensor networks with uncertain sensor locations. We consider a simple case of uncertainty where the location of each sensor is confined to a discrete number of points sampled uniformly at random from a region with a fixed radius. Employing techniques from topological data analysis, we model and visualize network coverage… ▽ More

    Submitted 18 October, 2017; originally announced October 2017.

  42. arXiv:1710.04325  [pdf, other

    cs.LG cs.CG stat.ML

    Improved Coresets for Kernel Density Estimates

    Authors: Jeff M. Phillips, Wai Ming Tai

    Abstract: We study the construction of coresets for kernel density estimates. That is we show how to approximate the kernel density estimate described by a large point set with another kernel density estimate with a much smaller point set. For characteristic kernels (including Gaussian and Laplace kernels), our approximation preserves the $L_\infty$ error between kernel density estimates within error $ε$, w… ▽ More

    Submitted 11 October, 2017; originally announced October 2017.

  43. arXiv:1709.04453  [pdf, other

    cs.HC cs.CG

    Visualization of Big Spatial Data using Coresets for Kernel Density Estimates

    Authors: Yan Zheng, Yi Ou, Alexander Lex, Jeff M. Phillips

    Abstract: The size of large, geo-located datasets has reached scales where visualization of all data points is inefficient. Random sampling is a method to reduce the size of a dataset, yet it can introduce unwanted errors. We describe a method for subsampling of spatial data suitable for creating kernel density estimates from very large data and demonstrate that it results in less error than random sampling… ▽ More

    Submitted 13 September, 2017; originally announced September 2017.

  44. arXiv:1702.03644  [pdf, other

    cs.LG cs.DS

    Coresets for Kernel Regression

    Authors: Yan Zheng, Jeff M. Phillips

    Abstract: Kernel regression is an essential and ubiquitous tool for non-parametric data analysis, particularly popular among time series and spatial data. However, the central operation which is performed many times, evaluating a kernel on the data set, takes linear time. This is impractical for modern large data sets. In this paper we describe coresets for kernel regression: compressed data sets which ca… ▽ More

    Submitted 31 May, 2017; v1 submitted 13 February, 2017; originally announced February 2017.

    Comments: 11 pages, 20 figures

  45. arXiv:1609.01226  [pdf, other

    cs.LG stat.ML

    The Robustness of Estimator Composition

    Authors: Pingfan Tang, Jeff M. Phillips

    Abstract: We formalize notions of robustness for composite estimators via the notion of a breakdown point. A composite estimator successively applies two (or more) estimators: on data decomposed into disjoint parts, it applies the first estimator on each part, then the second estimator on the outputs of the first estimator. And so on, if the composition is of more than two estimators. Informally, the breakd… ▽ More

    Submitted 5 September, 2016; originally announced September 2016.

    Comments: 14 pages, 2 figures, 29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain

  46. arXiv:1602.05350  [pdf, ps, other

    cs.LG

    Relative Error Embeddings for the Gaussian Kernel Distance

    Authors: Di Chen, Jeff M. Phillips

    Abstract: A reproducing kernel can define an embedding of a data point into an infinite dimensional reproducing kernel Hilbert space (RKHS). The norm in this space describes a distance, which we call the kernel distance. The random Fourier features (of Rahimi and Recht) describe an oblivious approximate mapping into finite dimensional Euclidean space that behaves similar to the RKHS. We show in this paper t… ▽ More

    Submitted 20 September, 2016; v1 submitted 17 February, 2016; originally announced February 2016.

  47. arXiv:1602.00412  [pdf, other

    cs.DS

    Efficient Frequent Directions Algorithm for Sparse Matrices

    Authors: Mina Ghashami, Edo Liberty, Jeff M. Phillips

    Abstract: This paper describes Sparse Frequent Directions, a variant of Frequent Directions for sketching sparse matrices. It resembles the original algorithm in many ways: both receive the rows of an input matrix $A^{n \times d}$ one by one in the streaming setting and compute a small sketch $B \in R^{\ell \times d}$. Both share the same strong (provably optimal) asymptotic guarantees with respect to the s… ▽ More

    Submitted 17 February, 2016; v1 submitted 1 February, 2016; originally announced February 2016.

  48. arXiv:1601.00630  [pdf, other

    cs.DM stat.CO

    Approximating the Distribution of the Median and other Robust Estimators on Uncertain Data

    Authors: Kevin Buchin, Jeff M. Phillips, Pingfan Tang

    Abstract: Robust estimators, like the median of a point set, are important for data analysis in the presence of outliers. We study robust estimators for locationally uncertain points with discrete distributions. That is, each point in a data set has a discrete probability distribution describing its location. The probabilistic nature of uncertain data makes it challenging to compute such estimators, since t… ▽ More

    Submitted 13 March, 2018; v1 submitted 4 January, 2016; originally announced January 2016.

    Comments: Full version of a paper to appear at SoCG 2018

  49. arXiv:1601.00617  [pdf, ps, other

    cs.CG

    Coresets and Sketches

    Authors: Jeff M. Phillips

    Abstract: Geometric data summarization has become an essential tool in both geometric approximation algorithms and where geometry intersects with big data problems. In linear or near-linear time large data sets can be compressed into a summary, and then more intricate algorithms can be run on the summaries whose results approximate those of the full data set. Coresets and sketches are the two most important… ▽ More

    Submitted 12 June, 2016; v1 submitted 4 January, 2016; originally announced January 2016.

    Comments: Near-final version of Chapter 49 in Handbook on Discrete and Computational Geometry, 3rd edition

  50. arXiv:1512.05059  [pdf, other

    cs.DS cs.LG stat.ML

    Streaming Kernel Principal Component Analysis

    Authors: Mina Ghashami, Daniel Perry, Jeff M. Phillips

    Abstract: Kernel principal component analysis (KPCA) provides a concise set of basis vectors which capture non-linear structures within large data sets, and is a central tool in data analysis and learning. To allow for non-linear relations, typically a full $n \times n$ kernel matrix is constructed over $n$ data points, but this requires too much space and time for large values of $n$. Techniques such as th… ▽ More

    Submitted 16 December, 2015; originally announced December 2015.