-
N$^2$: A Unified Python Package and Test Bench for Nearest Neighbor-Based Matrix Completion
Authors:
Caleb Chin,
Aashish Khubchandani,
Harshvardhan Maskara,
Kyuseong Choi,
Jacob Feitelberg,
Albert Gong,
Manit Paul,
Tathagata Sadhukhan,
Anish Agarwal,
Raaz Dwivedi
Abstract:
Nearest neighbor (NN) methods have re-emerged as competitive tools for matrix completion, offering strong empirical performance and recent theoretical guarantees, including entry-wise error bounds, confidence intervals, and minimax optimality. Despite their simplicity, recent work has shown that NN approaches are robust to a range of missingness patterns and effective across diverse applications.…
▽ More
Nearest neighbor (NN) methods have re-emerged as competitive tools for matrix completion, offering strong empirical performance and recent theoretical guarantees, including entry-wise error bounds, confidence intervals, and minimax optimality. Despite their simplicity, recent work has shown that NN approaches are robust to a range of missingness patterns and effective across diverse applications. This paper introduces N$^2$, a unified Python package and testbed that consolidates a broad class of NN-based methods through a modular, extensible interface. Built for both researchers and practitioners, N$^2$ supports rapid experimentation and benchmarking. Using this framework, we introduce a new NN variant that achieves state-of-the-art results in several settings. We also release a benchmark suite of real-world datasets, from healthcare and recommender systems to causal inference and LLM evaluation, designed to stress-test matrix completion methods beyond synthetic scenarios. Our experiments demonstrate that while classical methods excel on idealized data, NN-based techniques consistently outperform them in real-world settings.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
Adaptively-weighted Nearest Neighbors for Matrix Completion
Authors:
Tathagata Sadhukhan,
Manit Paul,
Raaz Dwivedi
Abstract:
In this technical note, we introduce and analyze AWNN: an adaptively weighted nearest neighbor method for performing matrix completion. Nearest neighbor (NN) methods are widely used in missing data problems across multiple disciplines such as in recommender systems and for performing counterfactual inference in panel data settings. Prior works have shown that in addition to being very intuitive an…
▽ More
In this technical note, we introduce and analyze AWNN: an adaptively weighted nearest neighbor method for performing matrix completion. Nearest neighbor (NN) methods are widely used in missing data problems across multiple disciplines such as in recommender systems and for performing counterfactual inference in panel data settings. Prior works have shown that in addition to being very intuitive and easy to implement, NN methods enjoy nice theoretical guarantees. However, the performance of majority of the NN methods rely on the appropriate choice of the radii and the weights assigned to each member in the nearest neighbor set and despite several works on nearest neighbor methods in the past two decades, there does not exist a systematic approach of choosing the radii and the weights without relying on methods like cross-validation. AWNN addresses this challenge by judiciously balancing the bias variance trade off inherent in weighted nearest-neighbor regression. We provide theoretical guarantees for the proposed method under minimal assumptions and support the theory via synthetic experiments.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Finite sample valid confidence sets of mode
Authors:
Manit Paul,
Arun Kumar Kuchibhotla
Abstract:
Estimating the mode of a unimodal distribution is a classical problem in statistics. Although there are several approaches for point-estimation of mode in the literature, very little has been explored about the interval-estimation of mode. Our work proposes a collection of novel methods of obtaining finite sample valid confidence set of the mode of a unimodal distribution. We analyze the behaviour…
▽ More
Estimating the mode of a unimodal distribution is a classical problem in statistics. Although there are several approaches for point-estimation of mode in the literature, very little has been explored about the interval-estimation of mode. Our work proposes a collection of novel methods of obtaining finite sample valid confidence set of the mode of a unimodal distribution. We analyze the behaviour of the width of the proposed confidence sets under some regularity assumptions of the density about the mode and show that the width of these confidence sets shrink to zero near optimally. Simply put, we show that it is possible to build finite sample valid confidence sets for the mode that shrink to a singleton as sample size increases. We support the theoretical results by showing the performance of the proposed methods on some synthetic data-sets. We believe that our confidence sets can be improved both in construction and in terms of rate.
△ Less
Submitted 31 March, 2025;
originally announced March 2025.
-
On adaptivity and minimax optimality of two-sided nearest neighbors
Authors:
Tathagata Sadhukhan,
Manit Paul,
Raaz Dwivedi
Abstract:
Nearest neighbor (NN) algorithms have been extensively used for missing data problems in recommender systems and sequential decision-making systems. Prior theoretical analysis has established favorable guarantees for NN when the underlying data is sufficiently smooth and the missingness probabilities are lower bounded. Here we analyze NN with non-smooth non-linear functions with vast amounts of mi…
▽ More
Nearest neighbor (NN) algorithms have been extensively used for missing data problems in recommender systems and sequential decision-making systems. Prior theoretical analysis has established favorable guarantees for NN when the underlying data is sufficiently smooth and the missingness probabilities are lower bounded. Here we analyze NN with non-smooth non-linear functions with vast amounts of missingness. In particular, we consider matrix completion settings where the entries of the underlying matrix follow a latent non-linear factor model, with the non-linearity belonging to a \Holder function class that is less smooth than Lipschitz. Our results establish following favorable properties for a suitable two-sided NN: (1) The mean squared error (MSE) of NN adapts to the smoothness of the non-linearity, (2) under certain regularity conditions, the NN error rate matches the rate obtained by an oracle equipped with the knowledge of both the row and column latent factors, and finally (3) NN's MSE is non-trivial for a wide range of settings even when several matrix entries might be missing deterministically. We support our theoretical findings via extensive numerical simulations and a case study with data from a mobile health study, HeartSteps.
△ Less
Submitted 19 November, 2024;
originally announced November 2024.
-
Data-driven Crop Growth Simulation on Time-varying Generated Images using Multi-conditional Generative Adversarial Networks
Authors:
Lukas Drees,
Dereje T. Demie,
Madhuri R. Paul,
Johannes Leonhardt,
Sabine J. Seidel,
Thomas F. Döring,
Ribana Roscher
Abstract:
Image-based crop growth modeling can substantially contribute to precision agriculture by revealing spatial crop development over time, which allows an early and location-specific estimation of relevant future plant traits, such as leaf area or biomass. A prerequisite for realistic and sharp crop image generation is the integration of multiple growth-influencing conditions in a model, such as an i…
▽ More
Image-based crop growth modeling can substantially contribute to precision agriculture by revealing spatial crop development over time, which allows an early and location-specific estimation of relevant future plant traits, such as leaf area or biomass. A prerequisite for realistic and sharp crop image generation is the integration of multiple growth-influencing conditions in a model, such as an image of an initial growth stage, the associated growth time, and further information about the field treatment. We present a two-stage framework consisting first of an image prediction model and second of a growth estimation model, which both are independently trained. The image prediction model is a conditional Wasserstein generative adversarial network (CWGAN). In the generator of this model, conditional batch normalization (CBN) is used to integrate different conditions along with the input image. This allows the model to generate time-varying artificial images dependent on multiple influencing factors of different kinds. These images are used by the second part of the framework for plant phenotyping by deriving plant-specific traits and comparing them with those of non-artificial (real) reference images. For various crop datasets, the framework allows realistic, sharp image predictions with a slight loss of quality from short-term to long-term predictions. Simulations of varying growth-influencing conditions performed with the trained framework provide valuable insights into how such factors relate to crop appearances, which is particularly useful in complex, less explored crop mixture systems. Further results show that adding process-based simulated biomass as a condition increases the accuracy of the derived phenotypic traits from the predicted images. This demonstrates the potential of our framework to serve as an interface between an image- and process-based crop growth model.
△ Less
Submitted 6 December, 2023;
originally announced December 2023.
-
Effect of influence in voter models and its application in detecting significant interference in political elections
Authors:
Manit Paul,
Rishideep Roy,
Soudeep Deb
Abstract:
In this article, we study the effect of vector-valued interventions in votes under a binary voter model, where each voter expresses their vote as a $0-1$ valued random variable to choose between two candidates. We assume that the outcome is determined by the majority function, which is true for a democratic system. The term intervention includes cases of counting errors, reporting irregularities,…
▽ More
In this article, we study the effect of vector-valued interventions in votes under a binary voter model, where each voter expresses their vote as a $0-1$ valued random variable to choose between two candidates. We assume that the outcome is determined by the majority function, which is true for a democratic system. The term intervention includes cases of counting errors, reporting irregularities, electoral malpractice etc. Our focus is to analyze the effect of the intervention on the final outcome. We construct statistical tests to detect significant irregularities in elections under two scenarios, one where exit poll data is available and more broadly under the assumption of a cost function associated with causing the interventions. Relevant theoretical results on the consistency of the test procedures are also derived. Through a detailed simulation study, we show that the test procedure has good power and is robust across various settings. We also implement our method on three real-life data sets. The applications provide results consistent with existing knowledge and establish that the method can be adopted for crucial problems related to political elections.
△ Less
Submitted 14 October, 2022;
originally announced October 2022.
-
Unmasking the Lottery Ticket Hypothesis: What's Encoded in a Winning Ticket's Mask?
Authors:
Mansheej Paul,
Feng Chen,
Brett W. Larsen,
Jonathan Frankle,
Surya Ganguli,
Gintare Karolina Dziugaite
Abstract:
Modern deep learning involves training costly, highly overparameterized networks, thus motivating the search for sparser networks that can still be trained to the same accuracy as the full network (i.e. matching). Iterative magnitude pruning (IMP) is a state of the art algorithm that can find such highly sparse matching subnetworks, known as winning tickets. IMP operates by iterative cycles of tra…
▽ More
Modern deep learning involves training costly, highly overparameterized networks, thus motivating the search for sparser networks that can still be trained to the same accuracy as the full network (i.e. matching). Iterative magnitude pruning (IMP) is a state of the art algorithm that can find such highly sparse matching subnetworks, known as winning tickets. IMP operates by iterative cycles of training, masking smallest magnitude weights, rewinding back to an early training point, and repeating. Despite its simplicity, the underlying principles for when and how IMP finds winning tickets remain elusive. In particular, what useful information does an IMP mask found at the end of training convey to a rewound network near the beginning of training? How does SGD allow the network to extract this information? And why is iterative pruning needed? We develop answers in terms of the geometry of the error landscape. First, we find that$\unicode{x2014}$at higher sparsities$\unicode{x2014}$pairs of pruned networks at successive pruning iterations are connected by a linear path with zero error barrier if and only if they are matching. This indicates that masks found at the end of training convey the identity of an axial subspace that intersects a desired linearly connected mode of a matching sublevel set. Second, we show SGD can exploit this information due to a strong form of robustness: it can return to this mode despite strong perturbations early in training. Third, we show how the flatness of the error landscape at the end of training determines a limit on the fraction of weights that can be pruned at each iteration of IMP. Finally, we show that the role of retraining in IMP is to find a network with new small weights to prune. Overall, these results make progress toward demystifying the existence of winning tickets by revealing the fundamental role of error landscape geometry.
△ Less
Submitted 6 October, 2022;
originally announced October 2022.
-
Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks
Authors:
Mansheej Paul,
Brett W. Larsen,
Surya Ganguli,
Jonathan Frankle,
Gintare Karolina Dziugaite
Abstract:
A striking observation about iterative magnitude pruning (IMP; Frankle et al. 2020) is that $\unicode{x2014}$ after just a few hundred steps of dense training $\unicode{x2014}$ the method can find a sparse sub-network that can be trained to the same accuracy as the dense network. However, the same does not hold at step 0, i.e. random initialization. In this work, we seek to understand how this ear…
▽ More
A striking observation about iterative magnitude pruning (IMP; Frankle et al. 2020) is that $\unicode{x2014}$ after just a few hundred steps of dense training $\unicode{x2014}$ the method can find a sparse sub-network that can be trained to the same accuracy as the dense network. However, the same does not hold at step 0, i.e. random initialization. In this work, we seek to understand how this early phase of pre-training leads to a good initialization for IMP both through the lens of the data distribution and the loss landscape geometry. Empirically we observe that, holding the number of pre-training iterations constant, training on a small fraction of (randomly chosen) data suffices to obtain an equally good initialization for IMP. We additionally observe that by pre-training only on "easy" training data, we can decrease the number of steps necessary to find a good initialization for IMP compared to training on the full dataset or a randomly chosen subset. Finally, we identify novel properties of the loss landscape of dense networks that are predictive of IMP performance, showing in particular that more examples being linearly mode connected in the dense network correlates well with good initializations for IMP. Combined, these results provide new insight into the role played by the early phase training in IMP.
△ Less
Submitted 2 June, 2022;
originally announced June 2022.
-
Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel
Authors:
Stanislav Fort,
Gintare Karolina Dziugaite,
Mansheej Paul,
Sepideh Kharaghani,
Daniel M. Roy,
Surya Ganguli
Abstract:
In suitably initialized wide networks, small learning rates transform deep neural networks (DNNs) into neural tangent kernel (NTK) machines, whose training dynamics is well-approximated by a linear weight expansion of the network at initialization. Standard training, however, diverges from its linearization in ways that are poorly understood. We study the relationship between the training dynamics…
▽ More
In suitably initialized wide networks, small learning rates transform deep neural networks (DNNs) into neural tangent kernel (NTK) machines, whose training dynamics is well-approximated by a linear weight expansion of the network at initialization. Standard training, however, diverges from its linearization in ways that are poorly understood. We study the relationship between the training dynamics of nonlinear deep networks, the geometry of the loss landscape, and the time evolution of a data-dependent NTK. We do so through a large-scale phenomenological analysis of training, synthesizing diverse measures characterizing loss landscape geometry and NTK dynamics. In multiple neural architectures and datasets, we find these diverse measures evolve in a highly correlated manner, revealing a universal picture of the deep learning process. In this picture, deep network training exhibits a highly chaotic rapid initial transient that within 2 to 3 epochs determines the final linearly connected basin of low loss containing the end point of training. During this chaotic transient, the NTK changes rapidly, learning useful features from the training data that enables it to outperform the standard initial NTK by a factor of 3 in less than 3 to 4 epochs. After this rapid chaotic transient, the NTK changes at constant velocity, and its performance matches that of full network training in 15% to 45% of training time. Overall, our analysis reveals a striking correlation between a diverse set of metrics over training time, governed by a rapid chaotic to stable transition in the first few epochs, that together poses challenges and opportunities for the development of more accurate theories of deep learning.
△ Less
Submitted 28 October, 2020;
originally announced October 2020.
-
Automatically Assessing Quality of Online Health Articles
Authors:
Fariha Afsana,
Muhammad Ashad Kabir,
Naeemul Hassan,
Manoranjan Paul
Abstract:
The information ecosystem today is overwhelmed by an unprecedented quantity of data on versatile topics are with varied quality. However, the quality of information disseminated in the field of medicine has been questioned as the negative health consequences of health misinformation can be life-threatening. There is currently no generic automated tool for evaluating the quality of online health in…
▽ More
The information ecosystem today is overwhelmed by an unprecedented quantity of data on versatile topics are with varied quality. However, the quality of information disseminated in the field of medicine has been questioned as the negative health consequences of health misinformation can be life-threatening. There is currently no generic automated tool for evaluating the quality of online health information spanned over a broad range. To address this gap, in this paper, we applied a data mining approach to automatically assess the quality of online health articles based on 10 quality criteria. We have prepared a labeled dataset with 53012 features and applied different feature selection methods to identify the best feature subset with which our trained classifier achieved an accuracy of 84%-90% varied over 10 criteria. Our semantic analysis of features shows the underpinning associations between the selected features & assessment criteria and further rationalize our assessment approach. Our findings will help in identifying high-quality health articles and thus aiding users in shaping their opinion to make the right choice while picking health-related help from online.
△ Less
Submitted 6 April, 2020;
originally announced April 2020.
-
Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance
Authors:
Mauricio Santillana,
Andre T. Nguyen,
Mark Dredze,
Michael J. Paul,
John S. Brownstein
Abstract:
We present a machine learning-based methodology capable of providing real-time ("nowcast") and forecast estimates of influenza activity in the US by leveraging data from multiple data sources including: Google searches, Twitter microblogs, nearly real-time hospital visit records, and data from a participatory surveillance system. Our main contribution consists of combining multiple influenza-like…
▽ More
We present a machine learning-based methodology capable of providing real-time ("nowcast") and forecast estimates of influenza activity in the US by leveraging data from multiple data sources including: Google searches, Twitter microblogs, nearly real-time hospital visit records, and data from a participatory surveillance system. Our main contribution consists of combining multiple influenza-like illnesses (ILI) activity estimates, generated independently with each data source, into a single prediction of ILI utilizing machine learning ensemble approaches. Our methodology exploits the information in each data source and produces accurate weekly ILI predictions for up to four weeks ahead of the release of CDC's ILI reports. We evaluate the predictive ability of our ensemble approach during the 2013-2014 (retrospective) and 2014-2015 (live) flu seasons for each of the four weekly time horizons. Our ensemble approach demonstrates several advantages: (1) our ensemble method's predictions outperform every prediction using each data source independently, (2) our methodology can produce predictions one week ahead of GFT's real-time estimates with comparable accuracy, and (3) our two and three week forecast estimates have comparable accuracy to real-time predictions using an autoregressive model. Moreover, our results show that considerable insight is gained from incorporating disparate data streams, in the form of social media and crowd sourced data, into influenza predictions in all time horizons
△ Less
Submitted 27 August, 2015;
originally announced August 2015.
-
Hierarchical modelling of faecal egg counts to assess anthelmintic efficacy
Authors:
Michaela Paul,
Paul R. Torgerson,
Johan Höglund,
Reinhard Furrer
Abstract:
Counting the number of parasite eggs in faecal samples is a widely used diagnostic method to evaluate parasite burden. Typically a sub-sample of the diluted faeces is examined for eggs. The resulting egg counts are multiplied by a specific correction factor to estimate the mean parasite burden. To detect anthelmintic resistance, the mean parasite burden from treated and untreated animals are compa…
▽ More
Counting the number of parasite eggs in faecal samples is a widely used diagnostic method to evaluate parasite burden. Typically a sub-sample of the diluted faeces is examined for eggs. The resulting egg counts are multiplied by a specific correction factor to estimate the mean parasite burden. To detect anthelmintic resistance, the mean parasite burden from treated and untreated animals are compared. However, this standard method has some limitations. In particular, the analysis of repeated samples may produce quite variable results as the sampling variability due to the counting technique is ignored. We propose a hierarchical model that takes this sampling variability as well as between-animal variation into account. Bayesian inference is done via Markov chain Monte Carlo sampling. The performance of the hierarchical model is illustrated by a re-analysis of faecal egg count data from a Swedish study assessing the anthelmintic resistance of nematode parasite in sheep. A simulation study shows that the hierarchical model provides better classification of anthelmintic resistance compared to the standard method.
△ Less
Submitted 12 January, 2014;
originally announced January 2014.