Search | arXiv e-print repository

Learning to Maximize Mutual Information for Dynamic Feature Selection

Authors: Ian Covert, Wei Qiu, Mingyu Lu, Nayoon Kim, Nathan White, Su-In Lee

Abstract: Feature selection helps reduce data acquisition costs in ML, but the standard approach is to train models with static feature subsets. Here, we consider the dynamic feature selection (DFS) problem where a model sequentially queries features based on the presently available information. DFS is often addressed with reinforcement learning, but we explore a simpler approach of greedily selecting featu… ▽ More Feature selection helps reduce data acquisition costs in ML, but the standard approach is to train models with static feature subsets. Here, we consider the dynamic feature selection (DFS) problem where a model sequentially queries features based on the presently available information. DFS is often addressed with reinforcement learning, but we explore a simpler approach of greedily selecting features based on their conditional mutual information. This method is theoretically appealing but requires oracle access to the data distribution, so we develop a learning approach based on amortized optimization. The proposed method is shown to recover the greedy policy when trained to optimality, and it outperforms numerous existing feature selection methods in our experiments, thus validating it as a simple but powerful approach for this problem. △ Less

Submitted 8 June, 2023; v1 submitted 2 January, 2023; originally announced January 2023.

Comments: ICML 2023 camera-ready

arXiv:2001.11552 [pdf]

Unwanted Advances in Higher Education: Uncovering Sexual Harassment Experiences in Academia with Text Mining

Authors: Amir Karami, Cynthia Nicole White, Kayla Ford, Suzanne Swan, Melek Yildiz Spinel

Abstract: Sexual harassment in academia is often a hidden problem because victims are usually reluctant to report their experiences. Recently, a web survey was developed to provide an opportunity to share thousands of sexual harassment experiences in academia. Using an efficient approach, this study collected and investigated more than 2,000 sexual harassment experiences to better understand these unwanted… ▽ More Sexual harassment in academia is often a hidden problem because victims are usually reluctant to report their experiences. Recently, a web survey was developed to provide an opportunity to share thousands of sexual harassment experiences in academia. Using an efficient approach, this study collected and investigated more than 2,000 sexual harassment experiences to better understand these unwanted advances in higher education. This paper utilized text mining to disclose hidden topics and explore their weight across three variables: harasser gender, institution type, and victim's field of study. We mapped the topics on five themes drawn from the sexual harassment literature and found that more than 50% of the topics were assigned to the unwanted sexual attention theme. Fourteen percent of the topics were in the gender harassment theme, in which insulting, sexist, or degrading comments or behavior was directed towards women. Five percent of the topics involved sexual coercion (a benefit is offered in exchange for sexual favors), 5% involved sex discrimination, and 7% of the topics discussed retaliation against the victim for reporting the harassment, or for simply not complying with the harasser. Findings highlight the power differential between faculty and students, and the toll on students when professors abuse their power. While some topics did differ based on type of institution, there were no differences between the topics based on gender of harasser or field of study. This research can be beneficial to researchers in further investigation of this paper's dataset, and to policymakers in improving existing policies to create a safe and supportive environment in academia. △ Less

Submitted 11 December, 2019; originally announced January 2020.

arXiv:1910.02379 [pdf, other]

Factors associated with injurious from falls in people with early stage Parkinson's disease

Authors: Sarini Abdullah, James McGree, Nicole White, Kerrie Mengersen, Graham Kerr

Abstract: Falls are common in people with Parkinson's disease (PD) and have detrimental effects which can lower the quality of life. While studies have been conducted to learn about falling in general, factors distinguishing injurious from non-injurious falls are less clear. We develop a two-stage Bayesian logistic regression model was used to model the association of falls and injurious falls with data mea… ▽ More Falls are common in people with Parkinson's disease (PD) and have detrimental effects which can lower the quality of life. While studies have been conducted to learn about falling in general, factors distinguishing injurious from non-injurious falls are less clear. We develop a two-stage Bayesian logistic regression model was used to model the association of falls and injurious falls with data measured on patients. The forward stepwise selection procedure was used to determine which patient measures were associated with falls and injurious falls, and Bayesian model averaging (BMA) was used to account for uncertainty in this variable selection procedure. Data on 99 patients for a 12-month time period were considered in this analysis. Fifty five percent of the patients experienced at least one fall, with a total of 335 falls cases; 25% of which were injurious falls. Fearful, Tinetti gait, and previous falls were the risk factors for fall/non-fall, with 77% accuracy, 76% sensitivity, and 76% specificity. Fall time, body mass index, anxiety, balance, gait, and gender were the risk factors associated with injurious falls. Thus, attaining normal body mass index, improving balance and gait could be seen as preventive efforts for injurious falls. There was no significant difference in the risk of falls between males and females, yet if falls occurred, females were more likely to get injured than males. △ Less

Submitted 6 October, 2019; originally announced October 2019.

Comments: 18 pages, 3 figures, 4 tables

MSC Class: 62P10; 62-07; 62J12 ACM Class: J.3.2; G.3.2; G.3.6

arXiv:1910.01864 [pdf, other]

Profile regression for subgrouping patients with early stage Parkinson's disease

Authors: Sarini Abdullah, James McGree, Nicole White, Kerrie Mengersen, Graham Kerr

Abstract: Falls are detrimental to people with Parkinson's Disease (PD) because of the potentially severe consequences to the patients' quality of life. While many studies have attempted to predict falls/non-falls, this study aimed to determine factors related to falls frequency in people with early PD. Ninety nine participants with early stage PD were assessed based on two types of tests. The first type of… ▽ More Falls are detrimental to people with Parkinson's Disease (PD) because of the potentially severe consequences to the patients' quality of life. While many studies have attempted to predict falls/non-falls, this study aimed to determine factors related to falls frequency in people with early PD. Ninety nine participants with early stage PD were assessed based on two types of tests. The first type of tests is disease-specific tests, comprised of the Unified Parkinson's Disease Rating Scale (UPDRS) and the Schwab and England activities of daily living scale (SEADL). A measure of postural instability and gait disorder (PIGD) and subtotal scores for subscales I, II, and III were derived from the UPDRS. The second type of tests is functional tests, including Tinetti gait and balance, Berg Balance Scale (BBS), Timed-Up and Go (TUG), Functional Reach (FR), Freezing of Gait (FOG), Mini Mental State Examination (MMSE), and Melbourne Edge Test (MET). Falls were recorded each month for 6 months. Clustering of patients via Finite Mixture Model (FMM) was conducted. Three clusters of patients were found: non-or single-fallers, low frequency fallers, and high frequency fallers. Several factors that are important to clustering PD patients were identified: UPDRS subscales II and III subtotals, PIGD and SE ADL. However these factors could not differentiate PD patients with low frequency fallers from high frequency fallers. While Tinetti,TUG, and BBS turned to be important factors in clustering PD patients, and could differentiate the three clusters. FMM is able to cluster people with PD into three groups. We obtain several factors important to explaining the clusters and also found different role of disease specific measures and functional tests to clustering PD patients. Upon examining these measures, it might be possible to develop new disease treatment to prevent, or to delay, the occurrence of falls. △ Less

Submitted 4 October, 2019; originally announced October 2019.

Comments: 30 pages, 11 figures, 4 tables

MSC Class: 62-07; 62P10; 62H30 ACM Class: G.3.6; G.3.14; J.3.2

arXiv:1910.01313 [pdf, other]

Assessing the predictive ability of the UPDRS for falls classification in early stage Parkinson's disease

Authors: Sarini Abdullah, Nicole White, James McGree, Kerrie Mengersen, Graham Kerr

Abstract: Identification of risk factors associated with falls in people with Parkinson's Disease (PD) is important due to their high risk of falling. In this study, various ways of utilizing the Unified Parkinson's Disease Rating Scale (UPDRS) were assessed for the identification of risk factors and for the prediction of falls. Three statistical methods for classification were considered:decision trees, ra… ▽ More Identification of risk factors associated with falls in people with Parkinson's Disease (PD) is important due to their high risk of falling. In this study, various ways of utilizing the Unified Parkinson's Disease Rating Scale (UPDRS) were assessed for the identification of risk factors and for the prediction of falls. Three statistical methods for classification were considered:decision trees, random forests, and logistic regression. UPDRS measurements on 51 participants with early stage PD, who completed monthly falls diaries for 12 months of follow-up were analyzed. All classification methods applied produced similar results in regards to classification accuracy and the selected important variables. The highest classification rates were obtained from model with individual items of the UPDRS with 80% accuracy (85% sensitivity and 77% specificity), higher than in any previous study. A comparison of the independent performance of the four parts of the UPDRS revealed the comparably high classification rates for Parts II and III of the UPDRS. Similar patterns with slightly different classification rates were observed for the 6- and 12-month of follow-up times. Consistent predictors for falls selected by all classification methods at two follow-up times are: thought disorder for UPDRS I, dressing and falling for UPDRS II, hand pronate/supinate for UPDRS III, and sleep disturbance and symptomatic orthostasis for UPDRS IV. While for the aggregate measures, subtotal 2 (sum of UPDRS II items) and bradykinesia showed high association with fall/non-fall. Fall/non-fall occurrences were more associated with individual items of the UPDRS than with the aggregate measures. UPDRS parts II and III produced comparably high classification rates for fall/non-fall prediction. Similar results were obtained for modelling data at 6-month and 12-month follow-up times. △ Less

Submitted 3 October, 2019; originally announced October 2019.

Comments: 29 pages, 7 figures, 5 tables

MSC Class: 62P10; 62-07; 62H30 ACM Class: G.3.6; G.3.7; J.3.2

arXiv:1907.00510 [pdf]

Hidden in Plain Sight For Too Long: Using Text Mining Techniques to Shine a Light on Workplace Sexism and Sexual Harassment

Authors: Amir Karami, Suzanne C. Swan, Cynthia Nicole White, Kayla Ford

Abstract: Objective: The goal of this study is to understand how people experience sexism and sexual harassment in the workplace by discovering themes in 2,362 experiences posted on the Everyday Sexism Project's website everydaysexism.com. Method: This study used both quantitative and qualitative methods. The quantitative method was a computational framework to collect and analyze a large number of workplac… ▽ More Objective: The goal of this study is to understand how people experience sexism and sexual harassment in the workplace by discovering themes in 2,362 experiences posted on the Everyday Sexism Project's website everydaysexism.com. Method: This study used both quantitative and qualitative methods. The quantitative method was a computational framework to collect and analyze a large number of workplace sexual harassment experiences. The qualitative method was the analysis of the topics generated by a text mining method. Results: Twenty-three topics were coded and then grouped into three overarching themes from the sex discrimination and sexual harassment literature. The Sex Discrimination theme included experiences in which women were treated unfavorably due to their sex, such as being passed over for promotion, denied opportunities, paid less than men, and ignored or talked over in meetings. The Sex Discrimination and Gender harassment theme included stories about sex discrimination and gender harassment, such as sexist hostility behaviors ranging from insults and jokes invoking misogynistic stereotypes to bullying behavior. The last theme, Unwanted Sexual Attention, contained stories describing sexual comments and behaviors used to degrade women. Unwanted touching was the highest weighted topic, indicating how common it was for website users to endure being touched, hugged or kissed, groped, and grabbed. Conclusions: This study illustrates how researchers can use automatic processes to go beyond the limits of traditional research methods and investigate naturally occurring large scale datasets on the internet to achieve a better understanding of everyday workplace sexism experiences. △ Less

Submitted 30 June, 2019; originally announced July 2019.

arXiv:1602.02466 [pdf, other]

Overfitting hidden Markov models with an unknown number of states

Authors: Zoé van Havre, Judith Rousseau, Nicole White, Kerrie Mengersen

Abstract: This paper presents new theory and methodology for the Bayesian estimation of overfitted hidden Markov models, with finite state space. The goal is then to achieve posterior emptying of extra states. A prior configuration is constructed which favours configurations where the hidden Markov chain remains ergodic although it empties out some of the states. Asymptotic posterior convergence rates are p… ▽ More This paper presents new theory and methodology for the Bayesian estimation of overfitted hidden Markov models, with finite state space. The goal is then to achieve posterior emptying of extra states. A prior configuration is constructed which favours configurations where the hidden Markov chain remains ergodic although it empties out some of the states. Asymptotic posterior convergence rates are proven theoretically, and demonstrated with a large sample simulation. The problem of overfitted HMMs is then considered in the context of smaller sample sizes, and due to computational and mixing issues two alternative prior structures are studied, one commonly used in practice, and a mixture of the two priors. The Prior Parallel Tempering approach of van Havre (2015) is also extended to HMMs to allow MCMC estimation of the complex posterior space. A replicate simulation study and an in-depth exploration is performed to compare the three priors with hyperparameters chosen according to the asymptotic constraints alongside less informative alternatives. △ Less

Submitted 8 February, 2016; originally announced February 2016.

Comments: Submitted to Bayesian Analysis on 04-August-2015

arXiv:1602.01915 [pdf, ps, other]

Clustering action potential spikes: Insights on the use of overfitted finite mixture models and Dirichlet process mixture models

Authors: Zoé van Havre, Nicole White, Judith Rousseau, Kerrie Mengersen

Abstract: The modelling of action potentials from extracellular recordings, or spike sorting, is a rich area of neuroscience research in which latent variable models are often used. Two such models, Overfitted Finite Mixture models (OFMs) and Dirichlet Process Mixture models (DPMs) are considered to provide insights for unsupervised clustering of complex, multivariate medical data when the number of cluster… ▽ More The modelling of action potentials from extracellular recordings, or spike sorting, is a rich area of neuroscience research in which latent variable models are often used. Two such models, Overfitted Finite Mixture models (OFMs) and Dirichlet Process Mixture models (DPMs) are considered to provide insights for unsupervised clustering of complex, multivariate medical data when the number of clusters is unknown. OFM and DPM are structured in a similar hierarchical fashion but they are based on different philosophies with different underlying assumptions. This study investigates how these differences impact on a real study of spike sorting, for the estimation of multivariate Gaussian location-scale mixture models in the presence of common difficulties arising from complex medical data. The results provide insights allowing the future analyst to choose an approach suited to the situation and goal of the research problem at hand. △ Less

Submitted 4 February, 2016; originally announced February 2016.

Comments: Submitted to Australian & New Zealand Journal of Statistics on 31-Aug-2015

arXiv:1502.05427 [pdf]

doi 10.1371/journal.pone.0131739

Overfitting Bayesian Mixture Models with an Unknown Number of Components

Authors: Zoe van Havre, Nicole White, Judith Rousseau, Kerrie Mengersen

Abstract: This paper proposes solutions to three issues pertaining to the estimation of finite mixture models with an unknown number of components: the non-identifiability induced by overfitting the number of components, the mixing limitations of standard Markov Chain Monte Carlo (MCMC) sampling techniques, and the related label switching problem. An overfitting approach is used to estimate the number of co… ▽ More This paper proposes solutions to three issues pertaining to the estimation of finite mixture models with an unknown number of components: the non-identifiability induced by overfitting the number of components, the mixing limitations of standard Markov Chain Monte Carlo (MCMC) sampling techniques, and the related label switching problem. An overfitting approach is used to estimate the number of components in a finite mixture model via a Zmix algorithm. Zmix provides a bridge between multidimensional samplers and test based estimation methods, whereby priors are chosen to encourage extra groups to have weights approaching zero. MCMC sampling is made possible by the implementation of prior parallel tempering, an extension of parallel tempering. Zmix can accurately estimate the number of components, posterior parameter estimates and allocation probabilities given a sufficiently large sample size. The results will reflect uncertainty in the final model and will report the range of possible candidate models and their respective estimated probabilities from a single run. Label switching is resolved with a computationally light-weight method, Zswitch, developed for overfitted mixtures by exploiting the intuitiveness of allocation-based relabelling algorithms and the precision of label-invariant loss functions. Four simulation studies are included to illustrate Zmix and Zswitch, as well as three case studies from the literature. All methods are available as part of the R package Zmix, which can currently be applied to univariate Gaussian mixture models △ Less

Submitted 24 August, 2015; v1 submitted 18 February, 2015; originally announced February 2015.

Journal ref: Plos One, 10(7), e0131739 (2015)

Showing 1–9 of 9 results for author: White, N