Search | arXiv e-print repository

arXiv:2401.08308 [pdf]

Sources of HIV infections among MSM with a migration background: a viral phylogenetic case study in Amsterdam, the Netherlands

Authors: Alexandra Blenkinsop, Nikos Pantazis, Evangelia Georgia Kostaki, Lysandros Sofocleous, Ard van Sighem, Daniela Bezemer, Thijs van de Laar, Marc van der Valk, Peter Reiss, Godelieve de Bree, Oliver Ratmann

Abstract: Background: Men and women with a migration background comprise an increasing proportion of incident HIV cases across Western Europe. Several studies indicate a substantial proportion acquire HIV post-migration. Methods: We used partial HIV consensus sequences with linked demographic and clinical data from the opt-out ATHENA cohort of people with HIV in the Netherlands to quantify population-leve… ▽ More Background: Men and women with a migration background comprise an increasing proportion of incident HIV cases across Western Europe. Several studies indicate a substantial proportion acquire HIV post-migration. Methods: We used partial HIV consensus sequences with linked demographic and clinical data from the opt-out ATHENA cohort of people with HIV in the Netherlands to quantify population-level sources of transmission to Dutch-born and foreign-born Amsterdam men who have sex with men (MSM) between 2010-2021. We identified phylogenetically and epidemiologically possible transmission pairs in local transmission chains and interpreted these in the context of estimated infection dates, quantifying transmission dynamics between sub-populations by world region of birth. Results: We estimate the majority of Amsterdam MSM who acquired their infection locally had a Dutch-born Amsterdam MSM source (56% [53-58%]). Dutch-born MSM were the predominant source population of infections among almost all foreign-born Amsterdam MSM sub-populations. Stratifying by two-year intervals indicated shifts in transmission dynamics, with a majority of infections originating from foreign-born MSM since 2018, although uncertainty ranges remained wide. Conclusions: In the context of declining HIV incidence among Amsterdam MSM, our data suggest whilst native-born MSM have predominantly driven transmissions in 2010-2021, the contribution from foreign-born MSM living in Amsterdam is increasing. △ Less

Submitted 16 January, 2024; originally announced January 2024.

arXiv:2311.06086 [pdf, other]

A three-step approach to production frontier estimation and the Matsuoka's distribution

Authors: Danilo Hiroshi Matsuoka, Guilherme Pumi, Hudson da Silva Torrent, Marcio valk

Abstract: In this work, we introduce a three-step semiparametric methodology for the estimation of production frontiers. We consider a model inspired by the well-known Cobb-Douglas production function, wherein input factors operate multiplicatively within the model. Efficiency in the proposed model is assumed to follow a continuous univariate uniparametric distribution in $(0,1)$, referred to as Matsuoka's… ▽ More In this work, we introduce a three-step semiparametric methodology for the estimation of production frontiers. We consider a model inspired by the well-known Cobb-Douglas production function, wherein input factors operate multiplicatively within the model. Efficiency in the proposed model is assumed to follow a continuous univariate uniparametric distribution in $(0,1)$, referred to as Matsuoka's distribution, which is discussed in detail. Following model linearization, the first step is to semiparametrically estimate the regression function through a local linear smoother. The second step focuses on the estimation of the efficiency parameter. Finally, we estimate the production frontier through a plug-in methodology. We present a rigorous asymptotic theory related to the proposed three-step estimation, including consistency, and asymptotic normality, and derive rates for the convergences presented. Incidentally, we also study the Matsuoka's distribution, deriving its main properties. The Matsuoka's distribution exhibits a versatile array of shapes capable of effectively encapsulating the typical behavior of efficiency within production frontier models. To complement the large sample results obtained, a Monte Carlo simulation study is conducted to assess the finite sample performance of the proposed three-step methodology. An empirical application using a dataset of Danish milk producers is also presented. △ Less

Submitted 21 March, 2024; v1 submitted 10 November, 2023; originally announced November 2023.

MSC Class: 62E10; 62G08; 62F10

arXiv:2109.00952 [pdf, other]

Fault detection and diagnosis of batch process using dynamic ARMA-based control charts

Authors: Batista Nunes de Oliveira, Marcio Valk, Danilo Marcondes Filho

Abstract: A wide range of approaches for batch processes monitoring can be found in the literature. This kind of process generates a very peculiar data structure, in which successive measurements of many process variables in each batch run are available. Traditional approaches do not take into account the time series nature of the data. The main reason is that the time series inference theory is not based o… ▽ More A wide range of approaches for batch processes monitoring can be found in the literature. This kind of process generates a very peculiar data structure, in which successive measurements of many process variables in each batch run are available. Traditional approaches do not take into account the time series nature of the data. The main reason is that the time series inference theory is not based on replications of time series, as it is in batch process data. It is based on the variability in a time domain. This fact demands some adaptations of this theory in order to accommodate the model coefficient estimates, considering jointly the batch to batch samples variability (batch domain) and the serial correlation in each batch (time domain). In order to address this issue, this paper proposes a new approach grounded in a group of control charts based on the classical ARMA model for monitoring and diagnostic of batch processes dynamics. The model coefficients are estimated (through the ordinary least square method) for each historical time series sample batch and modified Hotelling and t-Student distributions are derived and used to accommodate those estimates. A group of control charts based on that distributions are proposed for monitoring the new batches. Additionally, those groups of charts help to fault diagnosis, identifying the source of disturbances. Through simulated and real data we show that this approach seems to work well for both purposes. △ Less

Submitted 2 September, 2021; originally announced September 2021.

arXiv:2106.09115 [pdf, other]

Clustering inference in multiple groups

Authors: Debora Zava Bello, Marcio Valk, Gabriela Bettella Cybis

Abstract: Inference in clustering is paramount to uncovering inherent group structure in data. Clustering methods which assess statistical significance have recently drawn attention owing to their importance for the identification of patterns in high dimensional data with applications in many scientific fields. We present here a U-statistics based approach, specially tailored for high-dimensional data, that… ▽ More Inference in clustering is paramount to uncovering inherent group structure in data. Clustering methods which assess statistical significance have recently drawn attention owing to their importance for the identification of patterns in high dimensional data with applications in many scientific fields. We present here a U-statistics based approach, specially tailored for high-dimensional data, that clusters the data into three groups while assessing the significance of such partitions. Because our approach stands on the U-statistics based clustering framework of the methods in R package uclust, it inherits its characteristics being a non-parametric method relying on very few assumptions about the data, and thus can be applied to a wide range of dataset. Furthermore our method aims to be a more powerful tool to find the best partitions of the data into three groups when that particular structure is present. In order to do so, we first propose an extension of the test U-statistic and develop its asymptotic theory. Additionally we propose a ternary non-nested significance clustering method. Our approach is tested through multiple simulations and found to have more statistical power than competing alternatives in all scenarios considered. Applications to peripheral blood mononuclear cells and to image recognition shows the versatility of our proposal, presenting a superior performance when compared with other approaches. △ Less

Submitted 16 June, 2021; originally announced June 2021.

arXiv:1807.10338 [pdf, other]

doi 10.1016/j.jspi.2018.10.001

Beta Autoregressive Fractionally Integrated Moving Average Models

Authors: Guilherme Pumi, Marcio Valk, Cleber Bisognin, Fábio Mariano Bayer, Taiane Schaedler Prass

Abstract: In this work we introduce the class of beta autoregressive fractionally integrated moving average models for continuous random variables taking values in the continuous unit interval $(0,1)$. The proposed model accommodates a set of regressors and a long-range dependent time series structure. We derive the partial likelihood estimator for the parameters of the proposed model, obtain the associated… ▽ More In this work we introduce the class of beta autoregressive fractionally integrated moving average models for continuous random variables taking values in the continuous unit interval $(0,1)$. The proposed model accommodates a set of regressors and a long-range dependent time series structure. We derive the partial likelihood estimator for the parameters of the proposed model, obtain the associated score vector and Fisher information matrix. We also prove the consistency and asymptotic normality of the estimator under mild conditions. Hypotheses testing, diagnostic tools and forecasting are also proposed. A Monte Carlo simulation is considered to evaluate the finite sample performance of the partial likelihood estimators and to study some of the proposed tests. An empirical application is also presented and discussed. △ Less

Submitted 26 July, 2018; originally announced July 2018.

MSC Class: 62M10; 62F12; 62J12; 62J99

arXiv:1805.12179 [pdf, other]

U-statistical inference for hierarchical clustering

Authors: Marcio Valk, Gabriela Bettella Cybis

Abstract: Clustering methods are a valuable tool for the identification of patterns in high dimensional data with applications in many scientific problems. However, quantifying uncertainty in clustering is a challenging problem, particularly when dealing with High Dimension Low Sample Size (HDLSS) data. We develop here a U-statistics based clustering approach that assesses statistical significance in cluste… ▽ More Clustering methods are a valuable tool for the identification of patterns in high dimensional data with applications in many scientific problems. However, quantifying uncertainty in clustering is a challenging problem, particularly when dealing with High Dimension Low Sample Size (HDLSS) data. We develop here a U-statistics based clustering approach that assesses statistical significance in clustering and is specifically tailored to HDLSS scenarios. These non-parametric methods rely on very few assumptions about the data, and thus can be applied to a wide range of datasets for which the euclidean distance captures relevant features. We propose two significance clustering algorithms, a hierarchical method and a non-nested version. In order to do so, we first propose an extension of a relevant U-statistics and develop its asymptotic theory. Our methods are tested through extensive simulations and found to be more powerful than competing alternatives. They are further showcased in two applications ranging from genetics to image recognition problems. △ Less

Submitted 30 May, 2018; originally announced May 2018.

Comments: 18 pages, 5 figures

arXiv:1606.03376 [pdf, other]

Clustering and Classification of Genetic Data Through U-Statistics

Authors: Gabriela Bettella Cybis, Marcio Valk, Silvia Regina Costa Lopes

Abstract: Genetic data are frequently categorical and have complex dependence structures that are not always well understood. For this reason, clustering and classification based on genetic data, while highly relevant, are challenging statistical problems. Here we consider a highly versatile U-statistics based approach built on dissimilarities between pairs of data points for nonparametric clustering. In th… ▽ More Genetic data are frequently categorical and have complex dependence structures that are not always well understood. For this reason, clustering and classification based on genetic data, while highly relevant, are challenging statistical problems. Here we consider a highly versatile U-statistics based approach built on dissimilarities between pairs of data points for nonparametric clustering. In this work we propose statistical tests to assess group homogeneity taking into account the multiple testing issues, and a clustering algorithm based on dissimilarities within and between groups that highly speeds up the homogeneity test. We also propose a test to verify classification significance of a sample in one of two groups. A Monte Carlo simulation study is presented to evaluate power of the classification test, considering different group sizes and degree of separation. Size and power of the homogeneity test are also analyzed through simulations that compare it to competing methods. Finally, the methodology is applied to three different genetic datasets: global human genetic diversity, breast tumor gene expression and Dengue virus serotypes. These applications showcase this statistical framework's ability to answer diverse biological questions while adapting to the specificities of the different datatypes. △ Less

Submitted 10 June, 2016; originally announced June 2016.

Comments: 27 pages, 4 figures

Showing 1–7 of 7 results for author: Valk, M