Search | arXiv e-print repository

arXiv:2407.19761 [pdf, other]

doi 10.1016/j.fsigen.2024.103146

Shotgun DNA sequencing for human identification: Dynamic SNP selection and likelihood ratio calculations accounting for errors

Authors: Mikkel Meyer Andersen, Marie-Louise Kampmann, Alberte Honoré Jepsen, Niels Morling, Poul Svante Eriksen, Claus Børsting, Jeppe Dyrberg Andersen

Abstract: In forensic genetics, short tandem repeats (STRs) are used for human identification (HID). Degraded biological trace samples with low amounts of short DNA fragments (low-quality DNA samples) pose a challenge for STR typing. Predefined single nucleotide polymorphisms (SNPs) can be amplified on short PCR fragments and used to generate SNP profiles from low-quality DNA samples. However, the stochasti… ▽ More In forensic genetics, short tandem repeats (STRs) are used for human identification (HID). Degraded biological trace samples with low amounts of short DNA fragments (low-quality DNA samples) pose a challenge for STR typing. Predefined single nucleotide polymorphisms (SNPs) can be amplified on short PCR fragments and used to generate SNP profiles from low-quality DNA samples. However, the stochastic results from low-quality DNA samples may result in frequent locus drop-outs and insufficient numbers of SNP genotypes for convincing identification of individuals. Shotgun DNA sequencing potentially analyses all DNA fragments in a sample in contrast to the targeted PCR-based sequencing methods and may be applied to DNA samples of very low quality, like heavily compromised crime-scene samples and ancient DNA samples. Here, we developed a statistical model for shotgun sequencing, sequence alignment, and genotype calling. Results from replicated shotgun sequencing of buccal swab (high-quality samples) and hair samples (low-quality samples) were arranged in a genotype-call confusion matrix to estimate the calling error probability by maximum likelihood and Bayesian inference. We developed formulas for calculating the evidential weight as a likelihood ratio (LR) based on data from dynamically selected SNPs from shotgun DNA sequencing. The method accounts for potential genotyping errors. Different genotype quality filters may be applied to account for genotyping errors. An error probability of zero resulted in the forensically commonly used LR formula. When considering a single SNP marker's contribution to the LR, error probabilities larger than zero reduced the LR contribution of matching genotypes and increased the LR in the case of a mismatch. We developed an open-source R package, wgsLR, which implements the method, including estimating the calling error probability and calculating LR values. △ Less

Submitted 29 July, 2024; originally announced July 2024.

Comments: 25 pages, 9 figures

arXiv:2201.08659 [pdf, other]

Unity Smoothing for Handling Inconsistent Evidence in Bayesian Networks and Unity Propagation for Faster Inference

Authors: Mads Lindskou, Torben Tvedebrink, Poul Svante Eriksen, Søren Højsgaard, Niels Morling

Abstract: We propose Unity Smoothing (US) for handling inconsistencies between a Bayesian network model and new unseen observations. We show that prediction accuracy, using the junction tree algorithm with US is comparable to that of Laplace smoothing. Moreover, in applications were sparsity of the data structures is utilized, US outperforms Laplace smoothing in terms of memory usage. Furthermore, we detail… ▽ More We propose Unity Smoothing (US) for handling inconsistencies between a Bayesian network model and new unseen observations. We show that prediction accuracy, using the junction tree algorithm with US is comparable to that of Laplace smoothing. Moreover, in applications were sparsity of the data structures is utilized, US outperforms Laplace smoothing in terms of memory usage. Furthermore, we detail how to avoid redundant calculations that must otherwise be performed during the message passing scheme in the junction tree algorithm which we refer to as Unity Propagation (UP). Experimental results shows that it is always faster to exploit UP on top of the Lauritzen-Spigelhalter message passing scheme for the junction tree algorithm. △ Less

Submitted 21 January, 2022; originally announced January 2022.

arXiv:2103.03647 [pdf, other]

sparta: Sparse Tables and their Algebra with a View Towards High Dimensional Graphical Models

Authors: Mads Lindskou, Søren Højsgaard, Poul Svante Eriksen, Torben Tvedebrink

Abstract: A graphical model is a multivariate (potentially very high dimensional) probabilistic model, which is formed by combining lower dimensional components. Inference (computation of conditional probabilities) is based on message passing algorithms that utilize conditional independence structures. In graphical models for discrete variables with finite state spaces, there is a fundamental problem in hig… ▽ More A graphical model is a multivariate (potentially very high dimensional) probabilistic model, which is formed by combining lower dimensional components. Inference (computation of conditional probabilities) is based on message passing algorithms that utilize conditional independence structures. In graphical models for discrete variables with finite state spaces, there is a fundamental problem in high dimensions: A discrete distribution is represented by a table of values, and in high dimensions such tables can become prohibitively large. In inference, such tables must be multiplied which can lead to even larger tables. The sparta package meets this challenge by implementing methods that efficiently handles multiplication and marginalization of sparse tables. The package was written in the R programming language and is freely available from the Comprehensive R Archive Network (CRAN). The companion package jti, also on CRAN, was developed to showcase the potential of sparta in connection to the Junction Tree Algorithm. We show, that jti is able to handle highly complex graphical models which are otherwise infeasible due to lack of computer memory, using sparta as a backend for table operations. △ Less

Submitted 2 June, 2021; v1 submitted 5 March, 2021; originally announced March 2021.

arXiv:2103.02366 [pdf, other]

Detecting Outliers in High-dimensional Data with Mixed Variable Types using Conditional Gaussian Regression Models

Authors: Mads Lindskou, Torben Tvedebrink, Poul Svante Eriksen, Niels Morling

Abstract: Outlier detection has gained increasing interest in recent years, due to newly emerging technologies and the huge amount of high-dimensional data that are now available. Outlier detection can help practitioners to identify unwanted noise and/or locate interesting abnormal observations. To address this, we developed a novel method for outlier detection for use in, possibly high-dimensional, dataset… ▽ More Outlier detection has gained increasing interest in recent years, due to newly emerging technologies and the huge amount of high-dimensional data that are now available. Outlier detection can help practitioners to identify unwanted noise and/or locate interesting abnormal observations. To address this, we developed a novel method for outlier detection for use in, possibly high-dimensional, datasets with both discrete and continuous variables. We exploit the family of decomposable graphical models in order to model the relationship between the variables and use this to form an exact likelihood ratio test for an observation that is considered an outlier. We show that our method outperforms the state-of-the-art Isolation Forest algorithm on a real data example. △ Less

Submitted 19 May, 2021; v1 submitted 3 March, 2021; originally announced March 2021.

arXiv:2012.00513 [pdf, other]

DNA mixture deconvolution using an evolutionary algorithm with multiple populations, hill-climbing, and guided mutation

Authors: Søren B. Vilsen, Torben Tvedebrink, Poul Svante Eriksen

Abstract: DNA samples crime cases analysed in forensic genetics, frequently contain DNA from multiple contributors. These occur as convolutions of the DNA profiles of the individual contributors to the DNA sample. Thus, in cases where one or more of the contributors were unknown, an objective of interest would be the separation, often called deconvolution, of these unknown profiles. In order to obtain decon… ▽ More DNA samples crime cases analysed in forensic genetics, frequently contain DNA from multiple contributors. These occur as convolutions of the DNA profiles of the individual contributors to the DNA sample. Thus, in cases where one or more of the contributors were unknown, an objective of interest would be the separation, often called deconvolution, of these unknown profiles. In order to obtain deconvolutions of the unknown DNA profiles, we introduced a multiple population evolutionary algorithm (MEA). We allowed the mutation operator of the MEA to utilise that the fitness is based on a probabilistic model and guide it by using the deviations between the observed and the expected value for every element of the encoded individual. This guided mutation operator (GM) was designed such that the larger the deviation the higher probability of mutation. Furthermore, the GM was inhomogeneous in time, decreasing to a specified lower bound as the number of iterations increased. We analysed 102 two-person DNA mixture samples in varying mixture proportions. The samples were quantified using two different DNA prep. kits: (1) Illumina ForenSeq Panel B (30 samples), and (2) Applied Biosystems Precision ID Globalfiler NGS STR panel (72 samples). The DNA mixtures were deconvoluted by the MEA and compared to the true DNA profiles of the sample. We analysed three scenarios where we assumed: (1) the DNA profile of the major contributor was unknown, (2) DNA profile of the minor was unknown, and (3) both DNA profiles were unknown. Furthermore, we conducted a series of sensitivity experiments on the ForenSeq panel by varying the sub-population size, comparing a completely random homogeneous mutation operator to the guided operator with varying mutation decay rates, and allowing for hill-climbing of the parent population. △ Less

Submitted 1 December, 2020; originally announced December 2020.

arXiv:1509.07982 [pdf, other]

Targeted Fused Ridge Estimation of Inverse Covariance Matrices from Multiple High-Dimensional Data Classes

Authors: Anders Ellern Bilgrau, Carel F. W. Peeters, Poul Svante Eriksen, Martin Bøgsted, Wessel N. van Wieringen

Abstract: We consider the problem of jointly estimating multiple inverse covariance matrices from high-dimensional data consisting of distinct classes. An $\ell_2$-penalized maximum likelihood approach is employed. The suggested approach is flexible and generic, incorporating several other $\ell_2$-penalized estimators as special cases. In addition, the approach allows specification of target matrices throu… ▽ More We consider the problem of jointly estimating multiple inverse covariance matrices from high-dimensional data consisting of distinct classes. An $\ell_2$-penalized maximum likelihood approach is employed. The suggested approach is flexible and generic, incorporating several other $\ell_2$-penalized estimators as special cases. In addition, the approach allows specification of target matrices through which prior knowledge may be incorporated and which can stabilize the estimation procedure in high-dimensional settings. The result is a targeted fused ridge estimator that is of use when the precision matrices of the constituent classes are believed to chiefly share the same structure while potentially differing in a number of locations of interest. It has many applications in (multi)factorial study designs. We focus on the graphical interpretation of precision matrices with the proposed estimator then serving as a basis for integrative or meta-analytic Gaussian graphical modeling. Situations are considered in which the classes are defined by data sets and subtypes of diseases. The performance of the proposed estimator in the graphical modeling setting is assessed through extensive simulation experiments. Its practical usability is illustrated by the differential network modeling of 12 large-scale gene expression data sets of diffuse large B-cell lymphoma subtypes. The estimator and its related procedures are incorporated into the R-package rags2ridges. △ Less

Submitted 26 March, 2020; v1 submitted 26 September, 2015; originally announced September 2015.

Comments: 52 pages, 11 figures

Journal ref: Journal of Machine Learning Research, 21(26):1--52, 2020

arXiv:1503.07990 [pdf, other]

Estimating a common covariance matrix for network meta-analysis of gene expression datasets in diffuse large B-cell lymphoma

Authors: Anders Ellern Bilgrau, Rasmus Froberg Brøndum, Poul Svante Eriksen, Karen Dybkær, Martin Bøgsted

Abstract: The estimation of covariance matrices of gene expressions has many applications in cancer systems biology. Many gene expression studies, however, are hampered by low sample size and it has therefore become popular to increase sample size by collecting gene expression data across studies. Motivated by the traditional meta-analysis using random effects models, we present a hierarchical random covari… ▽ More The estimation of covariance matrices of gene expressions has many applications in cancer systems biology. Many gene expression studies, however, are hampered by low sample size and it has therefore become popular to increase sample size by collecting gene expression data across studies. Motivated by the traditional meta-analysis using random effects models, we present a hierarchical random covariance model and use it for the meta-analysis of gene correlation networks across 11 large-scale gene expression studies of diffuse large B-cell lymphoma (DLBCL). We suggest to use a maximum likelihood estimator for the underlying common covariance matrix and introduce an EM algorithm for estimation. By simulation experiments comparing the estimated covariance matrices by cophenetic correlation and Kullback-Leibler divergence the suggested estimator showed to perform better or not worse than a simple pooled estimator. In a posthoc analysis of the estimated common covariance matrix for the DLBCL data we were able to identify novel biologically meaningful gene correlation networks with eigengenes of prognostic value. In conclusion, the method seems to provide a generally applicable framework for meta-analysis, when multiple features are measured and believed to share a common covariance matrix obscured by study dependent noise. △ Less

Submitted 21 August, 2017; v1 submitted 27 March, 2015; originally announced March 2015.

Comments: 18 pages, 4 figures

arXiv:1406.6508 [pdf, other]

The multivariate Dirichlet-multinomial distribution and its application in forensic genetics to adjust for sub-population effects using the θ-correction

Authors: Torben Tvedebrink, Poul Svante Eriksen, Niels Morling

Abstract: In this paper, we discuss the construction of a multivariate generalisation of the Dirichlet-multinomial distribution. An example from forensic genetics in the statistical analysis of DNA mixtures motivates the study of this multivariate extension. In forensic genetics, adjustment of the match probabilities due to remote ancestry in the population is often done using the so-called θ-correction.… ▽ More In this paper, we discuss the construction of a multivariate generalisation of the Dirichlet-multinomial distribution. An example from forensic genetics in the statistical analysis of DNA mixtures motivates the study of this multivariate extension. In forensic genetics, adjustment of the match probabilities due to remote ancestry in the population is often done using the so-called θ-correction. This correction increases the probability of observing multiple copies of rare alleles and thereby reduces the weight of the evidence for rare genotypes. By numerical examples, we show how the θ-correction incorporated by the use of the multivariate Dirichlet-multinomial distribution affects the weight of evidence. Furthermore, we demonstrate how the θ-correction can be incorporated in a Markov structure needed to make efficient computations in a Bayesian network. △ Less

Submitted 4 November, 2014; v1 submitted 25 June, 2014; originally announced June 2014.

Comments: 11 pages, 4 figures

arXiv:1304.2129 [pdf, other]

A gentle introduction to the discrete Laplace method for estimating Y-STR haplotype frequencies

Authors: Mikkel Meyer Andersen, Poul Svante Eriksen, Niels Morling

Abstract: Y-STR data simulated under a Fisher-Wright model of evolution with a single-step mutation model turns out to be well predicted by a method using discrete Laplace distributions. Y-STR data simulated under a Fisher-Wright model of evolution with a single-step mutation model turns out to be well predicted by a method using discrete Laplace distributions. △ Less

Submitted 16 October, 2013; v1 submitted 8 April, 2013; originally announced April 2013.

Comments: 18 pages, 5 figures

arXiv:1210.1773 [pdf, other]

Efficient Forward Simulation of Fisher-Wright Populations with Stochastic Population Size and Neutral Single Step Mutations in Haplotypes

Authors: Mikkel Meyer Andersen, Poul Svante Eriksen

Abstract: In both population genetics and forensic genetics it is important to know how haplotypes are distributed in a population. Simulation of population dynamics helps facilitating research on the distribution of haplotypes. In forensic genetics, the haplotypes can for example consist of lineage markers such as short tandem repeat loci on the Y chromosome (Y-STR). A dominating model for describing popul… ▽ More In both population genetics and forensic genetics it is important to know how haplotypes are distributed in a population. Simulation of population dynamics helps facilitating research on the distribution of haplotypes. In forensic genetics, the haplotypes can for example consist of lineage markers such as short tandem repeat loci on the Y chromosome (Y-STR). A dominating model for describing population dynamics is the simple, yet powerful, Fisher-Wright model. We describe an efficient algorithm for exact forward simulation of exact Fisher-Wright populations (and not approximative such as the coalescent model). The efficiency comes from convenient data structures by changing the traditional view from individuals to haplotypes. The algorithm is implemented in the open-source R package 'fwsim' and is able to simulate very large populations. We focus on a haploid model and assume stochastic population size with flexible growth specification, no selection, a neutral single step mutation process, and self-reproducing individuals. These assumptions make the algorithm ideal for studying lineage markers such as Y-STR. △ Less

Submitted 5 October, 2012; originally announced October 2012.

Comments: 17 pages, 6 figures

MSC Class: 62-04 ACM Class: G.3

Showing 1–10 of 10 results for author: Eriksen, P S