-
Planning minimum regret $CO_2$ pipeline networks
Authors:
Stephan Bogs,
Ali Abdelshafy,
Grit Walther
Abstract:
The transition to a low-carbon economy necessitates effective carbon capture and storage (CCS) solutions, particularly for hard-to-abate sectors. Herein, pipeline networks are indispensable for cost-efficient $CO_2$ transportation over long distances. However, there is deep uncertainty regarding which industrial sectors will participate in such systems. This poses a significant challenge due to su…
▽ More
The transition to a low-carbon economy necessitates effective carbon capture and storage (CCS) solutions, particularly for hard-to-abate sectors. Herein, pipeline networks are indispensable for cost-efficient $CO_2$ transportation over long distances. However, there is deep uncertainty regarding which industrial sectors will participate in such systems. This poses a significant challenge due to substantial investments as well as the lengthy planning and development timelines required for $CO_2$ pipeline projects, which are further constrained by limited upgrade options for already built infrastructure. The economies of scale inherent in pipeline construction exacerbate these challenges, leading to potential regret over earlier decisions. While numerous models were developed to optimize the initial layout of pipeline infrastructure based on known demand, a gap exists in addressing the incremental development of infrastructure in conjunction with deep uncertainty. Hence, this paper introduces a novel optimization model for $CO_2$ pipeline infrastructure development, minimizing regret as its objective function and incorporating various upgrade options, such as looping and pressure increases. The model's effectiveness is also demonstrated by presenting a comprehensive case study of Germany's cement and lime industries. The developed approach quantitatively illustrates the trade-off between different options, which can help in deriving effective strategies for $CO_2$ infrastructure development.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
Fast and Optimal Changepoint Detection and Localization using Bonferroni Triplets
Authors:
Jayoon Jang,
Guenther Walther
Abstract:
The paper considers the problem of detecting and localizing changepoints in a sequence of independent observations. We propose to evaluate a local test statistic on a triplet of time points, for each such triplet in a particular collection. This collection is sparse enough so that the results of the local tests can simply be combined with a weighted Bonferroni correction. This results in a simple…
▽ More
The paper considers the problem of detecting and localizing changepoints in a sequence of independent observations. We propose to evaluate a local test statistic on a triplet of time points, for each such triplet in a particular collection. This collection is sparse enough so that the results of the local tests can simply be combined with a weighted Bonferroni correction. This results in a simple and fast method, {\sl Lean Bonferroni Changepoint detection} (LBD), that provides finite sample guarantees for the existance of changepoints as well as simultaneous confidence intervals for their locations. LBD is free of tuning parameters, and we show that LBD allows optimal inference for the detection of changepoints. To this end, we provide a lower bound for the critical constant that measures the difficulty of the changepoint detection problem, and we show that LBD attains this critical constant. We illustrate LBD for a number of distributional settings, namely when the observations are homoscedastic normal with known or unknown variance, for observations from a natural exponential family, and in a nonparametric setting where we assume only exchangeability for segments without a changepoint.
△ Less
Submitted 18 October, 2024;
originally announced October 2024.
-
Global Shipyard Capacities Limiting the Ramp-Up of Global Hydrogen Transport
Authors:
Maximilian Stargardt,
David Kress,
Heidi Heinrichs,
Jörn-Christian Meyer,
Jochen Linßen,
Grit Walther,
Detlef Stolten
Abstract:
Decarbonizing the global energy system requires significant expansions of renewable energy technologies. Given that cost-effective renewable sources are not necessarily situated in proximity to the largest energy demand centers globally, the maritime transportation of low-carbon energy carriers, such as renewable-based hydrogen or ammonia, will be needed. However, whether existent shipyards posses…
▽ More
Decarbonizing the global energy system requires significant expansions of renewable energy technologies. Given that cost-effective renewable sources are not necessarily situated in proximity to the largest energy demand centers globally, the maritime transportation of low-carbon energy carriers, such as renewable-based hydrogen or ammonia, will be needed. However, whether existent shipyards possess the required capacity to provide the necessary global fleet has not yet been answered. Therefore, this study estimates global tanker demand based on projections for global hydrogen demand, while comparing these projections with historic shipyard production. Our findings reveal a potential bottleneck until 2033-2039 if relying on liquefied hydrogen exclusively. This bottleneck could be circumvented by increasing local hydrogen production, utilizing pipelines, or liquefied ammonia as an energy carrier for hydrogen. Furthermore, the regional concentration of shipyard locations raises concerns about diversification. Increasing demand for container vessels could substantially hinder the scale-up of maritime hydrogen transport.
△ Less
Submitted 30 April, 2024; v1 submitted 14 March, 2024;
originally announced March 2024.
-
Beta-trees: Multivariate histograms with confidence statements
Authors:
Guenther Walther,
Qian Zhao
Abstract:
Multivariate histograms are difficult to construct due to the curse of dimensionality. Motivated by $k$-d trees in computer science, we show how to construct an efficient data-adaptive partition of Euclidean space that possesses the following two properties: With high confidence the distribution from which the data are generated is close to uniform on each rectangle of the partition; and despite t…
▽ More
Multivariate histograms are difficult to construct due to the curse of dimensionality. Motivated by $k$-d trees in computer science, we show how to construct an efficient data-adaptive partition of Euclidean space that possesses the following two properties: With high confidence the distribution from which the data are generated is close to uniform on each rectangle of the partition; and despite the data-dependent construction we can give guaranteed finite sample simultaneous confidence intervals for the probabilities (and hence for the average densities) of each rectangle in the partition. This partition will automatically adapt to the sizes of the regions where the distribution is close to uniform. The methodology produces confidence intervals whose widths depend only on the probability content of the rectangles and not on the dimensionality of the space, thus avoiding the curse of dimensionality. Moreover, the widths essentially match the optimal widths in the univariate setting. The simultaneous validity of the confidence intervals allows to use this construction, which we call {\sl Beta-trees}, for various data-analytic purposes. We illustrate this by using Beta-trees for visualizing data and for multivariate mode-hunting.
△ Less
Submitted 2 August, 2023;
originally announced August 2023.
-
Designing Tractable Piecewise Affine Policies for Multi-Stage Adjustable Robust Optimization
Authors:
Simon Thomä,
Grit Walther,
Maximilian Schiffer
Abstract:
We study piecewise affine policies for multi-stage adjustable robust optimization (ARO) problems with non-negative right-hand side uncertainty. First, we construct new dominating uncertainty sets and show how a multi-stage ARO problem can be solved efficiently with a linear program when uncertainty is replaced by these new sets. We then demonstrate how solutions for this alternative problem can be…
▽ More
We study piecewise affine policies for multi-stage adjustable robust optimization (ARO) problems with non-negative right-hand side uncertainty. First, we construct new dominating uncertainty sets and show how a multi-stage ARO problem can be solved efficiently with a linear program when uncertainty is replaced by these new sets. We then demonstrate how solutions for this alternative problem can be transformed into solutions for the original problem. By carefully choosing the dominating sets, we prove strong approximation bounds for our policies and extend many previously best-known bounds for the two-staged problem variant to its multi-stage counterpart. Moreover, the new bounds are - to the best of our knowledge - the first bounds shown for the general multi-stage ARO problem considered. We extensively compare our policies to other policies from the literature and prove relative performance guarantees. In two numerical experiments, we identify beneficial and disadvantageous properties for different policies and present effective adjustments to tackle the most critical disadvantages of our policies. Overall, the experiments show that our piecewise affine policies can be computed by orders of magnitude faster than affine policies, while often yielding comparable or even better results.
△ Less
Submitted 15 December, 2023; v1 submitted 1 July, 2022;
originally announced July 2022.
-
Tail bounds for empirically standardized sums
Authors:
Guenther Walther
Abstract:
Exponential tail bounds for sums play an important role in statistics, but the example of the $t$-statistic shows that the exponential tail decay may be lost when population parameters need to be estimated from the data. However, it turns out that if Studentizing is accompanied by estimating the location parameter in a suitable way, then the $t$-statistic regains the exponential tail behavior. Mot…
▽ More
Exponential tail bounds for sums play an important role in statistics, but the example of the $t$-statistic shows that the exponential tail decay may be lost when population parameters need to be estimated from the data. However, it turns out that if Studentizing is accompanied by estimating the location parameter in a suitable way, then the $t$-statistic regains the exponential tail behavior. Motivated by this example, the paper analyzes other ways of empirically standardizing sums and establishes tail bounds that are sub-Gaussian or even closer to normal for the following settings: Standardization with Studentized contrasts for normal observations, standardization with the log likelihood ratio statistic for observations from an exponential family, and standardization via self-normalization for observations from a symmetric distribution with unknown center of symmetry. The latter standardization gives rise to a novel scan statistic for heteroscedastic data whose asymptotic power is analyzed in the case where the observations have a log-concave distribution.
△ Less
Submitted 19 March, 2022; v1 submitted 13 September, 2021;
originally announced September 2021.
-
Calibrating the scan statistic with size-dependent critical values: heuristics, methodology and computation
Authors:
Guenther Walther
Abstract:
It is known that the scan statistic with variable window size favors the detection of signals with small spatial extent and there is a corresponding loss of power for signals with large spatial extent. Recent results have shown that this loss is not inevitable: Using critical values that depend on the size of the window allows optimal detection for all signal sizes simultaneously, so there is no s…
▽ More
It is known that the scan statistic with variable window size favors the detection of signals with small spatial extent and there is a corresponding loss of power for signals with large spatial extent. Recent results have shown that this loss is not inevitable: Using critical values that depend on the size of the window allows optimal detection for all signal sizes simultaneously, so there is no substantial price to pay for not knowing the correct window size and for scanning with a variable window size. This paper gives a review of the heuristics and methodology for such size-dependent critical values, their applications to various settings including the multivariate case, and recent results about fast algorithms for computing scan statistics.
△ Less
Submitted 14 February, 2022; v1 submitted 17 July, 2021;
originally announced July 2021.
-
A Concise Guide on the Integration of Battery Electric Buses into Urban Bus Networks
Authors:
Nicolas Dirks,
Dennis Wagner,
Maximilian Schiffer,
Grit Walther
Abstract:
With the increasing market penetration of battery-electric buses into urban bus networks, practitioners face many novel planning problems. As a result, the interest in optimization-based decision-making for these planning problems increases but practitioners' requirements on planning solutions and current academic approaches often diverge. Against this background, this survey aims to provide a con…
▽ More
With the increasing market penetration of battery-electric buses into urban bus networks, practitioners face many novel planning problems. As a result, the interest in optimization-based decision-making for these planning problems increases but practitioners' requirements on planning solutions and current academic approaches often diverge. Against this background, this survey aims to provide a concise guide on optimization-based planning approaches for integrating battery-electric buses into urban bus networks for both practitioners and academics. First, we derive practitioners' requirements for integrating battery-electric buses from state-of-the-art specifications, project reports, and expert knowledge. Second, we analyze whether existing optimization-based planning models fulfill these practitioners' requirements. Based on this analysis, we carve out the existing gap between practice and research and discuss how to address these in future research.
△ Less
Submitted 21 April, 2021;
originally announced April 2021.
-
On the Integration of Battery Electric Buses into Urban Bus Networks
Authors:
Nicolas Dirks,
Maximilian Schiffer,
Grit Walther
Abstract:
Cities all around the world struggle with urban air quality due to transportation related emissions. In public transport networks, replacing internal combustion engine buses by electric buses provides an opportunity to improve air quality. Hence, many bus network operators currently ask for an optimal transformation plan to integrate battery electric buses into their fleet. Ideally, this plan also…
▽ More
Cities all around the world struggle with urban air quality due to transportation related emissions. In public transport networks, replacing internal combustion engine buses by electric buses provides an opportunity to improve air quality. Hence, many bus network operators currently ask for an optimal transformation plan to integrate battery electric buses into their fleet. Ideally, this plan also considers the installation of necessary charging infrastructure to ensure a fleet's operational feasibility. Against this background, we introduce an integrated modeling approach to determine a cost-optimal, long-term, multi-period transformation plan for integrating battery electric buses into urban bus networks. Our model connects central strategic and operational decisions. We minimize total cost of ownership and analyze potential reductions of nitrogen oxide emissions. Our results base on a case study of a real-world bus network and show that a comprehensive integration of battery electric buses is feasible and economically beneficial. By analyzing the impact of battery capacities and charging power on the optimal fleet transformation, we show that medium-power charging facilities combined with medium-capacity batteries are superior to networks with low-power or high-power charging facilities.
△ Less
Submitted 22 March, 2021;
originally announced March 2021.
-
Confidence bands for a log-concave density
Authors:
Guenther Walther,
Alnur Ali,
Xinyue Shen,
Stephen Boyd
Abstract:
We present a new approach for inference about a log-concave distribution: Instead of using the method of maximum likelihood, we propose to incorporate the log-concavity constraint in an appropriate nonparametric confidence set for the cdf $F$. This approach has the advantage that it automatically provides a measure of statistical uncertainty and it thus overcomes a marked limitation of the maximum…
▽ More
We present a new approach for inference about a log-concave distribution: Instead of using the method of maximum likelihood, we propose to incorporate the log-concavity constraint in an appropriate nonparametric confidence set for the cdf $F$. This approach has the advantage that it automatically provides a measure of statistical uncertainty and it thus overcomes a marked limitation of the maximum likelihood estimate. In particular, we show how to construct confidence bands for the density that have a finite sample guaranteed confidence level. The nonparametric confidence set for $F$ which we introduce here has attractive computational and statistical properties: It allows to bring modern tools from optimization to bear on this problem via difference of convex programming, and it results in optimal statistical inference. We show that the width of the resulting confidence bands converges at nearly the parametric $n^{-\frac{1}{2}}$ rate when the log density is $k$-affine.
△ Less
Submitted 6 May, 2022; v1 submitted 6 November, 2020;
originally announced November 2020.
-
Calibrating the scan statistic: finite sample performance vs. asymptotics
Authors:
Guenther Walther,
Andrew Perry
Abstract:
We consider the problem of detecting an elevated mean on an interval with unknown location and length in the univariate Gaussian sequence model. Recent results have shown that using scale-dependent critical values for the scan statistic allows to attain asymptotically optimal detection simultaneously for all signal lengths, thereby improving on the traditional scan, but this procedure has been cri…
▽ More
We consider the problem of detecting an elevated mean on an interval with unknown location and length in the univariate Gaussian sequence model. Recent results have shown that using scale-dependent critical values for the scan statistic allows to attain asymptotically optimal detection simultaneously for all signal lengths, thereby improving on the traditional scan, but this procedure has been criticized for losing too much power for short signals. We explain this discrepancy by showing that these asymptotic optimality results will necessarily be too imprecise to discern the performance of scan statistics in a practically relevant way, even in a large sample context. Instead, we propose to assess the performance with a new finite sample criterion. We then present three calibrations for scan statistics that perform well across a range of relevant signal lengths: The first calibration uses a particular adjustment to the critical values and is therefore tailored to the Gaussian case. The second calibration uses a scale-dependent adjustment to the significance levels and is therefore applicable to arbitrary known null distributions. The third calibration restricts the scan to a particular sparse subset of the scan windows and then applies a weighted Bonferroni adjustment to the corresponding test statistics. This {\sl Bonferroni scan} is also applicable to arbitrary null distributions and in addition is very simple to implement. We show how to apply these calibrations for scanning in a number of distributional settings: for normal observations with an unknown baseline and a known or unknown constant variance,for observations from a natural exponential family, for potentially heteroscadastic observations from a symmetric density by employing self-normalization in a novel way, and for exchangeable observations using tests based on permutations, ranks or signs.
△ Less
Submitted 17 July, 2021; v1 submitted 13 August, 2020;
originally announced August 2020.
-
Large-scale inference with block structure
Authors:
Jiyao Kou,
Guenther Walther
Abstract:
The detection of weak and rare effects in large amounts of data arises in a number of modern data analysis problems. Known results show that in this situation the potential of statistical inference is severely limited by the large-scale multiple testing that is inherent in these problems. Here we show that fundamentally more powerful statistical inference is possible when there is some structure i…
▽ More
The detection of weak and rare effects in large amounts of data arises in a number of modern data analysis problems. Known results show that in this situation the potential of statistical inference is severely limited by the large-scale multiple testing that is inherent in these problems. Here we show that fundamentally more powerful statistical inference is possible when there is some structure in the signal that can be exploited, e.g. if the signal is clustered in many small blocks, as is the case in some relevant applications. We derive the detection boundary in such a situation where we allow both the number of blocks and the block length to grow polynomially with sample size. We derive these results both for the univariate and the multivariate settings as well as for the problem of detecting clusters in a network. These results recover as special cases the sparse mixture detection problem (Donoho and Jin, 2004) where there is no structure in the signal, as well as the scan problem (Chan and Walther, 2013) where the signal comprises a single interval. We develop methodology that allows optimal adaptive detection in the general setting, thus exploiting the structure if it is present without incurring a relevant penalty in the case where there is no structure. The advantage of this methodology can be considerable, as in the case of no structure the means need to increase at the rate $\sqrt{\log n}$ to ensure detection, while the presence of structure allows detection even if the means $decrease$ at a polynomial rate.
△ Less
Submitted 7 May, 2022; v1 submitted 28 June, 2019;
originally announced July 2019.
-
UniMorph 2.0: Universal Morphology
Authors:
Christo Kirov,
Ryan Cotterell,
John Sylak-Glassman,
Géraldine Walther,
Ekaterina Vylomova,
Patrick Xia,
Manaal Faruqui,
Sabrina J. Mielke,
Arya D. McCarthy,
Sandra Kübler,
David Yarowsky,
Jason Eisner,
Mans Hulden
Abstract:
The Universal Morphology UniMorph project is a collaborative effort to improve how NLP handles complex morphology across the world's languages. The project releases annotated morphological data using a universal tagset, the UniMorph schema. Each inflected form is associated with a lemma, which typically carries its underlying lexical meaning, and a bundle of morphological features from our schema.…
▽ More
The Universal Morphology UniMorph project is a collaborative effort to improve how NLP handles complex morphology across the world's languages. The project releases annotated morphological data using a universal tagset, the UniMorph schema. Each inflected form is associated with a lemma, which typically carries its underlying lexical meaning, and a bundle of morphological features from our schema. Additional supporting data and tools are also released on a per-language basis when available. UniMorph is based at the Center for Language and Speech Processing (CLSP) at Johns Hopkins University in Baltimore, Maryland and is sponsored by the DARPA LORELEI program. This paper details advances made to the collection, annotation, and dissemination of project resources since the initial UniMorph release described at LREC 2016. lexical resources} }
△ Less
Submitted 25 February, 2020; v1 submitted 25 October, 2018;
originally announced October 2018.
-
The CoNLL--SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection
Authors:
Ryan Cotterell,
Christo Kirov,
John Sylak-Glassman,
Géraldine Walther,
Ekaterina Vylomova,
Arya D. McCarthy,
Katharina Kann,
Sabrina J. Mielke,
Garrett Nicolai,
Miikka Silfverberg,
David Yarowsky,
Jason Eisner,
Mans Hulden
Abstract:
The CoNLL--SIGMORPHON 2018 shared task on supervised learning of morphological generation featured data sets from 103 typologically diverse languages. Apart from extending the number of languages involved in earlier supervised tasks of generating inflected forms, this year the shared task also featured a new second task which asked participants to inflect words in sentential context, similar to a…
▽ More
The CoNLL--SIGMORPHON 2018 shared task on supervised learning of morphological generation featured data sets from 103 typologically diverse languages. Apart from extending the number of languages involved in earlier supervised tasks of generating inflected forms, this year the shared task also featured a new second task which asked participants to inflect words in sentential context, similar to a cloze task. This second task featured seven languages. Task 1 received 27 submissions and task 2 received 6 submissions. Both tasks featured a low, medium, and high data condition. Nearly all submissions featured a neural component and built on highly-ranked systems from the earlier 2017 shared task. In the inflection task (task 1), 41 of the 52 languages present in last year's inflection task showed improvement by the best systems in the low-resource setting. The cloze task (task 2) proved to be difficult, and few submissions managed to consistently improve upon both a simple neural baseline system and a lemma-repeating baseline.
△ Less
Submitted 25 February, 2020; v1 submitted 16 October, 2018;
originally announced October 2018.
-
CoNLL-SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection in 52 Languages
Authors:
Ryan Cotterell,
Christo Kirov,
John Sylak-Glassman,
Géraldine Walther,
Ekaterina Vylomova,
Patrick Xia,
Manaal Faruqui,
Sandra Kübler,
David Yarowsky,
Jason Eisner,
Mans Hulden
Abstract:
The CoNLL-SIGMORPHON 2017 shared task on supervised morphological generation required systems to be trained and tested in each of 52 typologically diverse languages. In sub-task 1, submitted systems were asked to predict a specific inflected form of a given lemma. In sub-task 2, systems were given a lemma and some of its specific inflected forms, and asked to complete the inflectional paradigm by…
▽ More
The CoNLL-SIGMORPHON 2017 shared task on supervised morphological generation required systems to be trained and tested in each of 52 typologically diverse languages. In sub-task 1, submitted systems were asked to predict a specific inflected form of a given lemma. In sub-task 2, systems were given a lemma and some of its specific inflected forms, and asked to complete the inflectional paradigm by predicting all of the remaining inflected forms. Both sub-tasks included high, medium, and low-resource conditions. Sub-task 1 received 24 system submissions, while sub-task 2 received 3 system submissions. Following the success of neural sequence-to-sequence models in the SIGMORPHON 2016 shared task, all but one of the submissions included a neural component. The results show that high performance can be achieved with small training datasets, so long as models have appropriate inductive bias or make use of additional unlabeled data or synthetic data. However, different biasing and data augmentation resulted in disjoint sets of inflected forms being predicted correctly, suggesting that there is room for future improvement.
△ Less
Submitted 4 July, 2017; v1 submitted 27 June, 2017;
originally announced June 2017.
-
The Essential Histogram
Authors:
Housen Li,
Axel Munk,
Hannes Sieling,
Guenther Walther
Abstract:
The histogram is widely used as a simple, exploratory display of data, but it is usually not clear how to choose the number and size of bins. We construct a confidence set of distribution functions that optimally address the two main tasks of the histogram: estimating probabilities and detecting features such as increases and modes in the distribution. We define the essential histogram as the hist…
▽ More
The histogram is widely used as a simple, exploratory display of data, but it is usually not clear how to choose the number and size of bins. We construct a confidence set of distribution functions that optimally address the two main tasks of the histogram: estimating probabilities and detecting features such as increases and modes in the distribution. We define the essential histogram as the histogram in the confidence set with the fewest bins. Thus the essential histogram is the simplest visualization of the data that optimally achieves the main tasks of the histogram. The only assumption we make is that the data are independent and identically distributed. We provide a fast algorithm for the essential histogram, and illustrate our methodology with examples. An R-package is available on CRAN.
△ Less
Submitted 28 May, 2019; v1 submitted 21 December, 2016;
originally announced December 2016.
-
Optimal detection of multi-sample aligned sparse signals
Authors:
Hock Peng Chan,
Guenther Walther
Abstract:
We describe, in the detection of multi-sample aligned sparse signals, the critical boundary separating detectable from nondetectable signals, and construct tests that achieve optimal detectability: penalized versions of the Berk-Jones and the higher-criticism test statistics evaluated over pooled scans, and an average likelihood ratio over the critical boundary. We show in our results an inter-pla…
▽ More
We describe, in the detection of multi-sample aligned sparse signals, the critical boundary separating detectable from nondetectable signals, and construct tests that achieve optimal detectability: penalized versions of the Berk-Jones and the higher-criticism test statistics evaluated over pooled scans, and an average likelihood ratio over the critical boundary. We show in our results an inter-play between the scale of the sequence length to signal length ratio, and the sparseness of the signals. In particular the difficulty of the detection problem is not noticeably affected unless this ratio grows exponentially with the number of sequences. We also recover the multiscale and sparse mixture testing problems as illustrative special cases.
△ Less
Submitted 13 October, 2015;
originally announced October 2015.
-
Adaptive Concentration of Regression Trees, with Application to Random Forests
Authors:
Stefan Wager,
Guenther Walther
Abstract:
We study the convergence of the predictive surface of regression trees and forests. To support our analysis we introduce a notion of adaptive concentration for regression trees. This approach breaks tree training into a model selection phase in which we pick the tree splits, followed by a model fitting phase where we find the best regression model consistent with these splits. We then show that th…
▽ More
We study the convergence of the predictive surface of regression trees and forests. To support our analysis we introduce a notion of adaptive concentration for regression trees. This approach breaks tree training into a model selection phase in which we pick the tree splits, followed by a model fitting phase where we find the best regression model consistent with these splits. We then show that the fitted regression tree concentrates around the optimal predictor with the same splits: as d and n get large, the discrepancy is with high probability bounded on the order of sqrt(log(d) log(n)/k) uniformly over the whole regression surface, where d is the dimension of the feature space, n is the number of training examples, and k is the minimum leaf size for each tree. We also provide rate-matching lower bounds for this adaptive concentration statement. From a practical perspective, our result enables us to prove consistency results for adaptively grown forests in high dimensions, and to carry out valid post-selection inference in the sense of Berk et al. [2013] for subgroups defined by tree leaves.
△ Less
Submitted 30 April, 2016; v1 submitted 22 March, 2015;
originally announced March 2015.
-
Peer assessment enhances student learning
Authors:
Dennis L. Sun,
Naftali Harris,
Guenther Walther,
Michael Baiocchi
Abstract:
Feedback has a powerful influence on learning, but it is also expensive to provide. In large classes, it may even be impossible for instructors to provide individualized feedback. Peer assessment has received attention lately as a way of providing personalized feedback that scales to large classes. Besides these obvious benefits, some researchers have also conjectured that students learn by peer a…
▽ More
Feedback has a powerful influence on learning, but it is also expensive to provide. In large classes, it may even be impossible for instructors to provide individualized feedback. Peer assessment has received attention lately as a way of providing personalized feedback that scales to large classes. Besides these obvious benefits, some researchers have also conjectured that students learn by peer assessing, although no studies have ever conclusively demonstrated this effect. By conducting a randomized controlled trial in an introductory statistics class, we provide evidence that peer assessment causes significant gains in student achievement. The strength of our conclusions depends critically on the careful design of the experiment, which was made possible by a web-based platform that we developed. Hence, our study is also a proof of concept of the high-quality experiments that are possible with online tools.
△ Less
Submitted 14 October, 2014;
originally announced October 2014.
-
Gold and Methane: A Noble Combination for Delicate Oxidation
Authors:
Duncan J. Mowbray,
Annapaola Migani,
Guido Walther,
David M. Cardamone,
Angel Rubio
Abstract:
The ability to partially oxidize methane at low temperatures and pressures would have important environmental and economic applications. Although methane oxidation on gold nanoparticles has been observed experimentally, our density functional theory (DFT) calculations indicate neither CH4, CH3, nor H adsorb on a neutral gold nanoparticle. However, by positively charging gold nanoparticles, e.g. th…
▽ More
The ability to partially oxidize methane at low temperatures and pressures would have important environmental and economic applications. Although methane oxidation on gold nanoparticles has been observed experimentally, our density functional theory (DFT) calculations indicate neither CH4, CH3, nor H adsorb on a neutral gold nanoparticle. However, by positively charging gold nanoparticles, e.g. through charge transfer to the TiO2 substrate, CH4 binding increases while O2 binding remains relatively unchanged. We demonstrate that CH4 adsorption is via bonding with the metal s levels. This holds from small gold clusters (Au2) to large gold nanoparticles (Au201), and for all fcc transition metal dimers. These results provide the chemical understanding necessary to tune the catalytic activity of metal nanoparticles for the partial oxidation of methane under delicate conditions.
△ Less
Submitted 23 August, 2013;
originally announced August 2013.
-
On the Finite Dimensionality of Spaces of Absolutely Convergent Fourier Transforms
Authors:
Björn G. Walther
Abstract:
We extend the result of K. Karlander [Math. Scand. 80 (1997)] regarding finite dimensionality of spaces of absolutely convergent Fourier transforms.
We extend the result of K. Karlander [Math. Scand. 80 (1997)] regarding finite dimensionality of spaces of absolutely convergent Fourier transforms.
△ Less
Submitted 12 May, 2013;
originally announced May 2013.
-
Optimal detection of a jump in the intensity of a Poisson process or in a density with likelihood ratio statistics
Authors:
Camilo Rivera,
Guenther Walther
Abstract:
We consider the problem of detecting a `bump' in the intensity of a Poisson process or in a density. We analyze two types of likelihood ratio based statistics which allow for exact finite sample inference and asymptotically optimal detection: The maximum of the penalized square root of log likelihood ratios (`penalized scan') evaluated over a certain sparse set of intervals, and a certain average…
▽ More
We consider the problem of detecting a `bump' in the intensity of a Poisson process or in a density. We analyze two types of likelihood ratio based statistics which allow for exact finite sample inference and asymptotically optimal detection: The maximum of the penalized square root of log likelihood ratios (`penalized scan') evaluated over a certain sparse set of intervals, and a certain average of log likelihood ratios (`condensed average likelihood ratio'). We show that penalizing the {\sl square root} of the log likelihood ratio - rather than the log likelihood ratio itself - leads to a simple penalty term that yields optimal power. The thus derived penalty may prove useful for other problems that involve a Brownian bridge in the limit. The second key tool is an approximating set of intervals that is rich enough to allow for optimal detection but which is also sparse enough to allow justifying the validity of the penalization scheme simply via the union bound. This results in a considerable simplification in the theoretical treatment compared to the usual approach for this type of penalization technique, which requires establishing an exponential inequality for the variation of the test statistic. Another advantage of using the sparse approximating set is that it allows fast computation in nearly linear time.
We present a simulation study that illustrates the superior performance of the penalized scan and of the condensed average likelihood ratio compared to the standard scan statistic.
△ Less
Submitted 25 February, 2014; v1 submitted 12 November, 2012;
originally announced November 2012.
-
The Average Likelihood Ratio for Large-scale Multiple Testing and Detecting Sparse Mixtures
Authors:
Guenther Walther
Abstract:
Large-scale multiple testing problems require the simultaneous assessment of many p-values. This paper compares several methods to assess the evidence in multiple binomial counts of p-values: the maximum of the binomial counts after standardization (the `higher-criticism statistic'), the maximum of the binomial counts after a log-likelihood ratio transformation (the `Berk-Jones statistic'), and a…
▽ More
Large-scale multiple testing problems require the simultaneous assessment of many p-values. This paper compares several methods to assess the evidence in multiple binomial counts of p-values: the maximum of the binomial counts after standardization (the `higher-criticism statistic'), the maximum of the binomial counts after a log-likelihood ratio transformation (the `Berk-Jones statistic'), and a newly introduced average of the binomial counts after a likelihood ratio transformation. Simulations show that the higher criticism statistic has a superior performance to the Berk-Jones statistic in the case of very sparse alternatives (sparsity coefficient $β\gtrapprox 0.75$), while the situation is reversed for $β\lessapprox 0.75$. The average likelihood ratio is found to combine the favorable performance of higher criticism in the very sparse case with that of the Berk-Jones statistic in the less sparse case and thus appears to dominate both statistics. Some asymptotic optimality theory is considered but found to set in too slowly to illuminate the above findings, at least for sample sizes up to one million. In contrast, asymptotic approximations to the critical values of the Berk-Jones statistic that have been developed by Wellner and Koltchinskii (2003) and Jager and Wellner (2007) are found to give surprisingly accurate approximations even for quite small sample sizes.
△ Less
Submitted 1 November, 2011;
originally announced November 2011.
-
Detection with the scan and the average likelihood ratio
Authors:
Hock Peng Chan,
Guenther Walther
Abstract:
We investigate the performance of the scan (maximum likelihood ratio statistic) and of the average likelihood ratio statistic in the problem of detecting a deterministic signal with unknown spatial extent in the prototypical univariate sampled data model with white Gaussian noise. Our results show that the scan statistic, a popular tool for detection problems, is optimal only for the detection of…
▽ More
We investigate the performance of the scan (maximum likelihood ratio statistic) and of the average likelihood ratio statistic in the problem of detecting a deterministic signal with unknown spatial extent in the prototypical univariate sampled data model with white Gaussian noise. Our results show that the scan statistic, a popular tool for detection problems, is optimal only for the detection of signals with the smallest spatial extent. For signals with larger spatial extent the scan is suboptimal, and the power loss can be considerable. In contrast, the average likelihood ratio statistic is optimal for the detection of signals on all scales except the smallest ones, where its performance is only slightly suboptimal. We give rigorous mathematical statements of these results as well as heuristic explanations which suggest that the essence of these findings applies to detection problems quite generally, such as the detection of clusters in models involving densities or intensities or the detection of multivariate signals. We present a modification of the average likelihood ratio that yields optimal detection of signals with arbitrary spatial extent and which has the additional benefit of allowing for a fast computation of the statistic. In contrast, optimal detection with the scan seems to require the use of scale-dependent critical values.
△ Less
Submitted 25 February, 2014; v1 submitted 21 July, 2011;
originally announced July 2011.
-
Global Range Estimates for Maximal Oscillatory Integrals with Radial Testfunctions
Authors:
Björn G. Walther
Abstract:
We consider the maximal function of oscillatory integrals and prove a global estimate for radial test functions which is almost sharp with respect to the Sobolev regularity.
We consider the maximal function of oscillatory integrals and prove a global estimate for radial test functions which is almost sharp with respect to the Sobolev regularity.
△ Less
Submitted 24 March, 2011;
originally announced March 2011.
-
Inference and Modeling with Log-concave Distributions
Authors:
Guenther Walther
Abstract:
Log-concave distributions are an attractive choice for modeling and inference, for several reasons: The class of log-concave distributions contains most of the commonly used parametric distributions and thus is a rich and flexible nonparametric class of distributions. Further, the MLE exists and can be computed with readily available algorithms. Thus, no tuning parameter, such as a bandwidth, is n…
▽ More
Log-concave distributions are an attractive choice for modeling and inference, for several reasons: The class of log-concave distributions contains most of the commonly used parametric distributions and thus is a rich and flexible nonparametric class of distributions. Further, the MLE exists and can be computed with readily available algorithms. Thus, no tuning parameter, such as a bandwidth, is necessary for estimation. Due to these attractive properties, there has been considerable recent research activity concerning the theory and applications of log-concave distributions. This article gives a review of these results.
△ Less
Submitted 2 October, 2010;
originally announced October 2010.
-
Optimal and fast detection of spatial clusters with scan statistics
Authors:
Guenther Walther
Abstract:
We consider the detection of multivariate spatial clusters in the Bernoulli model with $N$ locations, where the design distribution has weakly dependent marginals. The locations are scanned with a rectangular window with sides parallel to the axes and with varying sizes and aspect ratios. Multivariate scan statistics pose a statistical problem due to the multiple testing over many scan windows,…
▽ More
We consider the detection of multivariate spatial clusters in the Bernoulli model with $N$ locations, where the design distribution has weakly dependent marginals. The locations are scanned with a rectangular window with sides parallel to the axes and with varying sizes and aspect ratios. Multivariate scan statistics pose a statistical problem due to the multiple testing over many scan windows, as well as a computational problem because statistics have to be evaluated on many windows. This paper introduces methodology that leads to both statistically optimal inference and computationally efficient algorithms. The main difference to the traditional calibration of scan statistics is the concept of grouping scan windows according to their sizes, and then applying different critical values to different groups. It is shown that this calibration of the scan statistic results in optimal inference for spatial clusters on both small scales and on large scales, as well as in the case where the cluster lives on one of the marginals. Methodology is introduced that allows for an efficient approximation of the set of all rectangles while still guaranteeing the statistical optimality results described above. It is shown that the resulting scan statistic has a computational complexity that is almost linear in $N$.
△ Less
Submitted 25 February, 2010;
originally announced February 2010.
-
Multiscale inference about a density
Authors:
Lutz Duembgen,
Günther Walther
Abstract:
We introduce a multiscale test statistic based on local order statistics and spacings that provides simultaneous confidence statements for the existence and location of local increases and decreases of a density or a failure rate. The procedure provides guaranteed finite-sample significance levels, is easy to implement and possesses certain asymptotic optimality and adaptivity properties.
We introduce a multiscale test statistic based on local order statistics and spacings that provides simultaneous confidence statements for the existence and location of local increases and decreases of a density or a failure rate. The procedure provides guaranteed finite-sample significance levels, is easy to implement and possesses certain asymptotic optimality and adaptivity properties.
△ Less
Submitted 7 August, 2008; v1 submitted 27 June, 2007;
originally announced June 2007.
-
Forward stagewise regression and the monotone lasso
Authors:
Trevor Hastie,
Jonathan Taylor,
Robert Tibshirani,
Guenther Walther
Abstract:
We consider the least angle regression and forward stagewise algorithms for solving penalized least squares regression problems. In Efron, Hastie, Johnstone & Tibshirani (2004) it is proved that the least angle regression algorithm, with a small modification, solves the lasso regression problem. Here we give an analogous result for incremental forward stagewise regression, showing that it solves…
▽ More
We consider the least angle regression and forward stagewise algorithms for solving penalized least squares regression problems. In Efron, Hastie, Johnstone & Tibshirani (2004) it is proved that the least angle regression algorithm, with a small modification, solves the lasso regression problem. Here we give an analogous result for incremental forward stagewise regression, showing that it solves a version of the lasso problem that enforces monotonicity. One consequence of this is as follows: while lasso makes optimal progress in terms of reducing the residual sum-of-squares per unit increase in $L_1$-norm of the coefficient $β$, forward stage-wise is optimal per unit $L_1$ arc-length traveled along the coefficient path. We also study a condition under which the coefficient paths of the lasso are monotone, and hence the different algorithms coincide. Finally, we compare the lasso and forward stagewise procedures in a simulation study involving a large number of correlated predictors.
△ Less
Submitted 2 May, 2007;
originally announced May 2007.
-
Combined and Comparative Analysis of Power Spectra
Authors:
P. A. Sturrock,
J. D. Scargle,
G. Walther,
M. S. Wheatland
Abstract:
In solar physics, especially in exploratory stages of research, it is often necessary to compare the power spectra of two or more time series. One may, for instance, wish to estimate what the power spectrum of the combined data sets might have been, or one may wish to estimate the significance of a particular peak that shows up in two or more power spectra. One may also on occasion need to searc…
▽ More
In solar physics, especially in exploratory stages of research, it is often necessary to compare the power spectra of two or more time series. One may, for instance, wish to estimate what the power spectrum of the combined data sets might have been, or one may wish to estimate the significance of a particular peak that shows up in two or more power spectra. One may also on occasion need to search for a complex of peaks in a single power spectrum, such as a fundamental and one or more harmonics, or a fundamental plus sidebands, etc. Visual inspection can be revealing, but it can also be misleading. This leads one to look for one or more ways of forming statistics, which readily lend themselves to significance estimation, from two or more power spectra. We derive formulas for statistics formed from the sum, the minimum, and the product of two or more power spectra. A distinguishing feature of our formulae is that, if each power spectrum has an exponential distribution, each statistic also has an exponential distribution.
△ Less
Submitted 2 February, 2005;
originally announced February 2005.
-
Comment on "Search for periodic modulations of the solar neutrino flux in Super-Kamiokande" by J. Yoo et al
Authors:
P. A. Sturrock,
D. O. Caldwell,
J. D. Scargle,
G. Walther,
M. S. Wheatland
Abstract:
We comment on a recent article by Yoo et al. that presents an analysis of Super-Kamiokande 10-day and 5-day data, correcting certain errors in that article. We also point out that, in using the Lomb-Scargle method of power spectrum analysis, Yoo et al. ignore much of the relevant data. A likelihood analysis, that can take account of all of the relevant data, yields evidence indicative of modulat…
▽ More
We comment on a recent article by Yoo et al. that presents an analysis of Super-Kamiokande 10-day and 5-day data, correcting certain errors in that article. We also point out that, in using the Lomb-Scargle method of power spectrum analysis, Yoo et al. ignore much of the relevant data. A likelihood analysis, that can take account of all of the relevant data, yields evidence indicative of modulation by solar processes.
△ Less
Submitted 3 August, 2004; v1 submitted 23 March, 2004;
originally announced March 2004.
-
Rotational Signature and Possible R-Mode Signature in the GALLEX Solar Neutrino Data
Authors:
P. A. Sturrock,
J. D. Scargle,
G. Walther,
M. S. Wheatland
Abstract:
Recent analysis of the Homestake data indicates that the solar neutrino flux contains a periodic variation that may be attributed to rotational modulation occurring deep in the solar interior, either in the tachocline or in the radiative zone. This paper presents an analysis of GALLEX data that yields supporting evidence of this rotational modulation at the 0.1% significance level. The depth of…
▽ More
Recent analysis of the Homestake data indicates that the solar neutrino flux contains a periodic variation that may be attributed to rotational modulation occurring deep in the solar interior, either in the tachocline or in the radiative zone. This paper presents an analysis of GALLEX data that yields supporting evidence of this rotational modulation at the 0.1% significance level. The depth of modulation inferred from the rotational signature is large enough to explain the neutrino deficit. The Rieger 157-day periodicity, first discovered in solar gamma-ray flares, is present also in Homestake data. A related oscillation with period 52 days is found in the GALLEX data. The relationship of these periods to the rotational period inferred from neutrino data suggests that they are due to r-mode oscillations.
△ Less
Submitted 21 April, 1999;
originally announced April 1999.
-
Acoustoelectric Study of Interface Trapping Defects in GaAs Epitaxial Strucrures
Authors:
I. V. Ostrovskii,
S. V. Saiko,
O. Ya. Olikh,
H. G. Walther
Abstract:
A new acousto-electrical method making use of transient transverse acoustoelectric voltage (TAV) to study solid state structures is reported. This voltage arises after a surface acoustic wave (SAW) generating the signal is switched off. Related measurements consist in detecting the shape of transient voltage and its spectral and temperature dependence. Both theory and experiment show that this m…
▽ More
A new acousto-electrical method making use of transient transverse acoustoelectric voltage (TAV) to study solid state structures is reported. This voltage arises after a surface acoustic wave (SAW) generating the signal is switched off. Related measurements consist in detecting the shape of transient voltage and its spectral and temperature dependence. Both theory and experiment show that this method is an effective tool to characterize trapping centers in the bulk as well as at surfaces or interfaces of epitaxial semiconductor structures.
△ Less
Submitted 23 October, 1997;
originally announced October 1997.
-
Absence of Correlation between the Solar Neutrino Flux and the Sunspot Number
Authors:
Guenther Walther
Abstract:
There exists a considerable amount of research claiming a puzzling anti-correlation between the neutrino detection rate at the Homestake experiment and indicators of solar activity such as the sunspot number, giving rise to explanations involving the hypothesis of a neutrino magnetic moment. It is argued here that the claimed significant anti-correlation is due to a statistical fallacy. A proper…
▽ More
There exists a considerable amount of research claiming a puzzling anti-correlation between the neutrino detection rate at the Homestake experiment and indicators of solar activity such as the sunspot number, giving rise to explanations involving the hypothesis of a neutrino magnetic moment. It is argued here that the claimed significant anti-correlation is due to a statistical fallacy. A proper test based on certain optimality criteria fails to detect a significant time variation of the neutrino flux in concert with the sunspot number, providing evidence that the observations are consistent with no correlation between the two series.
△ Less
Submitted 2 October, 1997;
originally announced October 1997.
-
An Apparent Periodicity in the Gallex, Homestake and Kamiokande Neutrino Data
Authors:
P. A. Sturrock,
G. Walther
Abstract:
In order to explore a recent proposal that the solar core may contain a component that varies periodically with a period in the range 21.0 - 22.4 days, due either to rotation or to some form of oscillation, we have examined the time series formed from measurements of the solar neutrino flux by means of the GALLEX, Homestake and Kamiokande experiments. Direct Fourier transform analysis of the Hom…
▽ More
In order to explore a recent proposal that the solar core may contain a component that varies periodically with a period in the range 21.0 - 22.4 days, due either to rotation or to some form of oscillation, we have examined the time series formed from measurements of the solar neutrino flux by means of the GALLEX, Homestake and Kamiokande experiments. Direct Fourier transform analysis of the Homestake data shows that the most prominent peak in the entire spectrum (examined down to 5 days period) is found at a frequency of approximately 17.2 y-1 corresponding to a period of approximately 21.3 days. According to the "shuffle test," the probability of finding this large a peak in the prescribed search band is about 0.03%, if it is assumed that there is no correlation between count rate and time. The GALLEX and Kamiokande data are examined in a way that searches for similarity in the shapes of the two spectra in sliding windows in frequency. We find that the "spectral correlation measure" peaks at 17.2 y-1, and the shuffle test indicates that the probability of finding this large a peak at a specified frequency is 2%, if it is again assumed that for each time series there is no correlation between count rate and time. The combined significance estimate is of order 1 part in 105 that the results are due to chance, on the assumption that there is no real structure to the count-rate time series.
△ Less
Submitted 21 October, 1996; v1 submitted 20 September, 1996;
originally announced September 1996.