-
Signature Isolation Forest
Authors:
Marta Campi,
Guillaume Staerman,
Gareth W. Peters,
Tomoko Matsui
Abstract:
Functional Isolation Forest (FIF) is a recent state-of-the-art Anomaly Detection (AD) algorithm designed for functional data. It relies on a tree partition procedure where an abnormality score is computed by projecting each curve observation on a drawn dictionary through a linear inner product. Such linear inner product and the dictionary are a priori choices that highly influence the algorithm's…
▽ More
Functional Isolation Forest (FIF) is a recent state-of-the-art Anomaly Detection (AD) algorithm designed for functional data. It relies on a tree partition procedure where an abnormality score is computed by projecting each curve observation on a drawn dictionary through a linear inner product. Such linear inner product and the dictionary are a priori choices that highly influence the algorithm's performances and might lead to unreliable results, particularly with complex datasets. This work addresses these challenges by introducing \textit{Signature Isolation Forest}, a novel AD algorithm class leveraging the rough path theory's signature transform. Our objective is to remove the constraints imposed by FIF through the proposition of two algorithms which specifically target the linearity of the FIF inner product and the choice of the dictionary. We provide several numerical experiments, including a real-world applications benchmark showing the relevance of our methods.
△ Less
Submitted 25 February, 2025; v1 submitted 7 March, 2024;
originally announced March 2024.
-
Data-Driven Framework for Uncovering Hidden Control Strategies in Evolutionary Analysis
Authors:
Nourddine Azzaoui,
Tomoko Matsui,
Daisuke Murakami
Abstract:
We have devised a data-driven framework for uncovering hidden control strategies used by an evolutionary system described by an evolutionary probability distribution. This innovative framework enables deciphering of the concealed mechanisms that contribute to the progression or mitigation of such situations as the spread of COVID-19. Novel algorithms are used to estimate the optimal control in tan…
▽ More
We have devised a data-driven framework for uncovering hidden control strategies used by an evolutionary system described by an evolutionary probability distribution. This innovative framework enables deciphering of the concealed mechanisms that contribute to the progression or mitigation of such situations as the spread of COVID-19. Novel algorithms are used to estimate the optimal control in tandem with the parameters for evolution in general dynamical systems, thereby extending the concept of model predictive control. This is a significant departure from conventional control methods, which require knowledge of the system to manipulate its evolution and of the controller's strategy or parameters. We used a generalized additive model, supplemented by extensive statistical testing, to identify a set of predictor covariates closely linked to the control. Using real-world COVID-19 data, we successfully delineated the descriptive behaviors of the COVID-19 epidemics in five prefectures in Japan and nine countries. We compared these nine countries and grouped them on the basis of shared profiles, providing valuable insights into their pandemic responses. Our findings underscore the potential of our framework as a powerful tool for understanding and managing complex evolutionary processes.
△ Less
Submitted 26 September, 2023;
originally announced September 2023.
-
$C^*$-algebra Net: A New Approach Generalizing Neural Network Parameters to $C^*$-algebra
Authors:
Yuka Hashimoto,
Zhao Wang,
Tomoko Matsui
Abstract:
We propose a new framework that generalizes the parameters of neural network models to $C^*$-algebra-valued ones. $C^*$-algebra is a generalization of the space of complex numbers. A typical example is the space of continuous functions on a compact space. This generalization enables us to combine multiple models continuously and use tools for functions such as regression and integration. Consequen…
▽ More
We propose a new framework that generalizes the parameters of neural network models to $C^*$-algebra-valued ones. $C^*$-algebra is a generalization of the space of complex numbers. A typical example is the space of continuous functions on a compact space. This generalization enables us to combine multiple models continuously and use tools for functions such as regression and integration. Consequently, we can learn features of data efficiently and adapt the models to problems continuously. We apply our framework to practical problems such as density estimation and few-shot learning and show that our framework enables us to learn features of data even with a limited number of samples. Our new framework highlights the potential possibility of applying the theory of $C^*$-algebra to general neural network models.
△ Less
Submitted 22 June, 2022; v1 submitted 19 June, 2022;
originally announced June 2022.
-
Dynamic Programming and Linear Programming for Odds Problem
Authors:
Sachika Kurokawa,
Tomomi Matsui
Abstract:
This paper discusses the odds problem, proposed by Bruss in 2000, and its variants. A recurrence relation called a dynamic programming (DP) equation is used to find an optimal stopping policy of the odds problem and its variants. In 2013, Buchbinder, Jain, and Singh proposed a linear programming (LP) formulation for finding an optimal stopping policy of the classical secretary problem, which is a…
▽ More
This paper discusses the odds problem, proposed by Bruss in 2000, and its variants. A recurrence relation called a dynamic programming (DP) equation is used to find an optimal stopping policy of the odds problem and its variants. In 2013, Buchbinder, Jain, and Singh proposed a linear programming (LP) formulation for finding an optimal stopping policy of the classical secretary problem, which is a special case of the odds problem. The proposed linear programming problem, which maximizes the probability of a win, differs from the DP equations known for long time periods. This paper shows that an ordinary DP equation is a modification of the dual problem of linear programming including the LP formulation proposed by Buchbinder, Jain, and Singh.
△ Less
Submitted 27 July, 2021;
originally announced July 2021.
-
Analysis of COVID-19 evolution based on testing closeness of sequential data
Authors:
Tomoko Matsui,
Nourddine Azzaoui,
Daisuke Murakami
Abstract:
A practical algorithm has been developed for closeness analysis of sequential data that combines closeness testing with algorithms based on the Markov chain tester. It was applied to reported sequential data for COVID-19 to analyze the evolution of COVID-19 during a certain time period (week, month, etc.).
A practical algorithm has been developed for closeness analysis of sequential data that combines closeness testing with algorithms based on the Markov chain tester. It was applied to reported sequential data for COVID-19 to analyze the evolution of COVID-19 during a certain time period (week, month, etc.).
△ Less
Submitted 30 June, 2021;
originally announced June 2021.
-
Improved log-Gaussian approximation for over-dispersed Poisson regression: application to spatial analysis of COVID-19
Authors:
Daisuke Murakami,
Tomoko Matsui
Abstract:
In the era of open data, Poisson and other count regression models are increasingly important. Still, conventional Poisson regression has remaining issues in terms of identifiability and computational efficiency. Especially, due to an identification problem, Poisson regression can be unstable for small samples with many zeros. Provided this, we develop a closed-form inference for an over-dispersed…
▽ More
In the era of open data, Poisson and other count regression models are increasingly important. Still, conventional Poisson regression has remaining issues in terms of identifiability and computational efficiency. Especially, due to an identification problem, Poisson regression can be unstable for small samples with many zeros. Provided this, we develop a closed-form inference for an over-dispersed Poisson regression including Poisson additive mixed models. The approach is derived via mode-based log-Gaussian approximation. The resulting method is fast, practical, and free from the identification problem. Monte Carlo experiments demonstrate that the estimation error of the proposed method is a considerably smaller estimation error than the closed-form alternatives and as small as the usual Poisson regressions. For counts with many zeros, our approximation has better estimation accuracy than conventional Poisson regression. We obtained similar results in the case of Poisson additive mixed modeling considering spatial or group effects. The developed method was applied for analyzing COVID-19 data in Japan. This result suggests that influences of pedestrian density, age, and other factors on the number of cases change over periods.
△ Less
Submitted 30 October, 2021; v1 submitted 28 April, 2021;
originally announced April 2021.
-
Compositionally-warped additive mixed modeling for a wide variety of non-Gaussian spatial data
Authors:
Daisuke Murakami,
Mami Kajita,
Seiji Kajita,
Tomoko Matsui
Abstract:
As with the advancement of geographical information systems, non-Gaussian spatial data sets are getting larger and more diverse. This study develops a general framework for fast and flexible non-Gaussian regression, especially for spatial/spatiotemporal modeling. The developed model, termed the compositionally-warped additive mixed model (CAMM), combines an additive mixed model (AMM) and the compo…
▽ More
As with the advancement of geographical information systems, non-Gaussian spatial data sets are getting larger and more diverse. This study develops a general framework for fast and flexible non-Gaussian regression, especially for spatial/spatiotemporal modeling. The developed model, termed the compositionally-warped additive mixed model (CAMM), combines an additive mixed model (AMM) and the compositionally-warped Gaussian process to model a wide variety of non-Gaussian continuous data including spatial and other effects. A specific advantage of the proposed CAMM is that it requires no explicit assumption of data distribution unlike existing AMMs. Monte Carlo experiments show the estimation accuracy and computational efficiency of CAMM for modeling non-Gaussian data including fat-tailed and/or skewed distributions. Finally, the model is applied to crime data to examine the empirical performance of the regression analysis and prediction. The result shows that CAMM provides intuitively reasonable coefficient estimates and outperforms AMM in terms of prediction accuracy. CAMM is verified to be a fast and flexible model that potentially covers a wide variety of non-Gaussian data modeling. The proposed approach is implemented in an R package spmoran.
△ Less
Submitted 22 June, 2021; v1 submitted 10 January, 2021;
originally announced January 2021.
-
Spatiotemporal analysis of urban heatwaves using Tukey g-and-h random field models
Authors:
Daisuke Murakami,
Gareth W. Peters,
Tomoko Matsui,
Yoshiki Yamagata
Abstract:
The statistical quantification of temperature processes for the analysis of urban heat island (UHI) effects and local heat-waves is an increasingly important application domain in smart city dynamic modelling. This leads to the increased importance of real-time heatwave risk management on a fine-grained spatial resolution. This study attempts to analyze and develop new methods for modelling the sp…
▽ More
The statistical quantification of temperature processes for the analysis of urban heat island (UHI) effects and local heat-waves is an increasingly important application domain in smart city dynamic modelling. This leads to the increased importance of real-time heatwave risk management on a fine-grained spatial resolution. This study attempts to analyze and develop new methods for modelling the spatio-temporal behavior of ground temperatures. The developed models consider higher-order stochastic spatial properties such as skewness and kurtosis, which are key components for understanding and describing local temperature fluctuations and UHI's. The developed models are applied to the greater Tokyo metropolitan area for a detailed real-world data case study. The analysis also demonstrates how to statistically incorporate a variety of real data sets. This includes remotely sensed imagery and a variety of ground-based monitoring site data to build models linking city and urban covariates to air temperature. The air temperature models are then used to capture high-resolution spatial emulator outputs for ground surface temperature modelling. The main class of processes studied includes the Tukey g-and-h processes for capturing spatial and temporal aspects of heat processes in urban environments.
△ Less
Submitted 2 April, 2020;
originally announced April 2020.
-
A spatiotemporal analysis of participatory sensing data "tweets" and extreme climate events toward real-time urban risk management
Authors:
Yoshiki Yamagata,
Daisuke Murakami,
Gareth W. Peters,
Tomoko Matsui
Abstract:
Real-time urban climate monitoring provides useful information that can be utilized to help monitor and adapt to extreme events, including urban heatwaves. Typical approaches to the monitoring of climate data include weather station monitoring and remote sensing. However, climate monitoring stations are very often distributed spatially in a sparse manner, and consequently, this has a significant i…
▽ More
Real-time urban climate monitoring provides useful information that can be utilized to help monitor and adapt to extreme events, including urban heatwaves. Typical approaches to the monitoring of climate data include weather station monitoring and remote sensing. However, climate monitoring stations are very often distributed spatially in a sparse manner, and consequently, this has a significant impact on the ability to reveal exposure risks due to extreme climates at an intra-urban scale. Additionally, traditional remote sensing data sources are typically not received and analyzed in real-time which is often required for adaptive urban management of climate extremes, such as sudden heatwaves. Fortunately, recent social media, such as Twitter, furnishes real-time and high-resolution spatial information that might be useful for climate condition estimation. The objective of this study is utilizing geo-tagged tweets (participatory sensing data) for urban temperature analysis. We first detect tweets relating hotness (hot-tweets). Then, we study relationships between monitored temperatures and hot-tweets via a statistical model framework based on copula modelling methods. We demonstrate that there are strong relationships between "hot-tweets" and temperatures recorded at an intra-urban scale. Subsequently, we then investigate the application of "hot-tweets" informing spatio-temporal Gaussian process interpolation of temperatures as an application example of "hot-tweets". We utilize a combination of spatially sparse weather monitoring sensor data and spatially and temporally dense lower quality twitter data. Here, a spatial best linear unbiased estimation technique is applied. The result suggests that tweets provide some useful auxiliary information for urban climate assessment. Lastly, effectiveness of tweets toward a real-time urban risk management is discussed based on the results.
△ Less
Submitted 17 September, 2015; v1 submitted 22 May, 2015;
originally announced May 2015.