Search | arXiv e-print repository

arXiv:2509.20161 [pdf, ps, other]

Efficient Multi-Objective Constrained Bayesian Optimization of Bridge Girder

Authors: Heine Havneraas Røstum, Joseph Morlier, Sebastien Gros, Ketil Aas-Jakobsen

Abstract: The buildings and construction sector is a significant source of greenhouse gas emissions, with cement production alone contributing 7~\% of global emissions and the industry as a whole accounting for approximately 37~\%. Reducing emissions by optimizing structural design can achieve significant global benefits. This article introduces an efficient multi-objective constrained Bayesian optimization… ▽ More The buildings and construction sector is a significant source of greenhouse gas emissions, with cement production alone contributing 7~\% of global emissions and the industry as a whole accounting for approximately 37~\%. Reducing emissions by optimizing structural design can achieve significant global benefits. This article introduces an efficient multi-objective constrained Bayesian optimization approach to address this challenge. Rather than attempting to determine the full set of non-dominated solutions with arbitrary trade-offs, the approach searches for a solution matching a specified trade-off. Structural design is typically conducted using computationally expensive finite element simulations, whereas Bayesian optimization offers an efficient approach for optimizing problems that involve such high-cost simulations. The proposed method integrates proper orthogonal decomposition for dimensionality reduction of simulation results with Kriging partial least squares to enhance efficiency. Constrained expected improvement is used as an acquisition function for Bayesian optimization. The approach is demonstrated through a case study of a two-lane, three-span post-tensioned concrete bridge girder, incorporating fifteen design variables and nine constraints. A comparison with conventional design methods demonstrates the potential of this optimization approach to achieve substantial cost reductions, with savings of approximately 10\% to 15\% in financial costs and about 20\% in environmental costs for the case study, while ensuring structural integrity. △ Less

Submitted 24 September, 2025; originally announced September 2025.

arXiv:2509.12833 [pdf, ps, other]

Safe Reinforcement Learning using Action Projection: Safeguard the Policy or the Environment?

Authors: Hannah Markgraf, Shamburaj Sawant, Hanna Krasowski, Lukas Schäfer, Sebastien Gros, Matthias Althoff

Abstract: Projection-based safety filters, which modify unsafe actions by mapping them to the closest safe alternative, are widely used to enforce safety constraints in reinforcement learning (RL). Two integration strategies are commonly considered: Safe environment RL (SE-RL), where the safeguard is treated as part of the environment, and safe policy RL (SP-RL), where it is embedded within the policy throu… ▽ More Projection-based safety filters, which modify unsafe actions by mapping them to the closest safe alternative, are widely used to enforce safety constraints in reinforcement learning (RL). Two integration strategies are commonly considered: Safe environment RL (SE-RL), where the safeguard is treated as part of the environment, and safe policy RL (SP-RL), where it is embedded within the policy through differentiable optimization layers. Despite their practical relevance in safety-critical settings, a formal understanding of their differences is lacking. In this work, we present a theoretical comparison of SE-RL and SP-RL. We identify a key distinction in how each approach is affected by action aliasing, a phenomenon in which multiple unsafe actions are projected to the same safe action, causing information loss in the policy gradients. In SE-RL, this effect is implicitly approximated by the critic, while in SP-RL, it manifests directly as rank-deficient Jacobians during backpropagation through the safeguard. Our contributions are threefold: (i) a unified formalization of SE-RL and SP-RL in the context of actor-critic algorithms, (ii) a theoretical analysis of their respective policy gradient estimates, highlighting the role of action aliasing, and (iii) a comparative study of mitigation strategies, including a novel penalty-based improvement for SP-RL that aligns with established SE-RL practices. Empirical results support our theoretical predictions, showing that action aliasing is more detrimental for SP-RL than for SE-RL. However, with appropriate improvement strategies, SP-RL can match or outperform improved SE-RL across a range of environments. These findings provide actionable insights for choosing and refining projection-based safe RL methods based on task characteristics. △ Less

Submitted 16 September, 2025; originally announced September 2025.

arXiv:2508.02441 [pdf, ps, other]

Computationally efficient Gauss-Newton reinforcement learning for model predictive control

Authors: Dean Brandner, Sebastien Gros, Sergio Lucia

Abstract: Model predictive control (MPC) is widely used in process control due to its interpretability and ability to handle constraints. As a parametric policy in reinforcement learning (RL), MPC offers strong initial performance and low data requirements compared to black-box policies like neural networks. However, most RL methods rely on first-order updates, which scale well to large parameter spaces but… ▽ More Model predictive control (MPC) is widely used in process control due to its interpretability and ability to handle constraints. As a parametric policy in reinforcement learning (RL), MPC offers strong initial performance and low data requirements compared to black-box policies like neural networks. However, most RL methods rely on first-order updates, which scale well to large parameter spaces but converge at most linearly, making them inefficient when each policy update requires solving an optimal control problem, as is the case with MPC. While MPC policies are typically sparsely parameterized and thus amenable to second-order approaches, existing second-order methods demand second-order policy derivatives, which can be computationally and memory-wise intractable. This work introduces a Gauss-Newton approximation of the deterministic policy Hessian that eliminates the need for second-order policy derivatives, enabling superlinear convergence with minimal computational overhead. To further improve robustness, we propose a momentum-based Hessian averaging scheme for stable training under noisy estimates. We demonstrate the effectiveness of the approach on a nonlinear continuously stirred tank reactor (CSTR), showing faster convergence and improved data efficiency over state-of-the-art first-order methods. △ Less

Submitted 4 August, 2025; originally announced August 2025.

Comments: 14 pages, 8 figures, submitted to Elsevier

arXiv:2507.04356 [pdf, ps, other]

Mission-Aligned Learning-Informed Control of Autonomous Systems: Formulation and Foundations

Authors: Vyacheslav Kungurtsev, Gustav Sir, Akhil Anand, Sebastien Gros, Haozhe Tian, Homayoun Hamedmoghadam

Abstract: Research, innovation and practical capital investment have been increasing rapidly toward the realization of autonomous physical agents. This includes industrial and service robots, unmanned aerial vehicles, embedded control devices, and a number of other realizations of cybernetic/mechatronic implementations of intelligent autonomous devices. In this paper, we consider a stylized version of robot… ▽ More Research, innovation and practical capital investment have been increasing rapidly toward the realization of autonomous physical agents. This includes industrial and service robots, unmanned aerial vehicles, embedded control devices, and a number of other realizations of cybernetic/mechatronic implementations of intelligent autonomous devices. In this paper, we consider a stylized version of robotic care, which would normally involve a two-level Reinforcement Learning procedure that trains a policy for both lower level physical movement decisions as well as higher level conceptual tasks and their sub-components. In order to deliver greater safety and reliability in the system, we present the general formulation of this as a two-level optimization scheme which incorporates control at the lower level, and classical planning at the higher level, integrated with a capacity for learning. This synergistic integration of multiple methodologies -- control, classical planning, and RL -- presents an opportunity for greater insight for algorithm development, leading to more efficient and reliable performance. Here, the notion of reliability pertains to physical safety and interpretability into an otherwise black box operation of autonomous agents, concerning users and regulators. This work presents the necessary background and general formulation of the optimization framework, detailing each component and its integration with the others. △ Less

Submitted 6 July, 2025; originally announced July 2025.

arXiv:2505.16242 [pdf, ps, other]

Offline Guarded Safe Reinforcement Learning for Medical Treatment Optimization Strategies

Authors: Runze Yan, Xun Shen, Akifumi Wachi, Sebastien Gros, Anni Zhao, Xiao Hu

Abstract: When applying offline reinforcement learning (RL) in healthcare scenarios, the out-of-distribution (OOD) issues pose significant risks, as inappropriate generalization beyond clinical expertise can result in potentially harmful recommendations. While existing methods like conservative Q-learning (CQL) attempt to address the OOD issue, their effectiveness is limited by only constraining action sele… ▽ More When applying offline reinforcement learning (RL) in healthcare scenarios, the out-of-distribution (OOD) issues pose significant risks, as inappropriate generalization beyond clinical expertise can result in potentially harmful recommendations. While existing methods like conservative Q-learning (CQL) attempt to address the OOD issue, their effectiveness is limited by only constraining action selection by suppressing uncertain actions. This action-only regularization imitates clinician actions that prioritize short-term rewards, but it fails to regulate downstream state trajectories, thereby limiting the discovery of improved long-term treatment strategies. To safely improve policy beyond clinician recommendations while ensuring that state-action trajectories remain in-distribution, we propose \textit{Offline Guarded Safe Reinforcement Learning} ($\mathsf{OGSRL}$), a theoretically grounded model-based offline RL framework. $\mathsf{OGSRL}$ introduces a novel dual constraint mechanism for improving policy with reliability and safety. First, the OOD guardian is established to specify clinically validated regions for safe policy exploration. By constraining optimization within these regions, it enables the reliable exploration of treatment strategies that outperform clinician behavior by leveraging the full patient state history, without drifting into unsupported state-action trajectories. Second, we introduce a safety cost constraint that encodes medical knowledge about physiological safety boundaries, providing domain-specific safeguards even in areas where training data might contain potentially unsafe interventions. Notably, we provide theoretical guarantees on safety and near-optimality: policies that satisfy these constraints remain in safe and reliable regions and achieve performance close to the best possible policy supported by the data. △ Less

Submitted 22 May, 2025; originally announced May 2025.

arXiv:2505.01353 [pdf, other]

Differentiable Nonlinear Model Predictive Control

Authors: Jonathan Frey, Katrin Baumgärtner, Gianluca Frison, Dirk Reinhardt, Jasper Hoffmann, Leonard Fichtner, Sebastien Gros, Moritz Diehl

Abstract: The efficient computation of parametric solution sensitivities is a key challenge in the integration of learning-enhanced methods with nonlinear model predictive control (MPC), as their availability is crucial for many learning algorithms. While approaches presented in the machine learning community are limited to convex or unconstrained formulations, this paper discusses the computation of soluti… ▽ More The efficient computation of parametric solution sensitivities is a key challenge in the integration of learning-enhanced methods with nonlinear model predictive control (MPC), as their availability is crucial for many learning algorithms. While approaches presented in the machine learning community are limited to convex or unconstrained formulations, this paper discusses the computation of solution sensitivities of general nonlinear programs (NLPs) using the implicit function theorem (IFT) and smoothed optimality conditions treated in interior-point methods (IPM). We detail sensitivity computation within a sequential quadratic programming (SQP) method which employs an IPM for the quadratic subproblems. The publication is accompanied by an efficient open-source implementation within the framework, providing both forward and adjoint sensitivities for general optimal control problems, achieving speedups exceeding 3x over the state-of-the-art solver mpc.pytorch. △ Less

Submitted 2 May, 2025; originally announced May 2025.

Comments: 19 page, 4 figures, 2 tables

arXiv:2502.02133 [pdf, other]

Synthesis of Model Predictive Control and Reinforcement Learning: Survey and Classification

Authors: Rudolf Reiter, Jasper Hoffmann, Dirk Reinhardt, Florian Messerer, Katrin Baumgärtner, Shamburaj Sawant, Joschka Boedecker, Moritz Diehl, Sebastien Gros

Abstract: The fields of MPC and RL consider two successful control techniques for Markov decision processes. Both approaches are derived from similar fundamental principles, and both are widely used in practical applications, including robotics, process control, energy systems, and autonomous driving. Despite their similarities, MPC and RL follow distinct paradigms that emerged from diverse communities and… ▽ More The fields of MPC and RL consider two successful control techniques for Markov decision processes. Both approaches are derived from similar fundamental principles, and both are widely used in practical applications, including robotics, process control, energy systems, and autonomous driving. Despite their similarities, MPC and RL follow distinct paradigms that emerged from diverse communities and different requirements. Various technical discrepancies, particularly the role of an environment model as part of the algorithm, lead to methodologies with nearly complementary advantages. Due to their orthogonal benefits, research interest in combination methods has recently increased significantly, leading to a large and growing set of complex ideas leveraging MPC and RL. This work illuminates the differences, similarities, and fundamentals that allow for different combination algorithms and categorizes existing work accordingly. Particularly, we focus on the versatile actor-critic RL approach as a basis for our categorization and examine how the online optimization approach of MPC can be used to improve the overall closed-loop performance of a policy. △ Less

Submitted 4 February, 2025; originally announced February 2025.

arXiv:2501.06086 [pdf, other]

All AI Models are Wrong, but Some are Optimal

Authors: Akhil S Anand, Shambhuraj Sawant, Dirk Reinhardt, Sebastien Gros

Abstract: AI models that predict the future behavior of a system (a.k.a. predictive AI models) are central to intelligent decision-making. However, decision-making using predictive AI models often results in suboptimal performance. This is primarily because AI models are typically constructed to best fit the data, and hence to predict the most likely future rather than to enable high-performance decision-ma… ▽ More AI models that predict the future behavior of a system (a.k.a. predictive AI models) are central to intelligent decision-making. However, decision-making using predictive AI models often results in suboptimal performance. This is primarily because AI models are typically constructed to best fit the data, and hence to predict the most likely future rather than to enable high-performance decision-making. The hope that such prediction enables high-performance decisions is neither guaranteed in theory nor established in practice. In fact, there is increasing empirical evidence that predictive models must be tailored to decision-making objectives for performance. In this paper, we establish formal (necessary and sufficient) conditions that a predictive model (AI-based or not) must satisfy for a decision-making policy established using that model to be optimal. We then discuss their implications for building predictive AI models for sequential decision-making. △ Less

Submitted 10 January, 2025; originally announced January 2025.

arXiv:2411.18305 [pdf, other]

doi 10.1016/j.eswa.2025.127180

Application of Soft Actor-Critic Algorithms in Optimizing Wastewater Treatment with Time Delays Integration

Authors: Esmaeel Mohammadi, Daniel Ortiz-Arroyo, Aviaja Anna Hansen, Mikkel Stokholm-Bjerregaard, Sebastien Gros, Akhil S Anand, Petar Durdevic

Abstract: Wastewater treatment plants face unique challenges for process control due to their complex dynamics, slow time constants, and stochastic delays in observations and actions. These characteristics make conventional control methods, such as Proportional-Integral-Derivative controllers, suboptimal for achieving efficient phosphorus removal, a critical component of wastewater treatment to ensure envir… ▽ More Wastewater treatment plants face unique challenges for process control due to their complex dynamics, slow time constants, and stochastic delays in observations and actions. These characteristics make conventional control methods, such as Proportional-Integral-Derivative controllers, suboptimal for achieving efficient phosphorus removal, a critical component of wastewater treatment to ensure environmental sustainability. This study addresses these challenges using a novel deep reinforcement learning approach based on the Soft Actor-Critic algorithm, integrated with a custom simulator designed to model the delayed feedback inherent in wastewater treatment plants. The simulator incorporates Long Short-Term Memory networks for accurate multi-step state predictions, enabling realistic training scenarios. To account for the stochastic nature of delays, agents were trained under three delay scenarios: no delay, constant delay, and random delay. The results demonstrate that incorporating random delays into the reinforcement learning framework significantly improves phosphorus removal efficiency while reducing operational costs. Specifically, the delay-aware agent achieved 36% reduction in phosphorus emissions, 55% higher reward, 77% lower target deviation from the regulatory limit, and 9% lower total costs than traditional control methods in the simulated environment. These findings underscore the potential of reinforcement learning to overcome the limitations of conventional control strategies in wastewater treatment, providing an adaptive and cost-effective solution for phosphorus removal. △ Less

Submitted 27 November, 2024; originally announced November 2024.

Journal ref: Expert Systems with Applications Volume 277, 5 June 2025, 127180

arXiv:2410.06474 [pdf, ps, other]

Flipping-based Policy for Chance-Constrained Markov Decision Processes

Authors: Xun Shen, Shuo Jiang, Akifumi Wachi, Kaumune Hashimoto, Sebastien Gros

Abstract: Safe reinforcement learning (RL) is a promising approach for many real-world decision-making problems where ensuring safety is a critical necessity. In safe RL research, while expected cumulative safety constraints (ECSCs) are typically the first choices, chance constraints are often more pragmatic for incorporating safety under uncertainties. This paper proposes a \textit{flipping-based policy} f… ▽ More Safe reinforcement learning (RL) is a promising approach for many real-world decision-making problems where ensuring safety is a critical necessity. In safe RL research, while expected cumulative safety constraints (ECSCs) are typically the first choices, chance constraints are often more pragmatic for incorporating safety under uncertainties. This paper proposes a \textit{flipping-based policy} for Chance-Constrained Markov Decision Processes (CCMDPs). The flipping-based policy selects the next action by tossing a potentially distorted coin between two action candidates. The probability of the flip and the two action candidates vary depending on the state. We establish a Bellman equation for CCMDPs and further prove the existence of a flipping-based policy within the optimal solution sets. Since solving the problem with joint chance constraints is challenging in practice, we then prove that joint chance constraints can be approximated into Expected Cumulative Safety Constraints (ECSCs) and that there exists a flipping-based policy in the optimal solution sets for constrained MDPs with ECSCs. As a specific instance of practical implementations, we present a framework for adapting constrained policy optimization to train a flipping-based policy. This framework can be applied to other safe RL algorithms. We demonstrate that the flipping-based policy can improve the performance of the existing safe RL algorithms under the same limits of safety constraints on Safety Gym benchmarks. △ Less

Submitted 8 October, 2024; originally announced October 2024.

Comments: Accepted to NeurIPS 2024

arXiv:2401.00661 [pdf, ps, other]

Personalized Dynamic Pricing Policy for Electric Vehicles: Reinforcement learning approach

Authors: Sangjun Bae, Balazs Kulcsar, Sebastien Gros

Abstract: With the increasing number of fast-electric vehicle charging stations (fast-EVCSs) and the popularization of information technology, electricity price competition between fast-EVCSs is highly expected, in which the utilization of public and/or privacy-preserved information will play a crucial role. Self-interest electric vehicle (EV) users, on the other hand, try to select a fast-EVCS for charging… ▽ More With the increasing number of fast-electric vehicle charging stations (fast-EVCSs) and the popularization of information technology, electricity price competition between fast-EVCSs is highly expected, in which the utilization of public and/or privacy-preserved information will play a crucial role. Self-interest electric vehicle (EV) users, on the other hand, try to select a fast-EVCS for charging in a way to maximize their utilities based on electricity price, estimated waiting time, and their state of charge. While existing studies have largely focused on finding equilibrium prices, this study proposes a personalized dynamic pricing policy (PeDP) for a fast-EVCS to maximize revenue using a reinforcement learning (RL) approach. We first propose a multiple fast-EVCSs competing simulation environment to model the selfish behavior of EV users using a game-based charging station selection model with a monetary utility function. In the environment, we propose a Q-learning-based PeDP to maximize fast-EVCS' revenue. Through numerical simulations based on the environment: (1) we identify the importance of waiting time in the EV charging market by comparing the classic Bertrand competition model with the proposed PeDP for fast-EVCSs (from the system perspective); (2) we evaluate the performance of the proposed PeDP and analyze the effects of the information on the policy (from the service provider perspective); and (3) it can be seen that privacy-preserved information sharing can be misused by artificial intelligence-based PeDP in a certain situation in the EV charging market (from the customer perspective). △ Less

Submitted 31 December, 2023; originally announced January 2024.

arXiv:2302.12667 [pdf, other]

Deep active learning for nonlinear system identification

Authors: Erlend Torje Berg Lundby, Adil Rasheed, Ivar Johan Halvorsen, Dirk Reinhardt, Sebastien Gros, Jan Tommy Gravdahl

Abstract: The exploding research interest for neural networks in modeling nonlinear dynamical systems is largely explained by the networks' capacity to model complex input-output relations directly from data. However, they typically need vast training data before they can be put to any good use. The data generation process for dynamical systems can be an expensive endeavor both in terms of time and resource… ▽ More The exploding research interest for neural networks in modeling nonlinear dynamical systems is largely explained by the networks' capacity to model complex input-output relations directly from data. However, they typically need vast training data before they can be put to any good use. The data generation process for dynamical systems can be an expensive endeavor both in terms of time and resources. Active learning addresses this shortcoming by acquiring the most informative data, thereby reducing the need to collect enormous datasets. What makes the current work unique is integrating the deep active learning framework into nonlinear system identification. We formulate a general static deep active learning acquisition problem for nonlinear system identification. This is enabled by exploring system dynamics locally in different regions of the input space to obtain a simulated dataset covering the broader input space. This simulated dataset can be used in a static deep active learning acquisition scheme referred to as global explorations. The global exploration acquires a batch of initial states corresponding to the most informative state-action trajectories according to a batch acquisition function. The local exploration solves an optimal control problem, finding the control trajectory that maximizes some measure of information. After a batch of informative initial states is acquired, a new round of local explorations from the initial states in the batch is conducted to obtain a set of corresponding control trajectories that are to be applied on the system dynamics to get data from the system. Information measures used in the acquisition scheme are derived from the predictive variance of an ensemble of neural networks. The novel method outperforms standard data acquisition methods used for system identification of nonlinear dynamical systems in the case study performed on simulated data. △ Less

Submitted 24 February, 2023; originally announced February 2023.

arXiv:2301.01667 [pdf, other]

Learning-based MPC from Big Data Using Reinforcement Learning

Authors: Shambhuraj Sawant, Akhil S Anand, Dirk Reinhardt, Sebastien Gros

Abstract: This paper presents an approach for learning Model Predictive Control (MPC) schemes directly from data using Reinforcement Learning (RL) methods. The state-of-the-art learning methods use RL to improve the performance of parameterized MPC schemes. However, these learning algorithms are often gradient-based methods that require frequent evaluations of computationally expensive MPC schemes, thereby… ▽ More This paper presents an approach for learning Model Predictive Control (MPC) schemes directly from data using Reinforcement Learning (RL) methods. The state-of-the-art learning methods use RL to improve the performance of parameterized MPC schemes. However, these learning algorithms are often gradient-based methods that require frequent evaluations of computationally expensive MPC schemes, thereby restricting their use on big datasets. We propose to tackle this issue by using tools from RL to learn a parameterized MPC scheme directly from data in an offline fashion. Our approach derives an MPC scheme without having to solve it over the collected dataset, thereby eliminating the computational complexity of existing techniques for big data. We evaluate the proposed method on three simulated experiments of varying complexity. △ Less

Submitted 4 January, 2023; originally announced January 2023.

arXiv:2212.03645 [pdf, ps, other]

doi 10.23919/MIPRO55190.2022.9803570

Systematic review of automatic translation of high-level security policy into firewall rules

Authors: Ivan Kovačević, Bruno Štengl, Stjepan Groš

Abstract: Firewalls are security devices that perform network traffic filtering. They are ubiquitous in the industry and are a common method used to enforce organizational security policy. Security policy is specified on a high level of abstraction, with statements such as "web browsing is allowed only on workstations inside the office network", and needs to be translated into low-level firewall rules to be… ▽ More Firewalls are security devices that perform network traffic filtering. They are ubiquitous in the industry and are a common method used to enforce organizational security policy. Security policy is specified on a high level of abstraction, with statements such as "web browsing is allowed only on workstations inside the office network", and needs to be translated into low-level firewall rules to be enforceable. There has been a lot of work regarding optimization, analysis and platform independence of firewall rules, but an area that has seen much less success is automatic translation of high-level security policies into firewall rules. In addition to improving rules' readability, such translation would make it easier to detect errors.This paper surveys of over twenty papers that aim to generate firewall rules according to a security policy specified on a higher level of abstraction. It also presents an overview of similar features in modern firewall systems. Most approaches define specialized domain languages that get compiled into firewall rule sets, with some of them relying on formal specification, ontology, or graphical models. The approaches' have improved over time, but there are still many drawbacks that need to be solved before wider application. △ Less

Submitted 7 December, 2022; originally announced December 2022.

Comments: 6 pages, 1 figure; Published in the 2022 45th Jubilee International Convention on Information, Communication and Electronic Technology (MIPRO)

arXiv:2205.08856 [pdf, other]

Bridging the gap between QP-based and MPC-based RL

Authors: Shambhuraj Sawant, Sebastien Gros

Abstract: Reinforcement learning methods typically use Deep Neural Networks to approximate the value functions and policies underlying a Markov Decision Process. Unfortunately, DNN-based RL suffers from a lack of explainability of the resulting policy. In this paper, we instead approximate the policy and value functions using an optimization problem, taking the form of Quadratic Programs (QPs). We propose s… ▽ More Reinforcement learning methods typically use Deep Neural Networks to approximate the value functions and policies underlying a Markov Decision Process. Unfortunately, DNN-based RL suffers from a lack of explainability of the resulting policy. In this paper, we instead approximate the policy and value functions using an optimization problem, taking the form of Quadratic Programs (QPs). We propose simple tools to promote structures in the QP, pushing it to resemble a linear MPC scheme. A generic unstructured QP offers high flexibility for learning, while a QP having the structure of an MPC scheme promotes the explainability of the resulting policy, additionally provides ways for its analysis. The tools we propose allow for continuously adjusting the trade-off between the former and the latter during learning. We illustrate the workings of our proposed method with the resulting structure using a point-mass task. △ Less

Submitted 18 May, 2022; originally announced May 2022.

arXiv:2204.12420 [pdf, other]

doi 10.1109/TTE.2022.3226683

Interpretable Battery Cycle Life Range Prediction Using Early Degradation Data at Cell Level

Authors: Huang Zhang, Yang Su, Faisal Altaf, Torsten Wik, Sebastien Gros

Abstract: Battery cycle life prediction using early degradation data has many potential applications throughout the battery product life cycle. For that reason, various data-driven methods have been proposed for point prediction of battery cycle life with minimum knowledge of the battery degradation mechanisms. However, managing the rapidly increasing amounts of batteries at end-of-life with lower economic… ▽ More Battery cycle life prediction using early degradation data has many potential applications throughout the battery product life cycle. For that reason, various data-driven methods have been proposed for point prediction of battery cycle life with minimum knowledge of the battery degradation mechanisms. However, managing the rapidly increasing amounts of batteries at end-of-life with lower economic and technical risk requires prediction of cycle life with quantified uncertainty, which is still lacking. The interpretability (i.e., the reason for high prediction accuracy) of these advanced data-driven methods is also worthy of investigation. Here, a Quantile Regression Forest (QRF) model, having the advantage of not assuming any specific distribution of cycle life, is introduced to make cycle life range prediction with uncertainty quantified as the width of the prediction interval, in addition to point predictions with high accuracy. The hyperparameters of the QRF model are optimized with a proposed alpha-logistic-weighted criterion so that the coverage probabilities associated with the prediction intervals are calibrated. The interpretability of the final QRF model is explored with two global model-agnostic methods, namely permutation importance and partial dependence plot. △ Less

Submitted 23 April, 2023; v1 submitted 26 April, 2022; originally announced April 2022.

arXiv:2203.13854 [pdf, other]

Quasi-Newton Iteration in Deterministic Policy Gradient

Authors: Arash Bahari Kordabad, Hossein Nejatbakhsh Esfahani, Wenqi Cai, Sebastien Gros

Abstract: This paper presents a model-free approximation for the Hessian of the performance of deterministic policies to use in the context of Reinforcement Learning based on Quasi-Newton steps in the policy parameters. We show that the approximate Hessian converges to the exact Hessian at the optimal policy, and allows for a superlinear convergence in the learning, provided that the policy parametrization… ▽ More This paper presents a model-free approximation for the Hessian of the performance of deterministic policies to use in the context of Reinforcement Learning based on Quasi-Newton steps in the policy parameters. We show that the approximate Hessian converges to the exact Hessian at the optimal policy, and allows for a superlinear convergence in the learning, provided that the policy parametrization is rich. The natural policy gradient method can be interpreted as a particular case of the proposed method. We analytically verify the formulation in a simple linear case and compare the convergence of the proposed method with the natural policy gradient in a nonlinear example. △ Less

Submitted 25 March, 2022; originally announced March 2022.

Comments: This paper has been accepted to 2022 American Control Conference (ACC). 6 pages

arXiv:2111.04146 [pdf, other]

Optimization of the Model Predictive Control Meta-Parameters Through Reinforcement Learning

Authors: Eivind Bøhn, Sebastien Gros, Signe Moe, Tor Arne Johansen

Abstract: Model predictive control (MPC) is increasingly being considered for control of fast systems and embedded applications. However, the MPC has some significant challenges for such systems. Its high computational complexity results in high power consumption from the control algorithm, which could account for a significant share of the energy resources in battery-powered embedded systems. The MPC param… ▽ More Model predictive control (MPC) is increasingly being considered for control of fast systems and embedded applications. However, the MPC has some significant challenges for such systems. Its high computational complexity results in high power consumption from the control algorithm, which could account for a significant share of the energy resources in battery-powered embedded systems. The MPC parameters must be tuned, which is largely a trial-and-error process that affects the control performance, the robustness and the computational complexity of the controller to a high degree. In this paper, we propose a novel framework in which any parameter of the control algorithm can be jointly tuned using reinforcement learning(RL), with the goal of simultaneously optimizing the control performance and the power usage of the control algorithm. We propose the novel idea of optimizing the meta-parameters of MPCwith RL, i.e. parameters affecting the structure of the MPCproblem as opposed to the solution to a given problem. Our control algorithm is based on an event-triggered MPC where we learn when the MPC should be re-computed, and a dual mode MPC and linear state feedback control law applied in between MPC computations. We formulate a novel mixture-distribution policy and show that with joint optimization we achieve improvements that do not present themselves when optimizing the same parameters in isolation. We demonstrate our framework on the inverted pendulum control task, reducing the total computation time of the control system by 36% while also improving the control performance by 18.4% over the best-performing MPC baseline. △ Less

Submitted 7 November, 2021; originally announced November 2021.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:2107.11102 [pdf, other]

doi 10.1109/ACCESS.2022.3147312

Automatically generating models of IT systems

Authors: Ivan Kovačević, Stjepan Groš, Ante Đerek

Abstract: Information technology system (ITS), informally, consists of hardware and software infrastructure (e.g., workstations, servers, laptops, installed software packages, databases, LANs, firewalls, etc.), along with physical and logical connections and inter-dependencies between various items. Nowadays, every company owns and operates an ITS, but detailed information about the system is rarely publicl… ▽ More Information technology system (ITS), informally, consists of hardware and software infrastructure (e.g., workstations, servers, laptops, installed software packages, databases, LANs, firewalls, etc.), along with physical and logical connections and inter-dependencies between various items. Nowadays, every company owns and operates an ITS, but detailed information about the system is rarely publicly available. However, there are many situations where the availability of such data would be beneficial. For example, cyber ranges need descriptions of complex realistic IT systems in order to provide an effective training and education platform. Furthermore, various algorithms in cybersecurity, in particular attack tree generation, need to be validated on realistic models of IT systems. In this paper, we describe a system we call the Generator that, based on the high-level requirements such as the number of employees and the business area the target company belongs to, generates a model of an ITS that satisfies the given requirements. We put special emphasis on the following two criteria: the generated ITS models a large amount of details, and ideally resembles a real system. Our survey of related literature found no sufficiently similar prior works, so we believe that this is the first attempt of building something like this. We created a proof-of-concept implementation of the Generator, validated it by generating ITS models for a simplified fictional financial institution, and analyzed the Generators performance with respect to the problem size. The research was done in an iterative manner, with coauthors continuously providing feedback on intermediate results. (...) We intend to extend this prototype to allow probabilistic generation of IT systems when only a subset of parameters is explicitly defined, and further develop and validate our approach with the help of domain experts. △ Less

Submitted 31 January, 2022; v1 submitted 23 July, 2021; originally announced July 2021.

Comments: 20 pages, 16 figures

Journal ref: IEEE Access (2022)

arXiv:2106.06000 [pdf, ps, other]

Use of a non-peer reviewed sources in cyber-security scientific research

Authors: Dalibor Gernhardt, Stjepan Groš

Abstract: Most publicly available data on cyber incidents comes from private companies and non-academic sources. Common sources of information include various security bulletins, white papers, reports, court cases, and blog posts describing specific events, often from a single point of view, followed by occasional academic sources, usually conference proceedings. The main characteristics of the available da… ▽ More Most publicly available data on cyber incidents comes from private companies and non-academic sources. Common sources of information include various security bulletins, white papers, reports, court cases, and blog posts describing specific events, often from a single point of view, followed by occasional academic sources, usually conference proceedings. The main characteristics of the available data sources are: lack of peer review and unavailability of confidential data. In this paper, we use an indirect approach to identify trusted sources used in scientific work. We analyze how top-rated peer reviewed literature relies on the use of non-peer reviewed sources on cybersecurity incidents. To identify current non-peer reviewed sources on cybersecurity we analyze references in top rated peer reviewed computer security conferences. We also analyze how non-peer reviewed sources are used, to motivate or support research. We examined 808 articles from top conferences in field of computer security. The result of this work are list of the most commonly used non-peer reviewed data sources and information about the context in which this data is used. Since these sources are accepted in top conferences, other researchers can consider them in their future research. To the best of our knowledge, analysis on how non-peer reviewed sources are used in cyber-security scientific research has not been done before. △ Less

Submitted 10 June, 2021; originally announced June 2021.

Comments: 9 pages, 6 tables

arXiv:2106.05702 [pdf, ps, other]

Myths and Misconceptions about Attackers and Attacks

Authors: Stjepan Groš

Abstract: This paper is based on a three year project during which we studied attackers' behavior, reading military planning literature, and thinking on how would we do the same things they do, and what problems would we, as attackers, face. This research is still ongoing, but while participating in applications for other projects and talking to cyber security experts we constantly face the same issues, nam… ▽ More This paper is based on a three year project during which we studied attackers' behavior, reading military planning literature, and thinking on how would we do the same things they do, and what problems would we, as attackers, face. This research is still ongoing, but while participating in applications for other projects and talking to cyber security experts we constantly face the same issues, namely attackers' behavior is not well understood, and consequently, there are a number of misconceptions floating around that are simply not true, or are only partially true. This is actually expected as someone who casually follows news about incidents easily gets impression that attackers and attacks are everywhere and every one is under attack. Our goal in this paper is to debunk these myths, to show what attackers really can and can not, what dilemmas they face, what we don't know about attackers and attacks, etc. The conclusion is that, while attackers do have upper hand, they don't have absolute advantage, i.e. they also operate in an uncertain environment. Knowing this, means that defenses could be well established. △ Less

Submitted 10 June, 2021; originally announced June 2021.

Comments: 8 pages, 27 reference. This paper is work in progress and as such may contain inaccuracies, missing or unfinished sentences and paragraphs

arXiv:2106.01154 [pdf, other]

Controlled Update of Software Components using Concurrent Exection of Patched and Unpatched Versions

Authors: Stjepan Groš, Ivan Kovačević, Ivan Dujmić, Matej Petrinović

Abstract: Software patching is a common method of removing vulnerabilities in software components to make IT systems more secure. However, there are many cases where software patching is not possible due to the critical nature of the application, especially when the vendor providing the application guarantees correct operation only in a specific configuration. In this paper, we propose a method to solve thi… ▽ More Software patching is a common method of removing vulnerabilities in software components to make IT systems more secure. However, there are many cases where software patching is not possible due to the critical nature of the application, especially when the vendor providing the application guarantees correct operation only in a specific configuration. In this paper, we propose a method to solve this problem. The idea is to run unpatched and patched application instances concurrently, with the unpatched one having complete control and the output of the patched one being used only for comparison, to watch for differences that are consequences of introduced bugs. To test this idea, we developed a system that allows us to run web applications in parallel and tested three web applications. The experiments have shown that the idea is promising for web applications from the technical side. Furthermore, we discuss the potential limitations of this system and the idea in general, how long two instances should run in order to be able to claim with some probability that the patched version has not introduced any new bugs, other potential use cases of the proposed system where two application instances run concurrently, and finally the potential uses of this system with different types of applications, such as SCADA systems. △ Less

Submitted 2 June, 2021; originally announced June 2021.

Comments: 9 pages, 4 figures

arXiv:2104.02743 [pdf, other]

Approximate Robust NMPC using Reinforcement Learning

Authors: Hossein Nejatbakhsh Esfahani, Arash Bahari Kordabad, Sebastien Gros

Abstract: We present a Reinforcement Learning-based Robust Nonlinear Model Predictive Control (RL-RNMPC) framework for controlling nonlinear systems in the presence of disturbances and uncertainties. An approximate Robust Nonlinear Model Predictive Control (RNMPC) of low computational complexity is used in which the state trajectory uncertainty is modelled via ellipsoids. Reinforcement Learning is then used… ▽ More We present a Reinforcement Learning-based Robust Nonlinear Model Predictive Control (RL-RNMPC) framework for controlling nonlinear systems in the presence of disturbances and uncertainties. An approximate Robust Nonlinear Model Predictive Control (RNMPC) of low computational complexity is used in which the state trajectory uncertainty is modelled via ellipsoids. Reinforcement Learning is then used in order to handle the ellipsoidal approximation and improve the closed-loop performance of the scheme by adjusting the MPC parameters generating the ellipsoids. The approach is tested on a simulated Wheeled Mobile Robot (WMR) tracking a desired trajectory while avoiding static obstacles. △ Less

Submitted 6 April, 2021; originally announced April 2021.

Comments: This paper has been accepted to 2021 European Control Conference (ECC)

arXiv:2104.02411 [pdf, other]

MPC-based Reinforcement Learning for Economic Problems with Application to Battery Storage

Authors: Arash Bahari Kordabad, Wenqi Cai, Sebastien Gros

Abstract: In this paper, we are interested in optimal control problems with purely economic costs, which often yield optimal policies having a (nearly) bang-bang structure. We focus on policy approximations based on Model Predictive Control (MPC) and the use of the deterministic policy gradient method to optimize the MPC closed-loop performance in the presence of unmodelled stochasticity or model error. Whe… ▽ More In this paper, we are interested in optimal control problems with purely economic costs, which often yield optimal policies having a (nearly) bang-bang structure. We focus on policy approximations based on Model Predictive Control (MPC) and the use of the deterministic policy gradient method to optimize the MPC closed-loop performance in the presence of unmodelled stochasticity or model error. When the policy has a (nearly) bang-bang structure, we observe that the policy gradient method can struggle to produce meaningful steps in the policy parameters. To tackle this issue, we propose a homotopy strategy based on the interior-point method, providing a relaxation of the policy during the learning. We investigate a specific well-known battery storage problem, and show that the proposed method delivers a homogeneous and faster learning than a classical policy gradient approach. △ Less

Submitted 6 April, 2021; originally announced April 2021.

Comments: This paper has been accepted to ECC2021. 6 pages

arXiv:2102.11122 [pdf, other]

Reinforcement Learning of the Prediction Horizon in Model Predictive Control

Authors: Eivind Bøhn, Sebastien Gros, Signe Moe, Tor Arne Johansen

Abstract: Model predictive control (MPC) is a powerful trajectory optimization control technique capable of controlling complex nonlinear systems while respecting system constraints and ensuring safe operation. The MPC's capabilities come at the cost of a high online computational complexity, the requirement of an accurate model of the system dynamics, and the necessity of tuning its parameters to the speci… ▽ More Model predictive control (MPC) is a powerful trajectory optimization control technique capable of controlling complex nonlinear systems while respecting system constraints and ensuring safe operation. The MPC's capabilities come at the cost of a high online computational complexity, the requirement of an accurate model of the system dynamics, and the necessity of tuning its parameters to the specific control application. The main tunable parameter affecting the computational complexity is the prediction horizon length, controlling how far into the future the MPC predicts the system response and thus evaluates the optimality of its computed trajectory. A longer horizon generally increases the control performance, but requires an increasingly powerful computing platform, excluding certain control applications.The performance sensitivity to the prediction horizon length varies over the state space, and this motivated the adaptive horizon model predictive control (AHMPC), which adapts the prediction horizon according to some criteria. In this paper we propose to learn the optimal prediction horizon as a function of the state using reinforcement learning (RL). We show how the RL learning problem can be formulated and test our method on two control tasks, showing clear improvements over the fixed horizon MPC scheme, while requiring only minutes of learning. △ Less

Submitted 22 February, 2021; originally announced February 2021.

Comments: This work has been submitted to IFAC NMPC 2021 for possible publication

arXiv:2102.01383 [pdf, other]

Stability-Constrained Markov Decision Processes Using MPC

Authors: Mario Zanon, Sébastien Gros, Michele Palladino

Abstract: In this paper, we consider solving discounted Markov Decision Processes (MDPs) under the constraint that the resulting policy is stabilizing. In practice MDPs are solved based on some form of policy approximation. We will leverage recent results proposing to use Model Predictive Control (MPC) as a structured policy in the context of Reinforcement Learning to make it possible to introduce stability… ▽ More In this paper, we consider solving discounted Markov Decision Processes (MDPs) under the constraint that the resulting policy is stabilizing. In practice MDPs are solved based on some form of policy approximation. We will leverage recent results proposing to use Model Predictive Control (MPC) as a structured policy in the context of Reinforcement Learning to make it possible to introduce stability requirements directly inside the MPC-based policy. This will restrict the solution of the MDP to stabilizing policies by construction. The stability theory for MPC is most mature for the undiscounted MPC case. Hence, we will first show in this paper that stable discounted MDPs can be reformulated as undiscounted ones. This observation will entail that the MPC-based policy with stability requirements will produce the optimal policy for the discounted MDP if it is stable, and the best stabilizing policy otherwise. △ Less

Submitted 2 February, 2021; originally announced February 2021.

arXiv:2012.07369 [pdf, other]

Learning for MPC with Stability & Safety Guarantees

Authors: Sébastien Gros, Mario Zanon

Abstract: The combination of learning methods with Model Predictive Control (MPC) has attracted a significant amount of attention in the recent literature. The hope of this combination is to reduce the reliance of MPC schemes on accurate models, and to tap into the fast developing machine learning and reinforcement learning tools to exploit the growing amount of data available for many systems. In particula… ▽ More The combination of learning methods with Model Predictive Control (MPC) has attracted a significant amount of attention in the recent literature. The hope of this combination is to reduce the reliance of MPC schemes on accurate models, and to tap into the fast developing machine learning and reinforcement learning tools to exploit the growing amount of data available for many systems. In particular, the combination of reinforcement learning and MPC has been proposed as a viable and theoretically justified approach to introduce explainable, safe and stable policies in reinforcement learning. However, a formal theory detailing how the safety and stability of an MPC-based policy can be maintained through the parameter updates delivered by the learning tools is still lacking. This paper addresses this gap. The theory is developed for the generic Robust MPC case, and applied in simulation in the robust tube-based linear MPC case, where the theory is fairly easy to deploy in practice. The paper focuses on Reinforcement Learning as a learning tool, but it applies to any learning method that updates the MPC parameters online. △ Less

Submitted 22 July, 2022; v1 submitted 14 December, 2020; originally announced December 2020.

arXiv:2011.13365 [pdf, other]

Optimization of the Model Predictive Control Update Interval Using Reinforcement Learning

Authors: Eivind Bøhn, Sebastien Gros, Signe Moe, Tor Arne Johansen

Abstract: In control applications there is often a compromise that needs to be made with regards to the complexity and performance of the controller and the computational resources that are available. For instance, the typical hardware platform in embedded control applications is a microcontroller with limited memory and processing power, and for battery powered applications the control system can account f… ▽ More In control applications there is often a compromise that needs to be made with regards to the complexity and performance of the controller and the computational resources that are available. For instance, the typical hardware platform in embedded control applications is a microcontroller with limited memory and processing power, and for battery powered applications the control system can account for a significant portion of the energy consumption. We propose a controller architecture in which the computational cost is explicitly optimized along with the control objective. This is achieved by a three-part architecture where a high-level, computationally expensive controller generates plans, which a computationally simpler controller executes by compensating for prediction errors, while a recomputation policy decides when the plan should be recomputed. In this paper, we employ model predictive control (MPC) as the high-level plan-generating controller, a linear state feedback controller as the simpler compensating controller, and reinforcement learning (RL) to learn the recomputation policy. Simulation results for two examples showcase the architecture's ability to improve upon the MPC approach and find reasonable compromises weighing the performance on the control objective and the computational resources expended. △ Less

Submitted 26 November, 2020; originally announced November 2020.

Comments: Submitted to 3rd Annual Learning for Dynamics and Control Conference (L4DC 2021)

arXiv:2004.01430 [pdf, ps, other]

Reinforcement Learning for Mixed-Integer Problems Based on MPC

Authors: Sebastien Gros, Mario Zanon

Abstract: Model Predictive Control has been recently proposed as policy approximation for Reinforcement Learning, offering a path towards safe and explainable Reinforcement Learning. This approach has been investigated for Q-learning and actor-critic methods, both in the context of nominal Economic MPC and Robust (N)MPC, showing very promising results. In that context, actor-critic methods seem to be the mo… ▽ More Model Predictive Control has been recently proposed as policy approximation for Reinforcement Learning, offering a path towards safe and explainable Reinforcement Learning. This approach has been investigated for Q-learning and actor-critic methods, both in the context of nominal Economic MPC and Robust (N)MPC, showing very promising results. In that context, actor-critic methods seem to be the most reliable approach. Many applications include a mixture of continuous and integer inputs, for which the classical actor-critic methods need to be adapted. In this paper, we present a policy approximation based on mixed-integer MPC schemes, and propose a computationally inexpensive technique to generate exploration in the mixed-integer input space that ensures a satisfaction of the constraints. We then propose a simple compatible advantage function approximation for the proposed policy, that allows one to build the gradient of the mixed-integer MPC-based policy. △ Less

Submitted 3 April, 2020; originally announced April 2020.

Comments: Accepted at IFAC 2020

arXiv:2004.00915 [pdf, ps, other]

Safe Reinforcement Learning via Projection on a Safe Set: How to Achieve Optimality?

Authors: Sebastien Gros, Mario Zanon, Alberto Bemporad

Abstract: For all its successes, Reinforcement Learning (RL) still struggles to deliver formal guarantees on the closed-loop behavior of the learned policy. Among other things, guaranteeing the safety of RL with respect to safety-critical systems is a very active research topic. Some recent contributions propose to rely on projections of the inputs delivered by the learned policy into a safe set, ensuring t… ▽ More For all its successes, Reinforcement Learning (RL) still struggles to deliver formal guarantees on the closed-loop behavior of the learned policy. Among other things, guaranteeing the safety of RL with respect to safety-critical systems is a very active research topic. Some recent contributions propose to rely on projections of the inputs delivered by the learned policy into a safe set, ensuring that the system safety is never jeopardized. Unfortunately, it is unclear whether this operation can be performed without disrupting the learning process. This paper addresses this issue. The problem is analysed in the context of $Q$-learning and policy gradient techniques. We show that the projection approach is generally disruptive in the context of $Q$-learning though a simple alternative solves the issue, while simple corrections can be used in the context of policy gradient methods in order to ensure that the policy gradients are unbiased. The proposed results extend to safe projections based on robust MPC techniques. △ Less

Submitted 2 April, 2020; originally announced April 2020.

Comments: Accepted at IFAC 2020

arXiv:2001.06616 [pdf, ps, other]

Research Directions in Cyber Threat Intelligence

Authors: Stjepan Groš

Abstract: Cyber threat intelligence is a relatively new field that has grown from two distinct fields, cyber security and intelligence. As such, it draws knowledge from and mixes the two fields. Yet, looking into current scientific research on cyber threat intelligence research, it is relatively scarce, which opens up a lot of opportunities. In this paper we define what cyber threat intelligence is, briefly… ▽ More Cyber threat intelligence is a relatively new field that has grown from two distinct fields, cyber security and intelligence. As such, it draws knowledge from and mixes the two fields. Yet, looking into current scientific research on cyber threat intelligence research, it is relatively scarce, which opens up a lot of opportunities. In this paper we define what cyber threat intelligence is, briefly review some aspects for cyber threat intelligence. Then, we analyze existing research fields that are much older that cyber threat intelligence but related to it. This opens up an opportunity to draw knowledge and methods from those older field, and in that way advance cyber threat intelligence much faster than it would by following its own path. With such an approach we effectively give a research directions for CTI. △ Less

Submitted 18 January, 2020; originally announced January 2020.

Comments: 6 pages

arXiv:1910.01721 [pdf, ps, other]

A Critical View on CIS Controls

Authors: Stjepan Groš

Abstract: CIS Controls is a set of 20 controls and 171 sub-controls that were created with an idea of having a list of something to implement so that organizations can increase their security. While good in theory, it is a big question of how viable this approach is in practice, and does it really help. There is only a minor number of critical views of CIS Controls and since CIS Controls are marketed by two… ▽ More CIS Controls is a set of 20 controls and 171 sub-controls that were created with an idea of having a list of something to implement so that organizations can increase their security. While good in theory, it is a big question of how viable this approach is in practice, and does it really help. There is only a minor number of critical views of CIS Controls and since CIS Controls are marketed by two very influential organizations they are very popular. Yet, there are alternatives published by ISO, NIST and even PCI consortium. In this paper we critically assess CIS Controls, assumptions on which they are based as well as validity of approach and claims made in its favor. The conclusion is that scientific community should be more active regarding this topic, but also that more material is necessary. This is something that CIS and SANS should support if they want to make CIS Controls viable alternative to other approaches. △ Less

Submitted 2 May, 2020; v1 submitted 3 October, 2019; originally announced October 2019.

Comments: 7 pages

Showing 1–32 of 32 results for author: Groš, S