-
Implementing NLPs in industrial process modeling: Addressing Categorical Variables
Authors:
Eleni D. Koronaki,
Geremy Loachamin Suntaxi,
Paris Papavasileiou,
Dimitrios G. Giovanis,
Martin Kathrein,
Andreas G. Boudouvis,
Stéphane P. A. Bordas
Abstract:
Important variables of processes are often categorical, i.e. names or labels representing, e.g. categories of inputs, or types of reactors or a sequence of steps. In this work, we use Natural Language Processing Models to derive embeddings of such inputs that represent their actual meaning, or reflect the "distances" between categories, i.e. how similar or dissimilar they are. This is a marked dif…
▽ More
Important variables of processes are often categorical, i.e. names or labels representing, e.g. categories of inputs, or types of reactors or a sequence of steps. In this work, we use Natural Language Processing Models to derive embeddings of such inputs that represent their actual meaning, or reflect the "distances" between categories, i.e. how similar or dissimilar they are. This is a marked difference from the current standard practice of using binary, or one-hot encoding to replace categorical variables with sequences of ones and zeros. Combined with dimensionality reduction techniques, either linear such as Principal Component Analysis, or nonlinear such as Uniform Manifold Approximation and Projection, the proposed approach leads to a meaningful, low-dimensional feature space. The significance of obtaining meaningful embeddings is illustrated in the context of an industrial coating process for cutting tools that includes both numerical and categorical inputs. In this industrial process, subject matter expertise suggests that the categorical inputs are critical for determining the final outcome but this cannot be taken into account with the current state-of-the-art. The proposed approach enables feature importance which is a marked improvement compared to the current state-of-the-art in the encoding of categorical variables. The proposed approach is not limited to the case-study presented here and is suitable for applications with similar mix of categorical and numerical critical inputs.
△ Less
Submitted 16 March, 2025; v1 submitted 27 September, 2024;
originally announced September 2024.
-
Discovering deposition process regimes: leveraging unsupervised learning for process insights, surrogate modeling, and sensitivity analysis
Authors:
Geremy Loachamín Suntaxi,
Paris Papavasileiou,
Eleni D. Koronaki,
Dimitrios G. Giovanis,
Georgios Gakis,
Ioannis G. Aviziotis,
Martin Kathrein,
Gabriele Pozzetti,
Christoph Czettl,
Stéphane P. A. Bordas,
Andreas G. Boudouvis
Abstract:
This work introduces a comprehensive approach utilizing data-driven methods to elucidate the deposition process regimes in Chemical Vapor Deposition (CVD) reactors and the interplay of physical mechanism that dominate in each one of them. Through this work, we address three key objectives. Firstly, our methodology relies on process outcomes, derived by a detailed CFD model, to identify clusters of…
▽ More
This work introduces a comprehensive approach utilizing data-driven methods to elucidate the deposition process regimes in Chemical Vapor Deposition (CVD) reactors and the interplay of physical mechanism that dominate in each one of them. Through this work, we address three key objectives. Firstly, our methodology relies on process outcomes, derived by a detailed CFD model, to identify clusters of "outcomes" corresponding to distinct process regimes, wherein the relative influence of input variables undergoes notable shifts. This phenomenon is experimentally validated through Arrhenius plot analysis, affirming the efficacy of our approach. Secondly, we demonstrate the development of an efficient surrogate model, based on Polynomial Chaos Expansion (PCE), that maintains accuracy, facilitating streamlined computational analyses. Finally, as a result of PCE, sensitivity analysis is made possible by means of Sobol' indices, that quantify the impact of process inputs across identified regimes. The insights gained from our analysis contribute to the formulation of hypotheses regarding phenomena occurring beyond the transition regime. Notably, the significance of temperature even in the diffusion-limited regime, as evidenced by the Arrhenius plot, suggests activation of gas phase reactions at elevated temperatures. Importantly, our proposed methods yield insights that align with experimental observations and theoretical principles, aiding decision-making in process design and optimization. By circumventing the need for costly and time-consuming experiments, our approach offers a pragmatic pathway towards enhanced process efficiency. Moreover, this study underscores the potential of data-driven computational methods for innovating reactor design paradigms.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Statistically Enhanced Learning: a feature engineering framework to boost (any) learning algorithms
Authors:
Florian Felice,
Christophe Ley,
Andreas Groll,
Stéphane Bordas
Abstract:
Feature engineering is of critical importance in the field of Data Science. While any data scientist knows the importance of rigorously preparing data to obtain good performing models, only scarce literature formalizes its benefits. In this work, we will present the method of Statistically Enhanced Learning (SEL), a formalization framework of existing feature engineering and extraction tasks in Ma…
▽ More
Feature engineering is of critical importance in the field of Data Science. While any data scientist knows the importance of rigorously preparing data to obtain good performing models, only scarce literature formalizes its benefits. In this work, we will present the method of Statistically Enhanced Learning (SEL), a formalization framework of existing feature engineering and extraction tasks in Machine Learning (ML). The difference compared to classical ML consists in the fact that certain predictors are not directly observed but obtained as statistical estimators. Our goal is to study SEL, aiming to establish a formalized framework and illustrate its improved performance by means of simulations as well as applications on real life use cases.
△ Less
Submitted 29 June, 2023;
originally announced June 2023.