-
A case study on the transformative potential of AI in software engineering on LeetCode and ChatGPT
Authors:
Manuel Merkel,
Jens Dörpinghaus
Abstract:
The recent surge in the field of generative artificial intelligence (GenAI) has the potential to bring about transformative changes across a range of sectors, including software engineering and education. As GenAI tools, such as OpenAI's ChatGPT, are increasingly utilised in software engineering, it becomes imperative to understand the impact of these technologies on the software product. This stu…
▽ More
The recent surge in the field of generative artificial intelligence (GenAI) has the potential to bring about transformative changes across a range of sectors, including software engineering and education. As GenAI tools, such as OpenAI's ChatGPT, are increasingly utilised in software engineering, it becomes imperative to understand the impact of these technologies on the software product. This study employs a methodological approach, comprising web scraping and data mining from LeetCode, with the objective of comparing the software quality of Python programs produced by LeetCode users with that generated by GPT-4o. In order to gain insight into these matters, this study addresses the question whether GPT-4o produces software of superior quality to that produced by humans.
The findings indicate that GPT-4o does not present a considerable impediment to code quality, understandability, or runtime when generating code on a limited scale. Indeed, the generated code even exhibits significantly lower values across all three metrics in comparison to the user-written code. However, no significantly superior values were observed for the generated code in terms of memory usage in comparison to the user code, which contravened the expectations. Furthermore, it will be demonstrated that GPT-4o encountered challenges in generalising to problems that were not included in the training data set.
This contribution presents a first large-scale study comparing generated code with human-written code based on LeetCode platform based on multiple measures including code quality, code understandability, time behaviour and resource utilisation. All data is publicly available for further research.
△ Less
Submitted 7 January, 2025;
originally announced January 2025.
-
A novel DFS/BFS approach towards link prediction
Authors:
Jens Dörpinghaus,
Tobias Hübenthal,
Denis Stepanov
Abstract:
Knowledge graphs have been shown to play a significant role in current knowledge mining fields, including life sciences, bioinformatics, computational social sciences, and social network analysis. The problem of link prediction bears many applications and has been extensively studied. However, most methods are restricted to dimension reduction, probabilistic model, or similarity-based approaches a…
▽ More
Knowledge graphs have been shown to play a significant role in current knowledge mining fields, including life sciences, bioinformatics, computational social sciences, and social network analysis. The problem of link prediction bears many applications and has been extensively studied. However, most methods are restricted to dimension reduction, probabilistic model, or similarity-based approaches and are inherently biased. In this paper, we provide a definition of graph prediction for link prediction and outline related work to support our novel approach, which integrates centrality measures with classical machine learning methods. We examine our experimental results in detail and identify areas for potential further research. Our method shows promise, particularly when utilizing randomly selected nodes and degree centrality.
△ Less
Submitted 17 September, 2024;
originally announced September 2024.
-
A Generic Framework for Hidden Markov Models on Biomedical Data
Authors:
Richard Fechner,
Jens Dörpinghaus,
Robert Rockenfeller,
Jennifer Faber
Abstract:
Background: Biomedical data are usually collections of longitudinal data assessed at certain points in time. Clinical observations assess the presences and severity of symptoms, which are the basis for description and modeling of disease progression. Deciphering potential underlying unknowns solely from the distinct observation would substantially improve the understanding of pathological cascades…
▽ More
Background: Biomedical data are usually collections of longitudinal data assessed at certain points in time. Clinical observations assess the presences and severity of symptoms, which are the basis for description and modeling of disease progression. Deciphering potential underlying unknowns solely from the distinct observation would substantially improve the understanding of pathological cascades. Hidden Markov Models (HMMs) have been successfully applied to the processing of possibly noisy continuous signals. The aim was to improve the application HMMs to multivariate time-series of categorically distributed data. Here, we used HHMs to study prediction of the loss of free walking ability as one major clinical deterioration in the most common autosomal dominantly inherited ataxia disorder worldwide. We used HHMs to investigate the prediction of loss of the ability to walk freely, representing a major clinical deterioration in the most common autosomal-dominant inherited ataxia disorder worldwide.
Results: We present a prediction pipeline which processes data paired with a configuration file, enabling to construct, validate and query a fully parameterized HMM-based model. In particular, we provide a theoretical and practical framework for multivariate time-series inference based on HMMs that includes constructing multiple HMMs, each to predict a particular observable variable. Our analysis is done on random data, but also on biomedical data based on Spinocerebellar ataxia type 3 disease.
Conclusions: HHMs are a promising approach to study biomedical data that naturally are represented as multivariate time-series. Our implementation of a HHMs framework is publicly available and can easily be adapted for further applications.
△ Less
Submitted 25 July, 2023;
originally announced July 2023.
-
Rule-based detection of access to education and training in Germany
Authors:
Jens Dörpinghaus,
David Samray,
Robert Helmrich
Abstract:
As a result of transformation processes, the German labor market is highly dependent on vocational training, retraining and continuing education. To match training seekers and offers, we present a novel approach towards the automated detection of access to education and training in German training offers and advertisements. We will in particular focus on (a) general school and education degrees an…
▽ More
As a result of transformation processes, the German labor market is highly dependent on vocational training, retraining and continuing education. To match training seekers and offers, we present a novel approach towards the automated detection of access to education and training in German training offers and advertisements. We will in particular focus on (a) general school and education degrees and schoolleaving certificates, (b) professional experience, (c) a previous apprenticeship and (d) a list of skills provided by the German Federal Employment Agency. This novel approach combines several methods: First, we provide a mapping of synonyms in education combining different qualifications and adding deprecated terms. Second, we provide a rule-based matching to identify the need for professional experience or apprenticeship. However, not all access requirements can be matched due to incompatible data schemata or non-standardizes requirements, e.g initial tests or interviews. While we can identify several shortcomings, the presented approach offers promising results for two data sets: training and re-training advertisements.
△ Less
Submitted 13 April, 2023;
originally announced April 2023.
-
Centrality Measures in multi-layer Knowledge Graphs
Authors:
Jens Dörpinghaus,
Vera Weil,
Carsten Düing,
Martin W. Sommer
Abstract:
Knowledge graphs play a central role for linking different data which leads to multiple layers. Thus, they are widely used in big data integration, especially for connecting data from different domains. Few studies have investigated the questions how multiple layers within graphs impact methods and algorithms developed for single-purpose networks, for example social networks. This manuscript inves…
▽ More
Knowledge graphs play a central role for linking different data which leads to multiple layers. Thus, they are widely used in big data integration, especially for connecting data from different domains. Few studies have investigated the questions how multiple layers within graphs impact methods and algorithms developed for single-purpose networks, for example social networks. This manuscript investigates the impact of multiple layers on centrality measures compared to single-purpose graph. In particular, (a) we develop an experimental environment to (b) evaluate two different centrality measures - degree and betweenness centrality - on random graphs inspired by social network analysis: small-world and scale-free networks. The presented approach (c) shows that the graph structures and topology has a great impact on its robustness for additional data stored. Although the experimental analysis of random graphs allows us to make some basic observations we will (d) make suggestions for additional research on particular graph structures that have a great impact on the stability of networks.
△ Less
Submitted 17 March, 2022;
originally announced March 2022.
-
Optimization of Retrieval Algorithms on Large Scale Knowledge Graphs
Authors:
Jens Dörpinghaus,
Andreas Stefan
Abstract:
Knowledge graphs have been shown to play an important role in recent knowledge mining and discovery, for example in the field of life sciences or bioinformatics. Although a lot of research has been done on the field of query optimization, query transformation and of course in storing and retrieving large scale knowledge graphs the field of algorithmic optimization is still a major challenge and a…
▽ More
Knowledge graphs have been shown to play an important role in recent knowledge mining and discovery, for example in the field of life sciences or bioinformatics. Although a lot of research has been done on the field of query optimization, query transformation and of course in storing and retrieving large scale knowledge graphs the field of algorithmic optimization is still a major challenge and a vital factor in using graph databases. Few researchers have addressed the problem of optimizing algorithms on large scale labeled property graphs. Here, we present two optimization approaches and compare them with a naive approach of directly querying the graph database. The aim of our work is to determine limiting factors of graph databases like Neo4j and we describe a novel solution to tackle these challenges. For this, we suggest a classification schema to differ between the complexity of a problem on a graph database. We evaluate our optimization approaches on a test system containing a knowledge graph derived biomedical publication data enriched with text mining data. This dense graph has more than 71M nodes and 850M relationships. The results are very encouraging and - depending on the problem - we were able to show a speedup of a factor between 44 and 3839.
△ Less
Submitted 10 February, 2020;
originally announced February 2020.
-
Towards context in large scale biomedical knowledge graphs
Authors:
Jens Dörpinghaus,
Andreas Stefan,
Bruce Schultz,
Marc Jacobs
Abstract:
Contextual information is widely considered for NLP and knowledge discovery in life sciences since it highly influences the exact meaning of natural language. The scientific challenge is not only to extract such context data, but also to store this data for further query and discovery approaches. Here, we propose a multiple step knowledge graph approach using labeled property graphs based on polyg…
▽ More
Contextual information is widely considered for NLP and knowledge discovery in life sciences since it highly influences the exact meaning of natural language. The scientific challenge is not only to extract such context data, but also to store this data for further query and discovery approaches. Here, we propose a multiple step knowledge graph approach using labeled property graphs based on polyglot persistence systems to utilize context data for context mining, graph queries, knowledge discovery and extraction. We introduce the graph-theoretic foundation for a general context concept within semantic networks and show a proof-of-concept based on biomedical literature and text mining. Our test system contains a knowledge graph derived from the entirety of PubMed and SCAIView data and is enriched with text mining data and domain specific language data using BEL. Here, context is a more general concept than annotations. This dense graph has more than 71M nodes and 850M relationships. We discuss the impact of this novel approach with 27 real world use cases represented by graph queries.
△ Less
Submitted 23 January, 2020;
originally announced January 2020.
-
Data Exploration and Validation on dense knowledge graphs for biomedical research
Authors:
Jens Dörpinghaus,
Alexander Apke,
Vanessa Lage-Rupprecht,
Andreas Stefan
Abstract:
Here we present a holistic approach for data exploration on dense knowledge graphs as a novel approach with a proof-of-concept in biomedical research. Knowledge graphs are increasingly becoming a vital factor in knowledge mining and discovery as they connect data using technologies from the semantic web. In this paper we extend a basic knowledge graph extracted from biomedical literature by contex…
▽ More
Here we present a holistic approach for data exploration on dense knowledge graphs as a novel approach with a proof-of-concept in biomedical research. Knowledge graphs are increasingly becoming a vital factor in knowledge mining and discovery as they connect data using technologies from the semantic web. In this paper we extend a basic knowledge graph extracted from biomedical literature by context data like named entities and relations obtained by text mining and other linked data sources like ontologies and databases. We will present an overview about this novel network. The aim of this work was to extend this current knowledge with approaches from graph theory. This method will build the foundation for quality control, validation of hypothesis, detection of missing data and time series analysis of biomedical knowledge in general. In this context we tried to apply multiple-valued decision diagrams to these questions. In addition this knowledge representation of linked data can be used as FAIR approach to answer semantic questions. This paper sheds new lights on dense and very large knowledge graphs and the importance of a graph-theoretic understanding of these networks.
△ Less
Submitted 8 December, 2019;
originally announced December 2019.