-
Beyond Surface Similarity: Evaluating LLM-Based Test Refactorings with Structural and Semantic Awareness
Authors:
Wendkûuni C. Ouédraogo,
Yinghua Li,
Xueqi Dang,
Xin Zhou,
Anil Koyuncu,
Jacques Klein,
David Lo,
Tegawendé F. Bissyandé
Abstract:
Large Language Models (LLMs) are increasingly employed to automatically refactor unit tests, aiming to enhance readability, naming, and structural clarity while preserving functional behavior. However, evaluating such refactorings remains challenging: traditional metrics like CodeBLEU are overly sensitive to renaming and structural edits, whereas embedding-based similarities capture semantics but…
▽ More
Large Language Models (LLMs) are increasingly employed to automatically refactor unit tests, aiming to enhance readability, naming, and structural clarity while preserving functional behavior. However, evaluating such refactorings remains challenging: traditional metrics like CodeBLEU are overly sensitive to renaming and structural edits, whereas embedding-based similarities capture semantics but ignore readability and modularity. We introduce CTSES, a composite metric that integrates CodeBLEU, METEOR, and ROUGE-L to balance behavior preservation, lexical quality, and structural alignment. CTSES is evaluated on over 5,000 test suites automatically refactored by GPT-4o and Mistral-Large-2407, using Chain-of-Thought prompting, across two established Java benchmarks: Defects4J and SF110. Our results show that CTSES yields more faithful and interpretable assessments, better aligned with developer expectations and human intuition than existing metrics.
△ Less
Submitted 7 June, 2025;
originally announced June 2025.
-
Mind the Gap: A Readability-Aware Metric for Test Code Complexity
Authors:
Wendkûuni C. Ouédraogo,
Yinghua Li,
Xueqi Dang,
Xin Zhou,
Anil Koyuncu,
Jacques Klein,
David Lo,
Tegawendé F. Bissyandé
Abstract:
Automatically generated unit tests-from search-based tools like EvoSuite or LLMs-vary significantly in structure and readability. Yet most evaluations rely on metrics like Cyclomatic Complexity and Cognitive Complexity, designed for functional code rather than test code. Recent studies have shown that SonarSource's Cognitive Complexity metric assigns near-zero scores to LLM-generated tests, yet it…
▽ More
Automatically generated unit tests-from search-based tools like EvoSuite or LLMs-vary significantly in structure and readability. Yet most evaluations rely on metrics like Cyclomatic Complexity and Cognitive Complexity, designed for functional code rather than test code. Recent studies have shown that SonarSource's Cognitive Complexity metric assigns near-zero scores to LLM-generated tests, yet its behavior on EvoSuite-generated tests and its applicability to test-specific code structures remain unexplored. We introduce CCTR, a Test-Aware Cognitive Complexity metric tailored for unit tests. CCTR integrates structural and semantic features like assertion density, annotation roles, and test composition patterns-dimensions ignored by traditional complexity models but critical for understanding test code. We evaluate 15,750 test suites generated by EvoSuite, GPT-4o, and Mistral Large-1024 across 350 classes from Defects4J and SF110. Results show CCTR effectively discriminates between structured and fragmented test suites, producing interpretable scores that better reflect developer-perceived effort. By bridging structural analysis and test readability, CCTR provides a foundation for more reliable evaluation and improvement of generated tests. We publicly release all data, prompts, and evaluation scripts to support replication.
△ Less
Submitted 7 June, 2025;
originally announced June 2025.
-
On the boundedness of periodic Fourier integral operators in Lebesgue spaces with variable exponent
Authors:
Boukary Tai,
Mohamed Congo,
Marie Françoise Ouedraogo,
Arouna Ouedraogo
Abstract:
The aim of this paper is to investigate the boundedness of periodic Fourier integral operators in Lebesgue spaces with variable exponent $L^{p(\cdot)}$ on the $n$-dimensional torus. We deal with operators of type $(ρ, δ)$ which symbols belong to the Hörmander class $S^{m}_{ρ, δ}(\mathbb{T}^{n}\times\mathbb{Z}^{n})$ for $0\leqδ<ρ\leq1.$
The aim of this paper is to investigate the boundedness of periodic Fourier integral operators in Lebesgue spaces with variable exponent $L^{p(\cdot)}$ on the $n$-dimensional torus. We deal with operators of type $(ρ, δ)$ which symbols belong to the Hörmander class $S^{m}_{ρ, δ}(\mathbb{T}^{n}\times\mathbb{Z}^{n})$ for $0\leqδ<ρ\leq1.$
△ Less
Submitted 21 October, 2024;
originally announced October 2024.
-
Test smells in LLM-Generated Unit Tests
Authors:
Wendkûuni C. Ouédraogo,
Yinghua Li,
Kader Kaboré,
Xunzhu Tang,
Anil Koyuncu,
Jacques Klein,
David Lo,
Tegawendé F. Bissyandé
Abstract:
The use of Large Language Models (LLMs) in automated test generation is gaining popularity, with much of the research focusing on metrics like compilability rate, code coverage and bug detection. However, an equally important quality metric is the presence of test smells design flaws or anti patterns in test code that hinder maintainability and readability. In this study, we explore the diffusion…
▽ More
The use of Large Language Models (LLMs) in automated test generation is gaining popularity, with much of the research focusing on metrics like compilability rate, code coverage and bug detection. However, an equally important quality metric is the presence of test smells design flaws or anti patterns in test code that hinder maintainability and readability. In this study, we explore the diffusion of test smells in LLM generated unit test suites and compare them to those found in human written ones. We analyze a benchmark of 20,500 LLM-generated test suites produced by four models (GPT-3.5, GPT-4, Mistral 7B, and Mixtral 8x7B) across five prompt engineering techniques, alongside a dataset of 780,144 human written test suites from 34,637 projects. Leveraging TsDetect, a state of the art tool capable of detecting 21 different types of test smells, we identify and analyze the prevalence and co-occurrence of various test smells in both human written and LLM-generated test suites. Our findings reveal new insights into the strengths and limitations of LLMs in test generation. First, regarding prevalence, we observe that LLMs frequently generate tests with common test smells, such as Magic Number Test and Assertion Roulette. Second, in terms of co occurrence, certain smells, like Long Test and Useless Test, tend to co occur in LLM-generated suites, influenced by specific prompt techniques. Third, we find that project complexity and LLM specific factors, including model size and context length, significantly affect the prevalence of test smells. Finally, the patterns of test smells in LLM-generated tests often mirror those in human-written tests, suggesting potential data leakage from training datasets. These insights underscore the need to refine LLM-based test generation for cleaner code and suggest improvements in both LLM capabilities and software testing practices.
△ Less
Submitted 14 October, 2024;
originally announced October 2024.
-
Spherical Fourier multipliers related to Gelfand pairs
Authors:
Yaogan Mensah,
Marie Françoise Ouedraogo
Abstract:
In this paper, we introduce a family of Fourier multipliers using the spherical Fourier transform on Gelfand pairs. We refer to them as spherical Fourier multipliers. We study certain sufficient conditions under which they are bounded. Then, under the hypothesis of compactness of the underlying group and under certain summability conditions, we obtain the belonging of the spherical Fourier multipl…
▽ More
In this paper, we introduce a family of Fourier multipliers using the spherical Fourier transform on Gelfand pairs. We refer to them as spherical Fourier multipliers. We study certain sufficient conditions under which they are bounded. Then, under the hypothesis of compactness of the underlying group and under certain summability conditions, we obtain the belonging of the spherical Fourier multipliers to some Schatten-von Neumann classes.
△ Less
Submitted 28 September, 2024; v1 submitted 12 July, 2024;
originally announced July 2024.
-
Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation
Authors:
Wendkûuni C. Ouédraogo,
Kader Kaboré,
Yinghua Li,
Haoye Tian,
Anil Koyuncu,
Jacques Klein,
David Lo,
Tegawendé F. Bissyandé
Abstract:
Unit testing is essential for software reliability, yet manual test creation is time-consuming and often neglected. Although search-based software testing improves efficiency, it produces tests with poor readability and maintainability. Although LLMs show promise for test generation, existing research lacks comprehensive evaluation across execution-driven assessment, reasoning-based prompting, and…
▽ More
Unit testing is essential for software reliability, yet manual test creation is time-consuming and often neglected. Although search-based software testing improves efficiency, it produces tests with poor readability and maintainability. Although LLMs show promise for test generation, existing research lacks comprehensive evaluation across execution-driven assessment, reasoning-based prompting, and real-world testing scenarios. This study presents the first large-scale empirical evaluation of LLM-generated unit tests at the class level, systematically analyzing four state-of-the-art models - GPT-3.5, GPT-4, Mistral 7B, and Mixtral 8x7B - against EvoSuite across 216,300 test cases from Defects4J, SF110, and CMD (a dataset mitigating LLM training data leakage). We evaluate five prompting techniques - Zero-Shot Learning (ZSL), Few-Shot Learning (FSL), Chain-of-Thought (CoT), Tree-of-Thought (ToT), and Guided Tree-of-Thought (GToT) - assessing syntactic correctness, compilability, hallucination-driven failures, readability, code coverage metrics, and fault detection capabilities. Our findings challenge prior claims that in-context learning is ineffective for test generation in code-specialized LLMs. Reasoning-based prompting - particularly GToT - significantly enhances test reliability, compilability, and structural adherence in general-purpose LLMs. However, hallucination-driven failures remain a persistent challenge, manifesting as non-existent symbol references, incorrect API calls, and fabricated dependencies, resulting in high compilation failure rates (up to 86%). Execution-based classification and mutation testing reveal that many failing tests stem from hallucinated dependencies, limiting effective fault detection.
△ Less
Submitted 5 July, 2025; v1 submitted 28 June, 2024;
originally announced July 2024.
-
On nilpotency in Leibniz algebras
Authors:
C. J. A. Béré,
M. F. Ouedraogo,
M. Ouattara
Abstract:
The main result of this paper is to prove that if a (right) Leibniz algebra $L$ is \textit{right nilpotent} of degree $n$, then $L$ is \textit{strongly nilpotent} of degree less or equal to $4n^2-2n+1$.
The main result of this paper is to prove that if a (right) Leibniz algebra $L$ is \textit{right nilpotent} of degree $n$, then $L$ is \textit{strongly nilpotent} of degree less or equal to $4n^2-2n+1$.
△ Less
Submitted 25 May, 2016;
originally announced May 2016.
-
Classification of Traces and Associated Determinants on Odd-Class Operators in Odd Dimensions
Authors:
Carolina Neira Jiménez,
Marie Françoise Ouedraogo
Abstract:
To supplement the already known classification of traces on classical pseudodifferential operators, we present a classification of traces on the algebras of odd-class pseudodifferential operators of non-positive order acting on smooth functions on a closed odd-dimensional manifold. By means of the one to one correspondence between continuous traces on Lie algebras and determinants on the associate…
▽ More
To supplement the already known classification of traces on classical pseudodifferential operators, we present a classification of traces on the algebras of odd-class pseudodifferential operators of non-positive order acting on smooth functions on a closed odd-dimensional manifold. By means of the one to one correspondence between continuous traces on Lie algebras and determinants on the associated regular Lie groups, we give a classification of determinants on the group associated to the algebra of odd-class pseudodifferential operators with fixed non-positive order. At the end we discuss two possible ways to extend the definition of a determinant outside a neighborhood of the identity on the Lie group associated to the algebra of odd-class pseudodifferential operators of order zero.
△ Less
Submitted 21 April, 2012; v1 submitted 29 November, 2011;
originally announced November 2011.
-
The multiplicative anomaly for determinants revisited; locality
Authors:
Marie-Francoise Ouedraogo,
Sylvie Paycha
Abstract:
Observing that the logarithm of a product of two elliptic operators differs from the sum of the logarithms by a finite sum of operator brackets, we infer that regularised traces of this difference are local as finite sums of noncommutative residues. From an explicit local formula for such regularised traces, we derive an explicit local formula for the multiplicative anomaly of zeta-determinants…
▽ More
Observing that the logarithm of a product of two elliptic operators differs from the sum of the logarithms by a finite sum of operator brackets, we infer that regularised traces of this difference are local as finite sums of noncommutative residues. From an explicit local formula for such regularised traces, we derive an explicit local formula for the multiplicative anomaly of zeta-determinants which sheds light on its locality and yields back previously known results.
△ Less
Submitted 25 April, 2009; v1 submitted 31 January, 2007;
originally announced January 2007.