Search | arXiv e-print repository

arXiv:2504.20605 [pdf, other]

TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models

Authors: Mihai Nadas, Laura Diosan, Andrei Piscoran, Andreea Tomescu

Abstract: Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We close this gap with TF1-EN-3M, the first open dataset of three million English-language fables generated exclusively by instruction-tuned models no larger than 8B parameters. Each story follows a six-slot scaffold (chara… ▽ More Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We close this gap with TF1-EN-3M, the first open dataset of three million English-language fables generated exclusively by instruction-tuned models no larger than 8B parameters. Each story follows a six-slot scaffold (character -> trait -> setting -> conflict -> resolution -> moral), produced through a combinatorial prompt engine that guarantees genre fidelity while covering a broad thematic space. A hybrid evaluation pipeline blends (i) a GPT-based critic that scores grammar, creativity, moral clarity, and template adherence with (ii) reference-free diversity and readability metrics. Among ten open-weight candidates, an 8B-parameter Llama-3 variant delivers the best quality-speed trade-off, producing high-scoring fables on a single consumer GPU (<24 GB VRAM) at approximately 13.5 cents per 1,000 fables. We release the dataset, generation code, evaluation scripts, and full metadata under a permissive license, enabling exact reproducibility and cost benchmarking. TF1-EN-3M opens avenues for research in instruction following, narrative intelligence, value alignment, and child-friendly educational AI, demonstrating that large-scale moral storytelling no longer requires proprietary giant models. △ Less

Submitted 29 April, 2025; originally announced April 2025.

arXiv:2504.18439 [pdf, other]

The Autonomous Software Stack of the FRED-003C: The Development That Led to Full-Scale Autonomous Racing

Authors: Zalán Demeter, Levente Puskás, Balázs Kovács, Ádám Matkovics, Martin Nádas, Balázs Tuba, Zsolt Farkas, Ármin Bogár-Németh, Gergely Bári

Abstract: Scientific development often takes place in the context of research projects carried out by dedicated students during their time at university. In the field of self-driving software research, the Formula Student Driverless competitions are an excellent platform to promote research and attract young engineers. This article presents the software stack developed by BME Formula Racing Team, that forme… ▽ More Scientific development often takes place in the context of research projects carried out by dedicated students during their time at university. In the field of self-driving software research, the Formula Student Driverless competitions are an excellent platform to promote research and attract young engineers. This article presents the software stack developed by BME Formula Racing Team, that formed the foundation of the development that ultimately led us to full-scale autonomous racing. The experience we gained here contributes greatly to our successful participation in the Abu Dhabi Autonomous Racing League. We therefore think it is important to share the system we used, providing a valuable starting point for other ambitious students. We provide a detailed description of the software pipeline we used, including a brief description of the hardware-software architecture. Furthermore, we introduce the methods that we developed for the modules that implement perception; localisation and mapping, planning, and control tasks. △ Less

Submitted 25 April, 2025; originally announced April 2025.

Comments: Accepted to be published at 2025 IEEE Intelligent Vehicles Symposium (IV)

arXiv:2503.14023 [pdf, other]

Synthetic Data Generation Using Large Language Models: Advances in Text and Code

Authors: Mihai Nadas, Laura Diosan, Andreea Tomescu

Abstract: Large language models (LLMs) have unlocked new possibilities for generating synthetic training data in both natural language and code. By producing artificial but task-relevant examples, these models can significantly augment or even replace real-world datasets, especially when labeled data is scarce or sensitive. This paper surveys recent advances in using LLMs to create synthetic text and code,… ▽ More Large language models (LLMs) have unlocked new possibilities for generating synthetic training data in both natural language and code. By producing artificial but task-relevant examples, these models can significantly augment or even replace real-world datasets, especially when labeled data is scarce or sensitive. This paper surveys recent advances in using LLMs to create synthetic text and code, emphasizing prompt-based generation, retrieval-augmented pipelines, and iterative self-refinement. We show how these methods enrich low-resource tasks such as classification and question answering, as well as code-centric applications such as instruction tuning, code translation, and bug repair, by enabling automated verification of functional correctness. Alongside potential benefits like cost-effectiveness, broad coverage, and controllable diversity, we address challenges such as factual inaccuracies in generated text, lack of stylistic realism, and the risk of bias amplification. Proposed mitigations include filtering and weighting outputs and reinforcement learning with execution feedback for code. We conclude with open research directions like automated prompt engineering, cross-modal data synthesis, and robust evaluation frameworks, highlighting the importance of LLM-generated synthetic data in advancing AI while emphasizing ethical and quality safeguards. △ Less

Submitted 18 March, 2025; originally announced March 2025.

Comments: 21 pages, 3 tables, 64 references, preprint

Showing 1–3 of 3 results for author: Nadas, M