-
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models
Authors:
Mihai Nadas,
Laura Diosan,
Andrei Piscoran,
Andreea Tomescu
Abstract:
Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We close this gap with TF1-EN-3M, the first open dataset of three million English-language fables generated exclusively by instruction-tuned models no larger than 8B parameters. Each story follows a six-slot scaffold (chara…
▽ More
Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We close this gap with TF1-EN-3M, the first open dataset of three million English-language fables generated exclusively by instruction-tuned models no larger than 8B parameters. Each story follows a six-slot scaffold (character -> trait -> setting -> conflict -> resolution -> moral), produced through a combinatorial prompt engine that guarantees genre fidelity while covering a broad thematic space.
A hybrid evaluation pipeline blends (i) a GPT-based critic that scores grammar, creativity, moral clarity, and template adherence with (ii) reference-free diversity and readability metrics. Among ten open-weight candidates, an 8B-parameter Llama-3 variant delivers the best quality-speed trade-off, producing high-scoring fables on a single consumer GPU (<24 GB VRAM) at approximately 13.5 cents per 1,000 fables.
We release the dataset, generation code, evaluation scripts, and full metadata under a permissive license, enabling exact reproducibility and cost benchmarking. TF1-EN-3M opens avenues for research in instruction following, narrative intelligence, value alignment, and child-friendly educational AI, demonstrating that large-scale moral storytelling no longer requires proprietary giant models.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
The Autonomous Software Stack of the FRED-003C: The Development That Led to Full-Scale Autonomous Racing
Authors:
Zalán Demeter,
Levente Puskás,
Balázs Kovács,
Ádám Matkovics,
Martin Nádas,
Balázs Tuba,
Zsolt Farkas,
Ármin Bogár-Németh,
Gergely Bári
Abstract:
Scientific development often takes place in the context of research projects carried out by dedicated students during their time at university. In the field of self-driving software research, the Formula Student Driverless competitions are an excellent platform to promote research and attract young engineers. This article presents the software stack developed by BME Formula Racing Team, that forme…
▽ More
Scientific development often takes place in the context of research projects carried out by dedicated students during their time at university. In the field of self-driving software research, the Formula Student Driverless competitions are an excellent platform to promote research and attract young engineers. This article presents the software stack developed by BME Formula Racing Team, that formed the foundation of the development that ultimately led us to full-scale autonomous racing. The experience we gained here contributes greatly to our successful participation in the Abu Dhabi Autonomous Racing League. We therefore think it is important to share the system we used, providing a valuable starting point for other ambitious students. We provide a detailed description of the software pipeline we used, including a brief description of the hardware-software architecture. Furthermore, we introduce the methods that we developed for the modules that implement perception; localisation and mapping, planning, and control tasks.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
Synthetic Data Generation Using Large Language Models: Advances in Text and Code
Authors:
Mihai Nadas,
Laura Diosan,
Andreea Tomescu
Abstract:
Large language models (LLMs) have unlocked new possibilities for generating synthetic training data in both natural language and code. By producing artificial but task-relevant examples, these models can significantly augment or even replace real-world datasets, especially when labeled data is scarce or sensitive. This paper surveys recent advances in using LLMs to create synthetic text and code,…
▽ More
Large language models (LLMs) have unlocked new possibilities for generating synthetic training data in both natural language and code. By producing artificial but task-relevant examples, these models can significantly augment or even replace real-world datasets, especially when labeled data is scarce or sensitive. This paper surveys recent advances in using LLMs to create synthetic text and code, emphasizing prompt-based generation, retrieval-augmented pipelines, and iterative self-refinement. We show how these methods enrich low-resource tasks such as classification and question answering, as well as code-centric applications such as instruction tuning, code translation, and bug repair, by enabling automated verification of functional correctness. Alongside potential benefits like cost-effectiveness, broad coverage, and controllable diversity, we address challenges such as factual inaccuracies in generated text, lack of stylistic realism, and the risk of bias amplification. Proposed mitigations include filtering and weighting outputs and reinforcement learning with execution feedback for code. We conclude with open research directions like automated prompt engineering, cross-modal data synthesis, and robust evaluation frameworks, highlighting the importance of LLM-generated synthetic data in advancing AI while emphasizing ethical and quality safeguards.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.