DATETIME: A new benchmark to measure LLM translation and reasoning capabilities
Authors:
Edward Gaere,
Florian Wangenheim
Abstract:
This paper introduces DATETIME, a new high-quality benchmark designed to evaluate the translation and reasoning abilities of a Large Language Model (LLM) on datetimes. A datetime is simply a date and a time, for example '11th.february.2023 ,1:12:31'. Datetimes are an interesting domain because they are intuitive and straightforward for humans to process but present significant challenges for LLMs.…
▽ More
This paper introduces DATETIME, a new high-quality benchmark designed to evaluate the translation and reasoning abilities of a Large Language Model (LLM) on datetimes. A datetime is simply a date and a time, for example '11th.february.2023 ,1:12:31'. Datetimes are an interesting domain because they are intuitive and straightforward for humans to process but present significant challenges for LLMs. At the time of writing, no publicly available benchmark exists for systematically evaluating LLMs on datetime processing. Our experiments show that state-of-the-art models exhibit significant difficulty with tasks involving reasoning on datetimes, and that General Artificial Intelligence is still a distant aspiration. We hypothesize that working with datetimes necessitates translation and/or computation capabilities, and the tasks of the benchmark are organized accordingly. Significant dispersion in performance across models is observed with surprisingly poor performance even on apparently trivial tasks. Whilst frontier models such as ChatGPT, Claude and Llama3.1 have evidently been built and trained with datetime reasoning abilities, significant improvement is required for the open-source models.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
A Self-Integration Testbed for Decentralized Socio-technical Systems
Authors:
Farzam Fanitabasi,
Edward Gaere,
Evangelos Pournaras
Abstract:
The Internet of Things comes along with new challenges for experimenting, testing, and operating decentralized socio-technical systems at large-scale. In such systems, autonomous agents interact locally with their users, and remotely with other agents to make intelligent collective choices. Via these interactions they self-regulate the consumption and production of distributed resources. While suc…
▽ More
The Internet of Things comes along with new challenges for experimenting, testing, and operating decentralized socio-technical systems at large-scale. In such systems, autonomous agents interact locally with their users, and remotely with other agents to make intelligent collective choices. Via these interactions they self-regulate the consumption and production of distributed resources. While such complex systems are often deployed and operated using centralized computing infrastructures, the socio-technical nature of these decentralized systems requires new value-sensitive design paradigms; empowering trust, transparency, and alignment with citizens' social values, such as privacy preservation, autonomy, and fairness among citizens' choices. Currently, instruments and tools to study such systems and guide the prototyping process from simulation to live deployment are missing, or not practical in this distributed socio-technical context. This paper bridges this gap by introducing a novel testbed architecture for decentralized socio-technical systems running on IoT. This new architecture is designed for a seamless reusability of (i) application-independent decentralized services by an IoT application, and (ii) different IoT applications by the same decentralized service. This dual self-integration promises IoT applications that are simpler to prototype, and can interoperate with decentralized services during runtime to self-integrate more complex functionality. Such integration provides stronger validation of IoT applications, and improves resource utilization. Pressure and crash tests during continuous operations of several weeks, with more than 80K network joining and leaving of agents, 2.4M parameter changes, and 100M communicated messages, confirm the robustness and practicality of the testbed architecture.
△ Less
Submitted 22 July, 2020; v1 submitted 6 February, 2020;
originally announced February 2020.