LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs
Authors:
Yujun Zhou,
Jingdong Yang,
Yue Huang,
Kehan Guo,
Zoe Emory,
Bikram Ghosh,
Amita Bedar,
Sujay Shekar,
Pin-Yu Chen,
Tian Gao,
Werner Geyer,
Nuno Moniz,
Nitesh V Chawla,
Xiangliang Zhang
Abstract:
Artificial Intelligence (AI) is revolutionizing scientific research, yet its growing integration into laboratory environments presents critical safety challenges. While large language models (LLMs) increasingly assist in tasks ranging from procedural guidance to autonomous experiment orchestration, an "illusion of understanding" may lead researchers to overestimate their reliability. Such overreli…
▽ More
Artificial Intelligence (AI) is revolutionizing scientific research, yet its growing integration into laboratory environments presents critical safety challenges. While large language models (LLMs) increasingly assist in tasks ranging from procedural guidance to autonomous experiment orchestration, an "illusion of understanding" may lead researchers to overestimate their reliability. Such overreliance is especially hazardous in high-stakes laboratory settings, where failures in hazard identification or risk assessment can result in severe accidents. To address these concerns, we propose the Laboratory Safety Benchmark (LabSafety Bench), a comprehensive framework that evaluates LLMs and vision language models (VLMs) on their ability to identify potential hazards, assess risks, and predict the consequences of unsafe actions in lab environments. LabSafety Bench comprises 765 multiple-choice questions aligned with US Occupational Safety and Health Administration (OSHA) protocols, along with 520 realistic laboratory scenarios featuring dual evaluation tasks: the Hazards Identification Test and the Consequence Identification Test, with 4090 open-ended questions in total. Evaluations across eight proprietary models, seven open-weight LLMs, and four VLMs reveal that, despite advanced performance on structured assessments, no model achieves the safety threshold required for reliable operation. None scored above 75% on the Hazards Identification Test. Moreover, while proprietary models tend to excel in multiple-choice evaluations, their performance in open-ended, real-world scenario responses is comparable to that of open-source models. These findings underscore the urgent need for specialized evaluation frameworks to ensure the safe and responsible deployment of AI in laboratory settings.
△ Less
Submitted 26 February, 2025; v1 submitted 18 October, 2024;
originally announced October 2024.
People over trust AI-generated medical responses and view them to be as valid as doctors, despite low accuracy
Authors:
Shruthi Shekar,
Pat Pataranutaporn,
Chethan Sarabu,
Guillermo A. Cecchi,
Pattie Maes
Abstract:
This paper presents a comprehensive analysis of how AI-generated medical responses are perceived and evaluated by non-experts. A total of 300 participants gave evaluations for medical responses that were either written by a medical doctor on an online healthcare platform, or generated by a large language model and labeled by physicians as having high or low accuracy. Results showed that participan…
▽ More
This paper presents a comprehensive analysis of how AI-generated medical responses are perceived and evaluated by non-experts. A total of 300 participants gave evaluations for medical responses that were either written by a medical doctor on an online healthcare platform, or generated by a large language model and labeled by physicians as having high or low accuracy. Results showed that participants could not effectively distinguish between AI-generated and Doctors' responses and demonstrated a preference for AI-generated responses, rating High Accuracy AI-generated responses as significantly more valid, trustworthy, and complete/satisfactory. Low Accuracy AI-generated responses on average performed very similar to Doctors' responses, if not more. Participants not only found these low-accuracy AI-generated responses to be valid, trustworthy, and complete/satisfactory but also indicated a high tendency to follow the potentially harmful medical advice and incorrectly seek unnecessary medical attention as a result of the response provided. This problematic reaction was comparable if not more to the reaction they displayed towards doctors' responses. This increased trust placed on inaccurate or inappropriate AI-generated medical advice can lead to misdiagnosis and harmful consequences for individuals seeking help. Further, participants were more trusting of High Accuracy AI-generated responses when told they were given by a doctor and experts rated AI-generated responses significantly higher when the source of the response was unknown. Both experts and non-experts exhibited bias, finding AI-generated responses to be more thorough and accurate than Doctors' responses but still valuing the involvement of a Doctor in the delivery of their medical advice. Ensuring AI systems are implemented with medical professionals should be the future of using AI for the delivery of medical advice.
△ Less
Submitted 11 August, 2024;
originally announced August 2024.