-
RWESummary: A Framework and Test for Choosing Large Language Models to Summarize Real-World Evidence (RWE) Studies
Authors:
Arjun Mukerji,
Michael L. Jackson,
Jason Jones,
Neil Sanghavi
Abstract:
Large Language Models (LLMs) have been extensively evaluated for general summarization tasks as well as medical research assistance, but they have not been specifically evaluated for the task of summarizing real-world evidence (RWE) from structured output of RWE studies. We introduce RWESummary, a proposed addition to the MedHELM framework (Bedi, Cui, Fuentes, Unell et al., 2025) to enable benchma…
▽ More
Large Language Models (LLMs) have been extensively evaluated for general summarization tasks as well as medical research assistance, but they have not been specifically evaluated for the task of summarizing real-world evidence (RWE) from structured output of RWE studies. We introduce RWESummary, a proposed addition to the MedHELM framework (Bedi, Cui, Fuentes, Unell et al., 2025) to enable benchmarking of LLMs for this task. RWESummary includes one scenario and three evaluations covering major types of errors observed in summarization of medical research studies and was developed using Atropos Health proprietary data. Additionally, we use RWESummary to compare the performance of different LLMs in our internal RWE summarization tool. At the time of publication, with 13 distinct RWE studies, we found the Gemini 2.5 models performed best overall (both Flash and Pro). We suggest RWESummary as a novel and useful foundation model benchmark for real-world evidence study summarization.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Authors:
Zeyi Liao,
Jaylen Jones,
Linxi Jiang,
Eric Fosler-Lussier,
Yu Su,
Zhiqiang Lin,
Huan Sun
Abstract:
Computer-use agents (CUAs) promise to automate complex tasks across operating systems (OS) and the web, but remain vulnerable to indirect prompt injection. Current evaluations of this threat either lack support realistic but controlled environments or ignore hybrid web-OS attack scenarios involving both interfaces. To address this, we propose RedTeamCUA, an adversarial testing framework featuring…
▽ More
Computer-use agents (CUAs) promise to automate complex tasks across operating systems (OS) and the web, but remain vulnerable to indirect prompt injection. Current evaluations of this threat either lack support realistic but controlled environments or ignore hybrid web-OS attack scenarios involving both interfaces. To address this, we propose RedTeamCUA, an adversarial testing framework featuring a novel hybrid sandbox that integrates a VM-based OS environment with Docker-based web platforms. Our sandbox supports key features tailored for red teaming, such as flexible adversarial scenario configuration, and a setting that decouples adversarial evaluation from navigational limitations of CUAs by initializing tests directly at the point of an adversarial injection. Using RedTeamCUA, we develop RTC-Bench, a comprehensive benchmark with 864 examples that investigate realistic, hybrid web-OS attack scenarios and fundamental security vulnerabilities. Benchmarking current frontier CUAs identifies significant vulnerabilities: Claude 3.7 Sonnet | CUA demonstrates an ASR of 42.9%, while Operator, the most secure CUA evaluated, still exhibits an ASR of 7.6%. Notably, CUAs often attempt to execute adversarial tasks with an Attempt Rate as high as 92.5%, although failing to complete them due to capability limitations. Nevertheless, we observe concerning ASRs of up to 50% in realistic end-to-end settings, with the recently released frontier Claude 4 Opus | CUA showing an alarming ASR of 48%, demonstrating that indirect prompt injection presents tangible risks for even advanced CUAs despite their capabilities and safeguards. Overall, RedTeamCUA provides an essential framework for advancing realistic, controlled, and systematic analysis of CUA vulnerabilities, highlighting the urgent need for robust defenses to indirect prompt injection prior to real-world deployment.
△ Less
Submitted 31 May, 2025; v1 submitted 27 May, 2025;
originally announced May 2025.
-
Programmatic Video Prediction Using Large Language Models
Authors:
Hao Tang,
Kevin Ellis,
Suhas Lohit,
Michael J. Jones,
Moitreya Chatterjee
Abstract:
The task of estimating the world model describing the dynamics of a real world process assumes immense importance for anticipating and preparing for future outcomes. For applications such as video surveillance, robotics applications, autonomous driving, etc. this objective entails synthesizing plausible visual futures, given a few frames of a video to set the visual context. Towards this end, we p…
▽ More
The task of estimating the world model describing the dynamics of a real world process assumes immense importance for anticipating and preparing for future outcomes. For applications such as video surveillance, robotics applications, autonomous driving, etc. this objective entails synthesizing plausible visual futures, given a few frames of a video to set the visual context. Towards this end, we propose ProgGen, which undertakes the task of video frame prediction by representing the dynamics of the video using a set of neuro-symbolic, human-interpretable set of states (one per frame) by leveraging the inductive biases of Large (Vision) Language Models (LLM/VLM). In particular, ProgGen utilizes LLM/VLM to synthesize programs: (i) to estimate the states of the video, given the visual context (i.e. the frames); (ii) to predict the states corresponding to future time steps by estimating the transition dynamics; (iii) to render the predicted states as visual RGB-frames. Empirical evaluations reveal that our proposed method outperforms competing techniques at the task of video frame prediction in two challenging environments: (i) PhyWorld (ii) Cart Pole. Additionally, ProgGen permits counter-factual reasoning and interpretable video generation attesting to its effectiveness and generalizability for video generation tasks.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video Parsing
Authors:
Yung-Hsuan Lai,
Janek Ebbers,
Yu-Chiang Frank Wang,
François Germain,
Michael Jeffrey Jones,
Moitreya Chatterjee
Abstract:
Audio-Visual Video Parsing (AVVP) entails the challenging task of localizing both uni-modal events (i.e., those occurring exclusively in either the visual or acoustic modality of a video) and multi-modal events (i.e., those occurring in both modalities concurrently). Moreover, the prohibitive cost of annotating training data with the class labels of all these events, along with their start and end…
▽ More
Audio-Visual Video Parsing (AVVP) entails the challenging task of localizing both uni-modal events (i.e., those occurring exclusively in either the visual or acoustic modality of a video) and multi-modal events (i.e., those occurring in both modalities concurrently). Moreover, the prohibitive cost of annotating training data with the class labels of all these events, along with their start and end times, imposes constraints on the scalability of AVVP techniques unless they can be trained in a weakly-supervised setting, where only modality-agnostic, video-level labels are available in the training data. To this end, recently proposed approaches seek to generate segment-level pseudo-labels to better guide model training. However, the absence of inter-segment dependencies when generating these pseudo-labels and the general bias towards predicting labels that are absent in a segment limit their performance. This work proposes a novel approach towards overcoming these weaknesses called Uncertainty-weighted Weakly-supervised Audio-visual Video Parsing (UWAV). Additionally, our innovative approach factors in the uncertainty associated with these estimated pseudo-labels and incorporates a feature mixup based training regularization for improved training. Empirical results show that UWAV outperforms state-of-the-art methods for the AVVP task on multiple metrics, across two different datasets, attesting to its effectiveness and generalizability.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Improving Open-World Object Localization by Discovering Background
Authors:
Ashish Singh,
Michael J. Jones,
Kuan-Chuan Peng,
Anoop Cherian,
Moitreya Chatterjee,
Erik Learned-Miller
Abstract:
Our work addresses the problem of learning to localize objects in an open-world setting, i.e., given the bounding box information of a limited number of object classes during training, the goal is to localize all objects, belonging to both the training and unseen classes in an image, during inference. Towards this end, recent work in this area has focused on improving the characterization of objec…
▽ More
Our work addresses the problem of learning to localize objects in an open-world setting, i.e., given the bounding box information of a limited number of object classes during training, the goal is to localize all objects, belonging to both the training and unseen classes in an image, during inference. Towards this end, recent work in this area has focused on improving the characterization of objects either explicitly by proposing new objective functions (localization quality) or implicitly using object-centric auxiliary-information, such as depth information, pixel/region affinity map etc. In this work, we address this problem by incorporating background information to guide the learning of the notion of objectness. Specifically, we propose a novel framework to discover background regions in an image and train an object proposal network to not detect any objects in these regions. We formulate the background discovery task as that of identifying image regions that are not discriminative, i.e., those that are redundant and constitute low information content. We conduct experiments on standard benchmarks to showcase the effectiveness of our proposed approach and observe significant improvements over the previous state-of-the-art approaches for this task.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
Requirements for Recognition and Rapid Response to Unfamiliar Events Outside of Agent Design Scope
Authors:
Robert E. Wray,
Steven J. Jones,
John E. Laird
Abstract:
Regardless of past learning, an agent in an open world will face unfamiliar events outside of prior experience, existing models, or policies. Further, the agent will sometimes lack relevant knowledge and/or sufficient time to assess the situation and evaluate response options. How can an agent respond reasonably to situations that are outside of its original design scope? How can it recognize such…
▽ More
Regardless of past learning, an agent in an open world will face unfamiliar events outside of prior experience, existing models, or policies. Further, the agent will sometimes lack relevant knowledge and/or sufficient time to assess the situation and evaluate response options. How can an agent respond reasonably to situations that are outside of its original design scope? How can it recognize such situations sufficiently quickly and reliably to determine reasonable, adaptive courses of action? We identify key characteristics needed for solutions, review the state-of-the-art, and outline a proposed, novel approach that combines domain-general meta-knowledge (inspired by human cognition) and metareasoning. This approach offers potential for fast, adaptive responses to unfamiliar situations, more fully meeting the performance characteristics required for open-world, general agents.
△ Less
Submitted 13 June, 2025; v1 submitted 16 April, 2025;
originally announced April 2025.
-
SurvSurf: a partially monotonic neural network for first-hitting time prediction of intermittently observed discrete and continuous sequential events
Authors:
Yichen Kelly Chen,
Sören Dittmer,
Kinga Bernatowicz,
Josep Arús-Pous,
Kamen Bliznashki,
John Aston,
James H. F. Rudd,
Carola-Bibiane Schönlieb,
James Jones,
Michael Roberts
Abstract:
We propose a neural-network based survival model (SurvSurf) specifically designed for direct and simultaneous probabilistic prediction of the first hitting time of sequential events from baseline. Unlike existing models, SurvSurf is theoretically guaranteed to never violate the monotonic relationship between the cumulative incidence functions of sequential events, while allowing nonlinear influenc…
▽ More
We propose a neural-network based survival model (SurvSurf) specifically designed for direct and simultaneous probabilistic prediction of the first hitting time of sequential events from baseline. Unlike existing models, SurvSurf is theoretically guaranteed to never violate the monotonic relationship between the cumulative incidence functions of sequential events, while allowing nonlinear influence from predictors. It also incorporates implicit truths for unobserved intermediate events in model fitting, and supports both discrete and continuous time and events. We also identified a variant of the Integrated Brier Score (IBS) that showed robust correlation with the mean squared error (MSE) between the true and predicted probabilities by accounting for implied truths about the missing intermediate events. We demonstrated the superiority of SurvSurf compared to modern and traditional predictive survival models in two simulated datasets and two real-world datasets, using MSE, the more robust IBS and by measuring the extent of monotonicity violation.
△ Less
Submitted 7 April, 2025;
originally announced April 2025.
-
Anisotropic mesh spacing prediction using neural networks
Authors:
Callum Lock,
Oubay Hassan,
Ruben Sevilla,
Jason Jones
Abstract:
This work presents a framework to predict near-optimal anisotropic spacing functions suitable to perform simulations with unseen operating conditions or geometric configurations. The strategy consists of utilising the vast amount of high fidelity data available in industry to compute a target anisotropic spacing and train an artificial neural network to predict the spacing for unseen scenarios. Th…
▽ More
This work presents a framework to predict near-optimal anisotropic spacing functions suitable to perform simulations with unseen operating conditions or geometric configurations. The strategy consists of utilising the vast amount of high fidelity data available in industry to compute a target anisotropic spacing and train an artificial neural network to predict the spacing for unseen scenarios. The trained neural network outputs the metric tensor at the nodes of a coarse background mesh that is then used to generate meshes for unseen cases. Examples are used to demonstrate the effect of the network hyperparameters and the training dataset on the accuracy of the predictions. The potential is demonstrated for examples involving up to 11 geometric parameters on CFD simulations involving a full aircraft configuration.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
LimeSoDa: A Dataset Collection for Benchmarking of Machine Learning Regressors in Digital Soil Mapping
Authors:
J. Schmidinger,
S. Vogel,
V. Barkov,
A. -D. Pham,
R. Gebbers,
H. Tavakoli,
J. Correa,
T. R. Tavares,
P. Filippi,
E. J. Jones,
V. Lukas,
E. Boenecke,
J. Ruehlmann,
I. Schroeter,
E. Kramer,
S. Paetzold,
M. Kodaira,
A. M. J. -C. Wadoux,
L. Bragazza,
K. Metzger,
J. Huang,
D. S. M. Valente,
J. L. Safanelli,
E. L. Bottega,
R. S. D. Dalmolin
, et al. (11 additional authors not shown)
Abstract:
Digital soil mapping (DSM) relies on a broad pool of statistical methods, yet determining the optimal method for a given context remains challenging and contentious. Benchmarking studies on multiple datasets are needed to reveal strengths and limitations of commonly used methods. Existing DSM studies usually rely on a single dataset with restricted access, leading to incomplete and potentially mis…
▽ More
Digital soil mapping (DSM) relies on a broad pool of statistical methods, yet determining the optimal method for a given context remains challenging and contentious. Benchmarking studies on multiple datasets are needed to reveal strengths and limitations of commonly used methods. Existing DSM studies usually rely on a single dataset with restricted access, leading to incomplete and potentially misleading conclusions. To address these issues, we introduce an open-access dataset collection called Precision Liming Soil Datasets (LimeSoDa). LimeSoDa consists of 31 field- and farm-scale datasets from various countries. Each dataset has three target soil properties: (1) soil organic matter or soil organic carbon, (2) clay content and (3) pH, alongside a set of features. Features are dataset-specific and were obtained by optical spectroscopy, proximal- and remote soil sensing. All datasets were aligned to a tabular format and are ready-to-use for modeling. We demonstrated the use of LimeSoDa for benchmarking by comparing the predictive performance of four learning algorithms across all datasets. This comparison included multiple linear regression (MLR), support vector regression (SVR), categorical boosting (CatBoost) and random forest (RF). The results showed that although no single algorithm was universally superior, certain algorithms performed better in specific contexts. MLR and SVR performed better on high-dimensional spectral datasets, likely due to better compatibility with principal components. In contrast, CatBoost and RF exhibited considerably better performances when applied to datasets with a moderate number (< 20) of features. These benchmarking results illustrate that the performance of a method is highly context-dependent. LimeSoDa therefore provides an important resource for improving the development and evaluation of statistical methods in DSM.
△ Less
Submitted 20 May, 2025; v1 submitted 27 February, 2025;
originally announced February 2025.
-
Learning from Mistakes: Understanding Ad-hoc Logs through Analyzing Accidental Commits
Authors:
Yi-Hung Chou,
Yiyang Min,
April Yi Wang,
James A. Jones
Abstract:
Developers often insert temporary "print" or "log" instructions into their code to help them better understand runtime behavior, usually when the code is not behaving as they expected. Despite the fact that such monitoring instructions, or "ad-hoc logs," are so commonly used by developers, there is almost no existing literature that studies developers' practices in how they use them. This paucity…
▽ More
Developers often insert temporary "print" or "log" instructions into their code to help them better understand runtime behavior, usually when the code is not behaving as they expected. Despite the fact that such monitoring instructions, or "ad-hoc logs," are so commonly used by developers, there is almost no existing literature that studies developers' practices in how they use them. This paucity of knowledge of the use of these ephemeral logs may be largely due to the fact that they typically only exist in the developers' local environments and are removed before they commit their code to their revision control system. In this work, we overcome this challenge by observing that developers occasionally mistakenly forget to remove such instructions before committing, and then they remove them shortly later. Additionally, we further study such developer logging practices by watching and analyzing live-streamed coding videos. Through these empirical approaches, we study where, how, and why developers use ad-hoc logs to better understand their code and its execution. We collect 27 GB of accidental commits that removed 548,880 ad-hoc logs in JavaScript from GitHub Archive repositories to provide the first large-scale dataset and empirical studies on ad-hoc logging practices. Our results reveal several illuminating findings, including a particular propensity for developers to use ad-hoc logs in asynchronous and callback functions. Our findings provide both empirical evidence and a valuable dataset for researchers and tool developers seeking to enhance ad-hoc logging practices, and potentially deepen our understanding of developers' practices towards understanding of software's runtime behaviors.
△ Less
Submitted 17 April, 2025; v1 submitted 16 January, 2025;
originally announced January 2025.
-
ComplexVAD: Detecting Interaction Anomalies in Video
Authors:
Furkan Mumcu,
Michael J. Jones,
Yasin Yilmaz,
Anoop Cherian
Abstract:
Existing video anomaly detection datasets are inadequate for representing complex anomalies that occur due to the interactions between objects. The absence of complex anomalies in previous video anomaly detection datasets affects research by shifting the focus onto simple anomalies. To address this problem, we introduce a new large-scale dataset: ComplexVAD. In addition, we propose a novel method…
▽ More
Existing video anomaly detection datasets are inadequate for representing complex anomalies that occur due to the interactions between objects. The absence of complex anomalies in previous video anomaly detection datasets affects research by shifting the focus onto simple anomalies. To address this problem, we introduce a new large-scale dataset: ComplexVAD. In addition, we propose a novel method to detect complex anomalies via modeling the interactions between objects using a scene graph with spatio-temporal attributes. With our proposed method and two other state-of-the-art video anomaly detection methods, we obtain baseline scores on ComplexVAD and demonstrate that our new method outperforms existing works.
△ Less
Submitted 16 January, 2025;
originally announced January 2025.
-
Leveraging the Global Research Infrastructure to Characterize the Impact of National Science Foundation Research
Authors:
Jamaica Jones,
Ted Habermann
Abstract:
The Global Research infrastructure (GRI) is made up of the repositories and organizations that provide persistent identifiers (PIDs) and metadata for many kinds of research objects and connect these objects to funders, research institutions, researchers, and one another using PIDs. The INFORMATE Project has combined three data sources to focus on understanding how the global research infrastructur…
▽ More
The Global Research infrastructure (GRI) is made up of the repositories and organizations that provide persistent identifiers (PIDs) and metadata for many kinds of research objects and connect these objects to funders, research institutions, researchers, and one another using PIDs. The INFORMATE Project has combined three data sources to focus on understanding how the global research infrastructure might help the US National Science Foundation (NSF) and other federal agencies identify and characterize the impact of their support. In this paper we present INFORMATE observations of three data systems. The NSF Award database represents NSF funding while the NSF Public Access Repository (PAR) and CHORUS, as a proxy for the GRI, represent two different view of results of that funding. We compare the first at the level of awards and the second two at the level of published research articles. Our findings demonstrate that CHORUS datasets include significantly more NSF awards and more related papers than does PAR. Our findings also suggest that time plays a significant role in the inclusion of award metadata across the sources analyzed. Data in those sources travel very different journeys, each presenting different obstacles to metadata completeness and suggesting necessary actions on the parts of authors and publishers to ensure that publication and funding metadata are captured. We discuss these actions, as well as implications our findings have for emergent technologies such as artificial intelligence and natural language processing.
△ Less
Submitted 12 January, 2025;
originally announced January 2025.
-
Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding
Authors:
Joshua Jones,
Oier Mees,
Carmelo Sferrazza,
Kyle Stachowicz,
Pieter Abbeel,
Sergey Levine
Abstract:
Interacting with the world is a multi-sensory experience: achieving effective general-purpose interaction requires making use of all available modalities -- including vision, touch, and audio -- to fill in gaps from partial observation. For example, when vision is occluded reaching into a bag, a robot should rely on its senses of touch and sound. However, state-of-the-art generalist robot policies…
▽ More
Interacting with the world is a multi-sensory experience: achieving effective general-purpose interaction requires making use of all available modalities -- including vision, touch, and audio -- to fill in gaps from partial observation. For example, when vision is occluded reaching into a bag, a robot should rely on its senses of touch and sound. However, state-of-the-art generalist robot policies are typically trained on large datasets to predict robot actions solely from visual and proprioceptive observations. In this work, we propose FuSe, a novel approach that enables finetuning visuomotor generalist policies on heterogeneous sensor modalities for which large datasets are not readily available by leveraging natural language as a common cross-modal grounding. We combine a multimodal contrastive loss with a sensory-grounded language generation loss to encode high-level semantics. In the context of robot manipulation, we show that FuSe enables performing challenging tasks that require reasoning jointly over modalities such as vision, touch, and sound in a zero-shot setting, such as multimodal prompting, compositional cross-modal prompting, and descriptions of objects it interacts with. We show that the same recipe is applicable to widely different generalist policies, including both diffusion-based generalist policies and large vision-language-action (VLA) models. Extensive experiments in the real world show that FuSeis able to increase success rates by over 20% compared to all considered baselines.
△ Less
Submitted 14 January, 2025; v1 submitted 8 January, 2025;
originally announced January 2025.
-
Benchmarking large language models for materials synthesis: the case of atomic layer deposition
Authors:
Angel Yanguas-Gil,
Matthew T. Dearing,
Jeffrey W. Elam,
Jessica C. Jones,
Sungjoon Kim,
Adnan Mohammad,
Chi Thang Nguyen,
Bratin Sengupta
Abstract:
In this work we introduce an open-ended question benchmark, ALDbench, to evaluate the performance of large language models (LLMs) in materials synthesis, and in particular in the field of atomic layer deposition, a thin film growth technique used in energy applications and microelectronics. Our benchmark comprises questions with a level of difficulty ranging from graduate level to domain expert cu…
▽ More
In this work we introduce an open-ended question benchmark, ALDbench, to evaluate the performance of large language models (LLMs) in materials synthesis, and in particular in the field of atomic layer deposition, a thin film growth technique used in energy applications and microelectronics. Our benchmark comprises questions with a level of difficulty ranging from graduate level to domain expert current with the state of the art in the field. Human experts reviewed the questions along the criteria of difficulty and specificity, and the model responses along four different criteria: overall quality, specificity, relevance, and accuracy. We ran this benchmark on an instance of OpenAI's GPT-4o. The responses from the model received a composite quality score of 3.7 on a 1 to 5 scale, consistent with a passing grade. However, 36% of the questions received at least one below average score. An in-depth analysis of the responses identified at least five instances of suspected hallucination. Finally, we observed statistically significant correlations between the difficulty of the question and the quality of the response, the difficulty of the question and the relevance of the response, and the specificity of the question and the accuracy of the response as graded by the human experts. This emphasizes the need to evaluate LLMs across multiple criteria beyond difficulty or accuracy.
△ Less
Submitted 13 December, 2024;
originally announced December 2024.
-
Americans' Support for AI Development -- Measured Daily with Open Data and Methods
Authors:
Jason Jeffrey Jones
Abstract:
The rapid development of artificial intelligence should be accompanied by measurement of public sentiment at high temporal resolution. Accordingly, here I present analysis of daily repeated surveys beginning April 18, 2024 (total N=4067). The results indicate that in the population of American adults, support for further development of artificial intelligence was modestly positive and increased a…
▽ More
The rapid development of artificial intelligence should be accompanied by measurement of public sentiment at high temporal resolution. Accordingly, here I present analysis of daily repeated surveys beginning April 18, 2024 (total N=4067). The results indicate that in the population of American adults, support for further development of artificial intelligence was modestly positive and increased a statistically reliable amount over the past year. Female and low-trust respondents reported less support, however, both also displayed growing support over time. Republicans increased support at a faster rate than Democrats, pointing to potential polarization. These findings underscore the need for continuous, high-frequency surveys to accurately track shifts in public opinion on transformative technologies like AI.
△ Less
Submitted 12 June, 2025; v1 submitted 6 December, 2024;
originally announced December 2024.
-
Leveraging Propagated Infection to Crossfire Mutants
Authors:
Hang Du,
Vijay Krishna Palepu,
James A. Jones
Abstract:
Mutation testing was proposed to identify weaknesses in test suites by repeatedly generating artificially faulty versions of the software (mutants) and determining if the test suite is sufficient to detect them (kill them). When the tests are insufficient, each surviving mutant provides an opportunity to improve the test suite. We conducted a study and found that many such surviving mutants (up to…
▽ More
Mutation testing was proposed to identify weaknesses in test suites by repeatedly generating artificially faulty versions of the software (mutants) and determining if the test suite is sufficient to detect them (kill them). When the tests are insufficient, each surviving mutant provides an opportunity to improve the test suite. We conducted a study and found that many such surviving mutants (up to 84% for the subjects of our study) are detectable by simply augmenting existing tests with additional assertions, or assertion amplification. Moreover, we find that many of these mutants are detectable by multiple existing tests, giving developers options for how to detect them. To help with these challenges, we created a technique that performs memory-state analysis to identify candidate assertions that developers can use to detect the surviving mutants. Additionally, we build upon prior research that identifies ``crossfiring'' opportunities -- tests that coincidentally kill multiple mutants. To this end, we developed a theoretical model that describes the varying granularities that crossfiring can occur in the existing test suite, which provide opportunities and options for how to kill surviving mutants. We operationalize this model to an accompanying technique that optimizes the assertion amplification of the existing tests to crossfire multiple mutants with fewer added assertions, optionally concentrated within fewer tests. Our experiments show that we can kill all surviving mutants that are detectable with existing test data with only 1.1% of the identified assertion candidates, and increasing by a factor of 6x, on average, the number of killed mutants from amplified tests, over tests that do not crossfire.
△ Less
Submitted 14 November, 2024;
originally announced November 2024.
-
AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts
Authors:
Vishal Kumar,
Zeyi Liao,
Jaylen Jones,
Huan Sun
Abstract:
Although large language models (LLMs) are typically aligned, they remain vulnerable to jailbreaking through either carefully crafted prompts in natural language or, interestingly, gibberish adversarial suffixes. However, gibberish tokens have received relatively less attention despite their success in attacking aligned LLMs. Recent work, AmpleGCG~\citep{liao2024amplegcg}, demonstrates that a gener…
▽ More
Although large language models (LLMs) are typically aligned, they remain vulnerable to jailbreaking through either carefully crafted prompts in natural language or, interestingly, gibberish adversarial suffixes. However, gibberish tokens have received relatively less attention despite their success in attacking aligned LLMs. Recent work, AmpleGCG~\citep{liao2024amplegcg}, demonstrates that a generative model can quickly produce numerous customizable gibberish adversarial suffixes for any harmful query, exposing a range of alignment gaps in out-of-distribution (OOD) language spaces. To bring more attention to this area, we introduce AmpleGCG-Plus, an enhanced version that achieves better performance in fewer attempts. Through a series of exploratory experiments, we identify several training strategies to improve the learning of gibberish suffixes. Our results, verified under a strict evaluation setting, show that it outperforms AmpleGCG on both open-weight and closed-source models, achieving increases in attack success rate (ASR) of up to 17\% in the white-box setting against Llama-2-7B-chat, and more than tripling ASR in the black-box setting against GPT-4. Notably, AmpleGCG-Plus jailbreaks the newer GPT-4o series of models at similar rates to GPT-4, and, uncovers vulnerabilities against the recently proposed circuit breakers defense. We publicly release AmpleGCG-Plus along with our collected training datasets.
△ Less
Submitted 29 October, 2024;
originally announced October 2024.
-
Report on the Workshop on Simulations for Information Access (Sim4IA 2024) at SIGIR 2024
Authors:
Timo Breuer,
Christin Katharina Kreutz,
Norbert Fuhr,
Krisztian Balog,
Philipp Schaer,
Nolwenn Bernard,
Ingo Frommholz,
Marcel Gohsen,
Kaixin Ji,
Gareth J. F. Jones,
Jüri Keller,
Jiqun Liu,
Martin Mladenov,
Gabriella Pasi,
Johanne Trippas,
Xi Wang,
Saber Zerhoudi,
ChengXiang Zhai
Abstract:
This paper is a report of the Workshop on Simulations for Information Access (Sim4IA) workshop at SIGIR 2024. The workshop had two keynotes, a panel discussion, nine lightning talks, and two breakout sessions. Key takeaways were user simulation's importance in academia and industry, the possible bridging of online and offline evaluation, and the issues of organizing a companion shared task around…
▽ More
This paper is a report of the Workshop on Simulations for Information Access (Sim4IA) workshop at SIGIR 2024. The workshop had two keynotes, a panel discussion, nine lightning talks, and two breakout sessions. Key takeaways were user simulation's importance in academia and industry, the possible bridging of online and offline evaluation, and the issues of organizing a companion shared task around user simulations for information access. We report on how we organized the workshop, provide a brief overview of what happened at the workshop, and summarize the main topics and findings of the workshop and future work.
△ Less
Submitted 26 September, 2024;
originally announced September 2024.
-
What do we know about Hugging Face? A systematic literature review and quantitative validation of qualitative claims
Authors:
Jason Jones,
Wenxin Jiang,
Nicholas Synovic,
George K. Thiruvathukal,
James C. Davis
Abstract:
Background: Collaborative Software Package Registries (SPRs) are an integral part of the software supply chain. Much engineering work synthesizes SPR package into applications. Prior research has examined SPRs for traditional software, such as NPM (JavaScript) and PyPI (Python). Pre-Trained Model (PTM) Registries are an emerging class of SPR of increasing importance, because they support the deep…
▽ More
Background: Collaborative Software Package Registries (SPRs) are an integral part of the software supply chain. Much engineering work synthesizes SPR package into applications. Prior research has examined SPRs for traditional software, such as NPM (JavaScript) and PyPI (Python). Pre-Trained Model (PTM) Registries are an emerging class of SPR of increasing importance, because they support the deep learning supply chain.
Aims: Recent empirical research has examined PTM registries in ways such as vulnerabilities, reuse processes, and evolution. However, no existing research synthesizes them to provide a systematic understanding of the current knowledge. Some of the existing research includes qualitative claims lacking quantitative analysis. Our research fills these gaps by providing a knowledge synthesis and quantitative analyses.
Methods: We first conduct a systematic literature review (SLR). We then observe that some of the claims are qualitative. We identify quantifiable metrics associated with those claims, and measure in order to substantiate these claims.
Results: From our SLR, we identify 12 claims about PTM reuse on the HuggingFace platform, 4 of which lack quantitative validation. We successfully test 3 of these claims through a quantitative analysis, and directly compare one with traditional software. Our findings corroborate qualitative claims with quantitative measurements. Our findings are: (1) PTMs have a much higher turnover rate than traditional software, indicating a dynamic and rapidly evolving reuse environment within the PTM ecosystem; and (2) There is a strong correlation between documentation quality and PTM popularity.
Conclusions: We confirm qualitative research claims with concrete metrics, supporting prior qualitative and case study research. Our measures show further dynamics of PTM reuse, inspiring research infrastructure and new measures.
△ Less
Submitted 3 September, 2024; v1 submitted 12 June, 2024;
originally announced June 2024.
-
Toward Constraint Compliant Goal Formulation and Planning
Authors:
Steven J. Jones,
Robert E. Wray
Abstract:
One part of complying with norms, rules, and preferences is incorporating constraints (such as knowledge of ethics) into one's goal formulation and planning processing. We explore in a simple domain how the encoding of knowledge in different ethical frameworks influences an agent's goal formulation and planning processing and demonstrate ability of an agent to satisfy and satisfice when its collec…
▽ More
One part of complying with norms, rules, and preferences is incorporating constraints (such as knowledge of ethics) into one's goal formulation and planning processing. We explore in a simple domain how the encoding of knowledge in different ethical frameworks influences an agent's goal formulation and planning processing and demonstrate ability of an agent to satisfy and satisfice when its collection of relevant constraints includes a mix of "hard" and "soft" constraints of various types. How the agent attempts to comply with ethical constraints depends on the ethical framing and we investigate tradeoffs between deontological framing and utilitarian framing for complying with an ethical norm. Representative scenarios highlight how performing the same task with different framings of the same norm leads to different behaviors. Our explorations suggest an important role for metacognitive judgments in resolving ethical conflicts during goal formulation and planning.
△ Less
Submitted 10 June, 2024; v1 submitted 21 May, 2024;
originally announced May 2024.
-
Investigating Remote Hands-On Assistance for Collaborative Development of Embedded Systems
Authors:
Yan Chen,
Jasmine Jones
Abstract:
Developing embedded systems is a complex endeavor that frequently requires collaborative teamwork. With the rise of freelance work and the global shift towards remote work, the need for effective remote collaboration has become crucial for many developers and their clients. However, current communication and coordination tools are predominantly tailored for software development rather than hardwar…
▽ More
Developing embedded systems is a complex endeavor that frequently requires collaborative teamwork. With the rise of freelance work and the global shift towards remote work, the need for effective remote collaboration has become crucial for many developers and their clients. However, current communication and coordination tools are predominantly tailored for software development rather than hardware-focused tasks. This study investigates the potential for remote support tools specifically designed for embedded systems development. Through interviews with 12 experienced embedded systems developers, we explored their existing remote work practices, challenges, and requirements. We also conducted a user enactment study featuring a custom-designed remote manipulation agent, Handy, as a theoretical assistant, to identify the kinds of support developers would value in a collaborative setting. Our findings highlight the scenarios and strategies employed in remote work, the specific support needs, and the challenges related to information exchange, coordination, and execution. Additionally, we explore concerns around privacy, control, and trust when using remote physical manipulation tools. This research contributes to the field by integrating the development of embedded systems with the remote, on-demand collaboration and assistance typical of software environments, offering a solid empirical foundation for future research on remote manipulation agents in this area.
△ Less
Submitted 18 August, 2024; v1 submitted 24 April, 2024;
originally announced April 2024.
-
Multimodal 3D Object Detection on Unseen Domains
Authors:
Deepti Hegde,
Suhas Lohit,
Kuan-Chuan Peng,
Michael J. Jones,
Vishal M. Patel
Abstract:
LiDAR datasets for autonomous driving exhibit biases in properties such as point cloud density, range, and object dimensions. As a result, object detection networks trained and evaluated in different environments often experience performance degradation. Domain adaptation approaches assume access to unannotated samples from the test distribution to address this problem. However, in the real world,…
▽ More
LiDAR datasets for autonomous driving exhibit biases in properties such as point cloud density, range, and object dimensions. As a result, object detection networks trained and evaluated in different environments often experience performance degradation. Domain adaptation approaches assume access to unannotated samples from the test distribution to address this problem. However, in the real world, the exact conditions of deployment and access to samples representative of the test dataset may be unavailable while training. We argue that the more realistic and challenging formulation is to require robustness in performance to unseen target domains. We propose to address this problem in a two-pronged manner. First, we leverage paired LiDAR-image data present in most autonomous driving datasets to perform multimodal object detection. We suggest that working with multimodal features by leveraging both images and LiDAR point clouds for scene understanding tasks results in object detectors more robust to unseen domain shifts. Second, we train a 3D object detector to learn multimodal object features across different distributions and promote feature invariance across these source domains to improve generalizability to unseen target domains. To this end, we propose CLIX$^\text{3D}$, a multimodal fusion and supervised contrastive learning framework for 3D object detection that performs alignment of object features from same-class samples of different domains while pushing the features from different classes apart. We show that CLIX$^\text{3D}$ yields state-of-the-art domain generalization performance under multiple dataset shifts.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
Equivariant Spatio-Temporal Self-Supervision for LiDAR Object Detection
Authors:
Deepti Hegde,
Suhas Lohit,
Kuan-Chuan Peng,
Michael J. Jones,
Vishal M. Patel
Abstract:
Popular representation learning methods encourage feature invariance under transformations applied at the input. However, in 3D perception tasks like object localization and segmentation, outputs are naturally equivariant to some transformations, such as rotation. Using pre-training loss functions that encourage equivariance of features under certain transformations provides a strong self-supervis…
▽ More
Popular representation learning methods encourage feature invariance under transformations applied at the input. However, in 3D perception tasks like object localization and segmentation, outputs are naturally equivariant to some transformations, such as rotation. Using pre-training loss functions that encourage equivariance of features under certain transformations provides a strong self-supervision signal while also retaining information of geometric relationships between transformed feature representations. This can enable improved performance in downstream tasks that are equivariant to such transformations. In this paper, we propose a spatio-temporal equivariant learning framework by considering both spatial and temporal augmentations jointly. Our experiments show that the best performance arises with a pre-training approach that encourages equivariance to translation, scaling, and flip, rotation and scene flow. For spatial augmentations, we find that depending on the transformation, either a contrastive objective or an equivariance-by-classification objective yields best results. To leverage real-world object deformations and motion, we consider sequential LiDAR scene pairs and develop a novel 3D scene flow-based equivariance objective that leads to improved performance overall. We show our pre-training method for 3D object detection which outperforms existing equivariant and invariant approaches in many settings.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
Autonomous microARPES
Authors:
Steinn Ymir Agustsson,
Alfred J. H. Jones,
Davide Curcio,
Søren Ulstrup,
Jill Miwa,
Davide Mottin,
Panagiotis Karras,
Philip Hofmann
Abstract:
Angle-resolved photoemission spectroscopy (ARPES) is a technique used to map the occupied electronic structure of solids. Recent progress in X-ray focusing optics has led to the development of ARPES into a microscopic tool, permitting the electronic structure to be spatially mapped across the surface of a sample. This comes at the expense of a time-consuming scanning process to cover not only a th…
▽ More
Angle-resolved photoemission spectroscopy (ARPES) is a technique used to map the occupied electronic structure of solids. Recent progress in X-ray focusing optics has led to the development of ARPES into a microscopic tool, permitting the electronic structure to be spatially mapped across the surface of a sample. This comes at the expense of a time-consuming scanning process to cover not only a three-dimensional energy-momentum ($E, k_z, k_y$) space but also the two-dimensional surface area. Here, we implement a protocol to autonomously search both $\mathbf{k}$- and real space in order to find positions of particular interest, either because of their high photoemission intensity or because of sharp spectral features. The search is based on the use of Gaussian process regression and can easily be expanded to include additional parameters or optimisation criteria. This autonomous experimental control is implemented on the SGM4 micro-focus beamline of the synchrotron radiation source ASTRID2.
△ Less
Submitted 16 February, 2024;
originally announced March 2024.
-
Evaluating Versal AI Engines for option price discovery in market risk analysis
Authors:
Mark Klaisoongnoen,
Nick Brown,
Tim Dykes,
Jessica R. Jones,
Utz-Uwe Haus
Abstract:
Whilst Field-Programmable Gate Arrays (FPGAs) have been popular in accelerating high-frequency financial workload for many years, their application in quantitative finance, the utilisation of mathematical models to analyse financial markets and securities, is less mature. Nevertheless, recent work has demonstrated the benefits that FPGAs can deliver to quantitative workloads, and in this paper, we…
▽ More
Whilst Field-Programmable Gate Arrays (FPGAs) have been popular in accelerating high-frequency financial workload for many years, their application in quantitative finance, the utilisation of mathematical models to analyse financial markets and securities, is less mature. Nevertheless, recent work has demonstrated the benefits that FPGAs can deliver to quantitative workloads, and in this paper, we study whether the Versal ACAP and its AI Engines (AIEs) can also deliver improved performance. We focus specifically on the industry standard Strategic Technology Analysis Center's (STAC) derivatives risk analysis benchmark STAC-A2. Porting a purely FPGA-based accelerator STAC-A2 inspired market risk (SIMR) benchmark to the Versal ACAP device by combining Programmable Logic (PL) and AIEs, we explore the development approach and techniques, before comparing performance across PL and AIEs. Ultimately, we found that our AIE approach is slower than a highly optimised existing PL-only version due to limits on both the AIE and PL that we explore and describe.
△ Less
Submitted 19 February, 2024;
originally announced February 2024.
-
A Multi-Aspect Framework for Counter Narrative Evaluation using Large Language Models
Authors:
Jaylen Jones,
Lingbo Mo,
Eric Fosler-Lussier,
Huan Sun
Abstract:
Counter narratives - informed responses to hate speech contexts designed to refute hateful claims and de-escalate encounters - have emerged as an effective hate speech intervention strategy. While previous work has proposed automatic counter narrative generation methods to aid manual interventions, the evaluation of these approaches remains underdeveloped. Previous automatic metrics for counter na…
▽ More
Counter narratives - informed responses to hate speech contexts designed to refute hateful claims and de-escalate encounters - have emerged as an effective hate speech intervention strategy. While previous work has proposed automatic counter narrative generation methods to aid manual interventions, the evaluation of these approaches remains underdeveloped. Previous automatic metrics for counter narrative evaluation lack alignment with human judgment as they rely on superficial reference comparisons instead of incorporating key aspects of counter narrative quality as evaluation criteria. To address prior evaluation limitations, we propose a novel evaluation framework prompting LLMs to provide scores and feedback for generated counter narrative candidates using 5 defined aspects derived from guidelines from counter narrative specialized NGOs. We found that LLM evaluators achieve strong alignment to human-annotated scores and feedback and outperform alternative metrics, indicating their potential as multi-aspect, reference-free and interpretable evaluators for counter narrative evaluation.
△ Less
Submitted 29 March, 2024; v1 submitted 18 February, 2024;
originally announced February 2024.
-
Development and validation of an artificial intelligence model to accurately predict spinopelvic parameters
Authors:
Edward S. Harake,
Joseph R. Linzey,
Cheng Jiang,
Rushikesh S. Joshi,
Mark M. Zaki,
Jaes C. Jones,
Siri S. Khalsa,
John H. Lee,
Zachary Wilseck,
Jacob R. Joseph,
Todd C. Hollon,
Paul Park
Abstract:
Objective. Achieving appropriate spinopelvic alignment has been shown to be associated with improved clinical symptoms. However, measurement of spinopelvic radiographic parameters is time-intensive and interobserver reliability is a concern. Automated measurement tools have the promise of rapid and consistent measurements, but existing tools are still limited by some degree of manual user-entry re…
▽ More
Objective. Achieving appropriate spinopelvic alignment has been shown to be associated with improved clinical symptoms. However, measurement of spinopelvic radiographic parameters is time-intensive and interobserver reliability is a concern. Automated measurement tools have the promise of rapid and consistent measurements, but existing tools are still limited by some degree of manual user-entry requirements. This study presents a novel artificial intelligence (AI) tool called SpinePose that automatically predicts spinopelvic parameters with high accuracy without the need for manual entry.
Methods. SpinePose was trained and validated on 761 sagittal whole-spine X-rays to predict sagittal vertical axis (SVA), pelvic tilt (PT), pelvic incidence (PI), sacral slope (SS), lumbar lordosis (LL), T1-pelvic angle (T1PA), and L1-pelvic angle (L1PA). A separate test set of 40 X-rays was labeled by 4 reviewers, including fellowship-trained spine surgeons and a fellowship-trained radiologist with neuroradiology subspecialty certification. Median errors relative to the most senior reviewer were calculated to determine model accuracy on test images. Intraclass correlation coefficients (ICC) were used to assess inter-rater reliability.
Results. SpinePose exhibited the following median (interquartile range) parameter errors: SVA: 2.2(2.3)mm, p=0.93; PT: 1.3(1.2)°, p=0.48; SS: 1.7(2.2)°, p=0.64; PI: 2.2(2.1)°, p=0.24; LL: 2.6(4.0)°, p=0.89; T1PA: 1.1(0.9)°, p=0.42; and L1PA: 1.4(1.6)°, p=0.49. Model predictions also exhibited excellent reliability at all parameters (ICC: 0.91-1.0).
Conclusions. SpinePose accurately predicted spinopelvic parameters with excellent reliability comparable to fellowship-trained spine surgeons and neuroradiologists. Utilization of predictive AI tools in spinal imaging can substantially aid in patient selection and surgical planning.
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
PeaTMOSS: A Dataset and Initial Analysis of Pre-Trained Models in Open-Source Software
Authors:
Wenxin Jiang,
Jerin Yasmin,
Jason Jones,
Nicholas Synovic,
Jiashen Kuo,
Nathaniel Bielanski,
Yuan Tian,
George K. Thiruvathukal,
James C. Davis
Abstract:
The development and training of deep learning models have become increasingly costly and complex. Consequently, software engineers are adopting pre-trained models (PTMs) for their downstream applications. The dynamics of the PTM supply chain remain largely unexplored, signaling a clear need for structured datasets that document not only the metadata but also the subsequent applications of these mo…
▽ More
The development and training of deep learning models have become increasingly costly and complex. Consequently, software engineers are adopting pre-trained models (PTMs) for their downstream applications. The dynamics of the PTM supply chain remain largely unexplored, signaling a clear need for structured datasets that document not only the metadata but also the subsequent applications of these models. Without such data, the MSR community cannot comprehensively understand the impact of PTM adoption and reuse. This paper presents the PeaTMOSS dataset, which comprises metadata for 281,638 PTMs and detailed snapshots for all PTMs with over 50 monthly downloads (14,296 PTMs), along with 28,575 open-source software repositories from GitHub that utilize these models. Additionally, the dataset includes 44,337 mappings from 15,129 downstream GitHub repositories to the 2,530 PTMs they use. To enhance the dataset's comprehensiveness, we developed prompts for a large language model to automatically extract model metadata, including the model's training datasets, parameters, and evaluation metrics. Our analysis of this dataset provides the first summary statistics for the PTM supply chain, showing the trend of PTM development and common shortcomings of PTM package documentation. Our example application reveals inconsistencies in software licenses across PTMs and their dependent projects. PeaTMOSS lays the foundation for future research, offering rich opportunities to investigate the PTM supply chain. We outline mining opportunities on PTMs, their downstream usage, and cross-cutting questions.
△ Less
Submitted 1 February, 2024;
originally announced February 2024.
-
A Hybrid Classical-Quantum HPC Workload
Authors:
Aniello Esposito,
Sebastien Cabaniols,
Jessica R. Jones,
David Brayford
Abstract:
A strategy for the orchestration of hybrid classical-quantum workloads on supercomputers featuring quantum devices is proposed. The method makes use of heterogeneous job launches with Slurm to interleave classical and quantum computation, thereby reducing idle time of the quantum components. To better understand the possible shortcomings and bottlenecks of such a workload, an example application i…
▽ More
A strategy for the orchestration of hybrid classical-quantum workloads on supercomputers featuring quantum devices is proposed. The method makes use of heterogeneous job launches with Slurm to interleave classical and quantum computation, thereby reducing idle time of the quantum components. To better understand the possible shortcomings and bottlenecks of such a workload, an example application is investigated that offloads parts of the computation to a quantum device. It executes on a classical HPC system, with a server mimicking the quantum device, within the MPMD paradigm in Slurm. Quantum circuits are synthesized by means of the Classiq software suite according to the needs of the scientific application, and the Qiskit Aer circuit simulator computes the state vectors. The HHL quantum algorithm for linear systems of equations is used to solve the algebraic problem from the discretization of a linear differential equation. Communication takes place over the MPI, which is broadly employed in the HPC community. Extraction of state vectors and circuit synthesis are the most time consuming, while communication is negligible in this setup. The present test bed serves as a basis for more advanced hybrid workloads eventually involving a real quantum device.
△ Less
Submitted 8 December, 2023;
originally announced December 2023.
-
PeaTMOSS: Mining Pre-Trained Models in Open-Source Software
Authors:
Wenxin Jiang,
Jason Jones,
Jerin Yasmin,
Nicholas Synovic,
Rajeev Sashti,
Sophie Chen,
George K. Thiruvathukal,
Yuan Tian,
James C. Davis
Abstract:
Developing and training deep learning models is expensive, so software engineers have begun to reuse pre-trained deep learning models (PTMs) and fine-tune them for downstream tasks. Despite the wide-spread use of PTMs, we know little about the corresponding software engineering behaviors and challenges.
To enable the study of software engineering with PTMs, we present the PeaTMOSS dataset: Pre-T…
▽ More
Developing and training deep learning models is expensive, so software engineers have begun to reuse pre-trained deep learning models (PTMs) and fine-tune them for downstream tasks. Despite the wide-spread use of PTMs, we know little about the corresponding software engineering behaviors and challenges.
To enable the study of software engineering with PTMs, we present the PeaTMOSS dataset: Pre-Trained Models in Open-Source Software. PeaTMOSS has three parts: a snapshot of (1) 281,638 PTMs, (2) 27,270 open-source software repositories that use PTMs, and (3) a mapping between PTMs and the projects that use them. We challenge PeaTMOSS miners to discover software engineering practices around PTMs. A demo and link to the full dataset are available at: https://github.com/PurdueDualityLab/PeaTMOSS-Demos.
△ Less
Submitted 5 October, 2023;
originally announced October 2023.
-
Robust Frame-to-Frame Camera Rotation Estimation in Crowded Scenes
Authors:
Fabien Delattre,
David Dirnfeld,
Phat Nguyen,
Stephen Scarano,
Michael J. Jones,
Pedro Miraldo,
Erik Learned-Miller
Abstract:
We present an approach to estimating camera rotation in crowded, real-world scenes from handheld monocular video. While camera rotation estimation is a well-studied problem, no previous methods exhibit both high accuracy and acceptable speed in this setting. Because the setting is not addressed well by other datasets, we provide a new dataset and benchmark, with high-accuracy, rigorously verified…
▽ More
We present an approach to estimating camera rotation in crowded, real-world scenes from handheld monocular video. While camera rotation estimation is a well-studied problem, no previous methods exhibit both high accuracy and acceptable speed in this setting. Because the setting is not addressed well by other datasets, we provide a new dataset and benchmark, with high-accuracy, rigorously verified ground truth, on 17 video sequences. Methods developed for wide baseline stereo (e.g., 5-point methods) perform poorly on monocular video. On the other hand, methods used in autonomous driving (e.g., SLAM) leverage specific sensor setups, specific motion models, or local optimization strategies (lagging batch processing) and do not generalize well to handheld video. Finally, for dynamic scenes, commonly used robustification techniques like RANSAC require large numbers of iterations, and become prohibitively slow. We introduce a novel generalization of the Hough transform on SO(3) to efficiently and robustly find the camera rotation most compatible with optical flow. Among comparably fast methods, ours reduces error by almost 50\% over the next best, and is more accurate than any method, irrespective of speed. This represents a strong new performance point for crowded scenes, an important setting for computer vision. The code and the dataset are available at https://fabiendelattre.com/robust-rotation-estimation.
△ Less
Submitted 15 September, 2023;
originally announced September 2023.
-
Fortran High-Level Synthesis: Reducing the barriers to accelerating HPC codes on FPGAs
Authors:
Gabriel Rodriguez-Canal,
Nick Brown,
Tim Dykes,
Jessica R. Jones,
Utz-Uwe Haus
Abstract:
In recent years the use of FPGAs to accelerate scientific applications has grown, with numerous applications demonstrating the benefit of FPGAs for high performance workloads. However, whilst High Level Synthesis (HLS) has significantly lowered the barrier to entry in programming FPGAs by enabling programmers to use C++, a major challenge is that most often these codes are not originally written i…
▽ More
In recent years the use of FPGAs to accelerate scientific applications has grown, with numerous applications demonstrating the benefit of FPGAs for high performance workloads. However, whilst High Level Synthesis (HLS) has significantly lowered the barrier to entry in programming FPGAs by enabling programmers to use C++, a major challenge is that most often these codes are not originally written in C++. Instead, Fortran is the lingua franca of scientific computing and-so it requires a complex and time consuming initial step to convert into C++ even before considering the FPGA.
In this paper we describe work enabling Fortran for AMD Xilinx FPGAs by connecting the LLVM Flang front end to AMD Xilinx's LLVM back end. This enables programmers to use Fortran as a first-class language for programming FPGAs, and as we demonstrate enjoy all the tuning and optimisation opportunities that HLS C++ provides. Furthermore, we demonstrate that certain language features of Fortran make it especially beneficial for programming FPGAs compared to C++. The result of this work is a lowering of the barrier to entry in using FPGAs for scientific computing, enabling programmers to leverage their existing codebase and language of choice on the FPGA directly.
△ Less
Submitted 25 August, 2023;
originally announced August 2023.
-
Observation of high-energy neutrinos from the Galactic plane
Authors:
R. Abbasi,
M. Ackermann,
J. Adams,
J. A. Aguilar,
M. Ahlers,
M. Ahrens,
J. M. Alameddine,
A. A. Alves Jr.,
N. M. Amin,
K. Andeen,
T. Anderson,
G. Anton,
C. Argüelles,
Y. Ashida,
S. Athanasiadou,
S. Axani,
X. Bai,
A. Balagopal V.,
S. W. Barwick,
V. Basu,
S. Baur,
R. Bay,
J. J. Beatty,
K. -H. Becker,
J. Becker Tjus
, et al. (364 additional authors not shown)
Abstract:
The origin of high-energy cosmic rays, atomic nuclei that continuously impact Earth's atmosphere, has been a mystery for over a century. Due to deflection in interstellar magnetic fields, cosmic rays from the Milky Way arrive at Earth from random directions. However, near their sources and during propagation, cosmic rays interact with matter and produce high-energy neutrinos. We search for neutrin…
▽ More
The origin of high-energy cosmic rays, atomic nuclei that continuously impact Earth's atmosphere, has been a mystery for over a century. Due to deflection in interstellar magnetic fields, cosmic rays from the Milky Way arrive at Earth from random directions. However, near their sources and during propagation, cosmic rays interact with matter and produce high-energy neutrinos. We search for neutrino emission using machine learning techniques applied to ten years of data from the IceCube Neutrino Observatory. We identify neutrino emission from the Galactic plane at the 4.5$σ$ level of significance, by comparing diffuse emission models to a background-only hypothesis. The signal is consistent with modeled diffuse emission from the Galactic plane, but could also arise from a population of unresolved point sources.
△ Less
Submitted 10 July, 2023;
originally announced July 2023.
-
A Practical Overview of Quantum Computing: Is Exascale Possible?
Authors:
James H. Davenport,
Jessica R. Jones,
Matthew Thomason
Abstract:
Despite numerous advances in the field and a seemingly ever-increasing amount of investment, we are still some years away from seeing a production quantum computer in action. However, it is possible to make some educated guesses about the operational difficulties and challenges that may be encountered in practice. We can be reasonably confident that the early machines will be hybrid, with the quan…
▽ More
Despite numerous advances in the field and a seemingly ever-increasing amount of investment, we are still some years away from seeing a production quantum computer in action. However, it is possible to make some educated guesses about the operational difficulties and challenges that may be encountered in practice. We can be reasonably confident that the early machines will be hybrid, with the quantum devices used in an apparently similar way to current accelerators such as FPGAs or GPUs. Compilers, libraries and the other tools relied upon currently for development of software will have to evolve/be reinvented to support the new technology, and training courses will have to be rethought completely rather than ``just'' updated alongside them.
The workloads we are likely to see making best use of these hybrid machines will initially be few, before rapidly increasing in diversity as we saw with the uptake of GPUs and other new technologies in the past. This will again be helped by the increase in the number of supporting libraries and development tools, and by the gradual re-development of existing software, to make use of the new quantum devices.
Unfortunately, at present the problem of error correction is still largely unsolved, although there have been many advances. Quantum computation is very sensitive to noise, leading to frequent errors during execution.
Quantum calculations, although asymptotically faster than their equivalents in ``traditional'' HPC, still take time, and while the profiling tools and programming approaches will have to change drastically, many of the skills honed in the current HPC industry will not suddenly become obsolete, but continue to be useful in the quantum era.
△ Less
Submitted 21 June, 2023;
originally announced June 2023.
-
Generative AI: Implications and Applications for Education
Authors:
Anastasia Olga,
Tzirides,
Akash Saini,
Gabriela Zapata,
Duane Searsmith,
Bill Cope,
Mary Kalantzis,
Vania Castro,
Theodora Kourkoulou,
John Jones,
Rodrigo Abrantes da Silva,
Jen Whiting,
Nikoleta Polyxeni Kastania
Abstract:
The launch of ChatGPT in November 2022 precipitated a panic among some educators while prompting qualified enthusiasm from others. Under the umbrella term Generative AI, ChatGPT is an example of a range of technologies for the delivery of computer-generated text, image, and other digitized media. This paper examines the implications for education of one generative AI technology, chatbots respondin…
▽ More
The launch of ChatGPT in November 2022 precipitated a panic among some educators while prompting qualified enthusiasm from others. Under the umbrella term Generative AI, ChatGPT is an example of a range of technologies for the delivery of computer-generated text, image, and other digitized media. This paper examines the implications for education of one generative AI technology, chatbots responding from large language models, or C-LLM. It reports on an application of a C-LLM to AI review and assessment of complex student work. In a concluding discussion, the paper explores the intrinsic limits of generative AI, bound as it is to language corpora and their textual representation through binary notation. Within these limits, we suggest the range of emerging and potential applications of Generative AI in education.
△ Less
Submitted 22 May, 2023; v1 submitted 12 May, 2023;
originally announced May 2023.
-
Examining the Potential for Conversational Exploratory Search using a Smart Speaker Digital Assistant
Authors:
Abhishek Kaushik,
Gareth J. F. Jones
Abstract:
Online Digital Assistants, such as Amazon Alexa, Google Assistant, Apple Siri are very popular and provide a range or services to their users, a key function is their ability to satisfy user information needs from the sources available to them. Users may often regard these applications as providing search services similar to Google type search engines. However, while it is clear that they are in g…
▽ More
Online Digital Assistants, such as Amazon Alexa, Google Assistant, Apple Siri are very popular and provide a range or services to their users, a key function is their ability to satisfy user information needs from the sources available to them. Users may often regard these applications as providing search services similar to Google type search engines. However, while it is clear that they are in general able to answer factoid questions effectively, it is much less obvious how well they support less specific or exploratory type search tasks. We describe an investigation examining the behaviour of the standard Amazon Alexa for exploratory search tasks. The results of our study show that it not effective in addressing these types of information needs. We propose extensions to Alexa designed to overcome these shortcomings. Our Custom Alexa application extends Alexa's conversational functionality for exploratory search. A user study shows that our extended Alexa application both enables users to more successfully complete exploratory search tasks and is well accepted by our test users.
△ Less
Submitted 18 March, 2023;
originally announced March 2023.
-
Comparing Conventional and Conversational Search Interaction using Implicit Evaluation Methods
Authors:
Abhishek Kaushik,
Gareth J. F. Jones
Abstract:
Conversational search applications offer the prospect of improved user experience in information seeking via agent support. However, it is not clear how searchers will respond to this mode of engagement, in comparison to a conventional user-driven search interface, such as those found in a standard web search engine. We describe a laboratory-based study directly comparing user behaviour for a conv…
▽ More
Conversational search applications offer the prospect of improved user experience in information seeking via agent support. However, it is not clear how searchers will respond to this mode of engagement, in comparison to a conventional user-driven search interface, such as those found in a standard web search engine. We describe a laboratory-based study directly comparing user behaviour for a conventional search interface (CSI) with that of an agent-mediated multiview conversational search interface (MCSI) which extends the CSI. User reaction and search outcomes of the two interfaces are compared using implicit evaluation using five analysis methods: claiming to have a better search experience in contrast to a corresponding standard search interface.
△ Less
Submitted 18 March, 2023; v1 submitted 16 March, 2023;
originally announced March 2023.
-
Computational-level Analysis of Constraint Compliance for General Intelligence
Authors:
Robert E. Wray,
Steven J. Jones,
John E. Laird
Abstract:
Human behavior is conditioned by codes and norms that constrain action. Rules, ``manners,'' laws, and moral imperatives are examples of classes of constraints that govern human behavior. These systems of constraints are "messy:" individual constraints are often poorly defined, what constraints are relevant in a particular situation may be unknown or ambiguous, constraints interact and conflict wit…
▽ More
Human behavior is conditioned by codes and norms that constrain action. Rules, ``manners,'' laws, and moral imperatives are examples of classes of constraints that govern human behavior. These systems of constraints are "messy:" individual constraints are often poorly defined, what constraints are relevant in a particular situation may be unknown or ambiguous, constraints interact and conflict with one another, and determining how to act within the bounds of the relevant constraints may be a significant challenge, especially when rapid decisions are needed. Despite such messiness, humans incorporate constraints in their decisions robustly and rapidly. General, artificially-intelligent agents must also be able to navigate the messiness of systems of real-world constraints in order to behave predictability and reliably. In this paper, we characterize sources of complexity in constraint processing for general agents and describe a computational-level analysis for such constraint compliance. We identify key algorithmic requirements based on the computational-level analysis and outline an initial, exploratory implementation of a general approach to constraint compliance.
△ Less
Submitted 15 June, 2023; v1 submitted 7 March, 2023;
originally announced March 2023.
-
Interpret Your Care: Predicting the Evolution of Symptoms for Cancer Patients
Authors:
Rupali Bhati,
Jennifer Jones,
Audrey Durand
Abstract:
Cancer treatment is an arduous process for patients and causes many side-effects during and post-treatment. The treatment can affect almost all body systems and result in pain, fatigue, sleep disturbances, cognitive impairments, etc. These conditions are often under-diagnosed or under-treated. In this paper, we use patient data to predict the evolution of their symptoms such that treatment-related…
▽ More
Cancer treatment is an arduous process for patients and causes many side-effects during and post-treatment. The treatment can affect almost all body systems and result in pain, fatigue, sleep disturbances, cognitive impairments, etc. These conditions are often under-diagnosed or under-treated. In this paper, we use patient data to predict the evolution of their symptoms such that treatment-related impairments can be prevented or effects meaningfully ameliorated. The focus of this study is on predicting the pain and tiredness level of a patient post their diagnosis. We implement an interpretable decision tree based model called LightGBM on real-world patient data consisting of 20163 patients. There exists a class imbalance problem in the dataset which we resolve using the oversampling technique of SMOTE. Our empirical results show that the value of the previous level of a symptom is a key indicator for prediction and the weighted average deviation in prediction of pain level is 3.52 and of tiredness level is 2.27.
△ Less
Submitted 19 February, 2023;
originally announced February 2023.
-
Improving Noise Robustness for Spoken Content Retrieval using Semi-supervised ASR and N-best Transcripts for BERT-based Ranking Models
Authors:
Yasufumi Moriya,
Gareth. J. F. Jones
Abstract:
BERT-based re-ranking and dense retrieval (DR) systems have been shown to improve search effectiveness for spoken content retrieval (SCR). However, both methods can still show a reduction in effectiveness when using ASR transcripts in comparison to accurate manual transcripts. We find that a known-item search task on the How2 dataset of spoken instruction videos shows a reduction in mean reciproca…
▽ More
BERT-based re-ranking and dense retrieval (DR) systems have been shown to improve search effectiveness for spoken content retrieval (SCR). However, both methods can still show a reduction in effectiveness when using ASR transcripts in comparison to accurate manual transcripts. We find that a known-item search task on the How2 dataset of spoken instruction videos shows a reduction in mean reciprocal rank (MRR) scores of 10-14%. As a potential method to reduce this disparity, we investigate the use of semi-supervised ASR transcripts and N-best ASR transcripts to mitigate ASR errors for spoken search using BERT-based ranking. Semi-supervised ASR transcripts brought 2-5.5% MRR improvements over standard ASR transcripts and our N-best early fusion methods for BERT DR systems improved MRR by 3-4%. Combining semi-supervised transcripts with N-best early fusion for BERT DR reduced the MRR gap in search effectiveness between manual and ASR transcripts by more than 50% from 14.32% to 6.58%.
△ Less
Submitted 15 January, 2023;
originally announced January 2023.
-
EVAL: Explainable Video Anomaly Localization
Authors:
Ashish Singh,
Michael J. Jones,
Erik Learned-Miller
Abstract:
We develop a novel framework for single-scene video anomaly localization that allows for human-understandable reasons for the decisions the system makes. We first learn general representations of objects and their motions (using deep networks) and then use these representations to build a high-level, location-dependent model of any particular scene. This model can be used to detect anomalies in ne…
▽ More
We develop a novel framework for single-scene video anomaly localization that allows for human-understandable reasons for the decisions the system makes. We first learn general representations of objects and their motions (using deep networks) and then use these representations to build a high-level, location-dependent model of any particular scene. This model can be used to detect anomalies in new videos of the same scene. Importantly, our approach is explainable - our high-level appearance and motion features can provide human-understandable reasons for why any part of a video is classified as normal or anomalous. We conduct experiments on standard video anomaly detection datasets (Street Scene, CUHK Avenue, ShanghaiTech and UCSD Ped1, Ped2) and show significant improvements over the previous state-of-the-art.
△ Less
Submitted 15 December, 2022;
originally announced December 2022.
-
Transformer-based normative modelling for anomaly detection of early schizophrenia
Authors:
Pedro F Da Costa,
Jessica Dafflon,
Sergio Leonardo Mendes,
João Ricardo Sato,
M. Jorge Cardoso,
Robert Leech,
Emily JH Jones,
Walter H. L. Pinaya
Abstract:
Despite the impact of psychiatric disorders on clinical health, early-stage diagnosis remains a challenge. Machine learning studies have shown that classifiers tend to be overly narrow in the diagnosis prediction task. The overlap between conditions leads to high heterogeneity among participants that is not adequately captured by classification models. To address this issue, normative approaches h…
▽ More
Despite the impact of psychiatric disorders on clinical health, early-stage diagnosis remains a challenge. Machine learning studies have shown that classifiers tend to be overly narrow in the diagnosis prediction task. The overlap between conditions leads to high heterogeneity among participants that is not adequately captured by classification models. To address this issue, normative approaches have surged as an alternative method. By using a generative model to learn the distribution of healthy brain data patterns, we can identify the presence of pathologies as deviations or outliers from the distribution learned by the model. In particular, deep generative models showed great results as normative models to identify neurological lesions in the brain. However, unlike most neurological lesions, psychiatric disorders present subtle changes widespread in several brain regions, making these alterations challenging to identify. In this work, we evaluate the performance of transformer-based normative models to detect subtle brain changes expressed in adolescents and young adults. We trained our model on 3D MRI scans of neurotypical individuals (N=1,765). Then, we obtained the likelihood of neurotypical controls and psychiatric patients with early-stage schizophrenia from an independent dataset (N=93) from the Human Connectome Project. Using the predicted likelihood of the scans as a proxy for a normative score, we obtained an AUROC of 0.82 when assessing the difference between controls and individuals with early-stage schizophrenia. Our approach surpassed recent normative methods based on brain age and Gaussian Process, showing the promising use of deep generative models to help in individualised analyses.
△ Less
Submitted 8 December, 2022;
originally announced December 2022.
-
Cross-Modal Knowledge Transfer Without Task-Relevant Source Data
Authors:
Sk Miraj Ahmed,
Suhas Lohit,
Kuan-Chuan Peng,
Michael J. Jones,
Amit K. Roy-Chowdhury
Abstract:
Cost-effective depth and infrared sensors as alternatives to usual RGB sensors are now a reality, and have some advantages over RGB in domains like autonomous navigation and remote sensing. As such, building computer vision and deep learning systems for depth and infrared data are crucial. However, large labeled datasets for these modalities are still lacking. In such cases, transferring knowledge…
▽ More
Cost-effective depth and infrared sensors as alternatives to usual RGB sensors are now a reality, and have some advantages over RGB in domains like autonomous navigation and remote sensing. As such, building computer vision and deep learning systems for depth and infrared data are crucial. However, large labeled datasets for these modalities are still lacking. In such cases, transferring knowledge from a neural network trained on a well-labeled large dataset in the source modality (RGB) to a neural network that works on a target modality (depth, infrared, etc.) is of great value. For reasons like memory and privacy, it may not be possible to access the source data, and knowledge transfer needs to work with only the source models. We describe an effective solution, SOCKET: SOurce-free Cross-modal KnowledgE Transfer for this challenging task of transferring knowledge from one source modality to a different target modality without access to task-relevant source data. The framework reduces the modality gap using paired task-irrelevant data, as well as by matching the mean and variance of the target features with the batch-norm statistics that are present in the source models. We show through extensive experiments that our method significantly outperforms existing source-free methods for classification tasks which do not account for the modality gap.
△ Less
Submitted 8 September, 2022;
originally announced September 2022.
-
Graph Neural Networks for Low-Energy Event Classification & Reconstruction in IceCube
Authors:
R. Abbasi,
M. Ackermann,
J. Adams,
N. Aggarwal,
J. A. Aguilar,
M. Ahlers,
M. Ahrens,
J. M. Alameddine,
A. A. Alves Jr.,
N. M. Amin,
K. Andeen,
T. Anderson,
G. Anton,
C. Argüelles,
Y. Ashida,
S. Athanasiadou,
S. Axani,
X. Bai,
A. Balagopal V.,
M. Baricevic,
S. W. Barwick,
V. Basu,
R. Bay,
J. J. Beatty,
K. -H. Becker
, et al. (359 additional authors not shown)
Abstract:
IceCube, a cubic-kilometer array of optical sensors built to detect atmospheric and astrophysical neutrinos between 1 GeV and 1 PeV, is deployed 1.45 km to 2.45 km below the surface of the ice sheet at the South Pole. The classification and reconstruction of events from the in-ice detectors play a central role in the analysis of data from IceCube. Reconstructing and classifying events is a challen…
▽ More
IceCube, a cubic-kilometer array of optical sensors built to detect atmospheric and astrophysical neutrinos between 1 GeV and 1 PeV, is deployed 1.45 km to 2.45 km below the surface of the ice sheet at the South Pole. The classification and reconstruction of events from the in-ice detectors play a central role in the analysis of data from IceCube. Reconstructing and classifying events is a challenge due to the irregular detector geometry, inhomogeneous scattering and absorption of light in the ice and, below 100 GeV, the relatively low number of signal photons produced per event. To address this challenge, it is possible to represent IceCube events as point cloud graphs and use a Graph Neural Network (GNN) as the classification and reconstruction method. The GNN is capable of distinguishing neutrino events from cosmic-ray backgrounds, classifying different neutrino event types, and reconstructing the deposited energy, direction and interaction vertex. Based on simulation, we provide a comparison in the 1-100 GeV energy range to the current state-of-the-art maximum likelihood techniques used in current IceCube analyses, including the effects of known systematic uncertainties. For neutrino event classification, the GNN increases the signal efficiency by 18% at a fixed false positive rate (FPR), compared to current IceCube methods. Alternatively, the GNN offers a reduction of the FPR by over a factor 8 (to below half a percent) at a fixed signal efficiency. For the reconstruction of energy, direction, and interaction vertex, the resolution improves by an average of 13%-20% compared to current maximum likelihood techniques in the energy range of 1-30 GeV. The GNN, when run on a GPU, is capable of processing IceCube events at a rate nearly double of the median IceCube trigger rate of 2.7 kHz, which opens the possibility of using low energy neutrinos in online searches for transient events.
△ Less
Submitted 11 October, 2022; v1 submitted 7 September, 2022;
originally announced September 2022.
-
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Authors:
Aarohi Srivastava,
Abhinav Rastogi,
Abhishek Rao,
Abu Awal Md Shoeb,
Abubakar Abid,
Adam Fisch,
Adam R. Brown,
Adam Santoro,
Aditya Gupta,
Adrià Garriga-Alonso,
Agnieszka Kluska,
Aitor Lewkowycz,
Akshat Agarwal,
Alethea Power,
Alex Ray,
Alex Warstadt,
Alexander W. Kocurek,
Ali Safaya,
Ali Tazarv,
Alice Xiang,
Alicia Parrish,
Allen Nie,
Aman Hussain,
Amanda Askell,
Amanda Dsouza
, et al. (426 additional authors not shown)
Abstract:
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur…
▽ More
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
△ Less
Submitted 12 June, 2023; v1 submitted 9 June, 2022;
originally announced June 2022.
-
Online User Profiling to Detect Social Bots on Twitter
Authors:
Maryam Heidari,
James H Jr Jones,
Ozlem Uzuner
Abstract:
Social media platforms can expose influential trends in many aspects of everyday life. However, the movements they represent can be contaminated by disinformation. Social bots are one of the significant sources of disinformation in social media. Social bots can pose serious cyber threats to society and public opinion. This research aims to develop machine learning models to detect bots based on th…
▽ More
Social media platforms can expose influential trends in many aspects of everyday life. However, the movements they represent can be contaminated by disinformation. Social bots are one of the significant sources of disinformation in social media. Social bots can pose serious cyber threats to society and public opinion. This research aims to develop machine learning models to detect bots based on the extracted user's profile from a Tweet's text. Online users' profile shows the user's personal information, such as age, gender, education, and personality. In this work, the user's profile is constructed based on the user's online posts. This work's main contribution is three-fold: First, we aim to improve bot detection through machine learning models based on the user's personal information generated by the user's online comments. When comparing two online posts, the similarity of personal information makes it difficult to differentiate a bot from a human user. However, this research turns personal information similarity among two online posts into an advantage for the new bot detection model. The new proposed model for bot detection creates user profiles based on personal information such as age, personality, gender, education from users' online posts and introduces a machine learning model to detect social bots with high prediction accuracy based on personal information. Second, create a new public data set that shows the user's profile for more than 6900 Twitter accounts in the Cresci 2017 data set.
△ Less
Submitted 9 March, 2022;
originally announced March 2022.
-
Achieving Reliable Human Assessment of Open-Domain Dialogue Systems
Authors:
Tianbo Ji,
Yvette Graham,
Gareth J. F. Jones,
Chenyang Lyu,
Qun Liu
Abstract:
Evaluation of open-domain dialogue systems is highly challenging and development of better techniques is highlighted time and again as desperately needed. Despite substantial efforts to carry out reliable live evaluation of systems in recent competitions, annotations have been abandoned and reported as too unreliable to yield sensible results. This is a serious problem since automatic metrics are…
▽ More
Evaluation of open-domain dialogue systems is highly challenging and development of better techniques is highlighted time and again as desperately needed. Despite substantial efforts to carry out reliable live evaluation of systems in recent competitions, annotations have been abandoned and reported as too unreliable to yield sensible results. This is a serious problem since automatic metrics are not known to provide a good indication of what may or may not be a high-quality conversation. Answering the distress call of competitions that have emphasized the urgent need for better evaluation techniques in dialogue, we present the successful development of human evaluation that is highly reliable while still remaining feasible and low cost. Self-replication experiments reveal almost perfectly repeatable results with a correlation of $r=0.969$. Furthermore, due to the lack of appropriate methods of statistical significance testing, the likelihood of potential improvements to systems occurring due to chance is rarely taken into account in dialogue evaluation, and the evaluation we propose facilitates application of standard tests. Since we have developed a highly reliable evaluation method, new insights into system performance can be revealed. We therefore include a comparison of state-of-the-art models (i) with and without personas, to measure the contribution of personas to conversation quality, as well as (ii) prescribed versus freely chosen topics. Interestingly with respect to personas, results indicate that personas do not positively contribute to conversation quality as expected.
△ Less
Submitted 11 March, 2022;
originally announced March 2022.
-
Smart Pointers and Shared Memory Synchronisation for Efficient Inter-process Communication in ROS on an Autonomous Vehicle
Authors:
Costin Iordache,
Stephen M. Fendyke,
Mike J. Jones,
Robert A. Buckley
Abstract:
Despite the stringent requirements of a real-time system, the reliance of the Robot Operating System (ROS) on the loopback network interface imposes a considerable overhead on the transport of high bandwidth data, while the nodelet package, which is an efficient mechanism for intra-process communication, does not address the problem of efficient local inter-process communication (IPC). To remedy t…
▽ More
Despite the stringent requirements of a real-time system, the reliance of the Robot Operating System (ROS) on the loopback network interface imposes a considerable overhead on the transport of high bandwidth data, while the nodelet package, which is an efficient mechanism for intra-process communication, does not address the problem of efficient local inter-process communication (IPC). To remedy this, we propose a novel integration into ROS of smart pointers and synchronisation primitives stored in shared memory. These obey the same semantics and, more importantly, exhibit the same performance as their C++ standard library counterparts, making them preferable to other local IPC mechanisms. We present a series of benchmarks for our mechanism - which we call LOT (Low Overhead Transport) - and use them to assess its performance on realistic data loads based on Five's Autonomous Vehicle (AV) system, and extend our analysis to the case where multiple ROS nodes are running in Docker containers. We find that our mechanism performs up to two orders of magnitude better than the standard IPC via local loopback. Finally, we apply industry-standard profiling techniques to explore the hotspots of code running in both user and kernel space, comparing our implementation against alternatives.
△ Less
Submitted 16 August, 2021;
originally announced August 2021.
-
Translation Quality Assessment: A Brief Survey on Manual and Automatic Methods
Authors:
Lifeng Han,
Gareth J. F. Jones,
Alan F. Smeaton
Abstract:
To facilitate effective translation modeling and translation studies, one of the crucial questions to address is how to assess translation quality. From the perspectives of accuracy, reliability, repeatability and cost, translation quality assessment (TQA) itself is a rich and challenging task. In this work, we present a high-level and concise survey of TQA methods, including both manual judgement…
▽ More
To facilitate effective translation modeling and translation studies, one of the crucial questions to address is how to assess translation quality. From the perspectives of accuracy, reliability, repeatability and cost, translation quality assessment (TQA) itself is a rich and challenging task. In this work, we present a high-level and concise survey of TQA methods, including both manual judgement criteria and automated evaluation metrics, which we classify into further detailed sub-categories. We hope that this work will be an asset for both translation model researchers and quality assessment researchers. In addition, we hope that it will enable practitioners to quickly develop a better understanding of the conventional TQA field, and to find corresponding closely relevant evaluation solutions for their own needs. This work may also serve inspire further development of quality assessment and evaluation methodologies for other natural language processing (NLP) tasks in addition to machine translation (MT), such as automatic text summarization (ATS), natural language understanding (NLU) and natural language generation (NLG).
△ Less
Submitted 5 May, 2021;
originally announced May 2021.
-
TRECVID 2020: A comprehensive campaign for evaluating video retrieval tasks across multiple application domains
Authors:
George Awad,
Asad A. Butt,
Keith Curtis,
Jonathan Fiscus,
Afzal Godil,
Yooyoung Lee,
Andrew Delgado,
Jesse Zhang,
Eliot Godard,
Baptiste Chocot,
Lukas Diduch,
Jeffrey Liu,
Alan F. Smeaton,
Yvette Graham,
Gareth J. F. Jones,
Wessel Kraaij,
Georges Quenot
Abstract:
The TREC Video Retrieval Evaluation (TRECVID) is a TREC-style video analysis and retrieval evaluation with the goal of promoting progress in research and development of content-based exploitation and retrieval of information from digital video via open, metrics-based evaluation. Over the last twenty years this effort has yielded a better understanding of how systems can effectively accomplish such…
▽ More
The TREC Video Retrieval Evaluation (TRECVID) is a TREC-style video analysis and retrieval evaluation with the goal of promoting progress in research and development of content-based exploitation and retrieval of information from digital video via open, metrics-based evaluation. Over the last twenty years this effort has yielded a better understanding of how systems can effectively accomplish such processing and how one can reliably benchmark their performance. TRECVID has been funded by NIST (National Institute of Standards and Technology) and other US government agencies. In addition, many organizations and individuals worldwide contribute significant time and effort. TRECVID 2020 represented a continuation of four tasks and the addition of two new tasks. In total, 29 teams from various research organizations worldwide completed one or more of the following six tasks: 1. Ad-hoc Video Search (AVS), 2. Instance Search (INS), 3. Disaster Scene Description and Indexing (DSDI), 4. Video to Text Description (VTT), 5. Activities in Extended Video (ActEV), 6. Video Summarization (VSUM). This paper is an introduction to the evaluation framework, tasks, data, and measures used in the evaluation campaign.
△ Less
Submitted 27 April, 2021;
originally announced April 2021.