-
Unified Modeling Language Code Generation from Diagram Images Using Multimodal Large Language Models
Authors:
Averi Bates,
Ryan Vavricka,
Shane Carleton,
Ruosi Shao,
Chongle Pan
Abstract:
The Unified Modeling Language is a standardized visual language widely used for modeling and documenting the design of software systems. Although many tools generate UML diagrams from UML code, generating executable UML code from image-based UML diagrams remains challenging. This paper proposes a new approach to generate UML code using a large multimodal language model automatically. Synthetic UML…
▽ More
The Unified Modeling Language is a standardized visual language widely used for modeling and documenting the design of software systems. Although many tools generate UML diagrams from UML code, generating executable UML code from image-based UML diagrams remains challenging. This paper proposes a new approach to generate UML code using a large multimodal language model automatically. Synthetic UML activity and sequence diagram datasets were created to train and test the model. We compared standard fine-tuning with LoRA techniques to optimize base models. The experiments measured code generation accuracy across different model sizes and training strategies. These results demonstrated that domain-adapted MM-LLMs perform for UML code generation automation, whereby, at the best model, it achieved BLEU and SSIM scores of 0.779 and 0.942 on sequence diagrams. This will enable the modernization of legacy systems and decrease the manual effort in software development workflows.
△ Less
Submitted 15 May, 2025; v1 submitted 15 March, 2025;
originally announced March 2025.
-
Integration of nested cross-validation, automated hyperparameter optimization, high-performance computing to reduce and quantify the variance of test performance estimation of deep learning models
Authors:
Paul Calle,
Averi Bates,
Justin C. Reynolds,
Yunlong Liu,
Haoyang Cui,
Sinaro Ly,
Chen Wang,
Qinghao Zhang,
Alberto J. de Armendi,
Shashank S. Shettar,
Kar Ming Fung,
Qinggong Tang,
Chongle Pan
Abstract:
The variability and biases in the real-world performance benchmarking of deep learning models for medical imaging compromise their trustworthiness for real-world deployment. The common approach of holding out a single fixed test set fails to quantify the variance in the estimation of test performance metrics. This study introduces NACHOS (Nested and Automated Cross-validation and Hyperparameter Op…
▽ More
The variability and biases in the real-world performance benchmarking of deep learning models for medical imaging compromise their trustworthiness for real-world deployment. The common approach of holding out a single fixed test set fails to quantify the variance in the estimation of test performance metrics. This study introduces NACHOS (Nested and Automated Cross-validation and Hyperparameter Optimization using Supercomputing) to reduce and quantify the variance of test performance metrics of deep learning models. NACHOS integrates Nested Cross-Validation (NCV) and Automated Hyperparameter Optimization (AHPO) within a parallelized high-performance computing (HPC) framework. NACHOS was demonstrated on a chest X-ray repository and an Optical Coherence Tomography (OCT) dataset under multiple data partitioning schemes. Beyond performance estimation, DACHOS (Deployment with Automated Cross-validation and Hyperparameter Optimization using Supercomputing) is introduced to leverage AHPO and cross-validation to build the final model on the full dataset, improving expected deployment performance. The findings underscore the importance of NCV in quantifying and reducing estimation variance, AHPO in optimizing hyperparameters consistently across test folds, and HPC in ensuring computational feasibility. By integrating these methodologies, NACHOS and DACHOS provide a scalable, reproducible, and trustworthy framework for DL model evaluation and deployment in medical imaging.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
ORCHID: Streaming Threat Detection over Versioned Provenance Graphs
Authors:
Akul Goyal,
Jason Liu,
Adam Bates,
Gang Wang
Abstract:
While Endpoint Detection and Response (EDR) are able to efficiently monitor threats by comparing static rules to the event stream, their inability to incorporate past system context leads to high rates of false alarms. Recent work has demonstrated Provenance-based Intrusion Detection Systems (Prov-IDS) that can examine the causal relationships between abnormal behaviors to improve threat classific…
▽ More
While Endpoint Detection and Response (EDR) are able to efficiently monitor threats by comparing static rules to the event stream, their inability to incorporate past system context leads to high rates of false alarms. Recent work has demonstrated Provenance-based Intrusion Detection Systems (Prov-IDS) that can examine the causal relationships between abnormal behaviors to improve threat classification. However, employing these Prov-IDS in practical settings remains difficult -- state-of-the-art neural network based systems are only fast in a fully offline deployment model that increases attacker dwell time, while simultaneously using simplified and less accurate provenance graphs to reduce memory consumption. Thus, today's Prov-IDS cannot operate effectively in the real-time streaming setting required for commercial EDR viability.
This work presents the design and implementation of ORCHID, a novel Prov-IDS that performs fine-grained detection of process-level threats over a real time event stream. ORCHID takes advantage of the unique immutable properties of a versioned provenance graphs to iteratively embed the entire graph in a sequential RNN model while only consuming a fraction of the computation and memory costs. We evaluate ORCHID on four public datasets, including DARPA TC, to show that ORCHID can provide competitive classification performance while eliminating detection lag and reducing memory consumption by two orders of magnitude.
△ Less
Submitted 23 August, 2024;
originally announced August 2024.
-
Carbon Filter: Real-time Alert Triage Using Large Scale Clustering and Fast Search
Authors:
Jonathan Oliver,
Raghav Batta,
Adam Bates,
Muhammad Adil Inam,
Shelly Mehta,
Shugao Xia
Abstract:
"Alert fatigue" is one of the biggest challenges faced by the Security Operations Center (SOC) today, with analysts spending more than half of their time reviewing false alerts. Endpoint detection products raise alerts by pattern matching on event telemetry against behavioral rules that describe potentially malicious behavior, but can suffer from high false positives that distract from actual atta…
▽ More
"Alert fatigue" is one of the biggest challenges faced by the Security Operations Center (SOC) today, with analysts spending more than half of their time reviewing false alerts. Endpoint detection products raise alerts by pattern matching on event telemetry against behavioral rules that describe potentially malicious behavior, but can suffer from high false positives that distract from actual attacks. While alert triage techniques based on data provenance may show promise, these techniques can take over a minute to inspect a single alert, while EDR customers may face tens of millions of alerts per day; the current reality is that these approaches aren't nearly scalable enough for production environments.
We present Carbon Filter, a statistical learning based system that dramatically reduces the number of alerts analysts need to manually review. Our approach is based on the observation that false alert triggers can be efficiently identified and separated from suspicious behaviors by examining the process initiation context (e.g., the command line) that launched the responsible process. Through the use of fast-search algorithms for training and inference, our approach scales to millions of alerts per day. Through batching queries to the model, we observe a theoretical maximum throughput of 20 million alerts per hour. Based on the analysis of tens of million alerts from customer deployments, our solution resulted in a 6-fold improvement in the Signal-to-Noise ratio without compromising on alert triage performance.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
ATLASv2: ATLAS Attack Engagements, Version 2
Authors:
Andy Riddle,
Kim Westfall,
Adam Bates
Abstract:
ATLASv2 is based on a previously generated dataset included in "ATLAS: A Sequence-based Learning Approach for Attack Investigation." The original ATLAS dataset is comprised of Windows Security Auditing system logs, Firefox logs, and DNS logs via WireShark. In ATLASv2, we aim to enrich the ATLAS dataset with higher quality background noise and additional logging vantage points. This work replicates…
▽ More
ATLASv2 is based on a previously generated dataset included in "ATLAS: A Sequence-based Learning Approach for Attack Investigation." The original ATLAS dataset is comprised of Windows Security Auditing system logs, Firefox logs, and DNS logs via WireShark. In ATLASv2, we aim to enrich the ATLAS dataset with higher quality background noise and additional logging vantage points. This work replicates the ten attack scenarios described in ATLAS, but extends the logging to include Sysmon logs and events tracked through VMware Carbon Black Cloud.
The main contribution of ATLASv2 is to improve the quality of the benign system activity and the integration of the attack scenarios. Instead of relying on automated scripts to generate activity, we had two researchers use the victim machines as their primary work stations throughout the course of the engagement. This allowed us to capture system logs on actual user behavior. Additionally, the researchers conducted the attacks in a lab setup allowing the integration of the attack into the work flow of the victim user. This allows the ATLASv2 dataset to provide realistic system logs that mirror the system log activity generated in real-world attacks.
△ Less
Submitted 3 October, 2023;
originally announced January 2024.
-
Ellipsis: Towards Efficient System Auditing for Real-Time Systems
Authors:
Ayoosh Bansal,
Anant Kandikuppa,
Chien-Ying Chen,
Monowar Hasan,
Adam Bates,
Sibin Mohan
Abstract:
System auditing is a powerful tool that provides insight into the nature of suspicious events in computing systems, allowing machine operators to detect and subsequently investigate security incidents. While auditing has proven invaluable to the security of traditional computers, existing audit frameworks are rarely designed with consideration for Real-Time Systems (RTS). The transparency provided…
▽ More
System auditing is a powerful tool that provides insight into the nature of suspicious events in computing systems, allowing machine operators to detect and subsequently investigate security incidents. While auditing has proven invaluable to the security of traditional computers, existing audit frameworks are rarely designed with consideration for Real-Time Systems (RTS). The transparency provided by system auditing would be of tremendous benefit in a variety of security-critical RTS domains, (e.g., autonomous vehicles); however, if audit mechanisms are not carefully integrated into RTS, auditing can be rendered ineffectual and violate the real-world temporal requirements of the RTS.
In this paper, we demonstrate how to adapt commodity audit frameworks to RTS. Using Linux Audit as a case study, we first demonstrate that the volume of audit events generated by commodity frameworks is unsustainable within the temporal and resource constraints of real-time (RT) applications. To address this, we present Ellipsis, a set of kernel-based reduction techniques that leverage the periodic repetitive nature of RT applications to aggressively reduce the costs of system-level auditing. Ellipsis generates succinct descriptions of RT applications' expected activity while retaining a detailed record of unexpected activities, enabling analysis of suspicious activity while meeting temporal constraints. Our evaluation of Ellipsis, using ArduPilot (an open-source autopilot application suite) demonstrates up to 93% reduction in audit log generation.
△ Less
Submitted 4 August, 2022;
originally announced August 2022.
-
Dynamic imaging using Motion-Compensated SmooThness Regularization on Manifolds (MoCo-SToRM)
Authors:
Qing Zou,
Luis A. Torres,
Sean B. Fain,
Nara S. Higano,
Alister J. Bates,
Mathews Jacob
Abstract:
We introduce an unsupervised motion-compensated reconstruction scheme for high-resolution free-breathing pulmonary MRI. We model the image frames in the time series as the deformed version of the 3D template image volume. We assume the deformation maps to be points on a smooth manifold in high-dimensional space. Specifically, we model the deformation map at each time instant as the output of a CNN…
▽ More
We introduce an unsupervised motion-compensated reconstruction scheme for high-resolution free-breathing pulmonary MRI. We model the image frames in the time series as the deformed version of the 3D template image volume. We assume the deformation maps to be points on a smooth manifold in high-dimensional space. Specifically, we model the deformation map at each time instant as the output of a CNN-based generator that has the same weight for all time-frames, driven by a low-dimensional latent vector. The time series of latent vectors account for the dynamics in the dataset, including respiratory motion and bulk motion. The template image volume, the parameters of the generator, and the latent vectors are learned directly from the k-t space data in an unsupervised fashion. Our experimental results show improved reconstructions compared to state-of-the-art methods, especially in the context of bulk motion during the scans.
△ Less
Submitted 6 December, 2021;
originally announced December 2021.
-
Discriminative Attribution from Counterfactuals
Authors:
Nils Eckstein,
Alexander S. Bates,
Gregory S. X. E. Jefferis,
Jan Funke
Abstract:
We present a method for neural network interpretability by combining feature attribution with counterfactual explanations to generate attribution maps that highlight the most discriminative features between pairs of classes. We show that this method can be used to quantitatively evaluate the performance of feature attribution methods in an objective manner, thus preventing potential observer bias.…
▽ More
We present a method for neural network interpretability by combining feature attribution with counterfactual explanations to generate attribution maps that highlight the most discriminative features between pairs of classes. We show that this method can be used to quantitatively evaluate the performance of feature attribution methods in an objective manner, thus preventing potential observer bias. We evaluate the proposed method on three diverse datasets, including a challenging artificial dataset and real-world biological data. We show quantitatively and qualitatively that the highlighted features are substantially more discriminative than those extracted using conventional attribution methods and argue that this type of explanation is better suited for understanding fine grained class differences as learned by a deep neural network.
△ Less
Submitted 27 September, 2021;
originally announced September 2021.
-
Toward Lattice QCD On Billion Core Approximate Computers
Authors:
Alexandra Bates,
Joseph Bates
Abstract:
We present evidence of the feasibility of using billion core approximate computers to run simple U(1) sigma models, and discuss how the approach might be extended to Lattice Quantum Chromodynamics (LQCD) models. This work is motivated by the extreme time, power, and cost needed to run LQCD on current computing hardware. We show that, using massively parallel approximate hardware, at least some mod…
▽ More
We present evidence of the feasibility of using billion core approximate computers to run simple U(1) sigma models, and discuss how the approach might be extended to Lattice Quantum Chromodynamics (LQCD) models. This work is motivated by the extreme time, power, and cost needed to run LQCD on current computing hardware. We show that, using massively parallel approximate hardware, at least some models can run with great speed and power efficiency without sacrificing accuracy. As a test of accuracy, a 32 x 32 x 32 U(1) sigma model yielded similar results using floating point and approximate representations for the spins. A 20 million point 3D model, run on a 34,000-core single-board prototype approximate computer, showed encouraging accuracy with a ~750 times improvement in speed and ~2500 times improvement in speed/watt compared to a traditional CPU. These results suggest there is value in future research to determine whether similar speed-ups and accuracies are possible running full LQCD on the compact billion-core approximate computing systems that are now practical.
△ Less
Submitted 29 October, 2020;
originally announced October 2020.
-
UNICORN: Runtime Provenance-Based Detector for Advanced Persistent Threats
Authors:
Xueyuan Han,
Thomas Pasquier,
Adam Bates,
James Mickens,
Margo Seltzer
Abstract:
Advanced Persistent Threats (APTs) are difficult to detect due to their "low-and-slow" attack patterns and frequent use of zero-day exploits. We present UNICORN, an anomaly-based APT detector that effectively leverages data provenance analysis. From modeling to detection, UNICORN tailors its design specifically for the unique characteristics of APTs. Through extensive yet time-efficient graph anal…
▽ More
Advanced Persistent Threats (APTs) are difficult to detect due to their "low-and-slow" attack patterns and frequent use of zero-day exploits. We present UNICORN, an anomaly-based APT detector that effectively leverages data provenance analysis. From modeling to detection, UNICORN tailors its design specifically for the unique characteristics of APTs. Through extensive yet time-efficient graph analysis, UNICORN explores provenance graphs that provide rich contextual and historical information to identify stealthy anomalous activities without pre-defined attack signatures. Using a graph sketching technique, it summarizes long-running system execution with space efficiency to combat slow-acting attacks that take place over a long time span. UNICORN further improves its detection capability using a novel modeling approach to understand long-term behavior as the system evolves. Our evaluation shows that UNICORN outperforms an existing state-of-the-art APT detection system and detects real-life APT scenarios with high accuracy.
△ Less
Submitted 14 January, 2020; v1 submitted 6 January, 2020;
originally announced January 2020.
-
Runtime Analysis of Whole-System Provenance
Authors:
Thomas Pasquier,
Xueyuan Han,
Thomas Moyer,
Adam Bates,
Olivier Hermant,
David Eyers,
Jean Bacon,
Margo Seltzer
Abstract:
Identifying the root cause and impact of a system intrusion remains a foundational challenge in computer security. Digital provenance provides a detailed history of the flow of information within a computing system, connecting suspicious events to their root causes. Although existing provenance-based auditing techniques provide value in forensic analysis, they assume that such analysis takes place…
▽ More
Identifying the root cause and impact of a system intrusion remains a foundational challenge in computer security. Digital provenance provides a detailed history of the flow of information within a computing system, connecting suspicious events to their root causes. Although existing provenance-based auditing techniques provide value in forensic analysis, they assume that such analysis takes place only retrospectively. Such post-hoc analysis is insufficient for realtime security applications, moreover, even for forensic tasks, prior provenance collection systems exhibited poor performance and scalability, jeopardizing the timeliness of query responses.
We present CamQuery, which provides inline, realtime provenance analysis, making it suitable for implementing security applications. CamQuery is a Linux Security Module that offers support for both userspace and in-kernel execution of analysis applications. We demonstrate the applicability of CamQuery to a variety of runtime security applications including data loss prevention, intrusion detection, and regulatory compliance. In evaluation, we demonstrate that CamQuery reduces the latency of realtime query mechanisms, while imposing minimal overheads on system execution. CamQuery thus enables the further deployment of provenance-based technologies to address central challenges in computer security.
△ Less
Submitted 25 August, 2018; v1 submitted 18 August, 2018;
originally announced August 2018.
-
An Optimal Dimensionality Multi-shell Sampling Scheme with Accurate and Efficient Transforms for Diffusion MRI
Authors:
Alice P. Bates,
Zubair Khalid,
Jason D. McEwen,
Rodney A. Kennedy
Abstract:
This paper proposes a multi-shell sampling scheme and corresponding transforms for the accurate reconstruction of the diffusion signal in diffusion MRI by expansion in the spherical polar Fourier (SPF) basis. The sampling scheme uses an optimal number of samples, equal to the degrees of freedom of the band-limited diffusion signal in the SPF domain, and allows for computationally efficient reconst…
▽ More
This paper proposes a multi-shell sampling scheme and corresponding transforms for the accurate reconstruction of the diffusion signal in diffusion MRI by expansion in the spherical polar Fourier (SPF) basis. The sampling scheme uses an optimal number of samples, equal to the degrees of freedom of the band-limited diffusion signal in the SPF domain, and allows for computationally efficient reconstruction. We use synthetic data sets to demonstrate that the proposed scheme allows for greater reconstruction accuracy of the diffusion signal than the multi-shell sampling schemes obtained using the generalised electrostatic energy minimisation (gEEM) method used in the Human Connectome Project. We also demonstrate that the proposed sampling scheme allows for increased angular discrimination and improved rotational invariance of reconstruction accuracy than the gEEM schemes.
△ Less
Submitted 20 April, 2017;
originally announced May 2017.
-
Multi-shell Sampling Scheme with Accurate and Efficient Transforms for Diffusion MRI
Authors:
Alice P. Bates,
Zubair Khalid,
Rodney A. Kennedy,
Jason D. McEwen
Abstract:
We propose a multi-shell sampling grid and develop corresponding transforms for the accurate reconstruction of the diffusion signal in diffusion MRI by expansion in the spherical polar Fourier (SPF) basis. The transform is exact in the radial direction and accurate, on the order of machine precision, in the angular direction. The sampling scheme uses an optimal number of samples equal to the degre…
▽ More
We propose a multi-shell sampling grid and develop corresponding transforms for the accurate reconstruction of the diffusion signal in diffusion MRI by expansion in the spherical polar Fourier (SPF) basis. The transform is exact in the radial direction and accurate, on the order of machine precision, in the angular direction. The sampling scheme uses an optimal number of samples equal to the degrees of freedom of the diffusion signal in the SPF domain.
△ Less
Submitted 22 February, 2017;
originally announced February 2017.
-
Retrofitting Applications with Provenance-Based Security Monitoring
Authors:
Adam Bates,
Kevin Butler,
Alin Dobra,
Brad Reaves,
Patrick Cable,
Thomas Moyer,
Nabil Schear
Abstract:
Data provenance is a valuable tool for detecting and preventing cyber attack, providing insight into the nature of suspicious events. For example, an administrator can use provenance to identify the perpetrator of a data leak, track an attacker's actions following an intrusion, or even control the flow of outbound data within an organization. Unfortunately, providing relevant data provenance for c…
▽ More
Data provenance is a valuable tool for detecting and preventing cyber attack, providing insight into the nature of suspicious events. For example, an administrator can use provenance to identify the perpetrator of a data leak, track an attacker's actions following an intrusion, or even control the flow of outbound data within an organization. Unfortunately, providing relevant data provenance for complex, heterogenous software deployments is challenging, requiring both the tedious instrumentation of many application components as well as a unified architecture for aggregating information between components.
In this work, we present a composition of techniques for bringing affordable and holistic provenance capabilities to complex application workflows, with particular consideration for the exemplar domain of web services. We present DAP, a transparent architecture for capturing detailed data provenance for web service components. Our approach leverages a key insight that minimal knowledge of open protocols can be leveraged to extract precise and efficient provenance information by interposing on application components' communications, granting DAP compatibility with existing web services without requiring instrumentation or developer cooperation. We show how our system can be used in real time to monitor system intrusions or detect data exfiltration attacks while imposing less than 5.1 ms end-to-end overhead on web requests. Through the introduction of a garbage collection optimization, DAP is able to monitor system activity without suffering from excessive storage overhead. DAP thus serves not only as a provenance-aware web framework, but as a case study in the non-invasive deployment of provenance capabilities for complex applications workflows.
△ Less
Submitted 1 September, 2016;
originally announced September 2016.
-
Efficient Computation of Slepian Functions for Arbitrary Regions on the Sphere
Authors:
Alice P. Bates,
Zubair Khalid,
Rodney A. Kennedy
Abstract:
In this paper, we develop a new method for the fast and memory-efficient computation of Slepian functions on the sphere. Slepian functions, which arise as the solution of the Slepian concentration problem on the sphere, have desirable properties for applications where measurements are only available within a spatially limited region on the sphere and/or a function is required to be analyzed over t…
▽ More
In this paper, we develop a new method for the fast and memory-efficient computation of Slepian functions on the sphere. Slepian functions, which arise as the solution of the Slepian concentration problem on the sphere, have desirable properties for applications where measurements are only available within a spatially limited region on the sphere and/or a function is required to be analyzed over the spatially limited region. Slepian functions are currently not easily computed for large band-limits for an arbitrary spatial region due to high computational and large memory storage requirements. For the special case of a polar cap, the symmetry of the region enables the decomposition of the Slepian concentration problem into smaller subproblems and consequently the efficient computation of Slepian functions for large band-limits. By exploiting the efficient computation of Slepian functions for the polar cap region on the sphere, we develop a formulation, supported by a fast algorithm, for the approximate computation of Slepian functions for an arbitrary spatial region to enable the analysis of modern datasets that support large band-limits. For the proposed algorithm, we carry out accuracy analysis of the approximation, computational complexity analysis, and review of memory storage requirements. We illustrate, through numerical experiments, that the proposed method enables faster computation, and has smaller storage requirements, while allowing for sufficiently accurate computation of the Slepian functions.
△ Less
Submitted 31 August, 2017; v1 submitted 18 August, 2016;
originally announced August 2016.