Search | arXiv e-print repository

arXiv:2409.07189 [pdf, other]

A Perspective on AI-Guided Molecular Simulations in VR: Exploring Strategies for Imitation Learning in Hyperdimensional Molecular Systems

Authors: Mohamed Dhouioui, Jonathan Barnoud, Rhoslyn Roebuck Williams, Harry J. Stroud, Phil Bates, David R. Glowacki

Abstract: Molecular dynamics simulations are a crucial computational tool for researchers to understand and engineer molecular structure and function in areas such as drug discovery, protein engineering, and material design. Despite their utility, MD simulations are expensive, owing to the high dimensionality of molecular systems. Interactive molecular dynamics in virtual reality (iMD-VR) has recently been… ▽ More Molecular dynamics simulations are a crucial computational tool for researchers to understand and engineer molecular structure and function in areas such as drug discovery, protein engineering, and material design. Despite their utility, MD simulations are expensive, owing to the high dimensionality of molecular systems. Interactive molecular dynamics in virtual reality (iMD-VR) has recently been developed as a 'human-in-the-loop' strategy, which leverages high-performance computing to accelerate the researcher's ability to solve the hyperdimensional sampling problem. By providing an immersive 3D environment that enables visualization and manipulation of real-time molecular motion, iMD-VR enables researchers and students to efficiently and intuitively explore and navigate these complex, high-dimensional systems. iMD-VR platforms offer a unique opportunity to quickly generate rich datasets that capture human experts' spatial insight regarding molecular structure and function. This paper explores the possibility of employing user-generated iMD-VR datasets to train AI agents via imitation learning (IL). IL is an important technique in robotics that enables agents to mimic complex behaviors from expert demonstrations, thus circumventing the need for explicit programming or intricate reward design. We review the utilization of IL for manipulation tasks in robotics and discuss how iMD-VR recordings could be used to train IL models for solving specific molecular 'tasks'. We then investigate how such approaches could be applied to the data captured from iMD-VR recordings. Finally, we outline the future research directions and potential challenges of using AI agents to augment human expertise to efficiently navigate conformational spaces, highlighting how this approach could provide valuable insight across domains such as materials science, protein engineering, and computer-aided drug design. △ Less

Submitted 11 September, 2024; originally announced September 2024.

Comments: (Accepted for presentation at the First Workshop on "eXtended Reality \& Intelligent Agents" (XRIA24) @ ECAI24, Santiago De Compostela (Spain), 20 October 2024)

arXiv:2306.01075 [pdf, other]

Pedestrian Crossing Action Recognition and Trajectory Prediction with 3D Human Keypoints

Authors: Jiachen Li, Xinwei Shi, Feiyu Chen, Jonathan Stroud, Zhishuai Zhang, Tian Lan, Junhua Mao, Jeonhyung Kang, Khaled S. Refaat, Weilong Yang, Eugene Ie, Congcong Li

Abstract: Accurate understanding and prediction of human behaviors are critical prerequisites for autonomous vehicles, especially in highly dynamic and interactive scenarios such as intersections in dense urban areas. In this work, we aim at identifying crossing pedestrians and predicting their future trajectories. To achieve these goals, we not only need the context information of road geometry and other t… ▽ More Accurate understanding and prediction of human behaviors are critical prerequisites for autonomous vehicles, especially in highly dynamic and interactive scenarios such as intersections in dense urban areas. In this work, we aim at identifying crossing pedestrians and predicting their future trajectories. To achieve these goals, we not only need the context information of road geometry and other traffic participants but also need fine-grained information of the human pose, motion and activity, which can be inferred from human keypoints. In this paper, we propose a novel multi-task learning framework for pedestrian crossing action recognition and trajectory prediction, which utilizes 3D human keypoints extracted from raw sensor data to capture rich information on human pose and activity. Moreover, we propose to apply two auxiliary tasks and contrastive learning to enable auxiliary supervisions to improve the learned keypoints representation, which further enhances the performance of major tasks. We validate our approach on a large-scale in-house dataset, as well as a public benchmark dataset, and show that our approach achieves state-of-the-art performance on a wide range of evaluation metrics. The effectiveness of each model component is validated in a detailed ablation study. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: ICRA 2023

arXiv:2111.06339 [pdf]

Supporting and Controlling Complex Concurrency in Fault- Tolerant Distributed Systems

Authors: Jie Xu, Brian Randell, Alexander Romanovsky, Robert J. Stroud, Avelino F. Zorzo

Abstract: Distributed computing often gives rise to complex concurrent and interacting activities. In some cases several concurrent activities may be working together, i.e. cooperating, to solve a given problem; in other cases, the activities may be independent but needing to share common system resources for which they must compete. Many difficulties and limitations occur in the widely advocated objects an… ▽ More Distributed computing often gives rise to complex concurrent and interacting activities. In some cases several concurrent activities may be working together, i.e. cooperating, to solve a given problem; in other cases, the activities may be independent but needing to share common system resources for which they must compete. Many difficulties and limitations occur in the widely advocated objects and (trans)actions model when it is supposed to support cooperating activities. We have introduced previously the concept of coordinated atomic (CA) actions [Xu et al. 1995]; this paper analyzes and examines the derived objects and CA actions model for constructing fault-tolerant distributed systems and providing unified support for both cooperative and competitive concurrency. Our investigation reveals and clarifies several significant problems that have not previously been studied extensively, including the problem of ensuring consistent access to shared objects from a joint action as opposed to a set of independent actions. Conceptual and implementation-related solutions are proposed and illustrated. △ Less

Submitted 11 November, 2021; originally announced November 2021.

Comments: 8 pages

Journal ref: International Symposium on Special Topics of Computers, 1998

arXiv:2104.04182 [pdf, other]

FIBER: Fill-in-the-Blanks as a Challenging Video Understanding Evaluation Framework

Authors: Santiago Castro, Ruoyao Wang, Pingxuan Huang, Ian Stewart, Oana Ignat, Nan Liu, Jonathan C. Stroud, Rada Mihalcea

Abstract: We propose fill-in-the-blanks as a video understanding evaluation framework and introduce FIBER -- a novel dataset consisting of 28,000 videos and descriptions in support of this evaluation framework. The fill-in-the-blanks setting tests a model's understanding of a video by requiring it to predict a masked noun phrase in the caption of the video, given the video and the surrounding text. The FIBE… ▽ More We propose fill-in-the-blanks as a video understanding evaluation framework and introduce FIBER -- a novel dataset consisting of 28,000 videos and descriptions in support of this evaluation framework. The fill-in-the-blanks setting tests a model's understanding of a video by requiring it to predict a masked noun phrase in the caption of the video, given the video and the surrounding text. The FIBER benchmark does not share the weaknesses of the current state-of-the-art language-informed video understanding tasks, namely: (1) video question answering using multiple-choice questions, where models perform relatively well because they exploit linguistic biases in the task formulation, thus making our framework challenging for the current state-of-the-art systems to solve; and (2) video captioning, which relies on an open-ended evaluation framework that is often inaccurate because system answers may be perceived as incorrect if they differ in form from the ground truth. The FIBER dataset and our code are available at https://lit.eecs.umich.edu/fiber/. △ Less

Submitted 22 March, 2022; v1 submitted 9 April, 2021; originally announced April 2021.

Comments: Accepted at ACL 2022 Main conference. Camera-ready version

arXiv:2007.14937 [pdf, other]

Learning Video Representations from Textual Web Supervision

Authors: Jonathan C. Stroud, Zhichao Lu, Chen Sun, Jia Deng, Rahul Sukthankar, Cordelia Schmid, David A. Ross

Abstract: Videos on the Internet are paired with pieces of text, such as titles and descriptions. This text typically describes the most important content in the video, such as the objects in the scene and the actions being performed. Based on this observation, we propose to use text as a method for learning video representations. To accomplish this, we propose a data collection process and use it to collec… ▽ More Videos on the Internet are paired with pieces of text, such as titles and descriptions. This text typically describes the most important content in the video, such as the objects in the scene and the actions being performed. Based on this observation, we propose to use text as a method for learning video representations. To accomplish this, we propose a data collection process and use it to collect 70M video clips shared publicly on the Internet, and we then train a model to pair each video with its associated text. We evaluate the model on several down-stream action recognition tasks, including Kinetics, HMDB-51, and UCF-101. We find that this approach is an effective method of pre-training video representations. Specifically, it outperforms all existing methods for self-supervised and cross-modal video representation learning. △ Less

Submitted 27 August, 2021; v1 submitted 29 July, 2020; originally announced July 2020.

arXiv:1912.02256 [pdf, other]

Compositional Temporal Visual Grounding of Natural Language Event Descriptions

Authors: Jonathan C. Stroud, Ryan McCaffrey, Rada Mihalcea, Jia Deng, Olga Russakovsky

Abstract: Temporal grounding entails establishing a correspondence between natural language event descriptions and their visual depictions. Compositional modeling becomes central: we first ground atomic descriptions "girl eating an apple," "batter hitting the ball" to short video segments, and then establish the temporal relationships between the segments. This compositional structure enables models to reco… ▽ More Temporal grounding entails establishing a correspondence between natural language event descriptions and their visual depictions. Compositional modeling becomes central: we first ground atomic descriptions "girl eating an apple," "batter hitting the ball" to short video segments, and then establish the temporal relationships between the segments. This compositional structure enables models to recognize a wider variety of events not seen during training through recognizing their atomic sub-events. Explicit temporal modeling accounts for a wide variety of temporal relationships that can be expressed in language: e.g., in the description "girl stands up from the table after eating an apple" the visual ordering of the events is reversed, with first "eating an apple" followed by "standing up from the table." We leverage these observations to develop a unified deep architecture, CTG-Net, to perform temporal grounding of natural language event descriptions to videos. We demonstrate that our system outperforms prior state-of-the-art methods on the DiDeMo, Tempo-TL, and Tempo-HL temporal grounding datasets. △ Less

Submitted 4 December, 2019; originally announced December 2019.

Comments: Project page: jonathancstroud.com/ctg

arXiv:1812.08249 [pdf, other]

D3D: Distilled 3D Networks for Video Action Recognition

Authors: Jonathan C. Stroud, David A. Ross, Chen Sun, Jia Deng, Rahul Sukthankar

Abstract: State-of-the-art methods for video action recognition commonly use an ensemble of two networks: the spatial stream, which takes RGB frames as input, and the temporal stream, which takes optical flow as input. In recent work, both of these streams consist of 3D Convolutional Neural Networks, which apply spatiotemporal filters to the video clip before performing classification. Conceptually, the tem… ▽ More State-of-the-art methods for video action recognition commonly use an ensemble of two networks: the spatial stream, which takes RGB frames as input, and the temporal stream, which takes optical flow as input. In recent work, both of these streams consist of 3D Convolutional Neural Networks, which apply spatiotemporal filters to the video clip before performing classification. Conceptually, the temporal filters should allow the spatial stream to learn motion representations, making the temporal stream redundant. However, we still see significant benefits in action recognition performance by including an entirely separate temporal stream, indicating that the spatial stream is "missing" some of the signal captured by the temporal stream. In this work, we first investigate whether motion representations are indeed missing in the spatial stream of 3D CNNs. Second, we demonstrate that these motion representations can be improved by distillation, by tuning the spatial stream to predict the outputs of the temporal stream, effectively combining both models into a single stream. Finally, we show that our Distilled 3D Network (D3D) achieves performance on par with two-stream approaches, using only a single model and with no need to compute optical flow. △ Less

Submitted 5 February, 2019; v1 submitted 19 December, 2018; originally announced December 2018.

Comments: Added link to code and models at https://www.jonathancstroud.com/d3d

arXiv:1704.04671 [pdf, other]

Temporal Action Localization by Structured Maximal Sums

Authors: Zehuan Yuan, Jonathan C. Stroud, Tong Lu, Jia Deng

Abstract: We address the problem of temporal action localization in videos. We pose action localization as a structured prediction over arbitrary-length temporal windows, where each window is scored as the sum of frame-wise classification scores. Additionally, our model classifies the start, middle, and end of each action as separate components, allowing our system to explicitly model each action's temporal… ▽ More We address the problem of temporal action localization in videos. We pose action localization as a structured prediction over arbitrary-length temporal windows, where each window is scored as the sum of frame-wise classification scores. Additionally, our model classifies the start, middle, and end of each action as separate components, allowing our system to explicitly model each action's temporal evolution and take advantage of informative temporal dependencies present in this structure. In this framework, we localize actions by searching for the structured maximal sum, a problem for which we develop a novel, provably-efficient algorithmic solution. The frame-wise classification scores are computed using features from a deep Convolutional Neural Network (CNN), which are trained end-to-end to directly optimize for a novel structured objective. We evaluate our system on the THUMOS 14 action detection benchmark and achieve competitive performance. △ Less

Submitted 15 April, 2017; originally announced April 2017.

Comments: Accepted to CVPR 2017

arXiv:1611.05788 [pdf, other]

Data Science in Service of Performing Arts: Applying Machine Learning to Predicting Audience Preferences

Authors: Jacob Abernethy, Cyrus Anderson, Alex Chojnacki, Chengyu Dai, John Dryden, Eric Schwartz, Wenbo Shen, Jonathan Stroud, Laura Wendlandt, Sheng Yang, Daniel Zhang

Abstract: Performing arts organizations aim to enrich their communities through the arts. To do this, they strive to match their performance offerings to the taste of those communities. Success relies on understanding audience preference and predicting their behavior. Similar to most e-commerce or digital entertainment firms, arts presenters need to recommend the right performance to the right customer at t… ▽ More Performing arts organizations aim to enrich their communities through the arts. To do this, they strive to match their performance offerings to the taste of those communities. Success relies on understanding audience preference and predicting their behavior. Similar to most e-commerce or digital entertainment firms, arts presenters need to recommend the right performance to the right customer at the right time. As part of the Michigan Data Science Team (MDST), we partnered with the University Musical Society (UMS), a non-profit performing arts presenter housed in the University of Michigan, Ann Arbor. We are providing UMS with analysis and business intelligence, utilizing historical individual-level sales data. We built a recommendation system based on collaborative filtering, gaining insights into the artistic preferences of customers, along with the similarities between performances. To better understand audience behavior, we used statistical methods from customer-base analysis. We characterized customer heterogeneity via segmentation, and we modeled customer cohorts to understand and predict ticket purchasing patterns. Finally, we combined statistical modeling with natural language processing (NLP) to explore the impact of wording in program descriptions. These ongoing efforts provide a platform to launch targeted marketing campaigns, helping UMS carry out its mission by allocating its resources more efficiently. Celebrating its 138th season, UMS is a 2014 recipient of the National Medal of Arts, and it continues to enrich communities by connecting world-renowned artists with diverse audiences, especially students in their formative years. We aim to contribute to that mission through data science and customer analytics. △ Less

Submitted 29 September, 2016; originally announced November 2016.

Comments: Presented at the Data For Good Exchange 2016

arXiv:1610.00580 [pdf, other]

Flint Water Crisis: Data-Driven Risk Assessment Via Residential Water Testing

Authors: Jacob Abernethy, Cyrus Anderson, Chengyu Dai, Arya Farahi, Linh Nguyen, Adam Rauh, Eric Schwartz, Wenbo Shen, Guangsha Shi, Jonathan Stroud, Xinyu Tan, Jared Webb, Sheng Yang

Abstract: Recovery from the Flint Water Crisis has been hindered by uncertainty in both the water testing process and the causes of contamination. In this work, we develop an ensemble of predictive models to assess the risk of lead contamination in individual homes and neighborhoods. To train these models, we utilize a wide range of data sources, including voluntary residential water tests, historical recor… ▽ More Recovery from the Flint Water Crisis has been hindered by uncertainty in both the water testing process and the causes of contamination. In this work, we develop an ensemble of predictive models to assess the risk of lead contamination in individual homes and neighborhoods. To train these models, we utilize a wide range of data sources, including voluntary residential water tests, historical records, and city infrastructure data. Additionally, we use our models to identify the most prominent factors that contribute to a high risk of lead contamination. In this analysis, we find that lead service lines are not the only factor that is predictive of the risk of lead contamination of water. These results could be used to guide the long-term recovery efforts in Flint, minimize the immediate damages, and improve resource-allocation decisions for similar water infrastructure crises. △ Less

Submitted 30 September, 2016; originally announced October 2016.

Comments: Presented at the Data For Good Exchange 2016

Showing 1–10 of 10 results for author: Stroud, J