-
A Perspective on AI-Guided Molecular Simulations in VR: Exploring Strategies for Imitation Learning in Hyperdimensional Molecular Systems
Authors:
Mohamed Dhouioui,
Jonathan Barnoud,
Rhoslyn Roebuck Williams,
Harry J. Stroud,
Phil Bates,
David R. Glowacki
Abstract:
Molecular dynamics simulations are a crucial computational tool for researchers to understand and engineer molecular structure and function in areas such as drug discovery, protein engineering, and material design. Despite their utility, MD simulations are expensive, owing to the high dimensionality of molecular systems. Interactive molecular dynamics in virtual reality (iMD-VR) has recently been…
▽ More
Molecular dynamics simulations are a crucial computational tool for researchers to understand and engineer molecular structure and function in areas such as drug discovery, protein engineering, and material design. Despite their utility, MD simulations are expensive, owing to the high dimensionality of molecular systems. Interactive molecular dynamics in virtual reality (iMD-VR) has recently been developed as a 'human-in-the-loop' strategy, which leverages high-performance computing to accelerate the researcher's ability to solve the hyperdimensional sampling problem. By providing an immersive 3D environment that enables visualization and manipulation of real-time molecular motion, iMD-VR enables researchers and students to efficiently and intuitively explore and navigate these complex, high-dimensional systems. iMD-VR platforms offer a unique opportunity to quickly generate rich datasets that capture human experts' spatial insight regarding molecular structure and function. This paper explores the possibility of employing user-generated iMD-VR datasets to train AI agents via imitation learning (IL). IL is an important technique in robotics that enables agents to mimic complex behaviors from expert demonstrations, thus circumventing the need for explicit programming or intricate reward design. We review the utilization of IL for manipulation tasks in robotics and discuss how iMD-VR recordings could be used to train IL models for solving specific molecular 'tasks'. We then investigate how such approaches could be applied to the data captured from iMD-VR recordings. Finally, we outline the future research directions and potential challenges of using AI agents to augment human expertise to efficiently navigate conformational spaces, highlighting how this approach could provide valuable insight across domains such as materials science, protein engineering, and computer-aided drug design.
△ Less
Submitted 11 September, 2024;
originally announced September 2024.
-
Pedestrian Crossing Action Recognition and Trajectory Prediction with 3D Human Keypoints
Authors:
Jiachen Li,
Xinwei Shi,
Feiyu Chen,
Jonathan Stroud,
Zhishuai Zhang,
Tian Lan,
Junhua Mao,
Jeonhyung Kang,
Khaled S. Refaat,
Weilong Yang,
Eugene Ie,
Congcong Li
Abstract:
Accurate understanding and prediction of human behaviors are critical prerequisites for autonomous vehicles, especially in highly dynamic and interactive scenarios such as intersections in dense urban areas. In this work, we aim at identifying crossing pedestrians and predicting their future trajectories. To achieve these goals, we not only need the context information of road geometry and other t…
▽ More
Accurate understanding and prediction of human behaviors are critical prerequisites for autonomous vehicles, especially in highly dynamic and interactive scenarios such as intersections in dense urban areas. In this work, we aim at identifying crossing pedestrians and predicting their future trajectories. To achieve these goals, we not only need the context information of road geometry and other traffic participants but also need fine-grained information of the human pose, motion and activity, which can be inferred from human keypoints. In this paper, we propose a novel multi-task learning framework for pedestrian crossing action recognition and trajectory prediction, which utilizes 3D human keypoints extracted from raw sensor data to capture rich information on human pose and activity. Moreover, we propose to apply two auxiliary tasks and contrastive learning to enable auxiliary supervisions to improve the learned keypoints representation, which further enhances the performance of major tasks. We validate our approach on a large-scale in-house dataset, as well as a public benchmark dataset, and show that our approach achieves state-of-the-art performance on a wide range of evaluation metrics. The effectiveness of each model component is validated in a detailed ablation study.
△ Less
Submitted 1 June, 2023;
originally announced June 2023.
-
Supporting and Controlling Complex Concurrency in Fault- Tolerant Distributed Systems
Authors:
Jie Xu,
Brian Randell,
Alexander Romanovsky,
Robert J. Stroud,
Avelino F. Zorzo
Abstract:
Distributed computing often gives rise to complex concurrent and interacting activities. In some cases several concurrent activities may be working together, i.e. cooperating, to solve a given problem; in other cases, the activities may be independent but needing to share common system resources for which they must compete. Many difficulties and limitations occur in the widely advocated objects an…
▽ More
Distributed computing often gives rise to complex concurrent and interacting activities. In some cases several concurrent activities may be working together, i.e. cooperating, to solve a given problem; in other cases, the activities may be independent but needing to share common system resources for which they must compete. Many difficulties and limitations occur in the widely advocated objects and (trans)actions model when it is supposed to support cooperating activities. We have introduced previously the concept of coordinated atomic (CA) actions [Xu et al. 1995]; this paper analyzes and examines the derived objects and CA actions model for constructing fault-tolerant distributed systems and providing unified support for both cooperative and competitive concurrency. Our investigation reveals and clarifies several significant problems that have not previously been studied extensively, including the problem of ensuring consistent access to shared objects from a joint action as opposed to a set of independent actions. Conceptual and implementation-related solutions are proposed and illustrated.
△ Less
Submitted 11 November, 2021;
originally announced November 2021.
-
FIBER: Fill-in-the-Blanks as a Challenging Video Understanding Evaluation Framework
Authors:
Santiago Castro,
Ruoyao Wang,
Pingxuan Huang,
Ian Stewart,
Oana Ignat,
Nan Liu,
Jonathan C. Stroud,
Rada Mihalcea
Abstract:
We propose fill-in-the-blanks as a video understanding evaluation framework and introduce FIBER -- a novel dataset consisting of 28,000 videos and descriptions in support of this evaluation framework. The fill-in-the-blanks setting tests a model's understanding of a video by requiring it to predict a masked noun phrase in the caption of the video, given the video and the surrounding text. The FIBE…
▽ More
We propose fill-in-the-blanks as a video understanding evaluation framework and introduce FIBER -- a novel dataset consisting of 28,000 videos and descriptions in support of this evaluation framework. The fill-in-the-blanks setting tests a model's understanding of a video by requiring it to predict a masked noun phrase in the caption of the video, given the video and the surrounding text. The FIBER benchmark does not share the weaknesses of the current state-of-the-art language-informed video understanding tasks, namely: (1) video question answering using multiple-choice questions, where models perform relatively well because they exploit linguistic biases in the task formulation, thus making our framework challenging for the current state-of-the-art systems to solve; and (2) video captioning, which relies on an open-ended evaluation framework that is often inaccurate because system answers may be perceived as incorrect if they differ in form from the ground truth. The FIBER dataset and our code are available at https://lit.eecs.umich.edu/fiber/.
△ Less
Submitted 22 March, 2022; v1 submitted 9 April, 2021;
originally announced April 2021.
-
Learning Video Representations from Textual Web Supervision
Authors:
Jonathan C. Stroud,
Zhichao Lu,
Chen Sun,
Jia Deng,
Rahul Sukthankar,
Cordelia Schmid,
David A. Ross
Abstract:
Videos on the Internet are paired with pieces of text, such as titles and descriptions. This text typically describes the most important content in the video, such as the objects in the scene and the actions being performed. Based on this observation, we propose to use text as a method for learning video representations. To accomplish this, we propose a data collection process and use it to collec…
▽ More
Videos on the Internet are paired with pieces of text, such as titles and descriptions. This text typically describes the most important content in the video, such as the objects in the scene and the actions being performed. Based on this observation, we propose to use text as a method for learning video representations. To accomplish this, we propose a data collection process and use it to collect 70M video clips shared publicly on the Internet, and we then train a model to pair each video with its associated text. We evaluate the model on several down-stream action recognition tasks, including Kinetics, HMDB-51, and UCF-101. We find that this approach is an effective method of pre-training video representations. Specifically, it outperforms all existing methods for self-supervised and cross-modal video representation learning.
△ Less
Submitted 27 August, 2021; v1 submitted 29 July, 2020;
originally announced July 2020.
-
Compositional Temporal Visual Grounding of Natural Language Event Descriptions
Authors:
Jonathan C. Stroud,
Ryan McCaffrey,
Rada Mihalcea,
Jia Deng,
Olga Russakovsky
Abstract:
Temporal grounding entails establishing a correspondence between natural language event descriptions and their visual depictions. Compositional modeling becomes central: we first ground atomic descriptions "girl eating an apple," "batter hitting the ball" to short video segments, and then establish the temporal relationships between the segments. This compositional structure enables models to reco…
▽ More
Temporal grounding entails establishing a correspondence between natural language event descriptions and their visual depictions. Compositional modeling becomes central: we first ground atomic descriptions "girl eating an apple," "batter hitting the ball" to short video segments, and then establish the temporal relationships between the segments. This compositional structure enables models to recognize a wider variety of events not seen during training through recognizing their atomic sub-events. Explicit temporal modeling accounts for a wide variety of temporal relationships that can be expressed in language: e.g., in the description "girl stands up from the table after eating an apple" the visual ordering of the events is reversed, with first "eating an apple" followed by "standing up from the table." We leverage these observations to develop a unified deep architecture, CTG-Net, to perform temporal grounding of natural language event descriptions to videos. We demonstrate that our system outperforms prior state-of-the-art methods on the DiDeMo, Tempo-TL, and Tempo-HL temporal grounding datasets.
△ Less
Submitted 4 December, 2019;
originally announced December 2019.
-
D3D: Distilled 3D Networks for Video Action Recognition
Authors:
Jonathan C. Stroud,
David A. Ross,
Chen Sun,
Jia Deng,
Rahul Sukthankar
Abstract:
State-of-the-art methods for video action recognition commonly use an ensemble of two networks: the spatial stream, which takes RGB frames as input, and the temporal stream, which takes optical flow as input. In recent work, both of these streams consist of 3D Convolutional Neural Networks, which apply spatiotemporal filters to the video clip before performing classification. Conceptually, the tem…
▽ More
State-of-the-art methods for video action recognition commonly use an ensemble of two networks: the spatial stream, which takes RGB frames as input, and the temporal stream, which takes optical flow as input. In recent work, both of these streams consist of 3D Convolutional Neural Networks, which apply spatiotemporal filters to the video clip before performing classification. Conceptually, the temporal filters should allow the spatial stream to learn motion representations, making the temporal stream redundant. However, we still see significant benefits in action recognition performance by including an entirely separate temporal stream, indicating that the spatial stream is "missing" some of the signal captured by the temporal stream. In this work, we first investigate whether motion representations are indeed missing in the spatial stream of 3D CNNs. Second, we demonstrate that these motion representations can be improved by distillation, by tuning the spatial stream to predict the outputs of the temporal stream, effectively combining both models into a single stream. Finally, we show that our Distilled 3D Network (D3D) achieves performance on par with two-stream approaches, using only a single model and with no need to compute optical flow.
△ Less
Submitted 5 February, 2019; v1 submitted 19 December, 2018;
originally announced December 2018.
-
Temporal Action Localization by Structured Maximal Sums
Authors:
Zehuan Yuan,
Jonathan C. Stroud,
Tong Lu,
Jia Deng
Abstract:
We address the problem of temporal action localization in videos. We pose action localization as a structured prediction over arbitrary-length temporal windows, where each window is scored as the sum of frame-wise classification scores. Additionally, our model classifies the start, middle, and end of each action as separate components, allowing our system to explicitly model each action's temporal…
▽ More
We address the problem of temporal action localization in videos. We pose action localization as a structured prediction over arbitrary-length temporal windows, where each window is scored as the sum of frame-wise classification scores. Additionally, our model classifies the start, middle, and end of each action as separate components, allowing our system to explicitly model each action's temporal evolution and take advantage of informative temporal dependencies present in this structure. In this framework, we localize actions by searching for the structured maximal sum, a problem for which we develop a novel, provably-efficient algorithmic solution. The frame-wise classification scores are computed using features from a deep Convolutional Neural Network (CNN), which are trained end-to-end to directly optimize for a novel structured objective. We evaluate our system on the THUMOS 14 action detection benchmark and achieve competitive performance.
△ Less
Submitted 15 April, 2017;
originally announced April 2017.
-
Data Science in Service of Performing Arts: Applying Machine Learning to Predicting Audience Preferences
Authors:
Jacob Abernethy,
Cyrus Anderson,
Alex Chojnacki,
Chengyu Dai,
John Dryden,
Eric Schwartz,
Wenbo Shen,
Jonathan Stroud,
Laura Wendlandt,
Sheng Yang,
Daniel Zhang
Abstract:
Performing arts organizations aim to enrich their communities through the arts. To do this, they strive to match their performance offerings to the taste of those communities. Success relies on understanding audience preference and predicting their behavior. Similar to most e-commerce or digital entertainment firms, arts presenters need to recommend the right performance to the right customer at t…
▽ More
Performing arts organizations aim to enrich their communities through the arts. To do this, they strive to match their performance offerings to the taste of those communities. Success relies on understanding audience preference and predicting their behavior. Similar to most e-commerce or digital entertainment firms, arts presenters need to recommend the right performance to the right customer at the right time. As part of the Michigan Data Science Team (MDST), we partnered with the University Musical Society (UMS), a non-profit performing arts presenter housed in the University of Michigan, Ann Arbor. We are providing UMS with analysis and business intelligence, utilizing historical individual-level sales data. We built a recommendation system based on collaborative filtering, gaining insights into the artistic preferences of customers, along with the similarities between performances. To better understand audience behavior, we used statistical methods from customer-base analysis. We characterized customer heterogeneity via segmentation, and we modeled customer cohorts to understand and predict ticket purchasing patterns. Finally, we combined statistical modeling with natural language processing (NLP) to explore the impact of wording in program descriptions. These ongoing efforts provide a platform to launch targeted marketing campaigns, helping UMS carry out its mission by allocating its resources more efficiently. Celebrating its 138th season, UMS is a 2014 recipient of the National Medal of Arts, and it continues to enrich communities by connecting world-renowned artists with diverse audiences, especially students in their formative years. We aim to contribute to that mission through data science and customer analytics.
△ Less
Submitted 29 September, 2016;
originally announced November 2016.
-
Flint Water Crisis: Data-Driven Risk Assessment Via Residential Water Testing
Authors:
Jacob Abernethy,
Cyrus Anderson,
Chengyu Dai,
Arya Farahi,
Linh Nguyen,
Adam Rauh,
Eric Schwartz,
Wenbo Shen,
Guangsha Shi,
Jonathan Stroud,
Xinyu Tan,
Jared Webb,
Sheng Yang
Abstract:
Recovery from the Flint Water Crisis has been hindered by uncertainty in both the water testing process and the causes of contamination. In this work, we develop an ensemble of predictive models to assess the risk of lead contamination in individual homes and neighborhoods. To train these models, we utilize a wide range of data sources, including voluntary residential water tests, historical recor…
▽ More
Recovery from the Flint Water Crisis has been hindered by uncertainty in both the water testing process and the causes of contamination. In this work, we develop an ensemble of predictive models to assess the risk of lead contamination in individual homes and neighborhoods. To train these models, we utilize a wide range of data sources, including voluntary residential water tests, historical records, and city infrastructure data. Additionally, we use our models to identify the most prominent factors that contribute to a high risk of lead contamination. In this analysis, we find that lead service lines are not the only factor that is predictive of the risk of lead contamination of water. These results could be used to guide the long-term recovery efforts in Flint, minimize the immediate damages, and improve resource-allocation decisions for similar water infrastructure crises.
△ Less
Submitted 30 September, 2016;
originally announced October 2016.