-
Leveraging Public Cloud Infrastructure for Real-time Connected Vehicle Speed Advisory at a Signalized Corridor
Authors:
Hsien-Wen Deng,
M Sabbir Salek,
Mizanur Rahman,
Mashrur Chowdhury,
Mitch Shue,
Amy W. Apon
Abstract:
In this study, we developed a real-time connected vehicle (CV) speed advisory application that uses public cloud services and tested it on a simulated signalized corridor for different roadway traffic conditions. First, we developed a scalable serverless cloud computing architecture leveraging public cloud services offered by Amazon Web Services (AWS) to support the requirements of a real-time CV…
▽ More
In this study, we developed a real-time connected vehicle (CV) speed advisory application that uses public cloud services and tested it on a simulated signalized corridor for different roadway traffic conditions. First, we developed a scalable serverless cloud computing architecture leveraging public cloud services offered by Amazon Web Services (AWS) to support the requirements of a real-time CV application. Second, we developed an optimization-based real-time CV speed advisory algorithm by taking a modular design approach, which makes the application automatically scalable and deployable in the cloud using the serverless architecture. Third, we developed a cloud-in-the-loop simulation testbed using AWS and an open-source microscopic roadway traffic simulator called Simulation of Urban Mobility (SUMO). Our analyses based on different roadway traffic conditions showed that the serverless CV speed advisory application meets the latency requirement of real-time CV mobility applications. Besides, our serverless CV speed advisory application reduced the average stopped delay (by 77%) and the aggregated risk of collision (by 21%) at signalized intersection of a corridor. These prove the feasibility as well as the efficacy of utilizing public cloud infrastructure to implement real-time roadway traffic management applications in a CV environment.
△ Less
Submitted 29 January, 2024;
originally announced January 2024.
-
Synthetic Image Data for Deep Learning
Authors:
Jason W. Anderson,
Marcin Ziolkowski,
Ken Kennedy,
Amy W. Apon
Abstract:
Realistic synthetic image data rendered from 3D models can be used to augment image sets and train image classification semantic segmentation models. In this work, we explore how high quality physically-based rendering and domain randomization can efficiently create a large synthetic dataset based on production 3D CAD models of a real vehicle. We use this dataset to quantify the effectiveness of s…
▽ More
Realistic synthetic image data rendered from 3D models can be used to augment image sets and train image classification semantic segmentation models. In this work, we explore how high quality physically-based rendering and domain randomization can efficiently create a large synthetic dataset based on production 3D CAD models of a real vehicle. We use this dataset to quantify the effectiveness of synthetic augmentation using U-net and Double-U-net models. We found that, for this domain, synthetic images were an effective technique for augmenting limited sets of real training data. We observed that models trained on purely synthetic images had a very low mean prediction IoU on real validation images. We also observed that adding even very small amounts of real images to a synthetic dataset greatly improved accuracy, and that models trained on datasets augmented with synthetic images were more accurate than those trained on real images alone. Finally, we found that in use cases that benefit from incremental training or model specialization, pretraining a base model on synthetic images provided a sizeable reduction in the training cost of transfer learning, allowing up to 90\% of the model training to be front-loaded.
△ Less
Submitted 12 December, 2022;
originally announced December 2022.
-
Proactive Query Expansion for Streaming Data Using External Source
Authors:
Farah Alshanik,
Amy Apon,
Yuheng Du,
Alexander Herzog,
Ilya Safro
Abstract:
Query expansion is the process of reformulating the original query by adding relevant words. Choosing which terms to add in order to improve the performance of the query expansion methods or to enhance the quality of the retrieved results is an important aspect of any information retrieval system. Adding words that can positively impact the quality of the search query or are informative enough pla…
▽ More
Query expansion is the process of reformulating the original query by adding relevant words. Choosing which terms to add in order to improve the performance of the query expansion methods or to enhance the quality of the retrieved results is an important aspect of any information retrieval system. Adding words that can positively impact the quality of the search query or are informative enough play an important role in returning or gathering relevant documents that cover a certain topic can result in improving the efficiency of the information retrieval system. Typically, query expansion techniques are used to add or substitute words to a given search query to collect relevant data. In this paper, we design and implement a pipeline of automated query expansion. We outline several tools using different methods to expand the query. Our methods depend on targeting emergent events in streaming data over time and finding the hidden topics from targeted documents using probabilistic topic models. We employ Dynamic Eigenvector Centrality to trigger the emergent events, and the Latent Dirichlet Allocation to discover the topics. Also, we use an external data source as a secondary stream to supplement the primary stream with relevant words and expand the query using the words from both primary and secondary streams. An experimental study is performed on Twitter data (primary stream) related to the events that happened during protests in Baltimore in 2015. The quality of the retrieved results was measured using a quality indicator of the streaming data: tweets count, hashtag count, and hashtag clustering.
△ Less
Submitted 17 January, 2022;
originally announced January 2022.
-
Accelerating Text Mining Using Domain-Specific Stop Word Lists
Authors:
Farah Alshanik,
Amy Apon,
Alexander Herzog,
Ilya Safro,
Justin Sybrandt
Abstract:
Text preprocessing is an essential step in text mining. Removing words that can negatively impact the quality of prediction algorithms or are not informative enough is a crucial storage-saving technique in text indexing and results in improved computational efficiency. Typically, a generic stop word list is applied to a dataset regardless of the domain. However, many common words are different fro…
▽ More
Text preprocessing is an essential step in text mining. Removing words that can negatively impact the quality of prediction algorithms or are not informative enough is a crucial storage-saving technique in text indexing and results in improved computational efficiency. Typically, a generic stop word list is applied to a dataset regardless of the domain. However, many common words are different from one domain to another but have no significance within a particular domain. Eliminating domain-specific common words in a corpus reduces the dimensionality of the feature space, and improves the performance of text mining tasks. In this paper, we present a novel mathematical approach for the automatic extraction of domain-specific words called the hyperplane-based approach. This new approach depends on the notion of low dimensional representation of the word in vector space and its distance from hyperplane. The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features. We compare the hyperplane-based approach with other feature selection methods, namely \c{hi}2 and mutual information. An experimental study is performed on three different datasets and five classification algorithms, and measure the dimensionality reduction and the increase in the classification performance. Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information. The computational time to identify the domain-specific words is significantly lower than mutual information.
△ Less
Submitted 18 November, 2020;
originally announced December 2020.
-
Vision-based Pedestrian Alert Safety System (PASS) for Signalized Intersections
Authors:
Mhafuzul Islam,
Mizanur Rahman,
Mashrur Chowdhury,
Gurcan Comert,
Eshaa Deepak Sood,
Amy Apon
Abstract:
Although Vehicle-to-Pedestrian (V2P) communication can significantly improve pedestrian safety at a signalized intersection, this safety is hindered as pedestrians often do not carry hand-held devices (e.g., Dedicated short-range communication (DSRC) and 5G enabled cell phone) to communicate with connected vehicles nearby. To overcome this limitation, in this study, traffic cameras at a signalized…
▽ More
Although Vehicle-to-Pedestrian (V2P) communication can significantly improve pedestrian safety at a signalized intersection, this safety is hindered as pedestrians often do not carry hand-held devices (e.g., Dedicated short-range communication (DSRC) and 5G enabled cell phone) to communicate with connected vehicles nearby. To overcome this limitation, in this study, traffic cameras at a signalized intersection were used to accurately detect and locate pedestrians via a vision-based deep learning technique to generate safety alerts in real-time about possible conflicts between vehicles and pedestrians. The contribution of this paper lies in the development of a system using a vision-based deep learning model that is able to generate personal safety messages (PSMs) in real-time (every 100 milliseconds). We develop a pedestrian alert safety system (PASS) to generate a safety alert of an imminent pedestrian-vehicle crash using generated PSMs to improve pedestrian safety at a signalized intersection. Our approach estimates the location and velocity of a pedestrian more accurately than existing DSRC-enabled pedestrian hand-held devices. A connected vehicle application, the Pedestrian in Signalized Crosswalk Warning (PSCW), was developed to evaluate the vision-based PASS. Numerical analyses show that our vision-based PASS is able to satisfy the accuracy and latency requirements of pedestrian safety applications in a connected vehicle environment.
△ Less
Submitted 1 July, 2019;
originally announced July 2019.
-
Multi-class Twitter Data Categorization and Geocoding with a Novel Computing Framework
Authors:
Sakib Mahmud Khan,
Mashrur Chowdhury,
Linh B. Ngo,
Amy Apon
Abstract:
This study details the progress in transportation data analysis with a novel computing framework in keeping with the continuous evolution of the computing technology. The computing framework combines the Labelled Latent Dirichlet Allocation (L-LDA)-incorporated Support Vector Machine (SVM) classifier with the supporting computing strategy on publicly available Twitter data in determining transport…
▽ More
This study details the progress in transportation data analysis with a novel computing framework in keeping with the continuous evolution of the computing technology. The computing framework combines the Labelled Latent Dirichlet Allocation (L-LDA)-incorporated Support Vector Machine (SVM) classifier with the supporting computing strategy on publicly available Twitter data in determining transportation-related events to provide reliable information to travelers. The analytical approach includes analyzing tweets using text classification and geocoding locations based on string similarity. A case study conducted for the New York City and its surrounding areas demonstrates the feasibility of the analytical approach. Approximately 700,010 tweets are analyzed to extract relevant transportation-related information for one week. The SVM classifier achieves more than 85% accuracy in identifying transportation-related tweets from structured data. To further categorize the transportation-related tweets into sub-classes: incident, congestion, construction, special events, and other events, three supervised classifiers are used: L-LDA, SVM, and L-LDA incorporated SVM. Findings from this study demonstrate that the analytical framework, which uses the L-LDA incorporated SVM, can classify roadway transportation-related data from Twitter with over 98.3% accuracy, which is significantly higher than the accuracies achieved by standalone L-LDA and SVM.
△ Less
Submitted 28 August, 2019; v1 submitted 8 May, 2019;
originally announced May 2019.
-
Scalable Dynamic Topic Modeling with Clustered Latent Dirichlet Allocation (CLDA)
Authors:
Chris Gropp,
Alexander Herzog,
Ilya Safro,
Paul W. Wilson,
Amy W. Apon
Abstract:
Topic modeling, a method for extracting the underlying themes from a collection of documents, is an increasingly important component of the design of intelligent systems enabling the sense-making of highly dynamic and diverse streams of text data. Traditional methods such as Dynamic Topic Modeling (DTM) do not lend themselves well to direct parallelization because of dependencies from one time ste…
▽ More
Topic modeling, a method for extracting the underlying themes from a collection of documents, is an increasingly important component of the design of intelligent systems enabling the sense-making of highly dynamic and diverse streams of text data. Traditional methods such as Dynamic Topic Modeling (DTM) do not lend themselves well to direct parallelization because of dependencies from one time step to another. In this paper, we introduce and empirically analyze Clustered Latent Dirichlet Allocation (CLDA), a method for extracting dynamic latent topics from a collection of documents. Our approach is based on data decomposition in which the data is partitioned into segments, followed by topic modeling on the individual segments. The resulting local models are then combined into a global solution using clustering. The decomposition and resulting parallelization leads to very fast runtime even on very large datasets. Our approach furthermore provides insight into how the composition of topics changes over time and can also be applied using other data partitioning strategies over any discrete features of the data, such as geographic features or classes of users. In this paper CLDA is applied successfully to seventeen years of NIPS conference papers (2,484 documents and 3,280,697 words), seventeen years of computer science journal abstracts (533,560 documents and 32,551,540 words), and to forty years of the PubMed corpus (4,025,978 documents and 273,853,980 words).
△ Less
Submitted 4 October, 2019; v1 submitted 24 October, 2016;
originally announced October 2016.