-
Adapting Whisper for Streaming Speech Recognition via Two-Pass Decoding
Authors:
Haoran Zhou,
Xingchen Song,
Brendan Fahy,
Qiaochu Song,
Binbin Zhang,
Zhendong Peng,
Anshul Wadhawan,
Denglin Jiang,
Apurv Verma,
Vinay Ramesh,
Srivas Prasad,
Michele M. Franceschini
Abstract:
OpenAI Whisper is a family of robust Automatic Speech Recognition (ASR) models trained on 680,000 hours of audio. However, its encoder-decoder architecture, trained with a sequence-to-sequence objective, lacks native support for streaming ASR. In this paper, we fine-tune Whisper for streaming ASR using the WeNet toolkit by adopting a Unified Two-pass (U2) structure. We introduce an additional Conn…
▽ More
OpenAI Whisper is a family of robust Automatic Speech Recognition (ASR) models trained on 680,000 hours of audio. However, its encoder-decoder architecture, trained with a sequence-to-sequence objective, lacks native support for streaming ASR. In this paper, we fine-tune Whisper for streaming ASR using the WeNet toolkit by adopting a Unified Two-pass (U2) structure. We introduce an additional Connectionist Temporal Classification (CTC) decoder trained with causal attention masks to generate streaming partial transcripts, while the original Whisper decoder reranks these partial outputs. Our experiments on LibriSpeech and an earnings call dataset demonstrate that, with adequate fine-tuning data, Whisper can be adapted into a capable streaming ASR model. We also introduce a hybrid tokenizer approach, which uses a smaller token space for the CTC decoder while retaining Whisper's original token space for the attention decoder, resulting in improved data efficiency and generalization.
△ Less
Submitted 13 June, 2025;
originally announced June 2025.
-
Olfactory Inertial Odometry: Sensor Calibration and Drift Compensation
Authors:
Kordel K. France,
Ovidiu Daescu,
Anirban Paul,
Shalini Prasad
Abstract:
Visual inertial odometry (VIO) is a process for fusing visual and kinematic data to understand a machine's state in a navigation task. Olfactory inertial odometry (OIO) is an analog to VIO that fuses signals from gas sensors with inertial data to help a robot navigate by scent. Gas dynamics and environmental factors introduce disturbances into olfactory navigation tasks that can make OIO difficult…
▽ More
Visual inertial odometry (VIO) is a process for fusing visual and kinematic data to understand a machine's state in a navigation task. Olfactory inertial odometry (OIO) is an analog to VIO that fuses signals from gas sensors with inertial data to help a robot navigate by scent. Gas dynamics and environmental factors introduce disturbances into olfactory navigation tasks that can make OIO difficult to facilitate. With our work here, we define a process for calibrating a robot for OIO that generalizes to several olfaction sensor types. Our focus is specifically on calibrating OIO for centimeter-level accuracy in localizing an odor source on a slow-moving robot platform to demonstrate use cases in robotic surgery and touchless security screening. We demonstrate our process for OIO calibration on a real robotic arm and show how this calibration improves performance over a cold-start olfactory navigation task.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
When Two LLMs Debate, Both Think They'll Win
Authors:
Pradyumna Shyama Prasad,
Minh Nhat Nguyen
Abstract:
Can LLMs accurately adjust their confidence when facing opposition? Building on previous studies measuring calibration on static fact-based question-answering tasks, we evaluate Large Language Models (LLMs) in a dynamic, adversarial debate setting, uniquely combining two realistic factors: (a) a multi-turn format requiring models to update beliefs as new information emerges, and (b) a zero-sum str…
▽ More
Can LLMs accurately adjust their confidence when facing opposition? Building on previous studies measuring calibration on static fact-based question-answering tasks, we evaluate Large Language Models (LLMs) in a dynamic, adversarial debate setting, uniquely combining two realistic factors: (a) a multi-turn format requiring models to update beliefs as new information emerges, and (b) a zero-sum structure to control for task-related uncertainty, since mutual high-confidence claims imply systematic overconfidence. We organized 60 three-round policy debates among ten state-of-the-art LLMs, with models privately rating their confidence (0-100) in winning after each round. We observed five concerning patterns: (1) Systematic overconfidence: models began debates with average initial confidence of 72.9% vs. a rational 50% baseline. (2) Confidence escalation: rather than reducing confidence as debates progressed, debaters increased their win probabilities, averaging 83% by the final round. (3) Mutual overestimation: in 61.7% of debates, both sides simultaneously claimed >=75% probability of victory, a logical impossibility. (4) Persistent self-debate bias: models debating identical copies increased confidence from 64.1% to 75.2%; even when explicitly informed their chance of winning was exactly 50%, confidence still rose (from 50.0% to 57.1%). (5) Misaligned private reasoning: models' private scratchpad thoughts sometimes differed from their public confidence ratings, raising concerns about faithfulness of chain-of-thought reasoning. These results suggest LLMs lack the ability to accurately self-assess or update their beliefs in dynamic, multi-turn tasks; a major concern as LLMs are now increasingly deployed without careful review in assistant and agentic roles.
Code for our experiments is available at https://github.com/pradyuprasad/llms_overconfidence
△ Less
Submitted 9 June, 2025; v1 submitted 25 May, 2025;
originally announced May 2025.
-
Revenue-Optimal Efficient Mechanism Design with General Type Spaces
Authors:
Siddharth Prasad,
Maria-Florina Balcan,
Tuomas Sandholm
Abstract:
We derive the revenue-optimal efficient (welfare-maximizing) mechanism in a general multidimensional mechanism design setting when type spaces -- that is, the underlying domains from which agents' values come from -- can capture arbitrarily complex informational constraints about the agents. Type spaces can encode information about agents representing, for example, machine learning predictions of…
▽ More
We derive the revenue-optimal efficient (welfare-maximizing) mechanism in a general multidimensional mechanism design setting when type spaces -- that is, the underlying domains from which agents' values come from -- can capture arbitrarily complex informational constraints about the agents. Type spaces can encode information about agents representing, for example, machine learning predictions of agent behavior, institutional knowledge about feasible market outcomes (such as item substitutability or complementarity in auctions), and correlations between multiple agents. Prior work has only dealt with connected type spaces, which are not expressive enough to capture many natural kinds of constraints such as disjunctive constraints. We provide two characterizations of the optimal mechanism based on allocations and connected components; both make use of an underlying network flow structure to the mechanism design. Our results significantly generalize and improve the prior state of the art in revenue-optimal efficient mechanism design. They also considerably expand the scope of what forms of agent information can be expressed and used to improve revenue.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Weakest Bidder Types and New Core-Selecting Combinatorial Auctions
Authors:
Siddharth Prasad,
Maria-Florina Balcan,
Tuomas Sandholm
Abstract:
Core-selecting combinatorial auctions are popular auction designs that constrain prices to eliminate the incentive for any group of bidders -- with the seller -- to renegotiate for a better deal. They help overcome the low-revenue issues of classical combinatorial auctions. We introduce a new class of core-selecting combinatorial auctions that leverage bidder information available to the auction d…
▽ More
Core-selecting combinatorial auctions are popular auction designs that constrain prices to eliminate the incentive for any group of bidders -- with the seller -- to renegotiate for a better deal. They help overcome the low-revenue issues of classical combinatorial auctions. We introduce a new class of core-selecting combinatorial auctions that leverage bidder information available to the auction designer. We model such information through constraints on the joint type space of the bidders -- these are constraints on bidders' private valuations that are known to hold by the auction designer before bids are elicited. First, we show that type space information can overcome the well-known impossibility of incentive-compatible core-selecting combinatorial auctions. We present a revised and generalized version of that impossibility result that depends on how much information is conveyed by the type spaces. We then devise a new family of core-selecting combinatorial auctions and show that they minimize the sum of bidders' incentives to deviate from truthful bidding. We develop new constraint generation techniques -- and build upon existing quadratic programming techniques -- to compute core prices, and conduct experiments to evaluate the incentive, revenue, fairness, and computational merits of our new auctions. Our new core-selecting auctions directly improve upon existing designs that have been used in many high-stakes auctions around the world. We envision that they will be a useful addition to any auction designer's toolkit.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Know Or Not: a library for evaluating out-of-knowledge base robustness
Authors:
Jessica Foo,
Pradyumna Shyama Prasad,
Shaun Khoo
Abstract:
While the capabilities of large language models (LLMs) have progressed significantly, their use in high-stakes applications have been limited due to risks of hallucination. One key approach in reducing hallucination is retrieval-augmented generation (RAG), but even in such setups, LLMs may still hallucinate when presented with questions outside of the knowledge base. Such behavior is unacceptable…
▽ More
While the capabilities of large language models (LLMs) have progressed significantly, their use in high-stakes applications have been limited due to risks of hallucination. One key approach in reducing hallucination is retrieval-augmented generation (RAG), but even in such setups, LLMs may still hallucinate when presented with questions outside of the knowledge base. Such behavior is unacceptable in high-stake applications where LLMs are expected to abstain from answering queries it does not have sufficient context on. In this work, we present a novel methodology for systematically evaluating out-of-knowledge base (OOKB) robustness of LLMs (whether LLMs know or do not know) in the RAG setting, without the need for manual annotation of gold standard answers. We implement our methodology in knowornot, an open-source library that enables users to develop their own customized evaluation data and pipelines for OOKB robustness. knowornot comprises four main features. Firstly, it provides a unified, high-level API that streamlines the process of setting up and running robustness benchmarks. Secondly, its modular architecture emphasizes extensibility and flexibility, allowing users to easily integrate their own LLM clients and RAG settings. Thirdly, its rigorous data modeling design ensures experiment reproducibility, reliability and traceability. Lastly, it implements a comprehensive suite of tools for users to customize their pipelines. We demonstrate the utility of knowornot by developing a challenging benchmark, PolicyBench, which spans four Question-Answer (QA) chatbots on government policies, and analyze its OOKB robustness. The source code of knowornot is available https://github.com/govtech-responsibleai/KnowOrNot.
△ Less
Submitted 18 May, 2025;
originally announced May 2025.
-
MXDOTP: A RISC-V ISA Extension for Enabling Microscaling (MX) Floating-Point Dot Products
Authors:
Gamze İslamoğlu,
Luca Bertaccini,
Arpan Suravi Prasad,
Francesco Conti,
Angelo Garofalo,
Luca Benini
Abstract:
Fast and energy-efficient low-bitwidth floating-point (FP) arithmetic is essential for Artificial Intelligence (AI) systems. Microscaling (MX) standardized formats have recently emerged as a promising alternative to baseline low-bitwidth FP formats, offering improved accuracy with a block-wise shared exponent scale combined with per-element values. However, efficiently executing the key linear alg…
▽ More
Fast and energy-efficient low-bitwidth floating-point (FP) arithmetic is essential for Artificial Intelligence (AI) systems. Microscaling (MX) standardized formats have recently emerged as a promising alternative to baseline low-bitwidth FP formats, offering improved accuracy with a block-wise shared exponent scale combined with per-element values. However, efficiently executing the key linear algebra primitives for AI applications on MX formats requires specialized hardware support for the fundamental operators such as scaled dot product. In this work, we propose MXDOTP, the first RISC-V ISA extension for MX dot products, focusing on the 8-bit MXFP8 FP format. We extend the open-source Snitch RISC-V core with a dedicated MXFP8 dot product-accumulate unit, which fully consumes blocks of eight 8-bit operands packed into 64-bit inputs. To feed MXDOTP at full utilization with four operands per cycle, including block scales, we exploit Snitch's Stream Semantic Registers (SSRs), achieving up to 80% utilization with minimal impact on the Snitch core's architecture and no modification to the register file. Implemented in 12 nm FinFET, a cluster with eight MXDOTP-extended cores reaches up to 356 GFLOPS/W when computing MXFP8 matrix multiplications at 0.8 V, 1 GHz. Compared to a software baseline, where MX dot products are computed by type casting FP8 inputs to FP32 for higher accumulation precision and applying explicit block scaling, the cluster achieves 25x speedup and 12.5x better energy efficiency at a minimal 5.1% area increase.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
A Sensor Agnostic Domain Generalization Framework for Leveraging Geospatial Foundation Models: Enhancing Semantic Segmentation viaSynergistic Pseudo-Labeling and Generative Learning
Authors:
Anan Yaghmour,
Melba M. Crawford,
Saurabh Prasad
Abstract:
Remote sensing enables a wide range of critical applications such as land cover and land use mapping, crop yield prediction, and environmental monitoring. Advances in satellite technology have expanded remote sensing datasets, yet high-performance segmentation models remain dependent on extensive labeled data, challenged by annotation scarcity and variability across sensors, illumination, and geog…
▽ More
Remote sensing enables a wide range of critical applications such as land cover and land use mapping, crop yield prediction, and environmental monitoring. Advances in satellite technology have expanded remote sensing datasets, yet high-performance segmentation models remain dependent on extensive labeled data, challenged by annotation scarcity and variability across sensors, illumination, and geography. Domain adaptation offers a promising solution to improve model generalization. This paper introduces a domain generalization approach to leveraging emerging geospatial foundation models by combining soft-alignment pseudo-labeling with source-to-target generative pre-training. We further provide new mathematical insights into MAE-based generative learning for domain-invariant feature learning. Experiments with hyperspectral and multispectral remote sensing datasets confirm our method's effectiveness in enhancing adaptability and segmentation.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
Study on Downlink CSI compression: Are Neural Networks the Only Solution?
Authors:
K. Sai Praneeth,
Anil Kumar Yerrapragada,
Achyuth Sagireddi,
Sai Prasad,
Radha Krishna Ganti
Abstract:
Massive Multi Input Multi Output (MIMO) systems enable higher data rates in the downlink (DL) with spatial multiplexing achieved by forming narrow beams. The higher DL data rates are achieved by effective implementation of spatial multiplexing and beamforming which is subject to availability of DL channel state information (CSI) at the base station. For Frequency Division Duplexing (FDD) systems,…
▽ More
Massive Multi Input Multi Output (MIMO) systems enable higher data rates in the downlink (DL) with spatial multiplexing achieved by forming narrow beams. The higher DL data rates are achieved by effective implementation of spatial multiplexing and beamforming which is subject to availability of DL channel state information (CSI) at the base station. For Frequency Division Duplexing (FDD) systems, the DL CSI has to be transmitted by User Equipment (UE) to the gNB and it constitutes a significant overhead which scales with the number of transmitter antennas and the granularity of the CSI. To address the overhead issue, AI/ML methods using auto-encoders have been investigated, where an encoder neural network model at the UE compresses the CSI and a decoder neural network model at the gNB reconstructs it. However, the use of AI/ML methods has a number of challenges related to (1) model complexity, (2) model generalization across channel scenarios and (3) inter-vendor compatibility of the two sides of the model. In this work, we investigate a more traditional dimensionality reduction method that uses Principal Component Analysis (PCA) and therefore does not suffer from the above challenges. Simulation results show that PCA based CSI compression actually achieves comparable reconstruction performance to commonly used deep neural networks based models.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
Scalable Higher Resolution Polar Sea Ice Classification and Freeboard Calculation from ICESat-2 ATL03 Data
Authors:
Jurdana Masuma Iqrah,
Younghyun Koo,
Wei Wang,
Hongjie Xie,
Sushil K. Prasad
Abstract:
ICESat-2 (IS2) by NASA is an Earth-observing satellite that measures high-resolution surface elevation. The IS2's ATL07 and ATL10 sea ice elevation and freeboard products of 10m-200m segments which aggregated 150 signal photons from the raw ATL03 (geolocated photon) data. These aggregated products can potentially overestimate local sea surface height, thus underestimating the calculations of freeb…
▽ More
ICESat-2 (IS2) by NASA is an Earth-observing satellite that measures high-resolution surface elevation. The IS2's ATL07 and ATL10 sea ice elevation and freeboard products of 10m-200m segments which aggregated 150 signal photons from the raw ATL03 (geolocated photon) data. These aggregated products can potentially overestimate local sea surface height, thus underestimating the calculations of freeboard (sea ice height above sea surface). To achieve a higher resolution of sea surface height and freeboard information, in this work we utilize a 2m window to resample the ATL03 data. Then, we classify these 2m segments into thick sea ice, thin ice, and open water using deep learning methods (Long short-term memory and Multi-layer perceptron models). To obtain labeled training data for our deep learning models, we use segmented Sentinel-2 (S2) multi-spectral imagery overlapping with IS2 tracks in space and time to auto-label IS2 data, followed by some manual corrections in the regions of transition between different ice/water types or cloudy regions. We employ a parallel workflow for this auto-labeling using PySpark to scale, and we achieve 9-fold data loading and 16.25-fold map-reduce speedup. To train our models, we employ a Horovod-based distributed deep-learning workflow on a DGX A100 8 GPU cluster, achieving a 7.25-fold speedup. Next, we calculate the local sea surface heights based on the open water segments. Finally, we scale the freeboard calculation using the derived local sea level and achieve 8.54-fold data loading and 15.7-fold map-reduce speedup. Compared with the ATL07 (local sea level) and ATL10 (freeboard) data products, our results show higher resolutions and accuracy (96.56%).
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
Random Adaptive Cache Placement Policy
Authors:
Vrushank Ahire,
Pranav Menon,
Aniruddh Muley,
Abhinandan S. Prasad
Abstract:
This paper presents a new hybrid cache replacement algorithm that combines random allocation with a modified V-Way cache implementation. Our RAC adapts to complex cache access patterns and optimizes cache usage by improving the utilization of cache sets, unlike traditional cache policies. The algorithm utilizes a 16-way set-associative cache with 2048 sets, incorporating dynamic allocation and fle…
▽ More
This paper presents a new hybrid cache replacement algorithm that combines random allocation with a modified V-Way cache implementation. Our RAC adapts to complex cache access patterns and optimizes cache usage by improving the utilization of cache sets, unlike traditional cache policies. The algorithm utilizes a 16-way set-associative cache with 2048 sets, incorporating dynamic allocation and flexible tag management. RAC extends the V-Way cache design and its variants by optimizing tag and data storage for enhanced efficiency.
We evaluated the algorithm using the ChampSim simulator with four diverse benchmark traces and observed significant improvements in cache hit rates up to 80.82% hit rate. Although the improvements in the instructions per cycle (IPC) were moderate, our findings emphasize the algorithm's potential to enhance cache utilization and reduce memory access times.
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
ScooterLab: A Programmable and Participatory Sensing Research Testbed using Micromobility Vehicles
Authors:
Ubaidullah Khan,
Raveen Wijewickrama,
Buddhi Ashan M. K.,
A. H. M. Nazmus Sakib,
Khoi Trinh,
Christina Duthie,
Nima Najafian,
Ahmer Patel,
R. N. Molina,
Anindya Maiti,
Sushil K. Prasad,
Greg P. Griffin,
Murtuza Jadliwala
Abstract:
Micromobility vehicles, such as e-scooters, are increasingly popular in urban communities but present significant challenges in terms of road safety, user privacy, infrastructure planning, and civil engineering. Addressing these critical issues requires a large-scale and easily accessible research infrastructure to collect diverse mobility and contextual data from micromobility users in realistic…
▽ More
Micromobility vehicles, such as e-scooters, are increasingly popular in urban communities but present significant challenges in terms of road safety, user privacy, infrastructure planning, and civil engineering. Addressing these critical issues requires a large-scale and easily accessible research infrastructure to collect diverse mobility and contextual data from micromobility users in realistic settings. To this end, we present ScooterLab, a community research testbed comprising a fleet of customizable battery-powered micromobility vehicles retrofitted with advanced sensing, communication, and control capabilities. ScooterLab enables interdisciplinary research at the intersection of computing, mobility, and urban planning by providing researchers with tools to design and deploy customized sensing experiments and access curated datasets. The testbed will enable advances in machine learning, privacy, and urban transportation research while promoting sustainable mobility.
△ Less
Submitted 10 January, 2025;
originally announced January 2025.
-
Optimizing Multitask Industrial Processes with Predictive Action Guidance
Authors:
Naval Kishore Mehta,
Arvind,
Shyam Sunder Prasad,
Sumeet Saurav,
Sanjay Singh
Abstract:
Monitoring complex assembly processes is critical for maintaining productivity and ensuring compliance with assembly standards. However, variability in human actions and subjective task preferences complicate accurate task anticipation and guidance. To address these challenges, we introduce the Multi-Modal Transformer Fusion and Recurrent Units (MMTFRU) Network for egocentric activity anticipation…
▽ More
Monitoring complex assembly processes is critical for maintaining productivity and ensuring compliance with assembly standards. However, variability in human actions and subjective task preferences complicate accurate task anticipation and guidance. To address these challenges, we introduce the Multi-Modal Transformer Fusion and Recurrent Units (MMTFRU) Network for egocentric activity anticipation, utilizing multimodal fusion to improve prediction accuracy. Integrated with the Operator Action Monitoring Unit (OAMU), the system provides proactive operator guidance, preventing deviations in the assembly process. OAMU employs two strategies: (1) Top-5 MMTF-RU predictions, combined with a reference graph and an action dictionary, for next-step recommendations; and (2) Top-1 MMTF-RU predictions, integrated with a reference graph, for detecting sequence deviations and predicting anomaly scores via an entropy-informed confidence mechanism. We also introduce Time-Weighted Sequence Accuracy (TWSA) to evaluate operator efficiency and ensure timely task completion. Our approach is validated on the industrial Meccano dataset and the largescale EPIC-Kitchens-55 dataset, demonstrating its effectiveness in dynamic environments.
△ Less
Submitted 9 January, 2025;
originally announced January 2025.
-
Hybrid deep convolution model for lung cancer detection with transfer learning
Authors:
Sugandha Saxena,
S. N. Prasad,
Ashwin M Polnaya,
Shweta Agarwala
Abstract:
Advances in healthcare research have significantly enhanced our understanding of disease mechanisms, diagnostic precision, and therapeutic options. Yet, lung cancer remains one of the leading causes of cancer-related mortality worldwide due to challenges in early and accurate diagnosis. While current lung cancer detection models show promise, there is considerable potential for further improving t…
▽ More
Advances in healthcare research have significantly enhanced our understanding of disease mechanisms, diagnostic precision, and therapeutic options. Yet, lung cancer remains one of the leading causes of cancer-related mortality worldwide due to challenges in early and accurate diagnosis. While current lung cancer detection models show promise, there is considerable potential for further improving the accuracy for timely intervention. To address this challenge, we introduce a hybrid deep convolution model leveraging transfer learning, named the Maximum Sensitivity Neural Network (MSNN). MSNN is designed to improve the precision of lung cancer detection by refining sensitivity and specificity. This model has surpassed existing deep learning approaches through experimental validation, achieving an accuracy of 98% and a sensitivity of 97%. By overlaying sensitivity maps onto lung Computed Tomography (CT) scans, it enables the visualization of regions most indicative of malignant or benign classifications. This innovative method demonstrates exceptional performance in distinguishing lung cancer with minimal false positives, thereby enhancing the accuracy of medical diagnoses.
△ Less
Submitted 5 June, 2025; v1 submitted 6 January, 2025;
originally announced January 2025.
-
Foundation Models and Adaptive Feature Selection: A Synergistic Approach to Video Question Answering
Authors:
Sai Bhargav Rongali,
Mohamad Hassan N C,
Ankit Jha,
Neha Bhargava,
Saurabh Prasad,
Biplab Banerjee
Abstract:
This paper tackles the intricate challenge of video question-answering (VideoQA). Despite notable progress, current methods fall short of effectively integrating questions with video frames and semantic object-level abstractions to create question-aware video representations. We introduce Local-Global Question Aware Video Embedding (LGQAVE), which incorporates three major innovations to integrate…
▽ More
This paper tackles the intricate challenge of video question-answering (VideoQA). Despite notable progress, current methods fall short of effectively integrating questions with video frames and semantic object-level abstractions to create question-aware video representations. We introduce Local-Global Question Aware Video Embedding (LGQAVE), which incorporates three major innovations to integrate multi-modal knowledge better and emphasize semantic visual concepts relevant to specific questions. LGQAVE moves beyond traditional ad-hoc frame sampling by utilizing a cross-attention mechanism that precisely identifies the most relevant frames concerning the questions. It captures the dynamics of objects within these frames using distinct graphs, grounding them in question semantics with the miniGPT model. These graphs are processed by a question-aware dynamic graph transformer (Q-DGT), which refines the outputs to develop nuanced global and local video representations. An additional cross-attention module integrates these local and global embeddings to generate the final video embeddings, which a language model uses to generate answers. Extensive evaluations across multiple benchmarks demonstrate that LGQAVE significantly outperforms existing models in delivering accurate multi-choice and open-ended answers.
△ Less
Submitted 12 December, 2024;
originally announced December 2024.
-
Grounded Language Design for Lightweight Diagramming for Formal Methods
Authors:
Siddhartha Prasad,
Ben Greenman,
Tim Nelson,
Shriram Krishnamurthi
Abstract:
Model finding, as embodied by SAT solvers and similar tools, is used widely, both in embedding settings and as a tool in its own right. For instance, tools like Alloy target SAT to enable users to incrementally define, explore, verify, and diagnose sophisticated specifications for a large number of complex systems.
These tools critically include a visualizer that lets users graphically explore t…
▽ More
Model finding, as embodied by SAT solvers and similar tools, is used widely, both in embedding settings and as a tool in its own right. For instance, tools like Alloy target SAT to enable users to incrementally define, explore, verify, and diagnose sophisticated specifications for a large number of complex systems.
These tools critically include a visualizer that lets users graphically explore these generated models. As we show, however, default visualizers, which know nothing about the domain, are unhelpful and even actively violate presentational and cognitive principles. At the other extreme, full-blown visualizations require significant effort as well as knowledge a specifier might not possess; they can also exhibit bad failure modes (including silent failure). Instead, we need a language to capture essential domain information for lightweight diagramming. We ground our language design in both the cognitive science literature on diagrams and on a large number of example custom visualizations. This identifies the key elements of lightweight diagrams. We distill these into a small set of orthogonal primitives. We extend an Alloy-like tool to support these primitives. We evaluate the effectiveness of the produced diagrams, finding them good for reasoning. We then compare this against many other drawing languages and tools to show that this work defines a new niche that is lightweight, effective, and driven by sound principles.
△ Less
Submitted 4 December, 2024;
originally announced December 2024.
-
Phase-Informed Tool Segmentation for Manual Small-Incision Cataract Surgery
Authors:
Bhuvan Sachdeva,
Naren Akash,
Tajamul Ashraf,
Simon Mueller,
Thomas Schultz,
Maximilian W. M. Wintergerst,
Niharika Singri Prasad,
Kaushik Murali,
Mohit Jain
Abstract:
Cataract surgery is the most common surgical procedure globally, with a disproportionately higher burden in developing countries. While automated surgical video analysis has been explored in general surgery, its application to ophthalmic procedures remains limited. Existing works primarily focus on Phaco cataract surgery, an expensive technique not accessible in regions where cataract treatment is…
▽ More
Cataract surgery is the most common surgical procedure globally, with a disproportionately higher burden in developing countries. While automated surgical video analysis has been explored in general surgery, its application to ophthalmic procedures remains limited. Existing works primarily focus on Phaco cataract surgery, an expensive technique not accessible in regions where cataract treatment is most needed. In contrast, Manual Small-Incision Cataract Surgery (MSICS) is the preferred low-cost, faster alternative in high-volume settings and for challenging cases. However, no dataset exists for MSICS. To address this gap, we introduce Sankara-MSICS, the first comprehensive dataset containing 53 surgical videos annotated for 18 surgical phases and 3,527 frames with 13 surgical tools at the pixel level. We benchmark this dataset on state-of-the-art models and present ToolSeg, a novel framework that enhances tool segmentation by introducing a phase-conditional decoder and a simple yet effective semi-supervised setup leveraging pseudo-labels from foundation models. Our approach significantly improves segmentation performance, achieving a $23.77\%$ to $38.10\%$ increase in mean Dice scores, with a notable boost for tools that are less prevalent and small. Furthermore, we demonstrate that ToolSeg generalizes to other surgical settings, showcasing its effectiveness on the CaDIS dataset.
△ Less
Submitted 3 December, 2024; v1 submitted 25 November, 2024;
originally announced November 2024.
-
A Novel Breast Ultrasound Image Augmentation Method Using Advanced Neural Style Transfer: An Efficient and Explainable Approach
Authors:
Lipismita Panigrahi,
Prianka Rani Saha,
Jurdana Masuma Iqrah,
Sushil Prasad
Abstract:
Clinical diagnosis of breast malignancy (BM) is a challenging problem in the recent era. In particular, Deep learning (DL) models have continued to offer important solutions for early BM diagnosis but their performance experiences overfitting due to the limited volume of breast ultrasound (BUS) image data. Further, large BUS datasets are difficult to manage due to privacy and legal concerns. Hence…
▽ More
Clinical diagnosis of breast malignancy (BM) is a challenging problem in the recent era. In particular, Deep learning (DL) models have continued to offer important solutions for early BM diagnosis but their performance experiences overfitting due to the limited volume of breast ultrasound (BUS) image data. Further, large BUS datasets are difficult to manage due to privacy and legal concerns. Hence, image augmentation is a necessary and challenging step to improve the performance of the DL models. However, the current DL-based augmentation models are inadequate and operate as a black box resulting lack of information and justifications about their suitability and efficacy. Additionally, pre and post-augmentation need high-performance computational resources and time to produce the augmented image and evaluate the model performance. Thus, this study aims to develop a novel efficient augmentation approach for BUS images with advanced neural style transfer (NST) and Explainable AI (XAI) harnessing GPU-based parallel infrastructure. We scale and distribute the training of the augmentation model across 8 GPUs using the Horovod framework on a DGX cluster, achieving a 5.09 speedup while maintaining the model's accuracy. The proposed model is evaluated on 800 (348 benign and 452 malignant) BUS images and its performance is analyzed with other progressive techniques, using different quantitative analyses. The result indicates that the proposed approach can successfully augment the BUS images with 92.47% accuracy.
△ Less
Submitted 31 October, 2024;
originally announced November 2024.
-
On the Application of Deep Learning for Precise Indoor Positioning in 6G
Authors:
Sai Prasanth Kotturi,
Anil Kumar Yerrapragada,
Sai Prasad,
Radha Krishna Ganti
Abstract:
Accurate localization in indoor environments is a challenge due to the Non Line of Sight (NLoS) nature of the signaling. In this paper, we explore the use of AI/ML techniques for positioning accuracy enhancement in Indoor Factory (InF) scenarios. The proposed neural network, which we term LocNet, is trained on measurements such as Channel Impulse Response (CIR) and Reference Signal Received Power…
▽ More
Accurate localization in indoor environments is a challenge due to the Non Line of Sight (NLoS) nature of the signaling. In this paper, we explore the use of AI/ML techniques for positioning accuracy enhancement in Indoor Factory (InF) scenarios. The proposed neural network, which we term LocNet, is trained on measurements such as Channel Impulse Response (CIR) and Reference Signal Received Power (RSRP) from multiple Transmit Receive Points (TRPs). Simulation results show that when using measurements from 18 TRPs, LocNet achieves a 9 cm positioning accuracy at the 90th percentile. Additionally, we demonstrate that the same model generalizes effectively even when measurements from some TRPs randomly become unavailable. Lastly, we provide insights on the robustness of the trained model to the errors in ground truth labels used for training.
△ Less
Submitted 25 October, 2024;
originally announced October 2024.
-
Characterizing Robocalls with Multiple Vantage Points
Authors:
Sathvik Prasad,
Aleksandr Nahapetyan,
Bradley Reaves
Abstract:
Telephone spam has been among the highest network security concerns for users for many years. In response, industry and government have deployed new technologies and regulations to curb the problem, and academic and industry researchers have provided methods and measurements to characterize robocalls. Have these efforts borne fruit? Are the research characterizations reliable, and have the prevent…
▽ More
Telephone spam has been among the highest network security concerns for users for many years. In response, industry and government have deployed new technologies and regulations to curb the problem, and academic and industry researchers have provided methods and measurements to characterize robocalls. Have these efforts borne fruit? Are the research characterizations reliable, and have the prevention and deterrence mechanisms succeeded?
In this paper, we address these questions through analysis of data from several independently-operated vantage points, ranging from industry and academic voice honeypots to public enforcement and consumer complaints, some with over 5 years of historic data. We first describe how we address the non-trivial methodological challenges of comparing disparate data sources, including comparing audio and transcripts from about 3 million voice calls. We also detail the substantial coherency of these diverse perspectives, which dramatically strengthens the evidence for the conclusions we draw about robocall characterization and mitigation while highlighting advantages of each approach. Among our many findings, we find that unsolicited calls are in slow decline, though complaints and call volumes remain high. We also find that robocallers have managed to adapt to STIR/SHAKEN, a mandatory call authentication scheme. In total, our findings highlight the most promising directions for future efforts to characterize and stop telephone spam.
△ Less
Submitted 22 October, 2024;
originally announced October 2024.
-
PiLocNet: Physics-informed neural network on 3D localization with rotating point spread function
Authors:
Mingda Lu,
Zitian Ao,
Chao Wang,
Sudhakar Prasad,
Raymond H. Chan
Abstract:
For the 3D localization problem using point spread function (PSF) engineering, we propose a novel enhancement of our previously introduced localization neural network, LocNet. The improved network is a physics-informed neural network (PINN) that we call PiLocNet. Previous works on the localization problem may be categorized separately into model-based optimization and neural network approaches. Ou…
▽ More
For the 3D localization problem using point spread function (PSF) engineering, we propose a novel enhancement of our previously introduced localization neural network, LocNet. The improved network is a physics-informed neural network (PINN) that we call PiLocNet. Previous works on the localization problem may be categorized separately into model-based optimization and neural network approaches. Our PiLocNet combines the unique strengths of both approaches by incorporating forward-model-based information into the network via a data-fitting loss term that constrains the neural network to yield results that are physically sensible. We additionally incorporate certain regularization terms from the variational method, which further improves the robustness of the network in the presence of image noise, as we show for the Poisson and Gaussian noise models. This framework accords interpretability to the neural network, and the results we obtain show its superiority. Although the paper focuses on the use of single-lobe rotating PSF to encode the full 3D source location, we expect the method to be widely applicable to other PSFs and imaging problems that are constrained by known forward processes.
△ Less
Submitted 9 February, 2025; v1 submitted 17 October, 2024;
originally announced October 2024.
-
Investigation of Hierarchical Spectral Vision Transformer Architecture for Classification of Hyperspectral Imagery
Authors:
Wei Liu,
Saurabh Prasad,
Melba Crawford
Abstract:
In the past three years, there has been significant interest in hyperspectral imagery (HSI) classification using vision Transformers for analysis of remotely sensed data. Previous research predominantly focused on the empirical integration of convolutional neural networks (CNNs) to augment the network's capability to extract local feature information. Yet, the theoretical justification for vision…
▽ More
In the past three years, there has been significant interest in hyperspectral imagery (HSI) classification using vision Transformers for analysis of remotely sensed data. Previous research predominantly focused on the empirical integration of convolutional neural networks (CNNs) to augment the network's capability to extract local feature information. Yet, the theoretical justification for vision Transformers out-performing CNN architectures in HSI classification remains a question. To address this issue, a unified hierarchical spectral vision Transformer architecture, specifically tailored for HSI classification, is investigated. In this streamlined yet effective vision Transformer architecture, multiple mixer modules are strategically integrated separately. These include the CNN-mixer, which executes convolution operations; the spatial self-attention (SSA)-mixer and channel self-attention (CSA)-mixer, both of which are adaptations of classical self-attention blocks; and hybrid models such as the SSA+CNN-mixer and CSA+CNN-mixer, which merge convolution with self-attention operations. This integration facilitates the development of a broad spectrum of vision Transformer-based models tailored for HSI classification. In terms of the training process, a comprehensive analysis is performed, contrasting classical CNN models and vision Transformer-based counterparts, with particular attention to disturbance robustness and the distribution of the largest eigenvalue of the Hessian. From the evaluations conducted on various mixer models rooted in the unified architecture, it is concluded that the unique strength of vision Transformers can be attributed to their overarching architecture, rather than being exclusively reliant on individual multi-head self-attention (MSA) components.
△ Less
Submitted 13 September, 2024;
originally announced September 2024.
-
Jäger: Automated Telephone Call Traceback
Authors:
David Adei,
Varun Madathil,
Sathvik Prasad,
Bradley Reaves,
Alessandra Scafuro
Abstract:
Unsolicited telephone calls that facilitate fraud or unlawful telemarketing continue to overwhelm network users and the regulators who prosecute them. The first step in prosecuting phone abuse is traceback -- identifying the call originator. This fundamental investigative task currently requires hours of manual effort per call. In this paper, we introduce Jäger, a distributed secure call traceback…
▽ More
Unsolicited telephone calls that facilitate fraud or unlawful telemarketing continue to overwhelm network users and the regulators who prosecute them. The first step in prosecuting phone abuse is traceback -- identifying the call originator. This fundamental investigative task currently requires hours of manual effort per call. In this paper, we introduce Jäger, a distributed secure call traceback system. Jäger can trace a call in a few seconds, even with partial deployment, while cryptographically preserving the privacy of call parties, carrier trade secrets like peers and call volume, and limiting the threat of bulk analysis. We establish definitions and requirements of secure traceback, then develop a suite of protocols that meet these requirements using witness encryption, oblivious pseudorandom functions, and group signatures. We prove these protocols secure in the universal composibility framework. We then demonstrate that Jäger has low compute and bandwidth costs per call, and these costs scale linearly with call volume. Jäger provides an efficient, secure, privacy-preserving system to revolutionize telephone abuse investigation with minimal costs to operators.
△ Less
Submitted 17 September, 2024; v1 submitted 4 September, 2024;
originally announced September 2024.
-
ContextQ: Generated Questions to Support Meaningful Parent-Child Dialogue While Co-Reading
Authors:
Griffin Dietz Smith,
Siddhartha Prasad,
Matt J. Davidson,
Leah Findlater,
R. Benjamin Shapiro
Abstract:
Much of early literacy education happens at home with caretakers reading books to young children. Prior research demonstrates how having dialogue with children during co-reading can develop critical reading readiness skills, but most adult readers are unsure if and how to lead effective conversations. We present ContextQ, a tablet-based reading application to unobtrusively present auto-generated d…
▽ More
Much of early literacy education happens at home with caretakers reading books to young children. Prior research demonstrates how having dialogue with children during co-reading can develop critical reading readiness skills, but most adult readers are unsure if and how to lead effective conversations. We present ContextQ, a tablet-based reading application to unobtrusively present auto-generated dialogic questions to caretakers to support this dialogic reading practice. An ablation study demonstrates how our method of encoding educator expertise into the question generation pipeline can produce high-quality output; and through a user study with 12 parent-child dyads (child age: 4-6), we demonstrate that this system can serve as a guide for parents in leading contextually meaningful dialogue, leading to significantly more conversational turns from both the parent and the child and deeper conversations with connections to the child's everyday life.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
Learning from Students: Applying t-Distributions to Explore Accurate and Efficient Formats for LLMs
Authors:
Jordan Dotzel,
Yuzong Chen,
Bahaa Kotb,
Sushma Prasad,
Gang Wu,
Sheng Li,
Mohamed S. Abdelfattah,
Zhiru Zhang
Abstract:
The increasing size of large language models (LLMs) traditionally requires low-precision integer formats to meet strict latency and power demands. Yet recently, alternative formats such as Normal Float (NF4) have increased model accuracy at the cost of increased chip area. In this work, we first conduct a large-scale analysis of LLM weights and activations across 30 networks and conclude that most…
▽ More
The increasing size of large language models (LLMs) traditionally requires low-precision integer formats to meet strict latency and power demands. Yet recently, alternative formats such as Normal Float (NF4) have increased model accuracy at the cost of increased chip area. In this work, we first conduct a large-scale analysis of LLM weights and activations across 30 networks and conclude that most distributions follow a Student's t-distribution. We then derive a new theoretically optimal format, Student Float (SF4), that improves over NF4 across modern LLMs, for example increasing the average accuracy on LLaMA2-7B by 0.76% across tasks. Using this format as a high-accuracy reference, we then propose augmenting E2M1 with two variants of supernormal support for higher model accuracy. Finally, we explore the quality and efficiency frontier across 11 datatypes by evaluating their model accuracy and hardware complexity. We discover a Pareto curve composed of INT4, E2M1, and E2M1 with supernormal support, which offers a continuous tradeoff between model accuracy and chip area. For example, E2M1 with supernormal support increases the accuracy of Phi-2 by up to 2.19% with 1.22% area overhead, enabling more LLM-based applications to be run at four bits. The supporting code is hosted at https://github.com/cornell-zhang/llm-datatypes.
△ Less
Submitted 10 June, 2024; v1 submitted 5 May, 2024;
originally announced May 2024.
-
MedPromptExtract (Medical Data Extraction Tool): Anonymization and Hi-fidelity Automated data extraction using NLP and prompt engineering
Authors:
Roomani Srivastava,
Suraj Prasad,
Lipika Bhat,
Sarvesh Deshpande,
Barnali Das,
Kshitij Jadhav
Abstract:
Introduction: The labour-intensive nature of data extraction from sources like discharge summaries (DS) poses significant obstacles to the digitisation of medical records particularly for low- and middle-income countries (LMICs). In this paper we present a completely automated method MedPromptExtract to efficiently extract data from DS while maintaining confidentiality. Methods: The source of data…
▽ More
Introduction: The labour-intensive nature of data extraction from sources like discharge summaries (DS) poses significant obstacles to the digitisation of medical records particularly for low- and middle-income countries (LMICs). In this paper we present a completely automated method MedPromptExtract to efficiently extract data from DS while maintaining confidentiality. Methods: The source of data was Discharge Summaries (DS) from Kokilaben Dhirubhai Ambani Hospital (KDAH) of patients having Acute Kidney Injury (AKI). A pre-existing tool EIGEN which leverages semi-supervised learning techniques for high-fidelity information extraction was used to anonymize the DS, Natural Language Processing (NLP) was used to extract data from regular fields. We used Prompt Engineering and Large Language Model(LLM) to extract custom clinical information from free flowing text describing the patients stay in the hospital. Twelve features associated with occurrence of AKI were extracted. The LLM responses were validated against clinicians annotations. Results: The MedPromptExtracttool first subjected DS to the anonymization pipeline which took three seconds per summary. Successful anonymization was verified by clinicians, thereafter NLP pipeline extracted structured text from the anonymized pdfs at the rate of 0.2 seconds per summary with 100% accuracy.Finally DS were analysed by the LLM pipeline using Gemini Pro for the twelve features. Accuracy metrics were calculated by comparing model responses to clinicians annotations with seven features achieving AUCs above 0.9, indicating high fidelity of the extraction process. Conclusion: MedPromptExtract serves as an automated adaptable tool for efficient data extraction from medical records with a dynamic user interface. Keywords: Digitizing Medical Records, Automated Anonymisation, Information Retrieval, Large Language Models, Prompt Engineering
△ Less
Submitted 6 September, 2024; v1 submitted 4 May, 2024;
originally announced May 2024.
-
EncodeNet: A Framework for Boosting DNN Accuracy with Entropy-driven Generalized Converting Autoencoder
Authors:
Hasanul Mahmud,
Kevin Desai,
Palden Lama,
Sushil K. Prasad
Abstract:
Image classification is a fundamental task in computer vision, and the quest to enhance DNN accuracy without inflating model size or latency remains a pressing concern. We make a couple of advances in this regard, leading to a novel EncodeNet design and training framework. The first advancement involves Converting Autoencoders, a novel approach that transforms images into an easy-to-classify image…
▽ More
Image classification is a fundamental task in computer vision, and the quest to enhance DNN accuracy without inflating model size or latency remains a pressing concern. We make a couple of advances in this regard, leading to a novel EncodeNet design and training framework. The first advancement involves Converting Autoencoders, a novel approach that transforms images into an easy-to-classify image of its class. Our prior work that applied the Converting Autoencoder and a simple classifier in tandem achieved moderate accuracy over simple datasets, such as MNIST and FMNIST. However, on more complex datasets like CIFAR-10, the Converting Autoencoder has a large reconstruction loss, making it unsuitable for enhancing DNN accuracy. To address these limitations, we generalize the design of Converting Autoencoders by leveraging a larger class of DNNs, those with architectures comprising feature extraction layers followed by classification layers. We incorporate a generalized algorithmic design of the Converting Autoencoder and intraclass clustering to identify representative images, leading to optimized image feature learning. Next, we demonstrate the effectiveness of our EncodeNet design and training framework, improving the accuracy of well-trained baseline DNNs while maintaining the overall model size. EncodeNet's building blocks comprise the trained encoder from our generalized Converting Autoencoders transferring knowledge to a lightweight classifier network - also extracted from the baseline DNN. Our experimental results demonstrate that EncodeNet improves the accuracy of VGG16 from 92.64% to 94.05% on CIFAR-10 and RestNet20 from 74.56% to 76.04% on CIFAR-100. It outperforms state-of-the-art techniques that rely on knowledge distillation and attention mechanisms, delivering higher accuracy for models of comparable size.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
Insights into the Lottery Ticket Hypothesis and Iterative Magnitude Pruning
Authors:
Tausifa Jan Saleem,
Ramanjit Ahuja,
Surendra Prasad,
Brejesh Lall
Abstract:
Lottery ticket hypothesis for deep neural networks emphasizes the importance of initialization used to re-train the sparser networks obtained using the iterative magnitude pruning process. An explanation for why the specific initialization proposed by the lottery ticket hypothesis tends to work better in terms of generalization (and training) performance has been lacking. Moreover, the underlying…
▽ More
Lottery ticket hypothesis for deep neural networks emphasizes the importance of initialization used to re-train the sparser networks obtained using the iterative magnitude pruning process. An explanation for why the specific initialization proposed by the lottery ticket hypothesis tends to work better in terms of generalization (and training) performance has been lacking. Moreover, the underlying principles in iterative magnitude pruning, like the pruning of smaller magnitude weights and the role of the iterative process, lack full understanding and explanation. In this work, we attempt to provide insights into these phenomena by empirically studying the volume/geometry and loss landscape characteristics of the solutions obtained at various stages of the iterative magnitude pruning process.
△ Less
Submitted 25 June, 2024; v1 submitted 22 March, 2024;
originally announced March 2024.
-
A Parallel Workflow for Polar Sea-Ice Classification using Auto-labeling of Sentinel-2 Imagery
Authors:
Jurdana Masuma Iqrah,
Wei Wang,
Hongjie Xie,
Sushil Prasad
Abstract:
The observation of the advancing and retreating pattern of polar sea ice cover stands as a vital indicator of global warming. This research aims to develop a robust, effective, and scalable system for classifying polar sea ice as thick/snow-covered, young/thin, or open water using Sentinel-2 (S2) images. Since the S2 satellite is actively capturing high-resolution imagery over the earth's surface,…
▽ More
The observation of the advancing and retreating pattern of polar sea ice cover stands as a vital indicator of global warming. This research aims to develop a robust, effective, and scalable system for classifying polar sea ice as thick/snow-covered, young/thin, or open water using Sentinel-2 (S2) images. Since the S2 satellite is actively capturing high-resolution imagery over the earth's surface, there are lots of images that need to be classified. One major obstacle is the absence of labeled S2 training data (images) to act as the ground truth. We demonstrate a scalable and accurate method for segmenting and automatically labeling S2 images using carefully determined color thresholds. We employ a parallel workflow using PySpark to scale and achieve 9-fold data loading and 16-fold map-reduce speedup on auto-labeling S2 images based on thin cloud and shadow-filtered color-based segmentation to generate label data. The auto-labeled data generated from this process are then employed to train a U-Net machine learning model, resulting in good classification accuracy. As training the U-Net classification model is computationally heavy and time-consuming, we distribute the U-Net model training to scale it over 8 GPUs using the Horovod framework over a DGX cluster with a 7.21x speedup without affecting the accuracy of the model. Using the Antarctic's Ross Sea region as an example, the U-Net model trained on auto-labeled data achieves a classification accuracy of 98.97% for auto-labeled training datasets when the thin clouds and shadows from the S2 images are filtered out.
△ Less
Submitted 19 March, 2024;
originally announced March 2024.
-
A Converting Autoencoder Toward Low-latency and Energy-efficient DNN Inference at the Edge
Authors:
Hasanul Mahmud,
Peng Kang,
Kevin Desai,
Palden Lama,
Sushil Prasad
Abstract:
Reducing inference time and energy usage while maintaining prediction accuracy has become a significant concern for deep neural networks (DNN) inference on resource-constrained edge devices. To address this problem, we propose a novel approach based on "converting" autoencoder and lightweight DNNs. This improves upon recent work such as early-exiting framework and DNN partitioning. Early-exiting f…
▽ More
Reducing inference time and energy usage while maintaining prediction accuracy has become a significant concern for deep neural networks (DNN) inference on resource-constrained edge devices. To address this problem, we propose a novel approach based on "converting" autoencoder and lightweight DNNs. This improves upon recent work such as early-exiting framework and DNN partitioning. Early-exiting frameworks spend different amounts of computation power for different input data depending upon their complexity. However, they can be inefficient in real-world scenarios that deal with many hard image samples. On the other hand, DNN partitioning algorithms that utilize the computation power of both the cloud and edge devices can be affected by network delays and intermittent connections between the cloud and the edge. We present CBNet, a low-latency and energy-efficient DNN inference framework tailored for edge devices. It utilizes a "converting" autoencoder to efficiently transform hard images into easy ones, which are subsequently processed by a lightweight DNN for inference. To the best of our knowledge, such autoencoder has not been proposed earlier. Our experimental results using three popular image-classification datasets on a Raspberry Pi 4, a Google Cloud instance, and an instance with Nvidia Tesla K80 GPU show that CBNet achieves up to 4.8x speedup in inference latency and 79% reduction in energy usage compared to competing techniques while maintaining similar or higher accuracy.
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
Evacuation Management Framework towards Smart City-wide Intelligent Emergency Interactive Response System
Authors:
Anuj Abraham,
Yi Zhang,
Shitala Prasad
Abstract:
A smart city solution toward future 6G network deployment allows small and medium sized enterprises (SMEs), industry, and government entities to connect with the infrastructures and play a crucial role in enhancing emergency preparedness with advanced sensors. The objective of this work is to propose a set of coordinated technological solutions to transform an existing emergency response system in…
▽ More
A smart city solution toward future 6G network deployment allows small and medium sized enterprises (SMEs), industry, and government entities to connect with the infrastructures and play a crucial role in enhancing emergency preparedness with advanced sensors. The objective of this work is to propose a set of coordinated technological solutions to transform an existing emergency response system into an intelligent interactive system, thereby improving the public services and the quality of life for residents at home, on road, in hospitals, transport hubs, etc. In this context, we consider a city wide view from three different application scenes that are closely related to peoples daily life, to optimize the actions taken at relevant departments. Therefore, using artificial intelligence (AI) and machine learning (ML) techniques to enable the next generation connected vehicle experiences, we specifically focus on accidents happening in indoor households, urban roads, and at large public facilities. This smart interactive response system will benefit from advanced sensor fusion and AI by formulating a real time dynamic model.
△ Less
Submitted 7 March, 2024;
originally announced March 2024.
-
A Survey on Human-AI Teaming with Large Pre-Trained Models
Authors:
Vanshika Vats,
Marzia Binta Nizam,
Minghao Liu,
Ziyuan Wang,
Richard Ho,
Mohnish Sai Prasad,
Vincent Titterton,
Sai Venkat Malreddy,
Riya Aggarwal,
Yanwen Xu,
Lei Ding,
Jay Mehta,
Nathan Grinnell,
Li Liu,
Sijia Zhong,
Devanathan Nallur Gandamani,
Xinyi Tang,
Rohan Ghosalkar,
Celeste Shen,
Rachel Shen,
Nafisa Hussain,
Kesav Ravichandran,
James Davis
Abstract:
In the rapidly evolving landscape of artificial intelligence (AI), the collaboration between human intelligence and AI systems, known as Human-AI (HAI) Teaming, has emerged as a cornerstone for advancing problem-solving and decision-making processes. The advent of Large Pre-trained Models (LPtM) has significantly transformed this landscape, offering unprecedented capabilities by leveraging vast am…
▽ More
In the rapidly evolving landscape of artificial intelligence (AI), the collaboration between human intelligence and AI systems, known as Human-AI (HAI) Teaming, has emerged as a cornerstone for advancing problem-solving and decision-making processes. The advent of Large Pre-trained Models (LPtM) has significantly transformed this landscape, offering unprecedented capabilities by leveraging vast amounts of data to understand and predict complex patterns. This paper surveys the pivotal integration of LPtMs with HAI, emphasizing how these models enhance collaborative intelligence beyond traditional approaches. It examines the potential of LPtMs in augmenting human capabilities, discussing this collaboration for AI model improvements, effective teaming, ethical considerations, and their broad applied implications in various sectors. Through this exploration, the study sheds light on the transformative impact of LPtM-enhanced HAI Teaming, providing insights for future research, policy development, and strategic implementations aimed at harnessing the full potential of this collaboration for research and societal benefit.
△ Less
Submitted 26 June, 2024; v1 submitted 7 March, 2024;
originally announced March 2024.
-
Gaze-Vector Estimation in the Dark with Temporally Encoded Event-driven Neural Networks
Authors:
Abeer Banerjee,
Naval K. Mehta,
Shyam S. Prasad,
Himanshu,
Sumeet Saurav,
Sanjay Singh
Abstract:
In this paper, we address the intricate challenge of gaze vector prediction, a pivotal task with applications ranging from human-computer interaction to driver monitoring systems. Our innovative approach is designed for the demanding setting of extremely low-light conditions, leveraging a novel temporal event encoding scheme, and a dedicated neural network architecture. The temporal encoding metho…
▽ More
In this paper, we address the intricate challenge of gaze vector prediction, a pivotal task with applications ranging from human-computer interaction to driver monitoring systems. Our innovative approach is designed for the demanding setting of extremely low-light conditions, leveraging a novel temporal event encoding scheme, and a dedicated neural network architecture. The temporal encoding method seamlessly integrates Dynamic Vision Sensor (DVS) events with grayscale guide frames, generating consecutively encoded images for input into our neural network. This unique solution not only captures diverse gaze responses from participants within the active age group but also introduces a curated dataset tailored for low-light conditions. The encoded temporal frames paired with our network showcase impressive spatial localization and reliable gaze direction in their predictions. Achieving a remarkable 100-pixel accuracy of 100%, our research underscores the potency of our neural network to work with temporally consecutive encoded images for precise gaze vector predictions in challenging low-light videos, contributing to the advancement of gaze prediction technologies.
△ Less
Submitted 5 March, 2024;
originally announced March 2024.
-
New Sequence-Independent Lifting Techniques for Cutting Planes and When They Induce Facets
Authors:
Siddharth Prasad,
Ellen Vitercik,
Maria-Florina Balcan,
Tuomas Sandholm
Abstract:
Sequence-independent lifting is a procedure for strengthening valid inequalities of an integer program. We generalize the sequence-independent lifting method of Gu, Nemhauser, and Savelsbergh (GNS lifting) for cover inequalities and correct an error in their proposed generalization. We obtain a new sequence-independent lifting technique -- piecewise-constant (PC) lifting -- with a number of intere…
▽ More
Sequence-independent lifting is a procedure for strengthening valid inequalities of an integer program. We generalize the sequence-independent lifting method of Gu, Nemhauser, and Savelsbergh (GNS lifting) for cover inequalities and correct an error in their proposed generalization. We obtain a new sequence-independent lifting technique -- piecewise-constant (PC) lifting -- with a number of interesting properties. We derive a broad set of sufficient conditions under which PC lifting is facet defining. To our knowledge, this is the first characterization of facet-defining sequence-independent liftings that are efficiently computable from the underlying cover. Finally, we demonstrate via experiments that PC lifting can be a useful alternative to GNS lifting. We test our new lifting techniques atop a number of novel cover cut generation routines, which prove to be effective in experiments with CPLEX.
△ Less
Submitted 24 January, 2024;
originally announced January 2024.
-
Conceptual Mutation Testing for Student Programming Misconceptions
Authors:
Siddhartha Prasad,
Ben Greenman,
Tim Nelson,
Shriram Krishnamurthi
Abstract:
Context: Students often misunderstand programming problem descriptions. This can lead them to solve the wrong problem, which creates frustration, obstructs learning, and imperils grades. Researchers have found that students can be made to better understand the problem by writing examples before they start programming. These examples are checked against correct and wrong implementations -- analogou…
▽ More
Context: Students often misunderstand programming problem descriptions. This can lead them to solve the wrong problem, which creates frustration, obstructs learning, and imperils grades. Researchers have found that students can be made to better understand the problem by writing examples before they start programming. These examples are checked against correct and wrong implementations -- analogous to mutation testing -- provided by course staff. Doing so results in better student understanding of the problem as well as better test suites to accompany the program, both of which are desirable educational outcomes.
Inquiry: Producing mutant implementations requires care. If there are too many, or they are too obscure, students will end up spending a lot of time on an unproductive task and also become frustrated. Instead, we want a small number of mutants that each correspond to common problem misconceptions. This paper presents a workflow with partial automation to produce mutants of this form which, notably, are not those produced by mutation-testing tools.
Approach: We comb through student tests that fail a correct implementation. The student misconceptions are embedded in these failures. We then use methods to semantically cluster these failures. These clusters are then translated into conceptual mutants. These can then be run against student data to determine whether we they are better than prior methods. Some of these processes also enjoy automation.
Knowledge: We find that student misconceptions illustrated by failing tests can be operationalized by the above process. The resulting mutants do much better at identifying student misconceptions.
Grounding: Our findings are grounded in a manual analysis of student examples and a quantitative evaluation of both our clustering techniques and our process for making conceptual mutants. The clustering evaluation compares against a ground truth using standard cluster-correspondence measures, while the mutant evaluation examines how conceptual mutants perform against student data.
Importance: Our work contributes a workflow, with some automation, to reduce the cost and increase the effectiveness of generating conceptually interesting mutants. Such mutants can both improve learning outcomes and reduce student frustration, leading to better educational outcomes. In the process, we also identify a variation of mutation testing not commonly discussed in the software literature.
△ Less
Submitted 28 December, 2023;
originally announced January 2024.
-
Siracusa: A 16 nm Heterogenous RISC-V SoC for Extended Reality with At-MRAM Neural Engine
Authors:
Arpan Suravi Prasad,
Moritz Scherer,
Francesco Conti,
Davide Rossi,
Alfio Di Mauro,
Manuel Eggimann,
Jorge Tómas Gómez,
Ziyun Li,
Syed Shakib Sarwar,
Zhao Wang,
Barbara De Salvo,
Luca Benini
Abstract:
Extended reality (XR) applications are Machine Learning (ML)-intensive, featuring deep neural networks (DNNs) with millions of weights, tightly latency-bound (10-20 ms end-to-end), and power-constrained (low tens of mW average power). While ML performance and efficiency can be achieved by introducing neural engines within low-power systems-on-chip (SoCs), system-level power for nontrivial DNNs dep…
▽ More
Extended reality (XR) applications are Machine Learning (ML)-intensive, featuring deep neural networks (DNNs) with millions of weights, tightly latency-bound (10-20 ms end-to-end), and power-constrained (low tens of mW average power). While ML performance and efficiency can be achieved by introducing neural engines within low-power systems-on-chip (SoCs), system-level power for nontrivial DNNs depends strongly on the energy of non-volatile memory (NVM) access for network weights. This work introduces Siracusa, a near-sensor heterogeneous SoC for next-generation XR devices manufactured in 16 nm CMOS. Siracusa couples an octa-core cluster of RISC-V digital signal processing cores with a novel tightly-coupled "At-Memory" integration between a state-of-the-art digital neural engine called N-EUREKA and an on-chip NVM based on magnetoresistive memory(MRAM), achieving 1.7x higher throughput and 3x better energy efficiency than XR SoCs using NVM as background memory. The fabricated SoC prototype achieves an area efficiency of 65.2 GOp/s/mm2 and a peak energy efficiency of 8.84 TOp/J for DNN inference while supporting complex heterogeneous application workloads, which combine ML with conventional signal processing and control.
△ Less
Submitted 14 April, 2024; v1 submitted 22 December, 2023;
originally announced December 2023.
-
Report on 2023 CyberTraining PI Meeting, 26-27 September 2023
Authors:
Geoffrey Fox,
Mary P Thomas,
Sajal Bhatia,
Marisa Brazil,
Nicole M Gasparini,
Venkatesh Mohan Merwade,
Henry J. Neeman,
Jeff Carver,
Henri Casanova,
Vipin Chaudhary,
Dirk Colbry,
Lonnie Crosby,
Prasun Dewan,
Jessica Eisma,
Nicole M Gasparini,
Ahmed Irfan,
Kate Kaehey,
Qianqian Liu,
Zhen Ni,
Sushil Prasad,
Apan Qasem,
Erik Saule,
Prabha Sundaravadivel,
Karen Tomko
Abstract:
This document describes a two-day meeting held for the Principal Investigators (PIs) of NSF CyberTraining grants. The report covers invited talks, panels, and six breakout sessions. The meeting involved over 80 PIs and NSF program managers (PMs). The lessons recorded in detail in the report are a wealth of information that could help current and future PIs, as well as NSF PMs, understand the futur…
▽ More
This document describes a two-day meeting held for the Principal Investigators (PIs) of NSF CyberTraining grants. The report covers invited talks, panels, and six breakout sessions. The meeting involved over 80 PIs and NSF program managers (PMs). The lessons recorded in detail in the report are a wealth of information that could help current and future PIs, as well as NSF PMs, understand the future directions suggested by the PI community. The meeting was held simultaneously with that of the PIs of the NSF Cyberinfrastructure for Sustained Scientific Innovation (CSSI) program. This co-location led to two joint sessions: one with NSF speakers and the other on broader impact. Further, the joint poster and refreshment sessions benefited from the interactions between CSSI and CyberTraining PIs.
△ Less
Submitted 28 December, 2023; v1 submitted 20 December, 2023;
originally announced December 2023.
-
Minimizing Factual Inconsistency and Hallucination in Large Language Models
Authors:
Muneeswaran I,
Shreya Saxena,
Siva Prasad,
M V Sai Prakash,
Advaith Shankar,
Varun V,
Vishal Vaddina,
Saisubramaniam Gopalakrishnan
Abstract:
Large Language Models (LLMs) are widely used in critical fields such as healthcare, education, and finance due to their remarkable proficiency in various language-related tasks. However, LLMs are prone to generating factually incorrect responses or "hallucinations," which can lead to a loss of credibility and trust among users. To address this issue, we propose a multi-stage framework that generat…
▽ More
Large Language Models (LLMs) are widely used in critical fields such as healthcare, education, and finance due to their remarkable proficiency in various language-related tasks. However, LLMs are prone to generating factually incorrect responses or "hallucinations," which can lead to a loss of credibility and trust among users. To address this issue, we propose a multi-stage framework that generates the rationale first, verifies and refines incorrect ones, and uses them as supporting references to generate the answer. The generated rationale enhances the transparency of the answer and our framework provides insights into how the model arrived at this answer, by using this rationale and the references to the context. In this paper, we demonstrate its effectiveness in improving the quality of responses to drug-related inquiries in the life sciences industry. Our framework improves traditional Retrieval Augmented Generation (RAG) by enabling OpenAI GPT-3.5-turbo to be 14-25% more faithful and 16-22% more accurate on two datasets. Furthermore, fine-tuning samples based on our framework improves the accuracy of smaller open-access LLMs by 33-42% and competes with RAG on commercial models.
△ Less
Submitted 23 November, 2023;
originally announced November 2023.
-
Joint-YODNet: A Light-weight Object Detector for UAVs to Achieve Above 100fps
Authors:
Vipin Gautam,
Shitala Prasad,
Sharad Sinha
Abstract:
Small object detection via UAV (Unmanned Aerial Vehicle) images captured from drones and radar is a complex task with several formidable challenges. This domain encompasses numerous complexities that impede the accurate detection and localization of small objects. To address these challenges, we propose a novel method called JointYODNet for UAVs to detect small objects, leveraging a joint loss fun…
▽ More
Small object detection via UAV (Unmanned Aerial Vehicle) images captured from drones and radar is a complex task with several formidable challenges. This domain encompasses numerous complexities that impede the accurate detection and localization of small objects. To address these challenges, we propose a novel method called JointYODNet for UAVs to detect small objects, leveraging a joint loss function specifically designed for this task. Our method revolves around the development of a joint loss function tailored to enhance the detection performance of small objects. Through extensive experimentation on a diverse dataset of UAV images captured under varying environmental conditions, we evaluated different variations of the loss function and determined the most effective formulation. The results demonstrate that our proposed joint loss function outperforms existing methods in accurately localizing small objects. Specifically, our method achieves a recall of 0.971, and a F1Score of 0.975, surpassing state-of-the-art techniques. Additionally, our method achieves a [email protected](%) of 98.6, indicating its robustness in detecting small objects across varying scales
△ Less
Submitted 27 September, 2023;
originally announced September 2023.
-
AaP-ReID: Improved Attention-Aware Person Re-identification
Authors:
Vipin Gautam,
Shitala Prasad,
Sharad Sinha
Abstract:
Person re-identification (ReID) is a well-known problem in the field of computer vision. The primary objective is to identify a specific individual within a gallery of images. However, this task is challenging due to various factors, such as pose variations, illumination changes, obstructions, and the presence ofconfusing backgrounds. Existing ReID methods often fail to capture discriminative feat…
▽ More
Person re-identification (ReID) is a well-known problem in the field of computer vision. The primary objective is to identify a specific individual within a gallery of images. However, this task is challenging due to various factors, such as pose variations, illumination changes, obstructions, and the presence ofconfusing backgrounds. Existing ReID methods often fail to capture discriminative features (e.g., head, shoes, backpacks) and instead capture irrelevant features when the target is occluded. Motivated by the success of part-based and attention-based ReID methods, we improve AlignedReID++ and present AaP-ReID, a more effective method for person ReID that incorporates channel-wise attention into a ResNet-based architecture. Our method incorporates the Channel-Wise Attention Bottleneck (CWAbottleneck) block and can learn discriminating features by dynamically adjusting the importance ofeach channel in the feature maps. We evaluated Aap-ReID on three benchmark datasets: Market-1501, DukeMTMC-reID, and CUHK03. When compared with state-of-the-art person ReID methods, we achieve competitive results with rank-1 accuracies of 95.6% on Market-1501, 90.6% on DukeMTMC-reID, and 82.4% on CUHK03.
△ Less
Submitted 27 September, 2023;
originally announced September 2023.
-
YOLORe-IDNet: An Efficient Multi-Camera System for Person-Tracking
Authors:
Vipin Gautam,
Shitala Prasad,
Sharad Sinha
Abstract:
The growing need for video surveillance in public spaces has created a demand for systems that can track individuals across multiple cameras feeds in real-time. While existing tracking systems have achieved impressive performance using deep learning models, they often rely on pre-existing images of suspects or historical data. However, this is not always feasible in cases where suspicious individu…
▽ More
The growing need for video surveillance in public spaces has created a demand for systems that can track individuals across multiple cameras feeds in real-time. While existing tracking systems have achieved impressive performance using deep learning models, they often rely on pre-existing images of suspects or historical data. However, this is not always feasible in cases where suspicious individuals are identified in real-time and without prior knowledge. We propose a person-tracking system that combines correlation filters and Intersection Over Union (IOU) constraints for robust tracking, along with a deep learning model for cross-camera person re-identification (Re-ID) on top of YOLOv5. The proposed system quickly identifies and tracks suspect in real-time across multiple cameras and recovers well after full or partial occlusion, making it suitable for security and surveillance applications. It is computationally efficient and achieves a high F1-Score of 79% and an IOU of 59% comparable to existing state-of-the-art algorithms, as demonstrated in our evaluation on a publicly available OTB-100 dataset. The proposed system offers a robust and efficient solution for the real-time tracking of individuals across multiple camera feeds. Its ability to track targets without prior knowledge or historical data is a significant improvement over existing systems, making it well-suited for public safety and surveillance applications.
△ Less
Submitted 23 September, 2023;
originally announced September 2023.
-
Content Prompting: Modeling Content Provider Dynamics to Improve User Welfare in Recommender Ecosystems
Authors:
Siddharth Prasad,
Martin Mladenov,
Craig Boutilier
Abstract:
Users derive value from a recommender system (RS) only to the extent that it is able to surface content (or items) that meet their needs/preferences. While RSs often have a comprehensive view of user preferences across the entire user base, content providers, by contrast, generally have only a local view of the preferences of users that have interacted with their content. This limits a provider's…
▽ More
Users derive value from a recommender system (RS) only to the extent that it is able to surface content (or items) that meet their needs/preferences. While RSs often have a comprehensive view of user preferences across the entire user base, content providers, by contrast, generally have only a local view of the preferences of users that have interacted with their content. This limits a provider's ability to offer new content to best serve the broader population. In this work, we tackle this information asymmetry with content prompting policies. A content prompt is a hint or suggestion to a provider to make available novel content for which the RS predicts unmet user demand. A prompting policy is a sequence of such prompts that is responsive to the dynamics of a provider's beliefs, skills and incentives. We aim to determine a joint prompting policy that induces a set of providers to make content available that optimizes user social welfare in equilibrium, while respecting the incentives of the providers themselves. Our contributions include: (i) an abstract model of the RS ecosystem, including content provider behaviors, that supports such prompting; (ii) the design and theoretical analysis of sequential prompting policies for individual providers; (iii) a mixed integer programming formulation for optimal joint prompting using path planning in content space; and (iv) simple, proof-of-concept experiments illustrating how such policies improve ecosystem health and user welfare.
△ Less
Submitted 2 September, 2023;
originally announced September 2023.
-
Dense Retrieval Adaptation using Target Domain Description
Authors:
Helia Hashemi,
Yong Zhuang,
Sachith Sri Ram Kothur,
Srivas Prasad,
Edgar Meij,
W. Bruce Croft
Abstract:
In information retrieval (IR), domain adaptation is the process of adapting a retrieval model to a new domain whose data distribution is different from the source domain. Existing methods in this area focus on unsupervised domain adaptation where they have access to the target document collection or supervised (often few-shot) domain adaptation where they additionally have access to (limited) labe…
▽ More
In information retrieval (IR), domain adaptation is the process of adapting a retrieval model to a new domain whose data distribution is different from the source domain. Existing methods in this area focus on unsupervised domain adaptation where they have access to the target document collection or supervised (often few-shot) domain adaptation where they additionally have access to (limited) labeled data in the target domain. There also exists research on improving zero-shot performance of retrieval models with no adaptation. This paper introduces a new category of domain adaptation in IR that is as-yet unexplored. Here, similar to the zero-shot setting, we assume the retrieval model does not have access to the target document collection. In contrast, it does have access to a brief textual description that explains the target domain. We define a taxonomy of domain attributes in retrieval tasks to understand different properties of a source domain that can be adapted to a target domain. We introduce a novel automatic data construction pipeline that produces a synthetic document collection, query set, and pseudo relevance labels, given a textual domain description. Extensive experiments on five diverse target domains show that adapting dense retrieval models using the constructed synthetic data leads to effective retrieval performance on the target domain.
△ Less
Submitted 5 July, 2023;
originally announced July 2023.
-
Anomaly Detection in Satellite Videos using Diffusion Models
Authors:
Akash Awasthi,
Son Ly,
Jaer Nizam,
Samira Zare,
Videet Mehta,
Safwan Ahmed,
Keshav Shah,
Ramakrishna Nemani,
Saurabh Prasad,
Hien Van Nguyen
Abstract:
The definition of anomaly detection is the identification of an unexpected event. Real-time detection of extreme events such as wildfires, cyclones, or floods using satellite data has become crucial for disaster management. Although several earth-observing satellites provide information about disasters, satellites in the geostationary orbit provide data at intervals as frequent as every minute, ef…
▽ More
The definition of anomaly detection is the identification of an unexpected event. Real-time detection of extreme events such as wildfires, cyclones, or floods using satellite data has become crucial for disaster management. Although several earth-observing satellites provide information about disasters, satellites in the geostationary orbit provide data at intervals as frequent as every minute, effectively creating a video from space. There are many techniques that have been proposed to identify anomalies in surveillance videos; however, the available datasets do not have dynamic behavior, so we discuss an anomaly framework that can work on very high-frequency datasets to find very fast-moving anomalies. In this work, we present a diffusion model which does not need any motion component to capture the fast-moving anomalies and outperforms the other baseline methods.
△ Less
Submitted 25 May, 2023;
originally announced June 2023.
-
Toward Polar Sea-Ice Classification using Color-based Segmentation and Auto-labeling of Sentinel-2 Imagery to Train an Efficient Deep Learning Model
Authors:
Jurdana Masuma Iqrah,
Younghyun Koo,
Wei Wang,
Hongjie Xie,
Sushil Prasad
Abstract:
Global warming is an urgent issue that is generating catastrophic environmental changes, such as the melting of sea ice and glaciers, particularly in the polar regions. The melting pattern and retreat of polar sea ice cover is an essential indicator of global warming. The Sentinel-2 satellite (S2) captures high-resolution optical imagery over the polar regions. This research aims at developing a r…
▽ More
Global warming is an urgent issue that is generating catastrophic environmental changes, such as the melting of sea ice and glaciers, particularly in the polar regions. The melting pattern and retreat of polar sea ice cover is an essential indicator of global warming. The Sentinel-2 satellite (S2) captures high-resolution optical imagery over the polar regions. This research aims at developing a robust and effective system for classifying polar sea ice as thick or snow-covered, young or thin, or open water using S2 images. A key challenge is the lack of labeled S2 training data to serve as the ground truth. We demonstrate a method with high precision to segment and automatically label the S2 images based on suitably determined color thresholds and employ these auto-labeled data to train a U-Net machine model (a fully convolutional neural network), yielding good classification accuracy. Evaluation results over S2 data from the polar summer season in the Ross Sea region of the Antarctic show that the U-Net model trained on auto-labeled data has an accuracy of 90.18% over the original S2 images, whereas the U-Net model trained on manually labeled data has an accuracy of 91.39%. Filtering out the thin clouds and shadows from the S2 images further improves U-Net's accuracy, respectively, to 98.97% for auto-labeled and 98.40% for manually labeled training datasets.
△ Less
Submitted 8 March, 2023;
originally announced March 2023.
-
Bicriteria Multidimensional Mechanism Design with Side Information
Authors:
Maria-Florina Balcan,
Siddharth Prasad,
Tuomas Sandholm
Abstract:
We develop a versatile methodology for multidimensional mechanism design that incorporates side information about agents to generate high welfare and high revenue simultaneously. Side information sources include advice from domain experts, predictions from machine learning models, and even the mechanism designer's gut instinct. We design a tunable mechanism that integrates side information with an…
▽ More
We develop a versatile methodology for multidimensional mechanism design that incorporates side information about agents to generate high welfare and high revenue simultaneously. Side information sources include advice from domain experts, predictions from machine learning models, and even the mechanism designer's gut instinct. We design a tunable mechanism that integrates side information with an improved VCG-like mechanism based on weakest types, which are agent types that generate the least welfare. We show that our mechanism, when carefully tuned, generates welfare and revenue competitive with the prior-free total social surplus, and its performance decays gracefully as the side information quality decreases. We consider a number of side information formats including distribution-free predictions, predictions that express uncertainty, agent types constrained to low-dimensional subspaces of the ambient type space, and the traditional setting with known priors over agent types. In each setting we design mechanisms based on weakest types and prove performance guarantees.
△ Less
Submitted 9 October, 2024; v1 submitted 27 February, 2023;
originally announced February 2023.
-
Large-Scale Knowledge Synthesis and Complex Information Retrieval from Biomedical Documents
Authors:
Shreya Saxena,
Raj Sangani,
Siva Prasad,
Shubham Kumar,
Mihir Athale,
Rohan Awhad,
Vishal Vaddina
Abstract:
Recent advances in the healthcare industry have led to an abundance of unstructured data, making it challenging to perform tasks such as efficient and accurate information retrieval at scale. Our work offers an all-in-one scalable solution for extracting and exploring complex information from large-scale research documents, which would otherwise be tedious. First, we briefly explain our knowledge…
▽ More
Recent advances in the healthcare industry have led to an abundance of unstructured data, making it challenging to perform tasks such as efficient and accurate information retrieval at scale. Our work offers an all-in-one scalable solution for extracting and exploring complex information from large-scale research documents, which would otherwise be tedious. First, we briefly explain our knowledge synthesis process to extract helpful information from unstructured text data of research documents. Then, on top of the knowledge extracted from the documents, we perform complex information retrieval using three major components- Paragraph Retrieval, Triplet Retrieval from Knowledge Graphs, and Complex Question Answering (QA). These components combine lexical and semantic-based methods to retrieve paragraphs and triplets and perform faceted refinement for filtering these search results. The complexity of biomedical queries and documents necessitates using a QA system capable of handling queries more complex than factoid queries, which we evaluate qualitatively on the COVID-19 Open Research Dataset (CORD-19) to demonstrate the effectiveness and value-add.
△ Less
Submitted 14 February, 2023;
originally announced February 2023.
-
Scalable mRMR feature selection to handle high dimensional datasets: Vertical partitioning based Iterative MapReduce framework
Authors:
Yelleti Vivek,
P. S. V. S. Sai Prasad
Abstract:
While building machine learning models, Feature selection (FS) stands out as an essential preprocessing step used to handle the uncertainty and vagueness in the data. Recently, the minimum Redundancy and Maximum Relevance (mRMR) approach has proven to be effective in obtaining the irredundant feature subset. Owing to the generation of voluminous datasets, it is essential to design scalable solutio…
▽ More
While building machine learning models, Feature selection (FS) stands out as an essential preprocessing step used to handle the uncertainty and vagueness in the data. Recently, the minimum Redundancy and Maximum Relevance (mRMR) approach has proven to be effective in obtaining the irredundant feature subset. Owing to the generation of voluminous datasets, it is essential to design scalable solutions using distributed/parallel paradigms. MapReduce solutions are proven to be one of the best approaches to designing fault-tolerant and scalable solutions. This work analyses the existing MapReduce approaches for mRMR feature selection and identifies the limitations thereof. In the current study, we proposed VMR_mRMR, an efficient vertical partitioning-based approach using a memorization approach, thereby overcoming the extant approaches limitations. The experiment analysis says that VMR_mRMR significantly outperformed extant approaches and achieved a better computational gain (C.G). In addition, we also conducted a comparative analysis with the horizontal partitioning approach HMR_mRMR [1] to assess the strengths and limitations of the proposed approach.
△ Less
Submitted 24 July, 2024; v1 submitted 21 August, 2022;
originally announced August 2022.
-
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Authors:
Aarohi Srivastava,
Abhinav Rastogi,
Abhishek Rao,
Abu Awal Md Shoeb,
Abubakar Abid,
Adam Fisch,
Adam R. Brown,
Adam Santoro,
Aditya Gupta,
Adrià Garriga-Alonso,
Agnieszka Kluska,
Aitor Lewkowycz,
Akshat Agarwal,
Alethea Power,
Alex Ray,
Alex Warstadt,
Alexander W. Kocurek,
Ali Safaya,
Ali Tazarv,
Alice Xiang,
Alicia Parrish,
Allen Nie,
Aman Hussain,
Amanda Askell,
Amanda Dsouza
, et al. (426 additional authors not shown)
Abstract:
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur…
▽ More
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
△ Less
Submitted 12 June, 2023; v1 submitted 9 June, 2022;
originally announced June 2022.
-
Private Eye: On the Limits of Textual Screen Peeking via Eyeglass Reflections in Video Conferencing
Authors:
Yan Long,
Chen Yan,
Shilin Xiao,
Shivan Prasad,
Wenyuan Xu,
Kevin Fu
Abstract:
Using mathematical modeling and human subjects experiments, this research explores the extent to which emerging webcams might leak recognizable textual and graphical information gleaming from eyeglass reflections captured by webcams. The primary goal of our work is to measure, compute, and predict the factors, limits, and thresholds of recognizability as webcam technology evolves in the future. Ou…
▽ More
Using mathematical modeling and human subjects experiments, this research explores the extent to which emerging webcams might leak recognizable textual and graphical information gleaming from eyeglass reflections captured by webcams. The primary goal of our work is to measure, compute, and predict the factors, limits, and thresholds of recognizability as webcam technology evolves in the future. Our work explores and characterizes the viable threat models based on optical attacks using multi-frame super resolution techniques on sequences of video frames. Our models and experimental results in a controlled lab setting show it is possible to reconstruct and recognize with over 75% accuracy on-screen texts that have heights as small as 10 mm with a 720p webcam. We further apply this threat model to web textual contents with varying attacker capabilities to find thresholds at which text becomes recognizable. Our user study with 20 participants suggests present-day 720p webcams are sufficient for adversaries to reconstruct textual content on big-font websites. Our models further show that the evolution towards 4K cameras will tip the threshold of text leakage to reconstruction of most header texts on popular websites. Besides textual targets, a case study on recognizing a closed-world dataset of Alexa top 100 websites with 720p webcams shows a maximum recognition accuracy of 94% with 10 participants even without using machine-learning models. Our research proposes near-term mitigations including a software prototype that users can use to blur the eyeglass areas of their video streams. For possible long-term defenses, we advocate an individual reflection testing procedure to assess threats under various settings, and justify the importance of following the principle of least privilege for privacy-sensitive scenarios.
△ Less
Submitted 16 January, 2023; v1 submitted 8 May, 2022;
originally announced May 2022.