-
The Confidence Paradox: Can LLM Know When It's Wrong
Authors:
Sahil Tripathi,
Md Tabrez Nafis,
Imran Hussain,
Jiechao Gao
Abstract:
Document Visual Question Answering (DocVQA) systems are increasingly deployed in real world applications, yet they remain ethically opaque-often producing overconfident answers to ambiguous questions or failing to communicate uncertainty in a trustworthy manner. This misalignment between model confidence and actual knowledge poses significant risks, particularly in domains requiring ethical accoun…
▽ More
Document Visual Question Answering (DocVQA) systems are increasingly deployed in real world applications, yet they remain ethically opaque-often producing overconfident answers to ambiguous questions or failing to communicate uncertainty in a trustworthy manner. This misalignment between model confidence and actual knowledge poses significant risks, particularly in domains requiring ethical accountability. Existing approaches such as LayoutLMv3, UDOP, and DONUT have advanced SOTA performance by focusing on architectural sophistication and accuracy; however, they fall short in ethical responsiveness.
To address these limitations, we introduce HonestVQA, a self-supervised honesty calibration framework for ethically aligned DocVQA. Our model-agnostic method quantifies uncertainty to identify knowledge gaps, aligns model confidence with actual correctness using weighted loss functions, and enforces ethical response behavior via contrastive learning. We further introduce two principled evaluation metrics--Honesty Score (H-Score) and Ethical Confidence Index (ECI)--to benchmark alignment between confidence, accuracy, and ethical communication. Empirically, HonestVQA improves DocVQA accuracy by up to 4.3% and F1 by 4.3% across SpDocVQA, InfographicsVQA, and SROIE datasets. It reduces overconfidence, lowering H-Score and ECI by 0.072 and 0.078, respectively. In cross domain evaluation, it achieves up to 78.9% accuracy and 76.1% F1-score, demonstrating strong generalization. Ablation shows a 3.8% drop in accuracy without alignment or contrastive loss.
△ Less
Submitted 29 June, 2025;
originally announced June 2025.
-
EASG-Bench: Video Q&A Benchmark with Egocentric Action Scene Graphs
Authors:
Ivan Rodin,
Tz-Ying Wu,
Kyle Min,
Sharath Nittur Sridhar,
Antonino Furnari,
Subarna Tripathi,
Giovanni Maria Farinella
Abstract:
We introduce EASG-Bench, a question-answering benchmark for egocentric videos where the question-answering pairs are created from spatio-temporally grounded dynamic scene graphs capturing intricate relationships among actors, actions, and objects. We propose a systematic evaluation framework and evaluate several language-only and video large language models (video-LLMs) on this benchmark. We obser…
▽ More
We introduce EASG-Bench, a question-answering benchmark for egocentric videos where the question-answering pairs are created from spatio-temporally grounded dynamic scene graphs capturing intricate relationships among actors, actions, and objects. We propose a systematic evaluation framework and evaluate several language-only and video large language models (video-LLMs) on this benchmark. We observe a performance gap in language-only and video-LLMs, especially on questions focusing on temporal ordering, thus identifying a research gap in the area of long-context video understanding. To promote the reproducibility of our findings and facilitate further research, the benchmark and accompanying code are available at the following GitHub page: https://github.com/fpv-iplab/EASG-bench.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
PALADIN : Robust Neural Fingerprinting for Text-to-Image Diffusion Models
Authors:
Murthy L,
Subarna Tripathi
Abstract:
The risk of misusing text-to-image generative models for malicious uses, especially due to the open-source development of such models, has become a serious concern. As a risk mitigation strategy, attributing generative models with neural fingerprinting is emerging as a popular technique. There has been a plethora of recent work that aim for addressing neural fingerprinting. A trade-off between the…
▽ More
The risk of misusing text-to-image generative models for malicious uses, especially due to the open-source development of such models, has become a serious concern. As a risk mitigation strategy, attributing generative models with neural fingerprinting is emerging as a popular technique. There has been a plethora of recent work that aim for addressing neural fingerprinting. A trade-off between the attribution accuracy and generation quality of such models has been studied extensively. None of the existing methods yet achieved $100\%$ attribution accuracy. However, any model with less than \emph{perfect} accuracy is practically non-deployable. In this work, we propose an accurate method to incorporate neural fingerprinting for text-to-image diffusion models leveraging the concepts of cyclic error correcting codes from the literature of coding theory.
△ Less
Submitted 28 May, 2025;
originally announced June 2025.
-
Keystep Recognition using Graph Neural Networks
Authors:
Julia Lee Romero,
Kyle Min,
Subarna Tripathi,
Morteza Karimzadeh
Abstract:
We pose keystep recognition as a node classification task, and propose a flexible graph-learning framework for fine-grained keystep recognition that is able to effectively leverage long-term dependencies in egocentric videos. Our approach, termed GLEVR, consists of constructing a graph where each video clip of the egocentric video corresponds to a node. The constructed graphs are sparse and comput…
▽ More
We pose keystep recognition as a node classification task, and propose a flexible graph-learning framework for fine-grained keystep recognition that is able to effectively leverage long-term dependencies in egocentric videos. Our approach, termed GLEVR, consists of constructing a graph where each video clip of the egocentric video corresponds to a node. The constructed graphs are sparse and computationally efficient, outperforming existing larger models substantially. We further leverage alignment between egocentric and exocentric videos during training for improved inference on egocentric videos, as well as adding automatic captioning as an additional modality. We consider each clip of each exocentric video (if available) or video captions as additional nodes during training. We examine several strategies to define connections across these nodes. We perform extensive experiments on the Ego-Exo4D dataset and show that our proposed flexible graph-based framework notably outperforms existing methods.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
Searching Clinical Data Using Generative AI
Authors:
Karan Hanswadkar,
Anika Kanchi,
Shivani Tripathi,
Shi Qiao,
Rony Chatterjee,
Alekh Jindal
Abstract:
Artificial Intelligence (AI) is making a major impact on healthcare, particularly through its application in natural language processing (NLP) and predictive analytics. The healthcare sector has increasingly adopted AI for tasks such as clinical data analysis and medical code assignment. However, searching for clinical information in large and often unorganized datasets remains a manual and error-…
▽ More
Artificial Intelligence (AI) is making a major impact on healthcare, particularly through its application in natural language processing (NLP) and predictive analytics. The healthcare sector has increasingly adopted AI for tasks such as clinical data analysis and medical code assignment. However, searching for clinical information in large and often unorganized datasets remains a manual and error-prone process. Assisting this process with automations can help physicians improve their operational productivity significantly.
In this paper, we present a generative AI approach, coined SearchAI, to enhance the accuracy and efficiency of searching clinical data. Unlike traditional code assignment, which is a one-to-one problem, clinical data search is a one-to-many problem, i.e., a given search query can map to a family of codes. Healthcare professionals typically search for groups of related diseases, drugs, or conditions that map to many codes, and therefore, they need search tools that can handle keyword synonyms, semantic variants, and broad open-ended queries. SearchAI employs a hierarchical model that respects the coding hierarchy and improves the traversal of relationships from parent to child nodes. SearchAI navigates these hierarchies predictively and ensures that all paths are reachable without losing any relevant nodes.
To evaluate the effectiveness of SearchAI, we conducted a series of experiments using both public and production datasets. Our results show that SearchAI outperforms default hierarchical traversals across several metrics, including accuracy, robustness, performance, and scalability. SearchAI can help make clinical data more accessible, leading to streamlined workflows, reduced administrative burden, and enhanced coding and diagnostic accuracy.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
SG-Blend: Learning an Interpolation Between Improved Swish and GELU for Robust Neural Representations
Authors:
Gaurav Sarkar,
Jay Gala,
Subarna Tripathi
Abstract:
The design of activation functions remains a pivotal component in optimizing deep neural networks. While prevailing choices like Swish and GELU demonstrate considerable efficacy, they often exhibit domain-specific optima. This work introduces SG-Blend, a novel activation function that blends our proposed SSwish, a first-order symmetric variant of Swish and the established GELU through dynamic inte…
▽ More
The design of activation functions remains a pivotal component in optimizing deep neural networks. While prevailing choices like Swish and GELU demonstrate considerable efficacy, they often exhibit domain-specific optima. This work introduces SG-Blend, a novel activation function that blends our proposed SSwish, a first-order symmetric variant of Swish and the established GELU through dynamic interpolation. By adaptively blending these constituent functions via learnable parameters, SG-Blend aims to harness their complementary strengths: SSwish's controlled non-monotonicity and symmetry, and GELU's smooth, probabilistic profile, to achieve a more universally robust balance between model expressivity and gradient stability. We conduct comprehensive empirical evaluations across diverse modalities and architectures, showing performance improvements across all considered natural language and computer vision tasks and models. These results, achieved with negligible computational overhead, underscore SG-Blend's potential as a versatile, drop-in replacement that consistently outperforms strong contemporary baselines. The code is available at https://anonymous.4open.science/r/SGBlend-6CBC.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
PICO: Reconstructing 3D People In Contact with Objects
Authors:
Alpár Cseke,
Shashank Tripathi,
Sai Kumar Dwivedi,
Arjun Lakshmipathy,
Agniv Chatterjee,
Michael J. Black,
Dimitrios Tzionas
Abstract:
Recovering 3D Human-Object Interaction (HOI) from single color images is challenging due to depth ambiguities, occlusions, and the huge variation in object shape and appearance. Thus, past work requires controlled settings such as known object shapes and contacts, and tackles only limited object classes. Instead, we need methods that generalize to natural images and novel object classes. We tackle…
▽ More
Recovering 3D Human-Object Interaction (HOI) from single color images is challenging due to depth ambiguities, occlusions, and the huge variation in object shape and appearance. Thus, past work requires controlled settings such as known object shapes and contacts, and tackles only limited object classes. Instead, we need methods that generalize to natural images and novel object classes. We tackle this in two main ways: (1) We collect PICO-db, a new dataset of natural images uniquely paired with dense 3D contact on both body and object meshes. To this end, we use images from the recent DAMON dataset that are paired with contacts, but these contacts are only annotated on a canonical 3D body. In contrast, we seek contact labels on both the body and the object. To infer these given an image, we retrieve an appropriate 3D object mesh from a database by leveraging vision foundation models. Then, we project DAMON's body contact patches onto the object via a novel method needing only 2 clicks per patch. This minimal human input establishes rich contact correspondences between bodies and objects. (2) We exploit our new dataset of contact correspondences in a novel render-and-compare fitting method, called PICO-fit, to recover 3D body and object meshes in interaction. PICO-fit infers contact for the SMPL-X body, retrieves a likely 3D object mesh and contact from PICO-db for that object, and uses the contact to iteratively fit the 3D body and object meshes to image evidence via optimization. Uniquely, PICO-fit works well for many object categories that no existing method can tackle. This is crucial to enable HOI understanding to scale in the wild. Our data and code are available at https://pico.is.tue.mpg.de.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
InteractVLM: 3D Interaction Reasoning from 2D Foundational Models
Authors:
Sai Kumar Dwivedi,
Dimitrije Antić,
Shashank Tripathi,
Omid Taheri,
Cordelia Schmid,
Michael J. Black,
Dimitrios Tzionas
Abstract:
We introduce InteractVLM, a novel method to estimate 3D contact points on human bodies and objects from single in-the-wild images, enabling accurate human-object joint reconstruction in 3D. This is challenging due to occlusions, depth ambiguities, and widely varying object shapes. Existing methods rely on 3D contact annotations collected via expensive motion-capture systems or tedious manual label…
▽ More
We introduce InteractVLM, a novel method to estimate 3D contact points on human bodies and objects from single in-the-wild images, enabling accurate human-object joint reconstruction in 3D. This is challenging due to occlusions, depth ambiguities, and widely varying object shapes. Existing methods rely on 3D contact annotations collected via expensive motion-capture systems or tedious manual labeling, limiting scalability and generalization. To overcome this, InteractVLM harnesses the broad visual knowledge of large Vision-Language Models (VLMs), fine-tuned with limited 3D contact data. However, directly applying these models is non-trivial, as they reason only in 2D, while human-object contact is inherently 3D. Thus we introduce a novel Render-Localize-Lift module that: (1) embeds 3D body and object surfaces in 2D space via multi-view rendering, (2) trains a novel multi-view localization model (MV-Loc) to infer contacts in 2D, and (3) lifts these to 3D. Additionally, we propose a new task called Semantic Human Contact estimation, where human contact predictions are conditioned explicitly on object semantics, enabling richer interaction modeling. InteractVLM outperforms existing work on contact estimation and also facilitates 3D reconstruction from an in-the wild image. Code and models are available at https://interactvlm.is.tue.mpg.de.
△ Less
Submitted 7 April, 2025;
originally announced April 2025.
-
Weed Detection using Convolutional Neural Network
Authors:
Santosh Kumar Tripathi,
Shivendra Pratap Singh,
Devansh Sharma,
Harshavardhan U Patekar
Abstract:
In this paper we use convolutional neural networks (CNNs) for weed detection in agricultural land. We specifically investigate the application of two CNN layer types, Conv2d and dilated Conv2d, for weed detection in crop fields. The suggested method extracts features from the input photos using pre-trained models, which are subsequently adjusted for weed detection. The findings of the experiment,…
▽ More
In this paper we use convolutional neural networks (CNNs) for weed detection in agricultural land. We specifically investigate the application of two CNN layer types, Conv2d and dilated Conv2d, for weed detection in crop fields. The suggested method extracts features from the input photos using pre-trained models, which are subsequently adjusted for weed detection. The findings of the experiment, which used a sizable collection of dataset consisting of 15336 segments, being 3249 of soil, 7376 of soybean, 3520 grass and 1191 of broadleaf weeds. show that the suggested approach can accurately and successfully detect weeds at an accuracy of 94%. This study has significant ramifications for lowering the usage of toxic herbicides and increasing the effectiveness of weed management in agriculture.
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
Graph-Based Multimodal and Multi-view Alignment for Keystep Recognition
Authors:
Julia Lee Romero,
Kyle Min,
Subarna Tripathi,
Morteza Karimzadeh
Abstract:
Egocentric videos capture scenes from a wearer's viewpoint, resulting in dynamic backgrounds, frequent motion, and occlusions, posing challenges to accurate keystep recognition. We propose a flexible graph-learning framework for fine-grained keystep recognition that is able to effectively leverage long-term dependencies in egocentric videos, and leverage alignment between egocentric and exocentric…
▽ More
Egocentric videos capture scenes from a wearer's viewpoint, resulting in dynamic backgrounds, frequent motion, and occlusions, posing challenges to accurate keystep recognition. We propose a flexible graph-learning framework for fine-grained keystep recognition that is able to effectively leverage long-term dependencies in egocentric videos, and leverage alignment between egocentric and exocentric videos during training for improved inference on egocentric videos. Our approach consists of constructing a graph where each video clip of the egocentric video corresponds to a node. During training, we consider each clip of each exocentric video (if available) as additional nodes. We examine several strategies to define connections across these nodes and pose keystep recognition as a node classification task on the constructed graphs. We perform extensive experiments on the Ego-Exo4D dataset and show that our proposed flexible graph-based framework notably outperforms existing methods by more than 12 points in accuracy. Furthermore, the constructed graphs are sparse and compute efficient. We also present a study examining on harnessing several multimodal features, including narrations, depth, and object class labels, on a heterogeneous graph and discuss their corresponding contribution to the keystep recognition performance.
△ Less
Submitted 7 January, 2025;
originally announced January 2025.
-
Spatio-Temporal Forecasting of PM2.5 via Spatial-Diffusion guided Encoder-Decoder Architecture
Authors:
Malay Pandey,
Vaishali Jain,
Nimit Godhani,
Sachchida Nand Tripathi,
Piyush Rai
Abstract:
In many problem settings that require spatio-temporal forecasting, the values in the time-series not only exhibit spatio-temporal correlations but are also influenced by spatial diffusion across locations. One such example is forecasting the concentration of fine particulate matter (PM2.5) in the atmosphere which is influenced by many complex factors, the most important ones being diffusion due to…
▽ More
In many problem settings that require spatio-temporal forecasting, the values in the time-series not only exhibit spatio-temporal correlations but are also influenced by spatial diffusion across locations. One such example is forecasting the concentration of fine particulate matter (PM2.5) in the atmosphere which is influenced by many complex factors, the most important ones being diffusion due to meteorological factors as well as transport across vast distances over a period of time. We present a novel Spatio-Temporal Graph Neural Network architecture, that specifically captures these dependencies to forecast the PM2.5 concentration. Our model is based on an encoder-decoder architecture where the encoder and decoder parts leverage gated recurrent units (GRU) augmented with a graph neural network (TransformerConv) to account for spatial diffusion. Our model can also be seen as a generalization of various existing models for time-series or spatio-temporal forecasting. We demonstrate the model's effectiveness on two real-world PM2.5 datasets: (1) data collected by us using a recently deployed network of low-cost PM$_{2.5}$ sensors from 511 locations spanning the entirety of the Indian state of Bihar over a period of one year, and (2) another publicly available dataset that covers severely polluted regions from China for a period of 4 years. Our experimental results show our model's impressive ability to account for both spatial as well as temporal dependencies precisely.
△ Less
Submitted 18 December, 2024;
originally announced December 2024.
-
Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling
Authors:
Jialong Li,
Shreyansh Tripathi,
Lakshay Rastogi,
Yiming Lei,
Rui Pan,
Yiting Xia
Abstract:
As machine learning models scale in size and complexity, their computational requirements become a significant barrier. Mixture-of-Experts (MoE) models alleviate this issue by selectively activating relevant experts. Despite this, MoE models are hindered by high communication overhead from all-to-all operations, low GPU utilization due to the synchronous communication constraint, and complications…
▽ More
As machine learning models scale in size and complexity, their computational requirements become a significant barrier. Mixture-of-Experts (MoE) models alleviate this issue by selectively activating relevant experts. Despite this, MoE models are hindered by high communication overhead from all-to-all operations, low GPU utilization due to the synchronous communication constraint, and complications from heterogeneous GPU environments.
This paper presents Aurora, which optimizes both model deployment and all-to-all communication scheduling to address these challenges in MoE inference. Aurora achieves minimal communication times by strategically ordering token transmissions in all-to-all communications. It improves GPU utilization by colocating experts from different models on the same device, avoiding the limitations of synchronous all-to-all communication. We analyze Aurora's optimization strategies theoretically across four common GPU cluster settings: exclusive vs. colocated models on GPUs, and homogeneous vs. heterogeneous GPUs. Aurora provides optimal solutions for three cases, and for the remaining NP-hard scenario, it offers a polynomial-time sub-optimal solution with only a 1.07x degradation from the optimal.
Aurora is the first approach to minimize MoE inference time via optimal model deployment and communication scheduling across various scenarios. Evaluations demonstrate that Aurora significantly accelerates inference, achieving speedups of up to 2.38x in homogeneous clusters and 3.54x in heterogeneous environments. Moreover, Aurora enhances GPU utilization by up to 1.5x compared to existing methods.
△ Less
Submitted 22 October, 2024;
originally announced October 2024.
-
SDFit: 3D Object Pose and Shape by Fitting a Morphable SDF to a Single Image
Authors:
Dimitrije Antić,
Georgios Paschalidis,
Shashank Tripathi,
Theo Gevers,
Sai Kumar Dwivedi,
Dimitrios Tzionas
Abstract:
Recovering 3D object pose and shape from a single image is a challenging and highly ill-posed problem. This is due to strong (self-)occlusions, depth ambiguities, the vast intra- and inter-class shape variance, and lack of 3D ground truth for natural images. While existing methods train deep networks on synthetic datasets to predict 3D shapes, they often struggle to generalize to real-world scenar…
▽ More
Recovering 3D object pose and shape from a single image is a challenging and highly ill-posed problem. This is due to strong (self-)occlusions, depth ambiguities, the vast intra- and inter-class shape variance, and lack of 3D ground truth for natural images. While existing methods train deep networks on synthetic datasets to predict 3D shapes, they often struggle to generalize to real-world scenarios, lack an explicit feedback loop for refining noisy estimates, and primarily focus on geometry without explicitly considering pixel alignment. To this end, we make two key observations: (1) a robust solution requires a model that imposes a strong category-specific shape prior to constrain the search space, and (2) foundational models embed 2D images and 3D shapes in joint spaces; both help resolve ambiguities. Hence, we propose SDFit, a novel optimization framework that is built on three key innovations: First, we use a learned morphable signed-distance-function (mSDF) model that acts as a strong shape prior, thus constraining the shape space. Second, we use foundational models to establish rich 2D-to-3D correspondences between image features and the mSDF. Third, we develop a fitting pipeline that iteratively refines both shape and pose, aligning the mSDF to the image. We evaluate SDFit on the Pix3D, Pascal3D+, and COMIC image datasets. SDFit performs on par with SotA methods, while demonstrating exceptional robustness to occlusions and requiring no retraining for unseen images. Therefore, SDFit contributes new insights for generalizing in the wild, paving the way for future research. Code will be released.
△ Less
Submitted 10 March, 2025; v1 submitted 24 September, 2024;
originally announced September 2024.
-
When Learning Meets Dynamics: Distributed User Connectivity Maximization in UAV-Based Communication Networks
Authors:
Bowei Li,
Saugat Tripathi,
Salman Hosain,
Ran Zhang,
Jiang,
Xie,
Miao Wang
Abstract:
Distributed management over Unmanned Aerial Vehicle (UAV) based communication networks (UCNs) has attracted increasing research attention. In this work, we study a distributed user connectivity maximization problem in a UCN. The work features a horizontal study over different levels of information exchange during the distributed iteration and a consideration of dynamics in UAV set and user distrib…
▽ More
Distributed management over Unmanned Aerial Vehicle (UAV) based communication networks (UCNs) has attracted increasing research attention. In this work, we study a distributed user connectivity maximization problem in a UCN. The work features a horizontal study over different levels of information exchange during the distributed iteration and a consideration of dynamics in UAV set and user distribution, which are not well addressed in the existing works. Specifically, the studied problem is first formulated into a time-coupled mixed-integer non-convex optimization problem. A heuristic two-stage UAV-user association policy is proposed to faster determine the user connectivity. To tackle the NP-hard problem in scalable manner, the distributed user connectivity maximization algorithm 1 (DUCM-1) is proposed under the multi-agent deep Q learning (MA-DQL) framework. DUCM-1 emphasizes on designing different information exchange levels and evaluating how they impact the learning convergence with stationary and dynamic user distribution. To comply with the UAV dynamics, DUCM-2 algorithm is developed which is devoted to autonomously handling arbitrary quit's and join-in's of UAVs in a considered time horizon. Extensive simulations are conducted i) to conclude that exchanging state information with a deliberated task-specific reward function design yields the best convergence performance, and ii) to show the efficacy and robustness of DUCM-2 against the dynamics.
△ Less
Submitted 9 September, 2024;
originally announced September 2024.
-
HUMOS: Human Motion Model Conditioned on Body Shape
Authors:
Shashank Tripathi,
Omid Taheri,
Christoph Lassner,
Michael J. Black,
Daniel Holden,
Carsten Stoll
Abstract:
Generating realistic human motion is essential for many computer vision and graphics applications. The wide variety of human body shapes and sizes greatly impacts how people move. However, most existing motion models ignore these differences, relying on a standardized, average body. This leads to uniform motion across different body types, where movements don't match their physical characteristics…
▽ More
Generating realistic human motion is essential for many computer vision and graphics applications. The wide variety of human body shapes and sizes greatly impacts how people move. However, most existing motion models ignore these differences, relying on a standardized, average body. This leads to uniform motion across different body types, where movements don't match their physical characteristics, limiting diversity. To solve this, we introduce a new approach to develop a generative motion model based on body shape. We show that it's possible to train this model using unpaired data by applying cycle consistency, intuitive physics, and stability constraints, which capture the relationship between identity and movement. The resulting model generates diverse, physically plausible, and dynamically stable human motions that are both quantitatively and qualitatively more realistic than current state-of-the-art methods. More details are available on our project page https://CarstenEpic.github.io/humos/.
△ Less
Submitted 3 April, 2025; v1 submitted 5 September, 2024;
originally announced September 2024.
-
Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation
Authors:
Tz-Ying Wu,
Kyle Min,
Subarna Tripathi,
Nuno Vasconcelos
Abstract:
Video understanding typically requires fine-tuning the large backbone when adapting to new domains. In this paper, we leverage the egocentric video foundation models (Ego-VFMs) based on video-language pre-training and propose a parameter-efficient adaptation for egocentric video tasks, namely Ego-VPA. It employs a local sparse approximation for each video frame/text feature using the basis prompts…
▽ More
Video understanding typically requires fine-tuning the large backbone when adapting to new domains. In this paper, we leverage the egocentric video foundation models (Ego-VFMs) based on video-language pre-training and propose a parameter-efficient adaptation for egocentric video tasks, namely Ego-VPA. It employs a local sparse approximation for each video frame/text feature using the basis prompts, and the selected basis prompts are used to synthesize video/text prompts. Since the basis prompts are shared across frames and modalities, it models context fusion and cross-modal transfer in an efficient fashion. Experiments show that Ego-VPA excels in lightweight adaptation (with only 0.84% learnable parameters), largely improving over baselines and reaching the performance of full fine-tuning.
△ Less
Submitted 26 February, 2025; v1 submitted 28 July, 2024;
originally announced July 2024.
-
SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video
Authors:
Hector A. Valdez,
Kyle Min,
Subarna Tripathi
Abstract:
Pretraining egocentric vision-language models has become essential to improving downstream egocentric video-text tasks. These egocentric foundation models commonly use the transformer architecture. The memory footprint of these models during pretraining can be substantial. Therefore, we pretrain SViTT-Ego, the first sparse egocentric video-text transformer model integrating edge and node sparsific…
▽ More
Pretraining egocentric vision-language models has become essential to improving downstream egocentric video-text tasks. These egocentric foundation models commonly use the transformer architecture. The memory footprint of these models during pretraining can be substantial. Therefore, we pretrain SViTT-Ego, the first sparse egocentric video-text transformer model integrating edge and node sparsification. We pretrain on the EgoClip dataset and incorporate the egocentric-friendly objective EgoNCE, instead of the frequently used InfoNCE. Most notably, SViTT-Ego obtains a +2.8% gain on EgoMCQ (intra-video) accuracy compared to LAVILA large, with no additional data augmentation techniques other than standard image augmentations, yet pretrainable on memory-limited devices.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
A PCA based Keypoint Tracking Approach to Automated Facial Expressions Encoding
Authors:
Shivansh Chandra Tripathi,
Rahul Garg
Abstract:
The Facial Action Coding System (FACS) for studying facial expressions is manual and requires significant effort and expertise. This paper explores the use of automated techniques to generate Action Units (AUs) for studying facial expressions. We propose an unsupervised approach based on Principal Component Analysis (PCA) and facial keypoint tracking to generate data-driven AUs called PCA AUs usin…
▽ More
The Facial Action Coding System (FACS) for studying facial expressions is manual and requires significant effort and expertise. This paper explores the use of automated techniques to generate Action Units (AUs) for studying facial expressions. We propose an unsupervised approach based on Principal Component Analysis (PCA) and facial keypoint tracking to generate data-driven AUs called PCA AUs using the publicly available DISFA dataset. The PCA AUs comply with the direction of facial muscle movements and are capable of explaining over 92.83 percent of the variance in other public test datasets (BP4D-Spontaneous and CK+), indicating their capability to generalize facial expressions. The PCA AUs are also comparable to a keypoint-based equivalence of FACS AUs in terms of variance explained on the test datasets. In conclusion, our research demonstrates the potential of automated techniques to be an alternative to manual FACS labeling which could lead to efficient real-time analysis of facial expressions in psychology and related fields. To promote further research, we have made code repository publicly available.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Unsupervised learning of Data-driven Facial Expression Coding System (DFECS) using keypoint tracking
Authors:
Shivansh Chandra Tripathi,
Rahul Garg
Abstract:
The development of existing facial coding systems, such as the Facial Action Coding System (FACS), relied on manual examination of facial expression videos for defining Action Units (AUs). To overcome the labor-intensive nature of this process, we propose the unsupervised learning of an automated facial coding system by leveraging computer-vision-based facial keypoint tracking. In this novel facia…
▽ More
The development of existing facial coding systems, such as the Facial Action Coding System (FACS), relied on manual examination of facial expression videos for defining Action Units (AUs). To overcome the labor-intensive nature of this process, we propose the unsupervised learning of an automated facial coding system by leveraging computer-vision-based facial keypoint tracking. In this novel facial coding system called the Data-driven Facial Expression Coding System (DFECS), the AUs are estimated by applying dimensionality reduction to facial keypoint movements from a neutral frame through a proposed Full Face Model (FFM). FFM employs a two-level decomposition using advanced dimensionality reduction techniques such as dictionary learning (DL) and non-negative matrix factorization (NMF). These techniques enhance the interpretability of AUs by introducing constraints such as sparsity and positivity to the encoding matrix. Results show that DFECS AUs estimated from the DISFA dataset can account for an average variance of up to 91.29 percent in test datasets (CK+ and BP4D-Spontaneous) and also surpass the variance explained by keypoint-based equivalents of FACS AUs in these datasets. Additionally, 87.5 percent of DFECS AUs are interpretable, i.e., align with the direction of facial muscle movements. In summary, advancements in automated facial coding systems can accelerate facial expression analysis across diverse fields such as security, healthcare, and entertainment. These advancements offer numerous benefits, including enhanced detection of abnormal behavior, improved pain analysis in healthcare settings, and enriched emotion-driven interactions. To facilitate further research, the code repository of DFECS has been made publicly accessible.
△ Less
Submitted 8 June, 2024;
originally announced June 2024.
-
Contrastive Language Video Time Pre-training
Authors:
Hengyue Liu,
Kyle Min,
Hector A. Valdez,
Subarna Tripathi
Abstract:
We introduce LAVITI, a novel approach to learning language, video, and temporal representations in long-form videos via contrastive learning. Different from pre-training on video-text pairs like EgoVLP, LAVITI aims to align language, video, and temporal features by extracting meaningful moments in untrimmed videos. Our model employs a set of learnable moment queries to decode clip-level visual, la…
▽ More
We introduce LAVITI, a novel approach to learning language, video, and temporal representations in long-form videos via contrastive learning. Different from pre-training on video-text pairs like EgoVLP, LAVITI aims to align language, video, and temporal features by extracting meaningful moments in untrimmed videos. Our model employs a set of learnable moment queries to decode clip-level visual, language, and temporal features. In addition to vision and language alignment, we introduce relative temporal embeddings (TE) to represent timestamps in videos, which enables contrastive learning of time. Significantly different from traditional approaches, the prediction of a particular timestamp is transformed by computing the similarity score between the predicted TE and all TEs. Furthermore, existing approaches for video understanding are mainly designed for short videos due to high computational complexity and memory footprint. Our method can be trained on the Ego4D dataset with only 8 NVIDIA RTX-3090 GPUs in a day. We validated our method on CharadesEgo action recognition, achieving state-of-the-art results.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
D-VRE: From a Jupyter-enabled Private Research Environment to Decentralized Collaborative Research Ecosystem
Authors:
Yuandou Wang,
Sheejan Tripathi,
Siamak Farshidi,
Zhiming Zhao
Abstract:
Today, scientific research is increasingly data-centric and compute-intensive, relying on data and models across distributed sources. However, it still faces challenges in the traditional cooperation mode, due to the high storage and computing cost, geo-location barriers, and local confidentiality regulations. The Jupyter environment has recently emerged and evolved as a vital virtual research env…
▽ More
Today, scientific research is increasingly data-centric and compute-intensive, relying on data and models across distributed sources. However, it still faces challenges in the traditional cooperation mode, due to the high storage and computing cost, geo-location barriers, and local confidentiality regulations. The Jupyter environment has recently emerged and evolved as a vital virtual research environment for scientific computing, which researchers can use to scale computational analyses up to larger datasets and high-performance computing resources. Nevertheless, existing approaches lack robust support of a decentralized cooperation mode to unlock the full potential of decentralized collaborative scientific research, e.g., seamlessly secure data sharing. In this work, we change the basic structure and legacy norms of current research environments via the seamless integration of Jupyter with Ethereum blockchain capabilities. As such, it creates a Decentralized Virtual Research Environment (D-VRE) from private computational notebooks to decentralized collaborative research ecosystem. We propose a novel architecture for the D-VRE and prototype some essential D-VRE elements for enabling secure data sharing with decentralized identity, user-centric agreement-making, membership, and research asset management. To validate our method, we conducted an experimental study to test all functionalities of D-VRE smart contracts and their gas consumption. In addition, we deployed the D-VRE prototype on a test net of the Ethereum blockchain for demonstration. The feedback from the studies showcases the current prototype's usability, ease of use, and potential and suggests further improvements.
△ Less
Submitted 26 June, 2024; v1 submitted 24 May, 2024;
originally announced May 2024.
-
VideoSAGE: Video Summarization with Graph Representation Learning
Authors:
Jose M. Rojas Chaves,
Subarna Tripathi
Abstract:
We propose a graph-based representation learning framework for video summarization. First, we convert an input video to a graph where nodes correspond to each of the video frames. Then, we impose sparsity on the graph by connecting only those pairs of nodes that are within a specified temporal distance. We then formulate the video summarization task as a binary node classification problem, precise…
▽ More
We propose a graph-based representation learning framework for video summarization. First, we convert an input video to a graph where nodes correspond to each of the video frames. Then, we impose sparsity on the graph by connecting only those pairs of nodes that are within a specified temporal distance. We then formulate the video summarization task as a binary node classification problem, precisely classifying video frames whether they should belong to the output summary video. A graph constructed this way aims to capture long-range interactions among video frames, and the sparsity ensures the model trains without hitting the memory and compute bottleneck. Experiments on two datasets(SumMe and TVSum) demonstrate the effectiveness of the proposed nimble model compared to existing state-of-the-art summarization approaches while being one order of magnitude more efficient in compute time and memory
△ Less
Submitted 14 April, 2024;
originally announced April 2024.
-
Loss Regularizing Robotic Terrain Classification
Authors:
Shakti Deo Kumar,
Sudhanshu Tripathi,
Krishna Ujjwal,
Sarvada Sakshi Jha,
Suddhasil De
Abstract:
Locomotion mechanics of legged robots are suitable when pacing through difficult terrains. Recognising terrains for such robots are important to fully yoke the versatility of their movements. Consequently, robotic terrain classification becomes significant to classify terrains in real time with high accuracy. The conventional classifiers suffer from overfitting problem, low accuracy problem, high…
▽ More
Locomotion mechanics of legged robots are suitable when pacing through difficult terrains. Recognising terrains for such robots are important to fully yoke the versatility of their movements. Consequently, robotic terrain classification becomes significant to classify terrains in real time with high accuracy. The conventional classifiers suffer from overfitting problem, low accuracy problem, high variance problem, and not suitable for live dataset. On the other hand, classifying a growing dataset is difficult for convolution based terrain classification. Supervised recurrent models are also not practical for this classification. Further, the existing recurrent architectures are still evolving to improve accuracy of terrain classification based on live variable-length sensory data collected from legged robots. This paper proposes a new semi-supervised method for terrain classification of legged robots, avoiding preprocessing of long variable-length dataset. The proposed method has a stacked Long Short-Term Memory architecture, including a new loss regularization. The proposed method solves the existing problems and improves accuracy. Comparison with the existing architectures show the improvements.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
PRECISE Framework: GPT-based Text For Improved Readability, Reliability, and Understandability of Radiology Reports For Patient-Centered Care
Authors:
Satvik Tripathi,
Liam Mutter,
Meghana Muppuri,
Suhani Dheer,
Emiliano Garza-Frias,
Komal Awan,
Aakash Jha,
Michael Dezube,
Azadeh Tabari,
Christopher P. Bridge,
Dania Daye
Abstract:
This study introduces and evaluates the PRECISE framework, utilizing OpenAI's GPT-4 to enhance patient engagement by providing clearer and more accessible chest X-ray reports at a sixth-grade reading level. The framework was tested on 500 reports, demonstrating significant improvements in readability, reliability, and understandability. Statistical analyses confirmed the effectiveness of the PRECI…
▽ More
This study introduces and evaluates the PRECISE framework, utilizing OpenAI's GPT-4 to enhance patient engagement by providing clearer and more accessible chest X-ray reports at a sixth-grade reading level. The framework was tested on 500 reports, demonstrating significant improvements in readability, reliability, and understandability. Statistical analyses confirmed the effectiveness of the PRECISE approach, highlighting its potential to foster patient-centric care delivery in healthcare decision-making.
△ Less
Submitted 19 February, 2024;
originally announced March 2024.
-
Fusing Multiple Algorithms for Heterogeneous Online Learning
Authors:
Darshan Gadginmath,
Shivanshu Tripathi,
Fabio Pasqualetti
Abstract:
This study addresses the challenge of online learning in contexts where agents accumulate disparate data, face resource constraints, and use different local algorithms. This paper introduces the Switched Online Learning Algorithm (SOLA), designed to solve the heterogeneous online learning problem by amalgamating updates from diverse agents through a dynamic switching mechanism contingent upon thei…
▽ More
This study addresses the challenge of online learning in contexts where agents accumulate disparate data, face resource constraints, and use different local algorithms. This paper introduces the Switched Online Learning Algorithm (SOLA), designed to solve the heterogeneous online learning problem by amalgamating updates from diverse agents through a dynamic switching mechanism contingent upon their respective performance and available resources. We theoretically analyze the design of the selecting mechanism to ensure that the regret of SOLA is bounded. Our findings show that the number of changes in selection needs to be bounded by a parameter dependent on the performance of the different local algorithms. Additionally, two test cases are presented to emphasize the effectiveness of SOLA, first on an online linear regression problem and then on an online classification problem with the MNIST dataset.
△ Less
Submitted 8 December, 2023;
originally announced December 2023.
-
Action Scene Graphs for Long-Form Understanding of Egocentric Videos
Authors:
Ivan Rodin,
Antonino Furnari,
Kyle Min,
Subarna Tripathi,
Giovanni Maria Farinella
Abstract:
We present Egocentric Action Scene Graphs (EASGs), a new representation for long-form understanding of egocentric videos. EASGs extend standard manually-annotated representations of egocentric videos, such as verb-noun action labels, by providing a temporally evolving graph-based description of the actions performed by the camera wearer, including interacted objects, their relationships, and how a…
▽ More
We present Egocentric Action Scene Graphs (EASGs), a new representation for long-form understanding of egocentric videos. EASGs extend standard manually-annotated representations of egocentric videos, such as verb-noun action labels, by providing a temporally evolving graph-based description of the actions performed by the camera wearer, including interacted objects, their relationships, and how actions unfold in time. Through a novel annotation procedure, we extend the Ego4D dataset by adding manually labeled Egocentric Action Scene Graphs offering a rich set of annotations designed for long-from egocentric video understanding. We hence define the EASG generation task and provide a baseline approach, establishing preliminary benchmarks. Experiments on two downstream tasks, egocentric action anticipation and egocentric activity summarization, highlight the effectiveness of EASGs for long-form egocentric video understanding. We will release the dataset and the code to replicate experiments and annotations.
△ Less
Submitted 6 December, 2023;
originally announced December 2023.
-
FRCSyn Challenge at WACV 2024:Face Recognition Challenge in the Era of Synthetic Data
Authors:
Pietro Melzi,
Ruben Tolosana,
Ruben Vera-Rodriguez,
Minchul Kim,
Christian Rathgeb,
Xiaoming Liu,
Ivan DeAndres-Tame,
Aythami Morales,
Julian Fierrez,
Javier Ortega-Garcia,
Weisong Zhao,
Xiangyu Zhu,
Zheyu Yan,
Xiao-Yu Zhang,
Jinlin Wu,
Zhen Lei,
Suvidha Tripathi,
Mahak Kothari,
Md Haider Zama,
Debayan Deb,
Bernardo Biesseck,
Pedro Vidal,
Roger Granada,
Guilherme Fickel,
Gustavo Führ
, et al. (22 additional authors not shown)
Abstract:
Despite the widespread adoption of face recognition technology around the world, and its remarkable performance on current benchmarks, there are still several challenges that must be covered in more detail. This paper offers an overview of the Face Recognition Challenge in the Era of Synthetic Data (FRCSyn) organized at WACV 2024. This is the first international challenge aiming to explore the use…
▽ More
Despite the widespread adoption of face recognition technology around the world, and its remarkable performance on current benchmarks, there are still several challenges that must be covered in more detail. This paper offers an overview of the Face Recognition Challenge in the Era of Synthetic Data (FRCSyn) organized at WACV 2024. This is the first international challenge aiming to explore the use of synthetic data in face recognition to address existing limitations in the technology. Specifically, the FRCSyn Challenge targets concerns related to data privacy issues, demographic biases, generalization to unseen scenarios, and performance limitations in challenging scenarios, including significant age disparities between enrollment and testing, pose variations, and occlusions. The results achieved in the FRCSyn Challenge, together with the proposed benchmark, contribute significantly to the application of synthetic data to improve face recognition technology.
△ Less
Submitted 17 November, 2023;
originally announced November 2023.
-
MUNCH: Modelling Unique 'N Controllable Heads
Authors:
Debayan Deb,
Suvidha Tripathi,
Pranit Puri
Abstract:
The automated generation of 3D human heads has been an intriguing and challenging task for computer vision researchers. Prevailing methods synthesize realistic avatars but with limited control over the diversity and quality of rendered outputs and suffer from limited correlation between shape and texture of the character. We propose a method that offers quality, diversity, control, and realism alo…
▽ More
The automated generation of 3D human heads has been an intriguing and challenging task for computer vision researchers. Prevailing methods synthesize realistic avatars but with limited control over the diversity and quality of rendered outputs and suffer from limited correlation between shape and texture of the character. We propose a method that offers quality, diversity, control, and realism along with explainable network design, all desirable features to game-design artists in the domain. First, our proposed Geometry Generator identifies disentangled latent directions and generate novel and diverse samples. A Render Map Generator then learns to synthesize multiply high-fidelty physically-based render maps including Albedo, Glossiness, Specular, and Normals. For artists preferring fine-grained control over the output, we introduce a novel Color Transformer Model that allows semantic color control over generated maps. We also introduce quantifiable metrics called Uniqueness and Novelty and a combined metric to test the overall performance of our model. Demo for both shapes and textures can be found: https://munch-seven.vercel.app/. We will release our model along with the synthetic dataset.
△ Less
Submitted 4 October, 2023;
originally announced October 2023.
-
DECO: Dense Estimation of 3D Human-Scene Contact In The Wild
Authors:
Shashank Tripathi,
Agniv Chatterjee,
Jean-Claude Passy,
Hongwei Yi,
Dimitrios Tzionas,
Michael J. Black
Abstract:
Understanding how humans use physical contact to interact with the world is key to enabling human-centric artificial intelligence. While inferring 3D contact is crucial for modeling realistic and physically-plausible human-object interactions, existing methods either focus on 2D, consider body joints rather than the surface, use coarse 3D body regions, or do not generalize to in-the-wild images. I…
▽ More
Understanding how humans use physical contact to interact with the world is key to enabling human-centric artificial intelligence. While inferring 3D contact is crucial for modeling realistic and physically-plausible human-object interactions, existing methods either focus on 2D, consider body joints rather than the surface, use coarse 3D body regions, or do not generalize to in-the-wild images. In contrast, we focus on inferring dense, 3D contact between the full body surface and objects in arbitrary images. To achieve this, we first collect DAMON, a new dataset containing dense vertex-level contact annotations paired with RGB images containing complex human-object and human-scene contact. Second, we train DECO, a novel 3D contact detector that uses both body-part-driven and scene-context-driven attention to estimate vertex-level contact on the SMPL body. DECO builds on the insight that human observers recognize contact by reasoning about the contacting body parts, their proximity to scene objects, and the surrounding scene context. We perform extensive evaluations of our detector on DAMON as well as on the RICH and BEHAVE datasets. We significantly outperform existing SOTA methods across all benchmarks. We also show qualitatively that DECO generalizes well to diverse and challenging real-world human interactions in natural images. The code, data, and models are available at https://deco.is.tue.mpg.de.
△ Less
Submitted 26 September, 2023;
originally announced September 2023.
-
Implementation of Fast and Power Efficient SEC-DAEC and SEC-DAEC-TAEC Codecs on FPGA
Authors:
Sayan Tripathi,
Jhilam Jana,
Jaydeb Bhaumik
Abstract:
The reliability of memory devices is affected by radiation induced soft errors. Multiple cell upsets (MCUs) caused by radiation corrupt data stored in multiple cells within memories. Error correction codes (ECCs) are typically used to mitigate the effects of MCUs. Single error correction-double error detection (SEC-DED) codes are not the right choice against MCUs, but are more suitable for protect…
▽ More
The reliability of memory devices is affected by radiation induced soft errors. Multiple cell upsets (MCUs) caused by radiation corrupt data stored in multiple cells within memories. Error correction codes (ECCs) are typically used to mitigate the effects of MCUs. Single error correction-double error detection (SEC-DED) codes are not the right choice against MCUs, but are more suitable for protecting memory against single cell upset (SCU). Single error correction-double adjacent error correction (SEC-DAEC) and single error correction-double adjacent error correction-triple adjacent error correction (SEC-DAEC-TAEC) codes are more suitable due to the increasing tendency of adjacent errors. This paper presents the implementation of fast and low power multi-bit adjacent error correction codes for protecting memories. Related SEC-DAEC and SEC-DAEC-TAEC codecs with data length of 16-bit, 32-bit and 64-bit have been implemented. It is found from FPGA based implementation results that the modified designs have comparable area and have less delay and power consumption.
△ Less
Submitted 30 July, 2023;
originally announced July 2023.
-
Emotional Speech-Driven Animation with Content-Emotion Disentanglement
Authors:
Radek Daněček,
Kiran Chhatre,
Shashank Tripathi,
Yandong Wen,
Michael J. Black,
Timo Bolkart
Abstract:
To be widely adopted, 3D facial avatars must be animated easily, realistically, and directly from speech signals. While the best recent methods generate 3D animations that are synchronized with the input audio, they largely ignore the impact of emotions on facial expressions. Realistic facial animation requires lip-sync together with the natural expression of emotion. To that end, we propose EMOTE…
▽ More
To be widely adopted, 3D facial avatars must be animated easily, realistically, and directly from speech signals. While the best recent methods generate 3D animations that are synchronized with the input audio, they largely ignore the impact of emotions on facial expressions. Realistic facial animation requires lip-sync together with the natural expression of emotion. To that end, we propose EMOTE (Expressive Model Optimized for Talking with Emotion), which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion. To achieve this, we supervise EMOTE with decoupled losses for speech (i.e., lip-sync) and emotion. These losses are based on two key observations: (1) deformations of the face due to speech are spatially localized around the mouth and have high temporal frequency, whereas (2) facial expressions may deform the whole face and occur over longer intervals. Thus, we train EMOTE with a per-frame lip-reading loss to preserve the speech-dependent content, while supervising emotion at the sequence level. Furthermore, we employ a content-emotion exchange mechanism in order to supervise different emotions on the same audio, while maintaining the lip motion synchronized with the speech. To employ deep perceptual losses without getting undesirable artifacts, we devise a motion prior in the form of a temporal VAE. Due to the absence of high-quality aligned emotional 3D face datasets with speech, EMOTE is trained with 3D pseudo-ground-truth extracted from an emotional video dataset (i.e., MEAD). Extensive qualitative and perceptual evaluations demonstrate that EMOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data, while offering additional, high-quality emotional control.
△ Less
Submitted 26 September, 2023; v1 submitted 15 June, 2023;
originally announced June 2023.
-
Single-Stage Visual Relationship Learning using Conditional Queries
Authors:
Alakh Desai,
Tz-Ying Wu,
Subarna Tripathi,
Nuno Vasconcelos
Abstract:
Research in scene graph generation (SGG) usually considers two-stage models, that is, detecting a set of entities, followed by combining them and labeling all possible relationships. While showing promising results, the pipeline structure induces large parameter and computation overhead, and typically hinders end-to-end optimizations. To address this, recent research attempts to train single-stage…
▽ More
Research in scene graph generation (SGG) usually considers two-stage models, that is, detecting a set of entities, followed by combining them and labeling all possible relationships. While showing promising results, the pipeline structure induces large parameter and computation overhead, and typically hinders end-to-end optimizations. To address this, recent research attempts to train single-stage models that are computationally efficient. With the advent of DETR, a set based detection model, one-stage models attempt to predict a set of subject-predicate-object triplets directly in a single shot. However, SGG is inherently a multi-task learning problem that requires modeling entity and predicate distributions simultaneously. In this paper, we propose Transformers with conditional queries for SGG, namely, TraCQ with a new formulation for SGG that avoids the multi-task learning problem and the combinatorial entity pair distribution. We employ a DETR-based encoder-decoder design and leverage conditional queries to significantly reduce the entity label space as well, which leads to 20% fewer parameters compared to state-of-the-art single-stage models. Experimental results show that TraCQ not only outperforms existing single-stage scene graph generation methods, it also beats many state-of-the-art two-stage methods on the Visual Genome dataset, yet is capable of end-to-end training and faster inference.
△ Less
Submitted 9 June, 2023;
originally announced June 2023.
-
On the Coverage of Cognitive mmWave Networks with Directional Sensing and Communication
Authors:
Shuchi Tripathi,
Abhishek K. Gupta,
SaiDhiraj Amuru
Abstract:
Millimeter-waves' propagation characteristics create prospects for spatial and temporal spectrum sharing in a variety of contexts, including cognitive spectrum sharing (CSS). However, CSS along with omnidirectional sensing, is not efficient at mmWave frequencies due to their directional nature of transmission, as this limits secondary networks' ability to access the spectrum. This inspired us to c…
▽ More
Millimeter-waves' propagation characteristics create prospects for spatial and temporal spectrum sharing in a variety of contexts, including cognitive spectrum sharing (CSS). However, CSS along with omnidirectional sensing, is not efficient at mmWave frequencies due to their directional nature of transmission, as this limits secondary networks' ability to access the spectrum. This inspired us to create an analytical approach using stochastic geometry to examine the implications of directional cognitive sensing in mmWave networks. We explore a scenario where multiple secondary transmitter-receiver pairs coexist with a primary transmitter-receiver pair, forming a cognitive network. The positions of the secondary transmitters are modelled using a homogeneous Poisson point process (PPP) with corresponding secondary receivers located around them. A threshold on directional transmission is imposed on each secondary transmitter in order to limit its interference at the primary receiver. We derive the medium-access-probability of a secondary user along with the fraction of the secondary transmitters active at a time-instant. To understand cognition's feasibility, we derive the coverage probabilities of primary and secondary links. We provide various design insights via numerical results. For example, we investigate the interference-threshold's optimal value while ensuring coverage for both links and its dependence on various parameters. We find that directionality improves both links' performance as a key factor. Further, allowing location-aware secondary directionality can help achieve similar coverage for all secondary links.
△ Less
Submitted 2 June, 2023;
originally announced June 2023.
-
Safe and Secure Smart Home using Cisco Packet Tracer
Authors:
Shivansh Walia,
Tejas Iyer,
Shubham Tripathi,
Akshith Vanaparthy
Abstract:
This project presents an implementation and designing of safe, secure and smart home with enhanced levels of security features which uses IoT-based technology. We got our motivation for this project after learning about movement of west towards smart homes and designs. This galvanized us to engage in this work as we wanted for homeowners to have a greater control over their in-house environment wh…
▽ More
This project presents an implementation and designing of safe, secure and smart home with enhanced levels of security features which uses IoT-based technology. We got our motivation for this project after learning about movement of west towards smart homes and designs. This galvanized us to engage in this work as we wanted for homeowners to have a greater control over their in-house environment while also promising more safety and security features for the denizen. This contrivance of smart-home archetype has been intended to assimilate many kinds of sensors, boards along with advanced IoT devices and programming languages all of which in conjunction validate control and monitoring prowess over discrete electronic items present in home.
△ Less
Submitted 24 April, 2023;
originally announced April 2023.
-
SViTT: Temporal Learning of Sparse Video-Text Transformers
Authors:
Yi Li,
Kyle Min,
Subarna Tripathi,
Nuno Vasconcelos
Abstract:
Do video-text transformers learn to model temporal relationships across frames? Despite their immense capacity and the abundance of multimodal training data, recent work has revealed the strong tendency of video-text models towards frame-based spatial representations, while temporal reasoning remains largely unsolved. In this work, we identify several key challenges in temporal learning of video-t…
▽ More
Do video-text transformers learn to model temporal relationships across frames? Despite their immense capacity and the abundance of multimodal training data, recent work has revealed the strong tendency of video-text models towards frame-based spatial representations, while temporal reasoning remains largely unsolved. In this work, we identify several key challenges in temporal learning of video-text transformers: the spatiotemporal trade-off from limited network size; the curse of dimensionality for multi-frame modeling; and the diminishing returns of semantic information by extending clip length. Guided by these findings, we propose SViTT, a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention. Analogous to graph-based networks, SViTT employs two forms of sparsity: edge sparsity that limits the query-key communications between tokens in self-attention, and node sparsity that discards uninformative visual tokens. Trained with a curriculum which increases model sparsity with the clip length, SViTT outperforms dense transformer baselines on multiple video-text retrieval and question answering benchmarks, with a fraction of computational cost. Project page: http://svcl.ucsd.edu/projects/svitt.
△ Less
Submitted 18 April, 2023;
originally announced April 2023.
-
Unbiased Scene Graph Generation in Videos
Authors:
Sayak Nag,
Kyle Min,
Subarna Tripathi,
Amit K. Roy Chowdhury
Abstract:
The task of dynamic scene graph generation (SGG) from videos is complicated and challenging due to the inherent dynamics of a scene, temporal fluctuation of model predictions, and the long-tailed distribution of the visual relationships in addition to the already existing challenges in image-based SGG. Existing methods for dynamic SGG have primarily focused on capturing spatio-temporal context usi…
▽ More
The task of dynamic scene graph generation (SGG) from videos is complicated and challenging due to the inherent dynamics of a scene, temporal fluctuation of model predictions, and the long-tailed distribution of the visual relationships in addition to the already existing challenges in image-based SGG. Existing methods for dynamic SGG have primarily focused on capturing spatio-temporal context using complex architectures without addressing the challenges mentioned above, especially the long-tailed distribution of relationships. This often leads to the generation of biased scene graphs. To address these challenges, we introduce a new framework called TEMPURA: TEmporal consistency and Memory Prototype guided UnceRtainty Attenuation for unbiased dynamic SGG. TEMPURA employs object-level temporal consistencies via transformer-based sequence modeling, learns to synthesize unbiased relationship representations using memory-guided training, and attenuates the predictive uncertainty of visual relations using a Gaussian Mixture Model (GMM). Extensive experiments demonstrate that our method achieves significant (up to 10% in some cases) performance gain over existing methods highlighting its superiority in generating more unbiased scene graphs.
△ Less
Submitted 29 June, 2023; v1 submitted 3 April, 2023;
originally announced April 2023.
-
3D Human Pose Estimation via Intuitive Physics
Authors:
Shashank Tripathi,
Lea Müller,
Chun-Hao P. Huang,
Omid Taheri,
Michael J. Black,
Dimitrios Tzionas
Abstract:
Estimating 3D humans from images often produces implausible bodies that lean, float, or penetrate the floor. Such methods ignore the fact that bodies are typically supported by the scene. A physics engine can be used to enforce physical plausibility, but these are not differentiable, rely on unrealistic proxy bodies, and are difficult to integrate into existing optimization and learning frameworks…
▽ More
Estimating 3D humans from images often produces implausible bodies that lean, float, or penetrate the floor. Such methods ignore the fact that bodies are typically supported by the scene. A physics engine can be used to enforce physical plausibility, but these are not differentiable, rely on unrealistic proxy bodies, and are difficult to integrate into existing optimization and learning frameworks. In contrast, we exploit novel intuitive-physics (IP) terms that can be inferred from a 3D SMPL body interacting with the scene. Inspired by biomechanics, we infer the pressure heatmap on the body, the Center of Pressure (CoP) from the heatmap, and the SMPL body's Center of Mass (CoM). With these, we develop IPMAN, to estimate a 3D body from a color image in a "stable" configuration by encouraging plausible floor contact and overlapping CoP and CoM. Our IP terms are intuitive, easy to implement, fast to compute, differentiable, and can be integrated into existing optimization and regression methods. We evaluate IPMAN on standard datasets and MoYo, a new dataset with synchronized multi-view images, ground-truth 3D bodies with complex poses, body-floor contact, CoM and pressure. IPMAN produces more plausible results than the state of the art, improving accuracy for static poses, while not hurting dynamic ones. Code and data are available for research at https://ipman.is.tue.mpg.de.
△ Less
Submitted 24 July, 2023; v1 submitted 31 March, 2023;
originally announced March 2023.
-
Fuzzified advanced robust hashes for identification of digital and physical objects
Authors:
Shashank Tripathi,
Volker Skwarek
Abstract:
With the rising numbers for IoT objects, it is becoming easier to penetrate counterfeit objects into the mainstream market by adversaries. Such infiltration of bogus products can be addressed with third-party-verifiable identification. Generally, state-of-the-art identification schemes do not guarantee that an identifier e.g. barcodes or RFID itself cannot be forged. This paper introduces identifi…
▽ More
With the rising numbers for IoT objects, it is becoming easier to penetrate counterfeit objects into the mainstream market by adversaries. Such infiltration of bogus products can be addressed with third-party-verifiable identification. Generally, state-of-the-art identification schemes do not guarantee that an identifier e.g. barcodes or RFID itself cannot be forged. This paper introduces identification patterns representing the objects intrinsic identity by robust hashes and not only by generated identification patterns. Inspired by these two notions, a collection of uniquely identifiable attributes called quasi-identifiers (QI) can be used to identify an object. Since all attributes do not contribute equally towards an object's identity, each QI has a different contribution towards the identifier. A robust hash developed utilising the QI has been named fuzzified robust hashes (FaR hashes), which can be used as an object identifier. Although the FaR hash is a single hash string, selected bits change in response to the modification of QI. On the other hand, other QIs in the object are more important for the object's identity. If these QIs change, the complete FaR hash is going to change. The calculation of FaR hash using attributes should allow third parties to generate the identifier and compare it with the current one to verify the genuineness of the object.
△ Less
Submitted 30 March, 2023;
originally announced March 2023.
-
MIME: Human-Aware 3D Scene Generation
Authors:
Hongwei Yi,
Chun-Hao P. Huang,
Shashank Tripathi,
Lea Hering,
Justus Thies,
Michael J. Black
Abstract:
Generating realistic 3D worlds occupied by moving humans has many applications in games, architecture, and synthetic data creation. But generating such scenes is expensive and labor intensive. Recent work generates human poses and motions given a 3D scene. Here, we take the opposite approach and generate 3D indoor scenes given 3D human motion. Such motions can come from archival motion capture or…
▽ More
Generating realistic 3D worlds occupied by moving humans has many applications in games, architecture, and synthetic data creation. But generating such scenes is expensive and labor intensive. Recent work generates human poses and motions given a 3D scene. Here, we take the opposite approach and generate 3D indoor scenes given 3D human motion. Such motions can come from archival motion capture or from IMU sensors worn on the body, effectively turning human movement in a "scanner" of the 3D world. Intuitively, human movement indicates the free-space in a room and human contact indicates surfaces or objects that support activities such as sitting, lying or touching. We propose MIME (Mining Interaction and Movement to infer 3D Environments), which is a generative model of indoor scenes that produces furniture layouts that are consistent with the human movement. MIME uses an auto-regressive transformer architecture that takes the already generated objects in the scene as well as the human motion as input, and outputs the next plausible object. To train MIME, we build a dataset by populating the 3D FRONT scene dataset with 3D humans. Our experiments show that MIME produces more diverse and plausible 3D scenes than a recent generative scene method that does not know about human movement. Code and data will be available for research at https://mime.is.tue.mpg.de.
△ Less
Submitted 8 December, 2022;
originally announced December 2022.
-
Algorithmic Bias in Machine Learning Based Delirium Prediction
Authors:
Sandhya Tripathi,
Bradley A Fritz,
Michael S Avidan,
Yixin Chen,
Christopher R King
Abstract:
Although prediction models for delirium, a commonly occurring condition during general hospitalization or post-surgery, have not gained huge popularity, their algorithmic bias evaluation is crucial due to the existing association between social determinants of health and delirium risk. In this context, using MIMIC-III and another academic hospital dataset, we present some initial experimental evid…
▽ More
Although prediction models for delirium, a commonly occurring condition during general hospitalization or post-surgery, have not gained huge popularity, their algorithmic bias evaluation is crucial due to the existing association between social determinants of health and delirium risk. In this context, using MIMIC-III and another academic hospital dataset, we present some initial experimental evidence showing how sociodemographic features such as sex and race can impact the model performance across subgroups. With this work, our intent is to initiate a discussion about the intersectionality effects of old age, race and socioeconomic factors on the early-stage detection and prevention of delirium using ML.
△ Less
Submitted 26 November, 2022; v1 submitted 8 November, 2022;
originally announced November 2022.
-
DELFI: Deep Mixture Models for Long-term Air Quality Forecasting in the Delhi National Capital Region
Authors:
Naishadh Parmar,
Raunak Shah,
Tushar Goswamy,
Vatsalya Tandon,
Ravi Sahu,
Ronak Sutaria,
Purushottam Kar,
Sachchida Nand Tripathi
Abstract:
The identification and control of human factors in climate change is a rapidly growing concern and robust, real-time air-quality monitoring and forecasting plays a critical role in allowing effective policy formulation and implementation. This paper presents DELFI, a novel deep learning-based mixture model to make effective long-term predictions of Particulate Matter (PM) 2.5 concentrations. A key…
▽ More
The identification and control of human factors in climate change is a rapidly growing concern and robust, real-time air-quality monitoring and forecasting plays a critical role in allowing effective policy formulation and implementation. This paper presents DELFI, a novel deep learning-based mixture model to make effective long-term predictions of Particulate Matter (PM) 2.5 concentrations. A key novelty in DELFI is its multi-scale approach to the forecasting problem. The observation that point predictions are more suitable in the short-term and probabilistic predictions in the long-term allows accurate predictions to be made as much as 24 hours in advance. DELFI incorporates meteorological data as well as pollutant-based features to ensure a robust model that is divided into two parts: (i) a stack of three Long Short-Term Memory (LSTM) networks that perform differential modelling of the same window of past data, and (ii) a fully-connected layer enabling attention to each of the components. Experimental evaluation based on deployment of 13 stations in the Delhi National Capital Region (Delhi-NCR) in India establishes that DELFI offers far superior predictions especially in the long-term as compared to even non-parametric baselines. The Delhi-NCR recorded the 3rd highest PM levels amongst 39 mega-cities across the world during 2011-2015 and DELFI's performance establishes it as a potential tool for effective long-term forecasting of PM levels to enable public health management and environment protection.
△ Less
Submitted 28 October, 2022;
originally announced October 2022.
-
PERI: Part Aware Emotion Recognition In The Wild
Authors:
Akshita Mittel,
Shashank Tripathi
Abstract:
Emotion recognition aims to interpret the emotional states of a person based on various inputs including audio, visual, and textual cues. This paper focuses on emotion recognition using visual features. To leverage the correlation between facial expression and the emotional state of a person, pioneering methods rely primarily on facial features. However, facial features are often unreliable in nat…
▽ More
Emotion recognition aims to interpret the emotional states of a person based on various inputs including audio, visual, and textual cues. This paper focuses on emotion recognition using visual features. To leverage the correlation between facial expression and the emotional state of a person, pioneering methods rely primarily on facial features. However, facial features are often unreliable in natural unconstrained scenarios, such as in crowded scenes, as the face lacks pixel resolution and contains artifacts due to occlusion and blur. To address this, in the wild emotion recognition exploits full-body person crops as well as the surrounding scene context. In a bid to use body pose for emotion recognition, such methods fail to realize the potential that facial expressions, when available, offer. Thus, the aim of this paper is two-fold. First, we demonstrate our method, PERI, to leverage both body pose and facial landmarks. We create part aware spatial (PAS) images by extracting key regions from the input image using a mask generated from both body pose and facial landmarks. This allows us to exploit body pose in addition to facial context whenever available. Second, to reason from the PAS images, we introduce context infusion (Cont-In) blocks. These blocks attend to part-specific information, and pass them onto the intermediate features of an emotion recognition network. Our approach is conceptually simple and can be applied to any existing emotion recognition method. We provide our results on the publicly available in the wild EMOTIC dataset. Compared to existing methods, PERI achieves superior performance and leads to significant improvements in the mAP of emotion categories, while decreasing Valence, Arousal and Dominance errors. Importantly, we observe that our method improves performance in both images with fully visible faces as well as in images with occluded or blurred faces.
△ Less
Submitted 18 October, 2022;
originally announced October 2022.
-
Leveraging unsupervised data and domain adaptation for deep regression in low-cost sensor calibration
Authors:
Swapnil Dey,
Vipul Arora,
Sachchida Nand Tripathi
Abstract:
Air quality monitoring is becoming an essential task with rising awareness about air quality. Low cost air quality sensors are easy to deploy but are not as reliable as the costly and bulky reference monitors. The low quality sensors can be calibrated against the reference monitors with the help of deep learning. In this paper, we translate the task of sensor calibration into a semi-supervised dom…
▽ More
Air quality monitoring is becoming an essential task with rising awareness about air quality. Low cost air quality sensors are easy to deploy but are not as reliable as the costly and bulky reference monitors. The low quality sensors can be calibrated against the reference monitors with the help of deep learning. In this paper, we translate the task of sensor calibration into a semi-supervised domain adaptation problem and propose a novel solution for the same. The problem is challenging because it is a regression problem with covariate shift and label gap. We use histogram loss instead of mean squared or mean absolute error, which is commonly used for regression, and find it useful against covariate shift. To handle the label gap, we propose weighting of samples for adversarial entropy optimization. In experimental evaluations, the proposed scheme outperforms many competitive baselines, which are based on semi-supervised and supervised domain adaptation, in terms of R2 score and mean absolute error. Ablation studies show the relevance of each proposed component in the entire scheme.
△ Less
Submitted 2 October, 2022;
originally announced October 2022.
-
Maximum Minimal Feedback Vertex Set: A Parameterized Perspective
Authors:
Ajinkya Gaikwad,
Hitendra Kumar,
Soumen Maity,
Saket Saurabh,
Shuvam Kant Tripathi
Abstract:
In this paper we study a maximization version of the classical Feedback Vertex Set (FVS) problem, namely, the Max Min FVS problem, in the realm of parameterized complexity. In this problem, given an undirected graph $G$, a positive integer $k$, the question is to check whether $G$ has a minimal feedback vertex set of size at least $k$. We obtain following results for Max Min FVS.
1) We first des…
▽ More
In this paper we study a maximization version of the classical Feedback Vertex Set (FVS) problem, namely, the Max Min FVS problem, in the realm of parameterized complexity. In this problem, given an undirected graph $G$, a positive integer $k$, the question is to check whether $G$ has a minimal feedback vertex set of size at least $k$. We obtain following results for Max Min FVS.
1) We first design a fixed parameter tractable (FPT) algorithm for Max Min FVS running in time $10^kn^{\mathcal{O}(1)}$.
2) Next, we consider the problem parameterized by the vertex cover number of the input graph (denoted by $\mathsf{vc}(G)$), and design an algorithm with running time $2^{\mathcal{O}(\mathsf{vc}(G)\log \mathsf{vc}(G))}n^{\mathcal{O}(1)}$. We complement this result by showing that the problem parameterized by $\mathsf{vc}(G)$ does not admit a polynomial compression unless coNP $\subseteq$ NP/poly.
3) Finally, we give an FPT-approximation scheme (fpt-AS) parameterized by $\mathsf{vc}(G)$. That is, we design an algorithm that for every $ε>0$, runs in time $2^{\mathcal{O}\left(\frac{\mathsf{vc}(G)}ε\right)} n^{\mathcal{O}(1)}$ and returns a minimal feedback vertex set of size at least $(1-ε){\sf opt}$.
△ Less
Submitted 3 August, 2022;
originally announced August 2022.
-
Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection
Authors:
Kyle Min,
Sourya Roy,
Subarna Tripathi,
Tanaya Guha,
Somdeb Majumdar
Abstract:
Active speaker detection (ASD) in videos with multiple speakers is a challenging task as it requires learning effective audiovisual features and spatial-temporal correlations over long temporal windows. In this paper, we present SPELL, a novel spatial-temporal graph learning framework that can solve complex tasks such as ASD. To this end, each person in a video frame is first encoded in a unique n…
▽ More
Active speaker detection (ASD) in videos with multiple speakers is a challenging task as it requires learning effective audiovisual features and spatial-temporal correlations over long temporal windows. In this paper, we present SPELL, a novel spatial-temporal graph learning framework that can solve complex tasks such as ASD. To this end, each person in a video frame is first encoded in a unique node for that frame. Nodes corresponding to a single person across frames are connected to encode their temporal dynamics. Nodes within a frame are also connected to encode inter-person relationships. Thus, SPELL reduces ASD to a node classification task. Importantly, SPELL is able to reason over long temporal contexts for all nodes without relying on computationally expensive fully connected graph neural networks. Through extensive experiments on the AVA-ActiveSpeaker dataset, we demonstrate that learning graph-based representations can significantly improve the active speaker detection performance owing to its explicit spatial and temporal structure. SPELL outperforms all previous state-of-the-art approaches while requiring significantly lower memory and computational resources. Our code is publicly available at https://github.com/SRA2/SPELL
△ Less
Submitted 12 October, 2022; v1 submitted 15 July, 2022;
originally announced July 2022.
-
Deep Learning to Jointly Schema Match, Impute, and Transform Databases
Authors:
Sandhya Tripathi,
Bradley A. Fritz,
Mohamed Abdelhack,
Michael S. Avidan,
Yixin Chen,
Christopher R. King
Abstract:
An applied problem facing all areas of data science is harmonizing data sources. Joining data from multiple origins with unmapped and only partially overlapping features is a prerequisite to developing and testing robust, generalizable algorithms, especially in health care. We approach this issue in the common but difficult case of numeric features such as nearly Gaussian and binary features, wher…
▽ More
An applied problem facing all areas of data science is harmonizing data sources. Joining data from multiple origins with unmapped and only partially overlapping features is a prerequisite to developing and testing robust, generalizable algorithms, especially in health care. We approach this issue in the common but difficult case of numeric features such as nearly Gaussian and binary features, where unit changes and variable shift make simple matching of univariate summaries unsuccessful. We develop two novel procedures to address this problem. First, we demonstrate multiple methods of "fingerprinting" a feature based on its associations to other features. In the setting of even modest prior information, this allows most shared features to be accurately identified. Second, we demonstrate a deep learning algorithm for translation between databases. Unlike prior approaches, our algorithm takes advantage of discovered mappings while identifying surrogates for unshared features and learning transformations. In synthetic and real-world experiments using two electronic health record databases, our algorithms outperform existing baselines for matching variable sets, while jointly learning to impute unshared or transformed variables.
△ Less
Submitted 22 June, 2022;
originally announced July 2022.
-
Moving Smart Contracts -- A Privacy Preserving Method for Off-Chain Data Trust
Authors:
Simon Tschirner,
Shashank Shekher Tripathi,
Mathias Roeper,
Markus M. Becker,
Volker Skwarek
Abstract:
Blockchains provide environments where parties can interact transparently and securely peer-to-peer without needing a trusted third party. Parties can trust the integrity and correctness of transactions and the verifiable execution of binary code on the blockchain (smart contracts) inside the system. Including information from outside of the blockchain remains challenging. A challenge is data priv…
▽ More
Blockchains provide environments where parties can interact transparently and securely peer-to-peer without needing a trusted third party. Parties can trust the integrity and correctness of transactions and the verifiable execution of binary code on the blockchain (smart contracts) inside the system. Including information from outside of the blockchain remains challenging. A challenge is data privacy. In a public system, shared data becomes public and, coming from a single source, often lacks credibility. A private system gives the parties control over their data and sources but trades in positive aspects as transparency. Often, not the data itself is the most critical information but the result of a computation performed on it.
An example is research data certification. To keep data private but still prove data provenance, researchers can store a hash value of that data on the blockchain. This hash value is either calculated locally on private data without the chance for validation or is calculated on the blockchain, meaning that data must be published and stored on the blockchain -- a problem of the overall data amount stored on and distributed with the ledger. A system we called moving smart contracts bypasses this problem: Data remain local, but trusted nodes can access them and execute trusted smart contract code stored on the blockchain. This method avoids the system-wide distribution of research data and makes it accessible and verifiable with trusted software.
△ Less
Submitted 18 May, 2022; v1 submitted 17 May, 2022;
originally announced May 2022.
-
Automated Application Processing
Authors:
Eshita Sharma,
Keshav Gupta,
Lubaina Machinewala,
Samaksh Dhingra,
Shrey Tripathi,
Shreyas V S,
Sujit Kumar Chakrabarti
Abstract:
Recruitment in large organisations often involves interviewing a large number of candidates. The process is resource intensive and complex. Therefore, it is important to carry it out efficiently and effectively. Planning the selection process consists of several problems, each of which maps to one or the other well-known computing problem. Research that looks at each of these problems in isolation…
▽ More
Recruitment in large organisations often involves interviewing a large number of candidates. The process is resource intensive and complex. Therefore, it is important to carry it out efficiently and effectively. Planning the selection process consists of several problems, each of which maps to one or the other well-known computing problem. Research that looks at each of these problems in isolation is rich and mature. However, research that takes an integrated view of the problem is not common. In this paper, we take two of the most important aspects of the application processing problem, namely review/interview panel creation and interview scheduling. We have implemented our approach as a prototype system and have used it to automatically plan the interview process of a real-life data set. Our system provides a distinctly better plan than the existing practice, which is predominantly manual. We have explored various algorithmic options and have customised them to solve these panel creation and interview scheduling problems. We have evaluated these design options experimentally on a real data set and have presented our observations. Our prototype and experimental process and results may be a very good starting point for a full-fledged development project for automating application processing process.
△ Less
Submitted 19 April, 2022;
originally announced April 2022.
-
EvoSTS Forecasting: Evolutionary Sparse Time-Series Forecasting
Authors:
Ethan Jacob Moyer,
Alisha Isabelle Augustin,
Satvik Tripathi,
Ansh Aashish Dholakia,
Andy Nguyen,
Isamu Mclean Isozaki,
Daniel Schwartz,
Edward Kim
Abstract:
In this work, we highlight our novel evolutionary sparse time-series forecasting algorithm also known as EvoSTS. The algorithm attempts to evolutionary prioritize weights of Long Short-Term Memory (LSTM) Network that best minimize the reconstruction loss of a predicted signal using a learned sparse coded dictionary. In each generation of our evolutionary algorithm, a set number of children with th…
▽ More
In this work, we highlight our novel evolutionary sparse time-series forecasting algorithm also known as EvoSTS. The algorithm attempts to evolutionary prioritize weights of Long Short-Term Memory (LSTM) Network that best minimize the reconstruction loss of a predicted signal using a learned sparse coded dictionary. In each generation of our evolutionary algorithm, a set number of children with the same initial weights are spawned. Each child undergoes a training step and adjusts their weights on the same data. Due to stochastic back-propagation, the set of children has a variety of weights with different levels of performance. The weights that best minimize the reconstruction loss with a given signal dictionary are passed to the next generation. The predictions from the best-performing weights of the first and last generation are compared. We found improvements while comparing the weights of these two generations. However, due to several confounding parameters and hyperparameter limitations, some of the weights had negligible improvements. To the best of our knowledge, this is the first attempt to use sparse coding in this way to optimize time series forecasting model weights, such as those of an LSTM network.
△ Less
Submitted 14 April, 2022;
originally announced April 2022.
-
Text Spotting Transformers
Authors:
Xiang Zhang,
Yongwen Su,
Subarna Tripathi,
Zhuowen Tu
Abstract:
In this paper, we present TExt Spotting TRansformers (TESTR), a generic end-to-end text spotting framework using Transformers for text detection and recognition in the wild. TESTR builds upon a single encoder and dual decoders for the joint text-box control point regression and character recognition. Other than most existing literature, our method is free from Region-of-Interest operations and heu…
▽ More
In this paper, we present TExt Spotting TRansformers (TESTR), a generic end-to-end text spotting framework using Transformers for text detection and recognition in the wild. TESTR builds upon a single encoder and dual decoders for the joint text-box control point regression and character recognition. Other than most existing literature, our method is free from Region-of-Interest operations and heuristics-driven post-processing procedures; TESTR is particularly effective when dealing with curved text-boxes where special cares are needed for the adaptation of the traditional bounding-box representations. We show our canonical representation of control points suitable for text instances in both Bezier curve and polygon annotations. In addition, we design a bounding-box guided polygon detection (box-to-polygon) process. Experiments on curved and arbitrarily shaped datasets demonstrate state-of-the-art performances of the proposed TESTR algorithm.
△ Less
Submitted 4 April, 2022;
originally announced April 2022.