Search | arXiv e-print repository

FLoRIST: Singular Value Thresholding for Efficient and Accurate Federated Fine-Tuning of Large Language Models

Authors: Hariharan Ramesh, Jyotikrishna Dass

Abstract: Integrating Low-Rank Adaptation (LoRA) into federated learning offers a promising solution for parameter-efficient fine-tuning of Large Language Models (LLMs) without sharing local data. However, several methods designed for federated LoRA present significant challenges in balancing communication efficiency, model accuracy, and computational cost, particularly among heterogeneous clients. These me… ▽ More Integrating Low-Rank Adaptation (LoRA) into federated learning offers a promising solution for parameter-efficient fine-tuning of Large Language Models (LLMs) without sharing local data. However, several methods designed for federated LoRA present significant challenges in balancing communication efficiency, model accuracy, and computational cost, particularly among heterogeneous clients. These methods either rely on simplistic averaging of local adapters, which introduces aggregation noise, require transmitting large stacked local adapters, leading to poor communication efficiency, or necessitate reconstructing memory-dense global weight-update matrix and performing computationally expensive decomposition to design client-specific low-rank adapters. In this work, we propose FLoRIST, a federated fine-tuning framework that achieves mathematically accurate aggregation without incurring high communication or computational overhead. Instead of constructing the full global weight-update matrix at the server, FLoRIST employs an efficient decomposition pipeline by performing singular value decomposition on stacked local adapters separately. This approach operates within a compact intermediate space to represent the accumulated information from local LoRAs. We introduce tunable singular value thresholding for server-side optimal rank selection to construct a pair of global low-rank adapters shared by all clients. Extensive empirical evaluations across multiple datasets and LLMs demonstrate that FLoRIST consistently strikes the best balance between superior communication efficiency and competitive performance in both homogeneous and heterogeneous setups. △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: 21 pages, 12 figures

arXiv:2506.00172 [pdf, ps, other]

Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents

Authors: Kaivalya Hariharan, Uzay Girit, Atticus Wang, Jacob Andreas

Abstract: Benchmarks for large language models (LLMs) have predominantly assessed short-horizon, localized reasoning. Existing long-horizon suites (e.g. SWE-bench) rely on manually curated issues, so expanding or tuning difficulty demands expensive human effort and evaluations quickly saturate. However, many real-world tasks, such as software engineering or scientific research, require agents to rapidly com… ▽ More Benchmarks for large language models (LLMs) have predominantly assessed short-horizon, localized reasoning. Existing long-horizon suites (e.g. SWE-bench) rely on manually curated issues, so expanding or tuning difficulty demands expensive human effort and evaluations quickly saturate. However, many real-world tasks, such as software engineering or scientific research, require agents to rapidly comprehend and manipulate novel, complex structures dynamically; evaluating these capabilities requires the ability to construct large and varied sets of problems for agents to solve. We introduce Breakpoint, a benchmarking methodology that automatically generates code-repair tasks by adversarially corrupting functions within real-world software repositories. Breakpoint systematically controls task difficulty along two clear dimensions: local reasoning (characterized by code complexity metrics such as cyclomatic complexity) and system-level reasoning (characterized by call-graph centrality and the number of simultaneously corrupted interdependent functions). In experiments across more than 900 generated tasks we demonstrate that our methodology can scale to arbitrary difficulty, with state-of-the-art models' success rates ranging from 55% on the easiest tasks down to 0% on the hardest. △ Less

Submitted 30 May, 2025; originally announced June 2025.

Comments: 21 pages, 14 figures

arXiv:2505.11204 [pdf, other]

RanDeS: Randomized Delta Superposition for Multi-Model Compression

Authors: Hangyu Zhou, Aaron Gokaslan, Volodymyr Kuleshov, Bharath Hariharan

Abstract: From a multi-model compression perspective, model merging enables memory-efficient serving of multiple models fine-tuned from the same base, but suffers from degraded performance due to interference among their task-specific parameter adjustments (i.e., deltas). In this paper, we reformulate model merging as a compress-and-retrieve scheme, revealing that the task interference arises from the summa… ▽ More From a multi-model compression perspective, model merging enables memory-efficient serving of multiple models fine-tuned from the same base, but suffers from degraded performance due to interference among their task-specific parameter adjustments (i.e., deltas). In this paper, we reformulate model merging as a compress-and-retrieve scheme, revealing that the task interference arises from the summation of irrelevant deltas during model retrieval. To address this issue, we use random orthogonal transformations to decorrelate these vectors into self-cancellation. We show that this approach drastically reduces interference, improving performance across both vision and language tasks. Since these transformations are fully defined by random seeds, adding new models requires no extra memory. Further, their data- and model-agnostic nature enables easy addition or removal of models with minimal compute overhead, supporting efficient and flexible multi-model serving. △ Less

Submitted 16 May, 2025; originally announced May 2025.

Comments: https://github.com/Zhou-Hangyu/randes

arXiv:2504.14039 [pdf, other]

MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks

Authors: Jaime Raldua Veuthey, Zainab Ali Majid, Suhas Hariharan, Jacob Haimes

Abstract: As Large Language Models (LLMs) advance, their potential for widespread societal impact grows simultaneously. Hence, rigorous LLM evaluations are both a technical necessity and social imperative. While numerous evaluation benchmarks have been developed, there remains a critical gap in meta-evaluation: effectively assessing benchmarks' quality. We propose MEQA, a framework for the meta-evaluation o… ▽ More As Large Language Models (LLMs) advance, their potential for widespread societal impact grows simultaneously. Hence, rigorous LLM evaluations are both a technical necessity and social imperative. While numerous evaluation benchmarks have been developed, there remains a critical gap in meta-evaluation: effectively assessing benchmarks' quality. We propose MEQA, a framework for the meta-evaluation of question and answer (QA) benchmarks, to provide standardized assessments, quantifiable scores, and enable meaningful intra-benchmark comparisons. We demonstrate this approach on cybersecurity benchmarks, using human and LLM evaluators, highlighting the benchmarks' strengths and weaknesses. We motivate our choice of test domain by AI models' dual nature as powerful defensive tools and security threats. △ Less

Submitted 18 April, 2025; originally announced April 2025.

arXiv:2504.12110 [pdf, other]

Towards LLM Agents for Earth Observation

Authors: Chia Hsiang Kao, Wenting Zhao, Shreelekha Revankar, Samuel Speas, Snehal Bhagat, Rajeev Datta, Cheng Perng Phoo, Utkarsh Mall, Carl Vondrick, Kavita Bala, Bharath Hariharan

Abstract: Earth Observation (EO) provides critical planetary data for environmental monitoring, disaster management, climate science, and other scientific domains. Here we ask: Are AI systems ready for reliable Earth Observation? We introduce \datasetnamenospace, a benchmark of 140 yes/no questions from NASA Earth Observatory articles across 13 topics and 17 satellite sensors. Using Google Earth Engine API… ▽ More Earth Observation (EO) provides critical planetary data for environmental monitoring, disaster management, climate science, and other scientific domains. Here we ask: Are AI systems ready for reliable Earth Observation? We introduce \datasetnamenospace, a benchmark of 140 yes/no questions from NASA Earth Observatory articles across 13 topics and 17 satellite sensors. Using Google Earth Engine API as a tool, LLM agents can only achieve an accuracy of 33% because the code fails to run over 58% of the time. We improve the failure rate for open models by fine-tuning synthetic data, allowing much smaller models (Llama-3.1-8B) to achieve comparable accuracy to much larger ones (e.g., DeepSeek-R1). Taken together, our findings identify significant challenges to be solved before AI agents can automate earth observation, and suggest paths forward. The project page is available at https://iandrover.github.io/UnivEarth. △ Less

Submitted 16 April, 2025; originally announced April 2025.

Comments: 36 pages

arXiv:2504.07093 [pdf, ps, other]

FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution

Authors: Gene Chou, Wenqi Xian, Guandao Yang, Mohamed Abdelfattah, Bharath Hariharan, Noah Snavely, Ning Yu, Paul Debevec

Abstract: A versatile video depth estimation model should (1) be accurate and consistent across frames, (2) produce high-resolution depth maps, and (3) support real-time streaming. We propose FlashDepth, a method that satisfies all three requirements, performing depth estimation on a 2044x1148 streaming video at 24 FPS. We show that, with careful modifications to pretrained single-image depth models, these… ▽ More A versatile video depth estimation model should (1) be accurate and consistent across frames, (2) produce high-resolution depth maps, and (3) support real-time streaming. We propose FlashDepth, a method that satisfies all three requirements, performing depth estimation on a 2044x1148 streaming video at 24 FPS. We show that, with careful modifications to pretrained single-image depth models, these capabilities are enabled with relatively little data and training. We evaluate our approach across multiple unseen datasets against state-of-the-art depth models, and find that ours outperforms them in terms of boundary sharpness and speed by a significant margin, while maintaining competitive accuracy. We hope our model will enable various applications that require high-resolution depth, such as video editing, and online decision-making, such as robotics. We release all code and model weights at https://github.com/Eyeline-Research/FlashDepth △ Less

Submitted 30 May, 2025; v1 submitted 9 April, 2025; originally announced April 2025.

arXiv:2504.00409 [pdf]

Semantic Mastery: Enhancing LLMs with Advanced Natural Language Understanding

Authors: Mohanakrishnan Hariharan

Abstract: Large language models (LLMs) have greatly improved their capability in performing NLP tasks. However, deeper semantic understanding, contextual coherence, and more subtle reasoning are still difficult to obtain. The paper discusses state-of-the-art methodologies that advance LLMs with more advanced NLU techniques, such as semantic parsing, knowledge integration, and contextual reinforcement learni… ▽ More Large language models (LLMs) have greatly improved their capability in performing NLP tasks. However, deeper semantic understanding, contextual coherence, and more subtle reasoning are still difficult to obtain. The paper discusses state-of-the-art methodologies that advance LLMs with more advanced NLU techniques, such as semantic parsing, knowledge integration, and contextual reinforcement learning. We analyze the use of structured knowledge graphs, retrieval-augmented generation (RAG), and fine-tuning strategies that match models with human-level understanding. Furthermore, we address the incorporation of transformer-based architectures, contrastive learning, and hybrid symbolic-neural methods that address problems like hallucinations, ambiguity, and inconsistency in the factual perspectives involved in performing complex NLP tasks, such as question-answering text summarization and dialogue generation. Our findings show the importance of semantic precision for enhancing AI-driven language systems and suggest future research directions to bridge the gap between statistical language models and true natural language understanding. △ Less

Submitted 1 April, 2025; originally announced April 2025.

arXiv:2503.16335 [pdf]

Enhancing Software Quality Assurance with an Adaptive Differential Evolution based Quantum Variational Autoencoder-Transformer Model

Authors: Seshu Babu Barma, Mohanakrishnan Hariharan, Satish Arvapalli

Abstract: An AI-powered quality engineering platform uses artificial intelligence to boost software quality assessments through automated defect prediction and optimized performance alongside improved feature extraction. Existing models result in difficulties addressing noisy data types together with imbalances, pattern recognition complexities, ineffective feature extraction, and generalization weaknesses.… ▽ More An AI-powered quality engineering platform uses artificial intelligence to boost software quality assessments through automated defect prediction and optimized performance alongside improved feature extraction. Existing models result in difficulties addressing noisy data types together with imbalances, pattern recognition complexities, ineffective feature extraction, and generalization weaknesses. To overcome those existing challenges in this research, we develop a new model Adaptive Differential Evolution based Quantum Variational Autoencoder-Transformer Model (ADE-QVAET), that combines a Quantum Variational Autoencoder-Transformer (QVAET) to obtain high-dimensional latent features and maintain sequential dependencies together with contextual relationships, resulting in superior defect prediction accuracy. Adaptive Differential Evolution (ADE) Optimization utilizes an adaptive parameter tuning method that enhances model convergence and predictive performance. ADE-QVAET integrates advanced AI techniques to create a robust solution for scalable and accurate software defect prediction that represents a top-level AI-driven technology for quality engineering applications. The proposed ADE-QVAET model attains high accuracy, precision, recall, and f1-score during the training percentage (TP) 90 of 98.08%, 92.45%, 94.67%, and 98.12%. △ Less

Submitted 20 March, 2025; originally announced March 2025.

arXiv:2503.11511 [pdf, other]

Alzheimer's Disease Classification Using Retinal OCT: TransnetOCT and Swin Transformer Models

Authors: Siva Manohar Reddy Kesu, Neelam Sinha, Hariharan Ramasangu, Thomas Gregor Issac

Abstract: Retinal optical coherence tomography (OCT) images are the biomarkers for neurodegenerative diseases, which are rising in prevalence. Early detection of Alzheimer's disease using retinal OCT is a primary challenging task. This work utilizes advanced deep learning techniques to classify retinal OCT images of subjects with Alzheimer's disease (AD) and healthy controls (CO). The goal is to enhance dia… ▽ More Retinal optical coherence tomography (OCT) images are the biomarkers for neurodegenerative diseases, which are rising in prevalence. Early detection of Alzheimer's disease using retinal OCT is a primary challenging task. This work utilizes advanced deep learning techniques to classify retinal OCT images of subjects with Alzheimer's disease (AD) and healthy controls (CO). The goal is to enhance diagnostic capabilities through efficient image analysis. In the proposed model, Raw OCT images have been preprocessed with ImageJ and given to various deep-learning models to evaluate the accuracy. The best classification architecture is TransNetOCT, which has an average accuracy of 98.18% for input OCT images and 98.91% for segmented OCT images for five-fold cross-validation compared to other models, and the Swin Transformer model has achieved an accuracy of 93.54%. The evaluation accuracy metric demonstrated TransNetOCT and Swin transformer models capability to classify AD and CO subjects reliably, contributing to the potential for improved diagnostic processes in clinical settings. △ Less

Submitted 14 March, 2025; originally announced March 2025.

Comments: 18 pages, 25 figures

arXiv:2503.06459 [pdf, other]

Deterministically approximating the volume of a Kostka polytope

Authors: Hariharan Narayanan, Piyush Srivastava

Abstract: Polynomial-time deterministic approximation of volumes of polytopes, up to an approximation factor that grows at most sub-exponentially with the dimension, remains an open problem. Recent work on this question has focused on identifying interesting classes of polytopes for which such approximation algorithms can be obtained. In this paper, we focus on one such class of polytopes: the Kostka polyto… ▽ More Polynomial-time deterministic approximation of volumes of polytopes, up to an approximation factor that grows at most sub-exponentially with the dimension, remains an open problem. Recent work on this question has focused on identifying interesting classes of polytopes for which such approximation algorithms can be obtained. In this paper, we focus on one such class of polytopes: the Kostka polytopes. The volumes of Kostka polytopes appear naturally in questions of random matrix theory, in the context of evaluating the probability density that a random Hermitian matrix with fixed spectrum $λ$ has a given diagonal $μ$ (the so-called randomized Schur-Horn problem): the corresponding Kostka polytope is denoted $\mathrm{GT}(λ, μ)$. We give a polynomial-time deterministic algorithm for approximating the volume of a ($Ω(n^2)$ dimensional) Kostka polytope $\mathrm{GT}(λ, μ)$ to within a multiplicative factor of $\exp(O(n\log n))$, when $λ$ is an integral partition with $n$ parts, with entries bounded above by a polynomial in $n$, and $μ$ is an integer vector lying in the interior of the permutohedron (i.e., convex hull of all permutations) of $λ$. The algorithm thus gives asymptotically correct estimates of the log-volume of Kostka polytopes corresponding to such $(λ, μ)$. Our approach is based on a partition function interpretation of a continuous analogue of Schur polynomials. △ Less

Submitted 5 April, 2025; v1 submitted 9 March, 2025; originally announced March 2025.

Comments: Added further discussion

arXiv:2502.17541 [pdf, ps, other]

Dataset Featurization: Uncovering Natural Language Features through Unsupervised Data Reconstruction

Authors: Michal Bravansky, Vaclav Kubon, Suhas Hariharan, Robert Kirk

Abstract: Interpreting data is central to modern research. Large language models (LLMs) show promise in providing such natural language interpretations of data, yet simple feature extraction methods such as prompting often fail to produce accurate and versatile descriptions for diverse datasets and lack control over granularity and scale. To address these limitations, we propose a domain-agnostic method for… ▽ More Interpreting data is central to modern research. Large language models (LLMs) show promise in providing such natural language interpretations of data, yet simple feature extraction methods such as prompting often fail to produce accurate and versatile descriptions for diverse datasets and lack control over granularity and scale. To address these limitations, we propose a domain-agnostic method for dataset featurization that provides precise control over the number of features extracted while maintaining compact and descriptive representations comparable to human labeling. Our method optimizes the selection of informative binary features by evaluating the ability of an LLM to reconstruct the original data using those features. We demonstrate its effectiveness in dataset modeling tasks and through two case studies: (1) Constructing a feature representation of jailbreak tactics that compactly captures both the effectiveness and diversity of a larger set of human-crafted attacks; and (2) automating the discovery of features that align with human preferences, achieving accuracy and robustness comparable to human-crafted features. Moreover, we show that the pipeline scales effectively, improving as additional features are sampled, making it suitable for large and diverse datasets. △ Less

Submitted 29 May, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

arXiv:2502.14156 [pdf, other]

Mixed Signals: A Diverse Point Cloud Dataset for Heterogeneous LiDAR V2X Collaboration

Authors: Katie Z Luo, Minh-Quan Dao, Zhenzhen Liu, Mark Campbell, Wei-Lun Chao, Kilian Q. Weinberger, Ezio Malis, Vincent Fremont, Bharath Hariharan, Mao Shan, Stewart Worrall, Julie Stephany Berrio Perez

Abstract: Vehicle-to-everything (V2X) collaborative perception has emerged as a promising solution to address the limitations of single-vehicle perception systems. However, existing V2X datasets are limited in scope, diversity, and quality. To address these gaps, we present Mixed Signals, a comprehensive V2X dataset featuring 45.1k point clouds and 240.6k bounding boxes collected from three connected autono… ▽ More Vehicle-to-everything (V2X) collaborative perception has emerged as a promising solution to address the limitations of single-vehicle perception systems. However, existing V2X datasets are limited in scope, diversity, and quality. To address these gaps, we present Mixed Signals, a comprehensive V2X dataset featuring 45.1k point clouds and 240.6k bounding boxes collected from three connected autonomous vehicles (CAVs) equipped with two different types of LiDAR sensors, plus a roadside unit with dual LiDARs. Our dataset provides precisely aligned point clouds and bounding box annotations across 10 classes, ensuring reliable data for perception training. We provide detailed statistical analysis on the quality of our dataset and extensively benchmark existing V2X methods on it. Mixed Signals V2X Dataset is one of the highest quality, large-scale datasets publicly available for V2X perception research. Details on the website https://mixedsignalsdataset.cs.cornell.edu/. △ Less

Submitted 19 February, 2025; originally announced February 2025.

arXiv:2502.10638 [pdf, other]

Script&Shift: A Layered Interface Paradigm for Integrating Content Development and Rhetorical Strategy with LLM Writing Assistants

Authors: Momin Siddiqui, Roy Pea, Hari Subramonyam

Abstract: Good writing is a dynamic process of knowledge transformation, where writers refine and evolve ideas through planning, translating, and reviewing. Generative AI-powered writing tools can enhance this process but may also disrupt the natural flow of writing, such as when using LLMs for complex tasks like restructuring content across different sections or creating smooth transitions. We introduce Sc… ▽ More Good writing is a dynamic process of knowledge transformation, where writers refine and evolve ideas through planning, translating, and reviewing. Generative AI-powered writing tools can enhance this process but may also disrupt the natural flow of writing, such as when using LLMs for complex tasks like restructuring content across different sections or creating smooth transitions. We introduce Script&Shift, a layered interface paradigm designed to minimize these disruptions by aligning writing intents with LLM capabilities to support diverse content development and rhetorical strategies. By bridging envisioning, semantic, and articulatory distances, Script&Shift's interactions allow writers to leverage LLMs for various content development tasks (scripting) and experiment with diverse organization strategies while tailoring their writing for different audiences (shifting). This approach preserves creative control while encouraging divergent and iterative writing. Our evaluation shows that Script&Shift enables writers to creatively and efficiently incorporate LLMs while preserving a natural flow of composition. △ Less

Submitted 14 February, 2025; originally announced February 2025.

arXiv:2502.10060 [pdf, other]

DiSciPLE: Learning Interpretable Programs for Scientific Visual Discovery

Authors: Utkarsh Mall, Cheng Perng Phoo, Mia Chiquier, Bharath Hariharan, Kavita Bala, Carl Vondrick

Abstract: Visual data is used in numerous different scientific workflows ranging from remote sensing to ecology. As the amount of observation data increases, the challenge is not just to make accurate predictions but also to understand the underlying mechanisms for those predictions. Good interpretation is important in scientific workflows, as it allows for better decision-making by providing insights into… ▽ More Visual data is used in numerous different scientific workflows ranging from remote sensing to ecology. As the amount of observation data increases, the challenge is not just to make accurate predictions but also to understand the underlying mechanisms for those predictions. Good interpretation is important in scientific workflows, as it allows for better decision-making by providing insights into the data. This paper introduces an automatic way of obtaining such interpretable-by-design models, by learning programs that interleave neural networks. We propose DiSciPLE (Discovering Scientific Programs using LLMs and Evolution) an evolutionary algorithm that leverages common sense and prior knowledge of large language models (LLMs) to create Python programs explaining visual data. Additionally, we propose two improvements: a program critic and a program simplifier to improve our method further to synthesize good programs. On three different real-world problems, DiSciPLE learns state-of-the-art programs on novel tasks with no prior literature. For example, we can learn programs with 35% lower error than the closest non-interpretable baseline for population density estimation. △ Less

Submitted 14 February, 2025; originally announced February 2025.

arXiv:2502.06682 [pdf, other]

Transfer Your Perspective: Controllable 3D Generation from Any Viewpoint in a Driving Scene

Authors: Tai-Yu Pan, Sooyoung Jeon, Mengdi Fan, Jinsu Yoo, Zhenyang Feng, Mark Campbell, Kilian Q. Weinberger, Bharath Hariharan, Wei-Lun Chao

Abstract: Self-driving cars relying solely on ego-centric perception face limitations in sensing, often failing to detect occluded, faraway objects. Collaborative autonomous driving (CAV) seems like a promising direction, but collecting data for development is non-trivial. It requires placing multiple sensor-equipped agents in a real-world driving scene, simultaneously! As such, existing datasets are limite… ▽ More Self-driving cars relying solely on ego-centric perception face limitations in sensing, often failing to detect occluded, faraway objects. Collaborative autonomous driving (CAV) seems like a promising direction, but collecting data for development is non-trivial. It requires placing multiple sensor-equipped agents in a real-world driving scene, simultaneously! As such, existing datasets are limited in locations and agents. We introduce a novel surrogate to the rescue, which is to generate realistic perception from different viewpoints in a driving scene, conditioned on a real-world sample - the ego-car's sensory data. This surrogate has huge potential: it could potentially turn any ego-car dataset into a collaborative driving one to scale up the development of CAV. We present the very first solution, using a combination of simulated collaborative data and real ego-car data. Our method, Transfer Your Perspective (TYP), learns a conditioned diffusion model whose output samples are not only realistic but also consistent in both semantics and layouts with the given ego-car data. Empirical results demonstrate TYP's effectiveness in aiding in a CAV setting. In particular, TYP enables us to (pre-)train collaborative perception algorithms like early and late fusion with little or no real-world collaborative data, greatly facilitating downstream CAV applications. △ Less

Submitted 1 April, 2025; v1 submitted 10 February, 2025; originally announced February 2025.

Comments: Accepted to CVPR 2025

arXiv:2501.04896 [pdf, other]

Quantifying Itch and its Impact on Sleep Using Machine Learning and Radio Signals

Authors: Michail Ouroutzoglou, Mingmin Zhao, Joshua Hellerstein, Hariharan Rahul, Asima Badic, Brian S. Kim, Dina Katabi

Abstract: Chronic itch affects 13% of the US population, is highly debilitating, and underlies many medical conditions. A major challenge in clinical care and new therapeutics development is the lack of an objective measure for quantifying itch, leading to reliance on subjective measures like patients' self-assessment of itch severity. In this paper, we show that a home radio device paired with artificial i… ▽ More Chronic itch affects 13% of the US population, is highly debilitating, and underlies many medical conditions. A major challenge in clinical care and new therapeutics development is the lack of an objective measure for quantifying itch, leading to reliance on subjective measures like patients' self-assessment of itch severity. In this paper, we show that a home radio device paired with artificial intelligence (AI) can concurrently capture scratching and evaluate its impact on sleep quality by analyzing radio signals bouncing in the environment. The device eliminates the need for wearable sensors or skin contact, enabling monitoring of chronic itch over extended periods at home without burdening patients or interfering with their skin condition. To validate the technology, we conducted an observational clinical study of chronic pruritus patients, monitored at home for one month using both the radio device and an infrared camera. Comparing the output of the device to ground truth data from the camera demonstrates its feasibility and accuracy (ROC AUC = 0.997, sensitivity = 0.825, specificity = 0.997). The results reveal a significant correlation between scratching and low sleep quality, manifested as a reduction in sleep efficiency (R = 0.6, p < 0.001) and an increase in sleep latency (R = 0.68, p < 0.001). Our study underscores the potential of passive, long-term, at-home monitoring of chronic scratching and its sleep implications, offering a valuable tool for both clinical care of chronic itch patients and pharmaceutical clinical trials. △ Less

Submitted 8 January, 2025; originally announced January 2025.

arXiv:2501.04654 [pdf, other]

Recorder: Comprehensive Parallel I/O Tracing and Analysis

Authors: Chen Wang, Izzet Yildirim, Hariharan Devarajan, Kathryn Mohror, Marc Snir

Abstract: This paper presents Recorder, a parallel I/O tracing tool designed to capture comprehensive I/O information on HPC applications. Recorder traces I/O calls across various I/O layers, storing all function parameters for each captured call. The volume of stored information scales linearly the application's execution scale. To address this, we present a sophisticated pattern-recognition-based compress… ▽ More This paper presents Recorder, a parallel I/O tracing tool designed to capture comprehensive I/O information on HPC applications. Recorder traces I/O calls across various I/O layers, storing all function parameters for each captured call. The volume of stored information scales linearly the application's execution scale. To address this, we present a sophisticated pattern-recognition-based compression algorithm. This algorithm identifies and compresses recurring I/O patterns both within individual processes and across multiple processes, significantly reducing space and time overheads. We evaluate the proposed compression algorithm using I/O benchmarks and real-world applications, demonstrating that Recorder can store more information while requiring approximately 12x less storage space compared to its predecessor. Notably, for applications with typical parallel I/O patterns, Recorder achieves a constant trace size regardless of execution scale. Additionally, a comparison with the profiling tool Darshan shows that Recorder captures detailed I/O information without incurring substantial overhead. The richer data collected by Recorder enables new insights and facilitates more in-depth I/O studies, offering valuable contributions to the I/O research community. △ Less

Submitted 8 January, 2025; originally announced January 2025.

Comments: 29 pages. Under Review. Submitted to the Journal of Supercomputing

arXiv:2412.09551 [pdf, other]

Video Creation by Demonstration

Authors: Yihong Sun, Hao Zhou, Liangzhe Yuan, Jennifer J. Sun, Yandong Li, Xuhui Jia, Hartwig Adam, Bharath Hariharan, Long Zhao, Ting Liu

Abstract: We explore a novel video creation experience, namely Video Creation by Demonstration. Given a demonstration video and a context image from a different scene, we generate a physically plausible video that continues naturally from the context image and carries out the action concepts from the demonstration. To enable this capability, we present $δ$-Diffusion, a self-supervised training approach that… ▽ More We explore a novel video creation experience, namely Video Creation by Demonstration. Given a demonstration video and a context image from a different scene, we generate a physically plausible video that continues naturally from the context image and carries out the action concepts from the demonstration. To enable this capability, we present $δ$-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction. Unlike most existing video generation controls that are based on explicit signals, we adopts the form of implicit latent control for maximal flexibility and expressiveness required by general videos. By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process with minimal appearance leakage. Empirically, $δ$-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations, and demonstrates potentials towards interactive world simulation. Sampled video generation results are available at https://delta-diffusion.github.io/. △ Less

Submitted 12 December, 2024; originally announced December 2024.

Comments: Project page at https://delta-diffusion.github.io/

arXiv:2412.08563 [pdf]

doi 10.52783/jes.7210

Physics Based Differentiable Rendering for Inverse Problems and Beyond

Authors: Preetish Kakkar, Srijani Mukherjee, Hariharan Ragothaman, Vishal Mehta

Abstract: Physics-based differentiable rendering (PBDR) has become an efficient method in computer vision, graphics, and machine learning for addressing an array of inverse problems. PBDR allows patterns to be generated from perceptions which can be applied to enhance object attributes like geometry, substances, and lighting by adding physical models of light propagation and materials interaction. Due to th… ▽ More Physics-based differentiable rendering (PBDR) has become an efficient method in computer vision, graphics, and machine learning for addressing an array of inverse problems. PBDR allows patterns to be generated from perceptions which can be applied to enhance object attributes like geometry, substances, and lighting by adding physical models of light propagation and materials interaction. Due to these capabilities, distinguished rendering has been employed in a wider range of sectors such as autonomous navigation, scene reconstruction, and material design. We provide an extensive overview of PBDR techniques in this study, emphasizing their creation, effectiveness, and limitations while managing inverse situations. We demonstrate modern techniques and examine their value in everyday situations. △ Less

Submitted 9 January, 2025; v1 submitted 11 December, 2024; originally announced December 2024.

Journal ref: Journal of Electrical systems, Vol. 20 No. 11s (2024)

arXiv:2412.03174 [pdf, other]

Resilient Timed Elastic Band Planner for Collision-Free Navigation in Unknown Environments

Authors: Geesara Kulathunga, Abdurrahman Yilmaz, Zhuoling Huang, Ibrahim Hroob, Hariharan Arunachalam, Leonardo Guevara, Alexandr Klimchik, Grzegorz Cielniak, Marc Hanheide

Abstract: In autonomous navigation, trajectory replanning, refinement, and control command generation are essential for effective motion planning. This paper presents a resilient approach to trajectory replanning addressing scenarios where the initial planner's solution becomes infeasible. The proposed method incorporates a hybrid A* algorithm to generate feasible trajectories when the primary planner fails… ▽ More In autonomous navigation, trajectory replanning, refinement, and control command generation are essential for effective motion planning. This paper presents a resilient approach to trajectory replanning addressing scenarios where the initial planner's solution becomes infeasible. The proposed method incorporates a hybrid A* algorithm to generate feasible trajectories when the primary planner fails and applies a soft constraints-based smoothing technique to refine these trajectories, ensuring continuity, obstacle avoidance, and kinematic feasibility. Obstacle constraints are modelled using a dynamic Voronoi map to improve navigation through narrow passages. This approach enhances the consistency of trajectory planning, speeds up convergence, and meets real-time computational requirements. In environments with around 30\% or higher obstacle density, the ratio of free space before and after placing new obstacles, the Resilient Timed Elastic Band (RTEB) planner achieves approximately 20\% reduction in traverse distance, traverse time, and control effort compared to the Timed Elastic Band (TEB) planner and Nonlinear Model Predictive Control (NMPC) planner. These improvements demonstrate the RTEB planner's potential for application in field robotics, particularly in agricultural and industrial environments, where navigating unstructured terrain is crucial for ensuring efficiency and operational resilience. △ Less

Submitted 4 December, 2024; originally announced December 2024.

arXiv:2411.16016 [pdf]

Establishing Design Routines for Efficient Control of Automated Robots

Authors: Hariharan Ragothaman, Harihar M, SK Guhananthan

Abstract: With continual advancements in technology, efforts to develop robots simulating human behavior have intensified. Cognitive robotics, combined with artificial intelligence (AI), has proven effective in surveying and research analysis. However, despite progress, human intervention remains necessary, and incorporating AI into robotic systems continues to pose challenges. This paper explores methodolo… ▽ More With continual advancements in technology, efforts to develop robots simulating human behavior have intensified. Cognitive robotics, combined with artificial intelligence (AI), has proven effective in surveying and research analysis. However, despite progress, human intervention remains necessary, and incorporating AI into robotic systems continues to pose challenges. This paper explores methodologies to integrate AI into robotic designs, aiming to enhance human-robot interactions. Several approaches are proposed to improve robotic performance, including routines for efficient control in varied environments and the incorporation of digital image processing for enhanced line-of-sight capabilities. A key contribution of this work is testing robotic systems in real-time environments to assess efficiency relative to existing models. Additionally, the paper introduces a robotic system with universal control capabilities, suitable for industrial applications, developed and programmed on the Arduino platform. Features such as GPS control for safe operations and progressive memory algorithms for efficient memory management are presented, offering advancements in both industrial and research applications. △ Less

Submitted 24 November, 2024; originally announced November 2024.

Journal ref: Young Scientists Convention, MSEC, March 2012

arXiv:2411.13549 [pdf, other]

Generating 3D-Consistent Videos from Unposed Internet Photos

Authors: Gene Chou, Kai Zhang, Sai Bi, Hao Tan, Zexiang Xu, Fujun Luan, Bharath Hariharan, Noah Snavely

Abstract: We address the problem of generating videos from unposed internet photos. A handful of input images serve as keyframes, and our model interpolates between them to simulate a path moving between the cameras. Given random images, a model's ability to capture underlying geometry, recognize scene identity, and relate frames in terms of camera position and orientation reflects a fundamental understandi… ▽ More We address the problem of generating videos from unposed internet photos. A handful of input images serve as keyframes, and our model interpolates between them to simulate a path moving between the cameras. Given random images, a model's ability to capture underlying geometry, recognize scene identity, and relate frames in terms of camera position and orientation reflects a fundamental understanding of 3D structure and scene layout. However, existing video models such as Luma Dream Machine fail at this task. We design a self-supervised method that takes advantage of the consistency of videos and variability of multiview internet photos to train a scalable, 3D-aware video model without any 3D annotations such as camera parameters. We validate that our method outperforms all baselines in terms of geometric and appearance consistency. We also show our model benefits applications that enable camera control, such as 3D Gaussian Splatting. Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos. △ Less

Submitted 20 November, 2024; originally announced November 2024.

arXiv:2411.08813 [pdf, other]

Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique

Authors: Suhas Hariharan, Zainab Ali Majid, Jaime Raldua Veuthey, Jacob Haimes

Abstract: A key development in the cybersecurity evaluations space is the work carried out by Meta, through their CyberSecEval approach. While this work is undoubtedly a useful contribution to a nascent field, there are notable features that limit its utility. Key drawbacks focus on the insecure code detection part of Meta's methodology. We explore these limitations, and use our exploration as a test case f… ▽ More A key development in the cybersecurity evaluations space is the work carried out by Meta, through their CyberSecEval approach. While this work is undoubtedly a useful contribution to a nascent field, there are notable features that limit its utility. Key drawbacks focus on the insecure code detection part of Meta's methodology. We explore these limitations, and use our exploration as a test case for LLM-assisted benchmark analysis. △ Less

Submitted 13 November, 2024; originally announced November 2024.

Comments: NeurIPS 2024, 2 pages

arXiv:2411.02095 [pdf]

doi 10.5121/ijcga.2024.14401

The evolution of volumetric video: A survey of smart transcoding and compression approaches

Authors: Preetish Kakkar, Hariharan Ragothaman

Abstract: Volumetric video, the capture and display of three-dimensional (3D) imagery, has emerged as a revolutionary technology poised to transform the media landscape, enabling immersive experiences that transcend the limitations of traditional 2D video. One of the key challenges in this domain is the efficient delivery of these high-bandwidth, data-intensive volumetric video streams, which requires innov… ▽ More Volumetric video, the capture and display of three-dimensional (3D) imagery, has emerged as a revolutionary technology poised to transform the media landscape, enabling immersive experiences that transcend the limitations of traditional 2D video. One of the key challenges in this domain is the efficient delivery of these high-bandwidth, data-intensive volumetric video streams, which requires innovative transcoding and compression techniques. This research paper explores the state-of-the-art in volumetric video compression and delivery, with a focus on the potential of AI-driven solutions to address the unique challenges posed by this emerging medium. △ Less

Submitted 9 January, 2025; v1 submitted 4 November, 2024; originally announced November 2024.

Journal ref: International Journal of Computer Graphics & Animation (IJCGA) 2024

arXiv:2411.00210 [pdf, other]

Scale-Aware Recognition in Satellite Images under Resource Constraints

Authors: Shreelekha Revankar, Cheng Perng Phoo, Utkarsh Mall, Bharath Hariharan, Kavita Bala

Abstract: Recognition of features in satellite imagery (forests, swimming pools, etc.) depends strongly on the spatial scale of the concept and therefore the resolution of the images. This poses two challenges: Which resolution is best suited for recognizing a given concept, and where and when should the costlier higher-resolution (HR) imagery be acquired? We present a novel scheme to address these challe… ▽ More Recognition of features in satellite imagery (forests, swimming pools, etc.) depends strongly on the spatial scale of the concept and therefore the resolution of the images. This poses two challenges: Which resolution is best suited for recognizing a given concept, and where and when should the costlier higher-resolution (HR) imagery be acquired? We present a novel scheme to address these challenges by introducing three components: (1) A technique to distill knowledge from models trained on HR imagery to recognition models that operate on imagery of lower resolution (LR), (2) a sampling strategy for HR imagery based on model disagreement, and (3) an LLM-based approach for inferring concept "scale". With these components we present a system to efficiently perform scale-aware recognition in satellite imagery, improving accuracy over single-scale inference while following budget constraints. Our novel approach offers up to a 26.3% improvement over entirely HR baselines, using 76.3% fewer HR images. △ Less

Submitted 2 February, 2025; v1 submitted 31 October, 2024; originally announced November 2024.

Comments: 16 pages, 4 figures

arXiv:2410.23891 [pdf, other]

AllClear: A Comprehensive Dataset and Benchmark for Cloud Removal in Satellite Imagery

Authors: Hangyu Zhou, Chia-Hsiang Kao, Cheng Perng Phoo, Utkarsh Mall, Bharath Hariharan, Kavita Bala

Abstract: Clouds in satellite imagery pose a significant challenge for downstream applications. A major challenge in current cloud removal research is the absence of a comprehensive benchmark and a sufficiently large and diverse training dataset. To address this problem, we introduce the largest public dataset -- $\textit{AllClear}$ for cloud removal, featuring 23,742 globally distributed regions of interes… ▽ More Clouds in satellite imagery pose a significant challenge for downstream applications. A major challenge in current cloud removal research is the absence of a comprehensive benchmark and a sufficiently large and diverse training dataset. To address this problem, we introduce the largest public dataset -- $\textit{AllClear}$ for cloud removal, featuring 23,742 globally distributed regions of interest (ROIs) with diverse land-use patterns, comprising 4 million images in total. Each ROI includes complete temporal captures from the year 2022, with (1) multi-spectral optical imagery from Sentinel-2 and Landsat 8/9, (2) synthetic aperture radar (SAR) imagery from Sentinel-1, and (3) auxiliary remote sensing products such as cloud masks and land cover maps. We validate the effectiveness of our dataset by benchmarking performance, demonstrating the scaling law -- the PSNR rises from $28.47$ to $33.87$ with $30\times$ more data, and conducting ablation studies on the temporal length and the importance of individual modalities. This dataset aims to provide comprehensive coverage of the Earth's surface and promote better cloud removal results. △ Less

Submitted 31 October, 2024; originally announced October 2024.

Comments: Accepted at NeurIPS 2024 Datasets and Benchmarks Track. Code and data available at https://allclear.cs.cornell.edu/

arXiv:2410.02646 [pdf, other]

Learning 3D Perception from Others' Predictions

Authors: Jinsu Yoo, Zhenyang Feng, Tai-Yu Pan, Yihong Sun, Cheng Perng Phoo, Xiangyu Chen, Mark Campbell, Kilian Q. Weinberger, Bharath Hariharan, Wei-Lun Chao

Abstract: Accurate 3D object detection in real-world environments requires a huge amount of annotated data with high quality. Acquiring such data is tedious and expensive, and often needs repeated effort when a new sensor is adopted or when the detector is deployed in a new environment. We investigate a new scenario to construct 3D object detectors: learning from the predictions of a nearby unit that is equ… ▽ More Accurate 3D object detection in real-world environments requires a huge amount of annotated data with high quality. Acquiring such data is tedious and expensive, and often needs repeated effort when a new sensor is adopted or when the detector is deployed in a new environment. We investigate a new scenario to construct 3D object detectors: learning from the predictions of a nearby unit that is equipped with an accurate detector. For example, when a self-driving car enters a new area, it may learn from other traffic participants whose detectors have been optimized for that area. This setting is label-efficient, sensor-agnostic, and communication-efficient: nearby units only need to share the predictions with the ego agent (e.g., car). Naively using the received predictions as ground-truths to train the detector for the ego car, however, leads to inferior performance. We systematically study the problem and identify viewpoint mismatches and mislocalization (due to synchronization and GPS errors) as the main causes, which unavoidably result in false positives, false negatives, and inaccurate pseudo labels. We propose a distance-based curriculum, first learning from closer units with similar viewpoints and subsequently improving the quality of other units' predictions via self-training. We further demonstrate that an effective pseudo label refinement module can be trained with a handful of annotated data, largely reducing the data quantity necessary to train an object detector. We validate our approach on the recently released real-world collaborative driving dataset, using reference cars' predictions as pseudo labels for the ego car. Extensive experiments including several scenarios (e.g., different sensors, detectors, and domains) demonstrate the effectiveness of our approach toward label-efficient learning of 3D perception from other units' predictions. △ Less

Submitted 29 March, 2025; v1 submitted 3 October, 2024; originally announced October 2024.

Comments: Accepted to ICLR 2025

arXiv:2409.19841 [pdf, other]

Counter-Current Learning: A Biologically Plausible Dual Network Approach for Deep Learning

Authors: Chia-Hsiang Kao, Bharath Hariharan

Abstract: Despite its widespread use in neural networks, error backpropagation has faced criticism for its lack of biological plausibility, suffering from issues such as the backward locking problem and the weight transport problem. These limitations have motivated researchers to explore more biologically plausible learning algorithms that could potentially shed light on how biological neural systems adapt… ▽ More Despite its widespread use in neural networks, error backpropagation has faced criticism for its lack of biological plausibility, suffering from issues such as the backward locking problem and the weight transport problem. These limitations have motivated researchers to explore more biologically plausible learning algorithms that could potentially shed light on how biological neural systems adapt and learn. Inspired by the counter-current exchange mechanisms observed in biological systems, we propose counter-current learning (CCL), a biologically plausible framework for credit assignment in neural networks. This framework employs a feedforward network to process input data and a feedback network to process targets, with each network enhancing the other through anti-parallel signal propagation. By leveraging the more informative signals from the bottom layer of the feedback network to guide the updates of the top layer of the feedforward network and vice versa, CCL enables the simultaneous transformation of source inputs to target outputs and the dynamic mutual influence of these transformations. Experimental results on MNIST, FashionMNIST, CIFAR10, and CIFAR100 datasets using multi-layer perceptrons and convolutional neural networks demonstrate that CCL achieves comparable performance to other biologically plausible algorithms while offering a more biologically realistic learning mechanism. Furthermore, we showcase the applicability of our approach to an autoencoder task, underscoring its potential for unsupervised representation learning. Our work presents a direction for biologically inspired and plausible learning algorithms, offering an alternative mechanism of learning and adaptation in neural networks. △ Less

Submitted 23 October, 2024; v1 submitted 29 September, 2024; originally announced September 2024.

Comments: Accepted at NeurIPS 2024. Code available at https://github.com/IandRover/CCL-NeurIPS24

arXiv:2409.16484 [pdf, other]

BehAV: Behavioral Rule Guided Autonomy Using VLMs for Robot Navigation in Outdoor Scenes

Authors: Kasun Weerakoon, Mohamed Elnoor, Gershom Seneviratne, Vignesh Rajagopal, Senthil Hariharan Arul, Jing Liang, Mohamed Khalid M Jaffar, Dinesh Manocha

Abstract: We present BehAV, a novel approach for autonomous robot navigation in outdoor scenes guided by human instructions and leveraging Vision Language Models (VLMs). Our method interprets human commands using a Large Language Model (LLM) and categorizes the instructions into navigation and behavioral guidelines. Navigation guidelines consist of directional commands (e.g., "move forward until") and assoc… ▽ More We present BehAV, a novel approach for autonomous robot navigation in outdoor scenes guided by human instructions and leveraging Vision Language Models (VLMs). Our method interprets human commands using a Large Language Model (LLM) and categorizes the instructions into navigation and behavioral guidelines. Navigation guidelines consist of directional commands (e.g., "move forward until") and associated landmarks (e.g., "the building with blue windows"), while behavioral guidelines encompass regulatory actions (e.g., "stay on") and their corresponding objects (e.g., "pavements"). We use VLMs for their zero-shot scene understanding capabilities to estimate landmark locations from RGB images for robot navigation. Further, we introduce a novel scene representation that utilizes VLMs to ground behavioral rules into a behavioral cost map. This cost map encodes the presence of behavioral objects within the scene and assigns costs based on their regulatory actions. The behavioral cost map is integrated with a LiDAR-based occupancy map for navigation. To navigate outdoor scenes while adhering to the instructed behaviors, we present an unconstrained Model Predictive Control (MPC)-based planner that prioritizes both reaching landmarks and following behavioral guidelines. We evaluate the performance of BehAV on a quadruped robot across diverse real-world scenarios, demonstrating a 22.49% improvement in alignment with human-teleoperated actions, as measured by Frechet distance, and achieving a 40% higher navigation success rate compared to state-of-the-art methods. △ Less

Submitted 2 October, 2024; v1 submitted 24 September, 2024; originally announced September 2024.

arXiv:2409.15394 [pdf, other]

doi 10.1145/3641519.3657395

Neural Control Variates with Automatic Integration

Authors: Zilu Li, Guandao Yang, Qingqing Zhao, Xi Deng, Leonidas Guibas, Bharath Hariharan, Gordon Wetzstein

Abstract: This paper presents a method to leverage arbitrary neural network architecture for control variates. Control variates are crucial in reducing the variance of Monte Carlo integration, but they hinge on finding a function that both correlates with the integrand and has a known analytical integral. Traditional approaches rely on heuristics to choose this function, which might not be expressive enough… ▽ More This paper presents a method to leverage arbitrary neural network architecture for control variates. Control variates are crucial in reducing the variance of Monte Carlo integration, but they hinge on finding a function that both correlates with the integrand and has a known analytical integral. Traditional approaches rely on heuristics to choose this function, which might not be expressive enough to correlate well with the integrand. Recent research alleviates this issue by modeling the integrands with a learnable parametric model, such as a neural network. However, the challenge remains in creating an expressive parametric model with a known analytical integral. This paper proposes a novel approach to construct learnable parametric control variates functions from arbitrary neural network architectures. Instead of using a network to approximate the integrand directly, we employ the network to approximate the anti-derivative of the integrand. This allows us to use automatic differentiation to create a function whose integration can be constructed by the antiderivative network. We apply our method to solve partial differential equations using the Walk-on-sphere algorithm. Our results indicate that this approach is unbiased and uses various network architectures to achieve lower variance than other control variate methods. △ Less

Submitted 23 September, 2024; originally announced September 2024.

Journal ref: SIGGRAPH Conference Papers 2024

arXiv:2408.10240 [pdf, other]

AltCanvas: A Tile-Based Image Editor with Generative AI for Blind or Visually Impaired People

Authors: Seonghee Lee, Maho Kohga, Steve Landau, Sile O'Modhrain, Hari Subramonyam

Abstract: People with visual impairments often struggle to create content that relies heavily on visual elements, particularly when conveying spatial and structural information. Existing accessible drawing tools, which construct images line by line, are suitable for simple tasks like math but not for more expressive artwork. On the other hand, emerging generative AI-based text-to-image tools can produce exp… ▽ More People with visual impairments often struggle to create content that relies heavily on visual elements, particularly when conveying spatial and structural information. Existing accessible drawing tools, which construct images line by line, are suitable for simple tasks like math but not for more expressive artwork. On the other hand, emerging generative AI-based text-to-image tools can produce expressive illustrations from descriptions in natural language, but they lack precise control over image composition and properties. To address this gap, our work integrates generative AI with a constructive approach that provides users with enhanced control and editing capabilities. Our system, AltCanvas, features a tile-based interface enabling users to construct visual scenes incrementally, with each tile representing an object within the scene. Users can add, edit, move, and arrange objects while receiving speech and audio feedback. Once completed, the scene can be rendered as a color illustration or as a vector for tactile graphic generation. Involving 14 blind or low-vision users in design and evaluation, we found that participants effectively used the AltCanvas workflow to create illustrations. △ Less

Submitted 4 August, 2024; originally announced August 2024.

arXiv:2408.10239 [pdf, ps, other]

A Conceptual Framework for Ethical Evaluation of Machine Learning Systems

Authors: Neha R. Gupta, Jessica Hullman, Hari Subramonyam

Abstract: Research in Responsible AI has developed a range of principles and practices to ensure that machine learning systems are used in a manner that is ethical and aligned with human values. However, a critical yet often neglected aspect of ethical ML is the ethical implications that appear when designing evaluations of ML systems. For instance, teams may have to balance a trade-off between highly infor… ▽ More Research in Responsible AI has developed a range of principles and practices to ensure that machine learning systems are used in a manner that is ethical and aligned with human values. However, a critical yet often neglected aspect of ethical ML is the ethical implications that appear when designing evaluations of ML systems. For instance, teams may have to balance a trade-off between highly informative tests to ensure downstream product safety, with potential fairness harms inherent to the implemented testing procedures. We conceptualize ethics-related concerns in standard ML evaluation techniques. Specifically, we present a utility framework, characterizing the key trade-off in ethical evaluation as balancing information gain against potential ethical harms. The framework is then a tool for characterizing challenges teams face, and systematically disentangling competing considerations that teams seek to balance. Differentiating between different types of issues encountered in evaluation allows us to highlight best practices from analogous domains, such as clinical trials and automotive crash testing, which navigate these issues in ways that can offer inspiration to improve evaluation processes in ML. Our analysis underscores the critical need for development teams to deliberately assess and manage ethical complexities that arise during the evaluation of ML systems, and for the industry to move towards designing institutional policies to support ethical evaluations. △ Less

Submitted 4 August, 2024; originally announced August 2024.

arXiv:2408.08301 [pdf, other]

VLPG-Nav: Object Navigation Using Visual Language Pose Graph and Object Localization Probability Maps

Authors: Senthil Hariharan Arul, Dhruva Kumar, Vivek Sugirtharaj, Richard Kim, Xuewei, Qi, Rajasimman Madhivanan, Arnie Sen, Dinesh Manocha

Abstract: We present VLPG-Nav, a visual language navigation method for guiding robots to specified objects within household scenes. Unlike existing methods primarily focused on navigating the robot toward objects, our approach considers the additional challenge of centering the object within the robot's camera view. Our method builds a visual language pose graph (VLPG) that functions as a spatial map of VL… ▽ More We present VLPG-Nav, a visual language navigation method for guiding robots to specified objects within household scenes. Unlike existing methods primarily focused on navigating the robot toward objects, our approach considers the additional challenge of centering the object within the robot's camera view. Our method builds a visual language pose graph (VLPG) that functions as a spatial map of VL embeddings. Given an open vocabulary object query, we plan a viewpoint for object navigation using the VLPG. Despite navigating to the viewpoint, real-world challenges like object occlusion, displacement, and the robot's localization error can prevent visibility. We build an object localization probability map that leverages the robot's current observations and prior VLPG. When the object isn't visible, the probability map is updated and an alternate viewpoint is computed. In addition, we propose an object-centering formulation that locally adjusts the robot's pose to center the object in the camera view. We evaluate the effectiveness of our approach through simulations and real-world experiments, evaluating its ability to successfully view and center the object within the camera field of view. VLPG-Nav demonstrates improved performance in locating the object, navigating around occlusions, and centering the object within the robot's camera view, outperforming the selected baselines in the evaluation metrics. △ Less

Submitted 15 August, 2024; originally announced August 2024.

arXiv:2407.19108 [pdf, other]

ObjectCarver: Semi-automatic segmentation, reconstruction and separation of 3D objects

Authors: Gemmechu Hassena, Jonathan Moon, Ryan Fujii, Andrew Yuen, Noah Snavely, Steve Marschner, Bharath Hariharan

Abstract: Implicit neural fields have made remarkable progress in reconstructing 3D surfaces from multiple images; however, they encounter challenges when it comes to separating individual objects within a scene. Previous work has attempted to tackle this problem by introducing a framework to train separate signed distance fields (SDFs) simultaneously for each of N objects and using a regularization term to… ▽ More Implicit neural fields have made remarkable progress in reconstructing 3D surfaces from multiple images; however, they encounter challenges when it comes to separating individual objects within a scene. Previous work has attempted to tackle this problem by introducing a framework to train separate signed distance fields (SDFs) simultaneously for each of N objects and using a regularization term to prevent objects from overlapping. However, all of these methods require segmentation masks to be provided, which are not always readily available. We introduce our method, ObjectCarver, to tackle the problem of object separation from just click input in a single view. Given posed multi-view images and a set of user-input clicks to prompt segmentation of the individual objects, our method decomposes the scene into separate objects and reconstructs a high-quality 3D surface for each one. We introduce a loss function that prevents floaters and avoids inappropriate carving-out due to occlusion. In addition, we introduce a novel scene initialization method that significantly speeds up the process while preserving geometric details compared to previous approaches. Despite requiring neither ground truth masks nor monocular cues, our method outperforms baselines both qualitatively and quantitatively. In addition, we introduce a new benchmark dataset for evaluation. △ Less

Submitted 26 July, 2024; originally announced July 2024.

Comments: Project page is: https://objectcarver.github.io/

arXiv:2407.04694 [pdf, other]

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

Authors: Rudolf Laine, Bilal Chughtai, Jan Betley, Kaivalya Hariharan, Jeremy Scheurer, Mikita Balesni, Marius Hobbhahn, Alexander Meinke, Owain Evans

Abstract: AI assistants such as ChatGPT are trained to respond to users by saying, "I am a large language model". This raises questions. Do such models know that they are LLMs and reliably act on this knowledge? Are they aware of their current circumstances, such as being deployed to the public? We refer to a model's knowledge of itself and its circumstances as situational awareness. To quantify situational… ▽ More AI assistants such as ChatGPT are trained to respond to users by saying, "I am a large language model". This raises questions. Do such models know that they are LLMs and reliably act on this knowledge? Are they aware of their current circumstances, such as being deployed to the public? We refer to a model's knowledge of itself and its circumstances as situational awareness. To quantify situational awareness in LLMs, we introduce a range of behavioral tests, based on question answering and instruction following. These tests form the $\textbf{Situational Awareness Dataset (SAD)}$, a benchmark comprising 7 task categories and over 13,000 questions. The benchmark tests numerous abilities, including the capacity of LLMs to (i) recognize their own generated text, (ii) predict their own behavior, (iii) determine whether a prompt is from internal evaluation or real-world deployment, and (iv) follow instructions that depend on self-knowledge. We evaluate 16 LLMs on SAD, including both base (pretrained) and chat models. While all models perform better than chance, even the highest-scoring model (Claude 3 Opus) is far from a human baseline on certain tasks. We also observe that performance on SAD is only partially predicted by metrics of general knowledge (e.g. MMLU). Chat models, which are finetuned to serve as AI assistants, outperform their corresponding base models on SAD but not on general knowledge tasks. The purpose of SAD is to facilitate scientific understanding of situational awareness in LLMs by breaking it down into quantitative abilities. Situational awareness is important because it enhances a model's capacity for autonomous planning and action. While this has potential benefits for automation, it also introduces novel risks related to AI safety and control. Code and latest results available at https://situational-awareness-dataset.org . △ Less

Submitted 5 July, 2024; originally announced July 2024.

Comments: 11 page main body, 98 page appendix, 58 figures

arXiv:2406.11819 [pdf, other]

MegaScenes: Scene-Level View Synthesis at Scale

Authors: Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, Noah Snavely

Abstract: Scene-level novel view synthesis (NVS) is fundamental to many vision and graphics applications. Recently, pose-conditioned diffusion models have led to significant progress by extracting 3D information from 2D foundation models, but these methods are limited by the lack of scene-level training data. Common dataset choices either consist of isolated objects (Objaverse), or of object-centric scenes… ▽ More Scene-level novel view synthesis (NVS) is fundamental to many vision and graphics applications. Recently, pose-conditioned diffusion models have led to significant progress by extracting 3D information from 2D foundation models, but these methods are limited by the lack of scene-level training data. Common dataset choices either consist of isolated objects (Objaverse), or of object-centric scenes with limited pose distributions (DTU, CO3D). In this paper, we create a large-scale scene-level dataset from Internet photo collections, called MegaScenes, which contains over 100K structure from motion (SfM) reconstructions from around the world. Internet photos represent a scalable data source but come with challenges such as lighting and transient objects. We address these issues to further create a subset suitable for the task of NVS. Additionally, we analyze failure cases of state-of-the-art NVS methods and significantly improve generation consistency. Through extensive experiments, we validate the effectiveness of both our dataset and method on generating in-the-wild scenes. For details on the dataset and code, see our project page at https://megascenes.github.io. △ Less

Submitted 21 August, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: Accepted at ECCV 2024. Our project page is at https://megascenes.github.io

arXiv:2406.06613 [pdf, other]

GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents

Authors: Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, Joshua Clymer, Arjun Yadav

Abstract: Large language models have demonstrated remarkable few-shot performance on many natural language understanding tasks. Despite several demonstrations of using large language models in complex, strategic scenarios, there lacks a comprehensive framework for evaluating agents' performance across various types of reasoning found in games. To address this gap, we introduce GameBench, a cross-domain benc… ▽ More Large language models have demonstrated remarkable few-shot performance on many natural language understanding tasks. Despite several demonstrations of using large language models in complex, strategic scenarios, there lacks a comprehensive framework for evaluating agents' performance across various types of reasoning found in games. To address this gap, we introduce GameBench, a cross-domain benchmark for evaluating strategic reasoning abilities of LLM agents. We focus on 9 different game environments, where each covers at least one axis of key reasoning skill identified in strategy games, and select games for which strategy explanations are unlikely to form a significant portion of models' pretraining corpuses. Our evaluations use GPT-3 and GPT-4 in their base form along with two scaffolding frameworks designed to enhance strategic reasoning ability: Chain-of-Thought (CoT) prompting and Reasoning Via Planning (RAP). Our results show that none of the tested models match human performance, and at worst GPT-4 performs worse than random action. CoT and RAP both improve scores but not comparable to human levels. △ Less

Submitted 22 July, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

arXiv:2405.16034 [pdf, other]

DiffuBox: Refining 3D Object Detection with Point Diffusion

Authors: Xiangyu Chen, Zhenzhen Liu, Katie Z Luo, Siddhartha Datta, Adhitya Polavaram, Yan Wang, Yurong You, Boyi Li, Marco Pavone, Wei-Lun Chao, Mark Campbell, Bharath Hariharan, Kilian Q. Weinberger

Abstract: Ensuring robust 3D object detection and localization is crucial for many applications in robotics and autonomous driving. Recent models, however, face difficulties in maintaining high performance when applied to domains with differing sensor setups or geographic locations, often resulting in poor localization accuracy due to domain shift. To overcome this challenge, we introduce a novel diffusion-… ▽ More Ensuring robust 3D object detection and localization is crucial for many applications in robotics and autonomous driving. Recent models, however, face difficulties in maintaining high performance when applied to domains with differing sensor setups or geographic locations, often resulting in poor localization accuracy due to domain shift. To overcome this challenge, we introduce a novel diffusion-based box refinement approach. This method employs a domain-agnostic diffusion model, conditioned on the LiDAR points surrounding a coarse bounding box, to simultaneously refine the box's location, size, and orientation. We evaluate this approach under various domain adaptation settings, and our results reveal significant improvements across different datasets, object classes and detectors. Our PyTorch implementation is available at \href{https://github.com/cxy1997/DiffuBox}{https://github.com/cxy1997/DiffuBox}. △ Less

Submitted 6 December, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.14841 [pdf, other]

MOD-UV: Learning Mobile Object Detectors from Unlabeled Videos

Authors: Yihong Sun, Bharath Hariharan

Abstract: Embodied agents must detect and localize objects of interest, e.g. traffic participants for self-driving cars. Supervision in the form of bounding boxes for this task is extremely expensive. As such, prior work has looked at unsupervised instance detection and segmentation, but in the absence of annotated boxes, it is unclear how pixels must be grouped into objects and which objects are of interes… ▽ More Embodied agents must detect and localize objects of interest, e.g. traffic participants for self-driving cars. Supervision in the form of bounding boxes for this task is extremely expensive. As such, prior work has looked at unsupervised instance detection and segmentation, but in the absence of annotated boxes, it is unclear how pixels must be grouped into objects and which objects are of interest. This results in over-/under-segmentation and irrelevant objects. Inspired by human visual system and practical applications, we posit that the key missing cue for unsupervised detection is motion: objects of interest are typically mobile objects that frequently move and their motions can specify separate instances. In this paper, we propose MOD-UV, a Mobile Object Detector learned from Unlabeled Videos only. We begin with instance pseudo-labels derived from motion segmentation, but introduce a novel training paradigm to progressively discover small objects and static-but-mobile objects that are missed by motion segmentation. As a result, though only learned from unlabeled videos, MOD-UV can detect and segment mobile objects from a single static image. Empirically, we achieve state-of-the-art performance in unsupervised mobile object detection on Waymo Open, nuScenes, and KITTI Datasets without using any external data or supervised models. Code is available at https://github.com/YihongSun/MOD-UV. △ Less

Submitted 31 July, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

Comments: ECCV 2024

arXiv:2405.12946 [pdf, other]

Tutorly: Turning Programming Videos Into Apprenticeship Learning Environments with LLMs

Authors: Wengxi Li, Roy Pea, Nick Haber, Hari Subramonyam

Abstract: Online programming videos, including tutorials and streamcasts, are widely popular and contain a wealth of expert knowledge. However, effectively utilizing these resources to achieve targeted learning goals can be challenging. Unlike direct tutoring, video content lacks tailored guidance based on individual learning paces, personalized feedback, and interactive engagement necessary for support and… ▽ More Online programming videos, including tutorials and streamcasts, are widely popular and contain a wealth of expert knowledge. However, effectively utilizing these resources to achieve targeted learning goals can be challenging. Unlike direct tutoring, video content lacks tailored guidance based on individual learning paces, personalized feedback, and interactive engagement necessary for support and monitoring. Our work transforms programming videos into one-on-one tutoring experiences using the cognitive apprenticeship framework. Tutorly, developed as a JupyterLab Plugin, allows learners to (1) set personalized learning goals, (2) engage in learning-by-doing through a conversational LLM-based mentor agent, (3) receive guidance and feedback based on a student model that steers the mentor moves. In a within-subject study with 16 participants learning exploratory data analysis from a streamcast, Tutorly significantly improved their performance from 61.9% to 76.6% based on a post-test questionnaire. Tutorly demonstrates the potential for enhancing programming video learning experiences with LLM and learner modeling. △ Less

Submitted 21 May, 2024; originally announced May 2024.

arXiv:2405.02260 [pdf, other]

Leveraging Large Language Models to Enhance Domain Expert Inclusion in Data Science Workflows

Authors: Jasmine Y. Shih, Vishal Mohanty, Yannis Katsis, Hariharan Subramonyam

Abstract: Domain experts can play a crucial role in guiding data scientists to optimize machine learning models while ensuring contextual relevance for downstream use. However, in current workflows, such collaboration is challenging due to differing expertise, abstract documentation practices, and lack of access and visibility into low-level implementation artifacts. To address these challenges and enable d… ▽ More Domain experts can play a crucial role in guiding data scientists to optimize machine learning models while ensuring contextual relevance for downstream use. However, in current workflows, such collaboration is challenging due to differing expertise, abstract documentation practices, and lack of access and visibility into low-level implementation artifacts. To address these challenges and enable domain expert participation, we introduce CellSync, a collaboration framework comprising (1) a Jupyter Notebook extension that continuously tracks changes to dataframes and model metrics and (2) a Large Language Model powered visualization dashboard that makes those changes interpretable to domain experts. Through CellSync's cell-level dataset visualization with code summaries, domain experts can interactively examine how individual data and modeling operations impact different data segments. The chat features enable data-centric conversations and targeted feedback to data scientists. Our preliminary evaluation shows that CellSync provides transparency and promotes critical discussions about the intents and implications of data operations. △ Less

Submitted 3 May, 2024; originally announced May 2024.

arXiv:2404.17673 [pdf, other]

Learning Manipulation Tasks in Dynamic and Shared 3D Spaces

Authors: Hariharan Arunachalam, Marc Hanheide, Sariah Mghames

Abstract: Automating the segregation process is a need for every sector experiencing a high volume of materials handling, repetitive and exhaustive operations, in addition to risky exposures. Learning automated pick-and-place operations can be efficiently done by introducing collaborative autonomous systems (e.g. manipulators) in the workplace and among human operators. In this paper, we propose a deep rein… ▽ More Automating the segregation process is a need for every sector experiencing a high volume of materials handling, repetitive and exhaustive operations, in addition to risky exposures. Learning automated pick-and-place operations can be efficiently done by introducing collaborative autonomous systems (e.g. manipulators) in the workplace and among human operators. In this paper, we propose a deep reinforcement learning strategy to learn the place task of multi-categorical items from a shared workspace between dual-manipulators and to multi-goal destinations, assuming the pick has been already completed. The learning strategy leverages first a stochastic actor-critic framework to train an agent's policy network, and second, a dynamic 3D Gym environment where both static and dynamic obstacles (e.g. human factors and robot mate) constitute the state space of a Markov decision process. Learning is conducted in a Gazebo simulator and experiments show an increase in cumulative reward function for the agent further away from human factors. Future investigations will be conducted to enhance the task performance for both agents simultaneously. △ Less

Submitted 26 April, 2024; originally announced April 2024.

Comments: 5 pages

arXiv:2404.05139 [pdf, other]

Better Monocular 3D Detectors with LiDAR from the Past

Authors: Yurong You, Cheng Perng Phoo, Carlos Andres Diaz-Ruiz, Katie Z Luo, Wei-Lun Chao, Mark Campbell, Bharath Hariharan, Kilian Q Weinberger

Abstract: Accurate 3D object detection is crucial to autonomous driving. Though LiDAR-based detectors have achieved impressive performance, the high cost of LiDAR sensors precludes their widespread adoption in affordable vehicles. Camera-based detectors are cheaper alternatives but often suffer inferior performance compared to their LiDAR-based counterparts due to inherent depth ambiguities in images. In th… ▽ More Accurate 3D object detection is crucial to autonomous driving. Though LiDAR-based detectors have achieved impressive performance, the high cost of LiDAR sensors precludes their widespread adoption in affordable vehicles. Camera-based detectors are cheaper alternatives but often suffer inferior performance compared to their LiDAR-based counterparts due to inherent depth ambiguities in images. In this work, we seek to improve monocular 3D detectors by leveraging unlabeled historical LiDAR data. Specifically, at inference time, we assume that the camera-based detectors have access to multiple unlabeled LiDAR scans from past traversals at locations of interest (potentially from other high-end vehicles equipped with LiDAR sensors). Under this setup, we proposed a novel, simple, and end-to-end trainable framework, termed AsyncDepth, to effectively extract relevant features from asynchronous LiDAR traversals of the same location for monocular 3D detectors. We show consistent and significant performance gain (up to 9 AP) across multiple state-of-the-art models and datasets with a negligible additional latency of 9.66 ms and a small storage cost. △ Less

Submitted 9 April, 2024; v1 submitted 7 April, 2024; originally announced April 2024.

Comments: Accepted by ICRA 2024. The code can be found at https://github.com/YurongYou/AsyncDepth

arXiv:2402.17721 [pdf, other]

Prototyping with Prompts: Emerging Approaches and Challenges in Generative AI Design for Collaborative Software Teams

Authors: Hari Subramonyam, Divy Thakkar, Andrew Ku, Jürgen Dieber, Anoop Sinha

Abstract: Generative AI models are increasingly being integrated into human task workflows, enabling the production of expressive content across a wide range of contexts. Unlike traditional human-AI design methods, the new approach to designing generative capabilities focuses heavily on prompt engineering strategies. This shift requires a deeper understanding of how collaborative software teams establish an… ▽ More Generative AI models are increasingly being integrated into human task workflows, enabling the production of expressive content across a wide range of contexts. Unlike traditional human-AI design methods, the new approach to designing generative capabilities focuses heavily on prompt engineering strategies. This shift requires a deeper understanding of how collaborative software teams establish and apply design guidelines, iteratively prototype prompts, and evaluate them to achieve specific outcomes. To explore these dynamics, we conducted design studies with 39 industry professionals, including UX designers, AI engineers, and product managers. Our findings highlight emerging practices and role shifts in AI system prototyping among multistakeholder teams. We observe various prompting and prototyping strategies, highlighting the pivotal role of to-be-generated content characteristics in enabling rapid, iterative prototyping with generative AI. By identifying associated challenges, such as the limited model interpretability and overfitting the design to specific example content, we outline considerations for generative AI prototyping. △ Less

Submitted 30 March, 2025; v1 submitted 27 February, 2024; originally announced February 2024.

arXiv:2402.17699 [pdf, other]

Gradient-based Discrete Sampling with Automatic Cyclical Scheduling

Authors: Patrick Pynadath, Riddhiman Bhattacharya, Arun Hariharan, Ruqi Zhang

Abstract: Discrete distributions, particularly in high-dimensional deep models, are often highly multimodal due to inherent discontinuities. While gradient-based discrete sampling has proven effective, it is susceptible to becoming trapped in local modes due to the gradient information. To tackle this challenge, we propose an automatic cyclical scheduling, designed for efficient and accurate sampling in mul… ▽ More Discrete distributions, particularly in high-dimensional deep models, are often highly multimodal due to inherent discontinuities. While gradient-based discrete sampling has proven effective, it is susceptible to becoming trapped in local modes due to the gradient information. To tackle this challenge, we propose an automatic cyclical scheduling, designed for efficient and accurate sampling in multimodal discrete distributions. Our method contains three key components: (1) a cyclical step size schedule where large steps discover new modes and small steps exploit each mode; (2) a cyclical balancing schedule, ensuring "balanced" proposals for given step sizes and high efficiency of the Markov chain; and (3) an automatic tuning scheme for adjusting the hyperparameters in the cyclical schedules, allowing adaptability across diverse datasets with minimal tuning. We prove the non-asymptotic convergence and inference guarantee for our method in general discrete distributions. Extensive experiments demonstrate the superiority of our method in sampling complex multimodal discrete distributions. △ Less

Submitted 24 October, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

arXiv:2402.08938 [pdf, other]

doi 10.1145/3643834.3661577

AINeedsPlanner: A Workbook to Support Effective Collaboration Between AI Experts and Clients

Authors: Dae Hyun Kim, Hyungyu Shin, Shakhnozakhon Yadgarova, Jinho Son, Hariharan Subramonyam, Juho Kim

Abstract: Clients often partner with AI experts to develop AI applications tailored to their needs. In these partnerships, careful planning and clear communication are critical, as inaccurate or incomplete specifications can result in misaligned model characteristics, expensive reworks, and potential friction between collaborators. Unfortunately, given the complexity of requirements ranging from functionali… ▽ More Clients often partner with AI experts to develop AI applications tailored to their needs. In these partnerships, careful planning and clear communication are critical, as inaccurate or incomplete specifications can result in misaligned model characteristics, expensive reworks, and potential friction between collaborators. Unfortunately, given the complexity of requirements ranging from functionality, data, and governance, effective guidelines for collaborative specification of requirements in client-AI expert collaborations are missing. In this work, we introduce AINeedsPlanner, a workbook that AI experts and clients can use to facilitate effective interchange and clear specifications. The workbook is based on (1) an interview of 10 completed AI application project teams, which identifies and characterizes steps in AI application planning and (2) a study with 12 AI experts, which defines a taxonomy of AI experts' information needs and dimensions that affect the information needs. Finally, we demonstrate the workbook's utility with two case studies in real-world settings. △ Less

Submitted 26 May, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

Comments: To appear in DIS 2024

arXiv:2312.08793 [pdf, other]

Forbidden Facts: An Investigation of Competing Objectives in Llama-2

Authors: Tony T. Wang, Miles Wang, Kaivalya Hariharan, Nir Shavit

Abstract: LLMs often face competing pressures (for example helpfulness vs. harmlessness). To understand how models resolve such conflicts, we study Llama-2-chat models on the forbidden fact task. Specifically, we instruct Llama-2 to truthfully complete a factual recall statement while forbidding it from saying the correct answer. This often makes the model give incorrect answers. We decompose Llama-2 into 1… ▽ More LLMs often face competing pressures (for example helpfulness vs. harmlessness). To understand how models resolve such conflicts, we study Llama-2-chat models on the forbidden fact task. Specifically, we instruct Llama-2 to truthfully complete a factual recall statement while forbidding it from saying the correct answer. This often makes the model give incorrect answers. We decompose Llama-2 into 1000+ components, and rank each one with respect to how useful it is for forbidding the correct answer. We find that in aggregate, around 35 components are enough to reliably implement the full suppression behavior. However, these components are fairly heterogeneous and many operate using faulty heuristics. We discover that one of these heuristics can be exploited via a manually designed adversarial attack which we call The California Attack. Our results highlight some roadblocks standing in the way of being able to successfully interpret advanced ML systems. Project website available at https://forbiddenfacts.github.io . △ Less

Submitted 31 December, 2023; v1 submitted 14 December, 2023; originally announced December 2023.

Comments: Accepted to the ATTRIB and SoLaR workshops at NeurIPS 2023; (v3: clarified experimental details)

arXiv:2312.06960 [pdf, other]

Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment

Authors: Utkarsh Mall, Cheng Perng Phoo, Meilin Kelsey Liu, Carl Vondrick, Bharath Hariharan, Kavita Bala

Abstract: We introduce a method to train vision-language models for remote-sensing images without using any textual annotations. Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language. Specifically, we train an image encoder for remote sensing images to align with the image encoder of CLIP using a large amount of paired… ▽ More We introduce a method to train vision-language models for remote-sensing images without using any textual annotations. Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language. Specifically, we train an image encoder for remote sensing images to align with the image encoder of CLIP using a large amount of paired internet and satellite images. Our unsupervised approach enables the training of a first-of-its-kind large-scale vision language model (VLM) for remote sensing images at two different resolutions. We show that these VLMs enable zero-shot, open-vocabulary image classification, retrieval, segmentation and visual question answering for satellite images. On each of these tasks, our VLM trained without textual annotations outperforms existing VLMs trained with supervision, with gains of up to 20% for classification and 80% for segmentation. △ Less

Submitted 11 December, 2023; originally announced December 2023.

arXiv:2312.06131 [pdf, other]

ML-based Modeling to Predict I/O Performance on Different Storage Sub-systems

Authors: Yiheng Xu, Pranav Sivaraman, Hariharan Devarajan, Kathryn Mohror, Abhinav Bhatele

Abstract: Parallel applications can spend a significant amount of time performing I/O on large-scale supercomputers. Fast near-compute storage accelerators called burst buffers can reduce the time a processor spends performing I/O and mitigate I/O bottlenecks. However, determining if a given application could be accelerated using burst buffers is not straightforward even for storage experts. The relationshi… ▽ More Parallel applications can spend a significant amount of time performing I/O on large-scale supercomputers. Fast near-compute storage accelerators called burst buffers can reduce the time a processor spends performing I/O and mitigate I/O bottlenecks. However, determining if a given application could be accelerated using burst buffers is not straightforward even for storage experts. The relationship between an application's I/O characteristics (such as I/O volume, processes involved, etc.) and the best storage sub-system for it can be complicated. As a result, adapting parallel applications to use burst buffers efficiently is a trial-and-error process. In this work, we present a Python-based tool called PrismIO that enables programmatic analysis of I/O traces. Using PrismIO, we identify bottlenecks on burst buffers and parallel file systems and explain why certain I/O patterns perform poorly. Further, we use machine learning to model the relationship between I/O characteristics and burst buffer selections. We run IOR (an I/O benchmark) with various I/O characteristics on different storage systems and collect performance data. We use the data as the input for training the model. Our model can predict if a file of an application should be placed on BBs for unseen IOR scenarios with an accuracy of 94.47% and for four real applications with an accuracy of 95.86%. △ Less

Submitted 11 January, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

arXiv:2312.05984 [pdf, ps, other]

Accurate Differential Operators for Hybrid Neural Fields

Authors: Aditya Chetan, Guandao Yang, Zichen Wang, Steve Marschner, Bharath Hariharan

Abstract: Neural fields have become widely used in various fields, from shape representation to neural rendering, and for solving partial differential equations (PDEs). With the advent of hybrid neural field representations like Instant NGP that leverage small MLPs and explicit representations, these models train quickly and can fit large scenes. Yet in many applications like rendering and simulation, hybri… ▽ More Neural fields have become widely used in various fields, from shape representation to neural rendering, and for solving partial differential equations (PDEs). With the advent of hybrid neural field representations like Instant NGP that leverage small MLPs and explicit representations, these models train quickly and can fit large scenes. Yet in many applications like rendering and simulation, hybrid neural fields can cause noticeable and unreasonable artifacts. This is because they do not yield accurate spatial derivatives needed for these downstream applications. In this work, we propose two ways to circumvent these challenges. Our first approach is a post hoc operator that uses local polynomial fitting to obtain more accurate derivatives from pre-trained hybrid neural fields. Additionally, we also propose a self-supervised fine-tuning approach that refines the hybrid neural field to yield accurate derivatives directly while preserving the initial signal. We show applications of our method to rendering, collision simulation, and solving PDEs. We observe that using our approach yields more accurate derivatives, reducing artifacts and leading to more accurate simulations in downstream applications. △ Less

Submitted 1 June, 2025; v1 submitted 10 December, 2023; originally announced December 2023.

Comments: Accepted in CVPR 2025. Project page is available at https://justachetan.github.io/hnf-derivatives/

Showing 1–50 of 182 results for author: Hariharan