-
Bonsai: Interpretable Tree-Adaptive Grounded Reasoning
Authors:
Kate Sanders,
Benjamin Van Durme
Abstract:
To develop general-purpose collaborative agents, humans need reliable AI systems that can (1) adapt to new domains and (2) transparently reason with uncertainty to allow for verification and correction. Black-box models demonstrate powerful data processing abilities but do not satisfy these criteria due to their opaqueness, domain specificity, and lack of uncertainty awareness. We introduce Bonsai…
▽ More
To develop general-purpose collaborative agents, humans need reliable AI systems that can (1) adapt to new domains and (2) transparently reason with uncertainty to allow for verification and correction. Black-box models demonstrate powerful data processing abilities but do not satisfy these criteria due to their opaqueness, domain specificity, and lack of uncertainty awareness. We introduce Bonsai, a compositional and probabilistic reasoning system that generates adaptable inference trees by retrieving relevant grounding evidence and using it to compute likelihoods of sub-claims derived from broader natural language inferences. Bonsai's reasoning power is tunable at test-time via evidence scaling and it demonstrates reliable handling of varied domains including transcripts, photographs, videos, audio, and databases. Question-answering and human alignment experiments demonstrate that Bonsai matches the performance of domain-specific black-box methods while generating interpretable, grounded, and uncertainty-aware reasoning traces.
△ Less
Submitted 4 April, 2025;
originally announced April 2025.
-
WikiVideo: Article Generation from Multiple Videos
Authors:
Alexander Martin,
Reno Kriz,
William Gantt Walden,
Kate Sanders,
Hannah Recknor,
Eugene Yang,
Francis Ferraro,
Benjamin Van Durme
Abstract:
We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple diverse videos about real-world events, such as natural disasters or political elections. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text and existing methods for video-based summari…
▽ More
We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple diverse videos about real-world events, such as natural disasters or political elections. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text and existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers?
Authors:
Jiefu Ou,
William Gantt Walden,
Kate Sanders,
Zhengping Jiang,
Kaiser Sun,
Jeffrey Cheng,
William Jurayj,
Miriam Wanner,
Shaobo Liang,
Candice Morgan,
Seunghoon Han,
Weiqi Wang,
Chandler May,
Hannah Recknor,
Daniel Khashabi,
Benjamin Van Durme
Abstract:
A core part of scientific peer review involves providing expert critiques that directly assess the scientific claims a paper makes. While it is now possible to automatically generate plausible (if generic) reviews, ensuring that these reviews are sound and grounded in the papers' claims remains challenging. To facilitate LLM benchmarking on these challenges, we introduce CLAIMCHECK, an annotated d…
▽ More
A core part of scientific peer review involves providing expert critiques that directly assess the scientific claims a paper makes. While it is now possible to automatically generate plausible (if generic) reviews, ensuring that these reviews are sound and grounded in the papers' claims remains challenging. To facilitate LLM benchmarking on these challenges, we introduce CLAIMCHECK, an annotated dataset of NeurIPS 2023 and 2024 submissions and reviews mined from OpenReview. CLAIMCHECK is richly annotated by ML experts for weakness statements in the reviews and the paper claims that they dispute, as well as fine-grained labels of the validity, objectivity, and type of the identified weaknesses. We benchmark several LLMs on three claim-centric tasks supported by CLAIMCHECK, requiring models to (1) associate weaknesses with the claims they dispute, (2) predict fine-grained labels for weaknesses and rewrite the weaknesses to enhance their specificity, and (3) verify a paper's claims with grounded reasoning. Our experiments reveal that cutting-edge LLMs, while capable of predicting weakness labels in (2), continue to underperform relative to human experts on all other tasks.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion
Authors:
Saron Samuel,
Dan DeGenaro,
Jimena Guallar-Blasco,
Kate Sanders,
Oluwaseun Eisape,
Tanner Spendlove,
Arun Reddy,
Alexander Martin,
Andrew Yates,
Eugene Yang,
Cameron Carpenter,
David Etter,
Efsun Kayi,
Matthew Wiesner,
Kenton Murray,
Reno Kriz
Abstract:
Videos inherently contain multiple modalities, including visual events, text overlays, sounds, and speech, all of which are important for retrieval. However, state-of-the-art multimodal language models like VAST and LanguageBind are built on vision-language models (VLMs), and thus overly prioritize visual signals. Retrieval benchmarks further reinforce this bias by focusing on visual queries and n…
▽ More
Videos inherently contain multiple modalities, including visual events, text overlays, sounds, and speech, all of which are important for retrieval. However, state-of-the-art multimodal language models like VAST and LanguageBind are built on vision-language models (VLMs), and thus overly prioritize visual signals. Retrieval benchmarks further reinforce this bias by focusing on visual queries and neglecting other modalities. We create a search system MMMORRF that extracts text and features from both visual and audio modalities and integrates them with a novel modality-aware weighted reciprocal rank fusion. MMMORRF is both effective and efficient, demonstrating practicality in searching videos based on users' information needs instead of visual descriptive queries. We evaluate MMMORRF on MultiVENT 2.0 and TVR, two multimodal benchmarks designed for more targeted information needs, and find that it improves nDCG@20 by 81% over leading multimodal encoders and 37% over single-modality retrieval, demonstrating the value of integrating diverse modalities.
△ Less
Submitted 9 May, 2025; v1 submitted 26 March, 2025;
originally announced March 2025.
-
Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval
Authors:
Arun Reddy,
Alexander Martin,
Eugene Yang,
Andrew Yates,
Kate Sanders,
Kenton Murray,
Reno Kriz,
Celso M. de Melo,
Benjamin Van Durme,
Rama Chellappa
Abstract:
In this work, we tackle the problem of text-to-video retrieval (T2VR). Inspired by the success of late interaction techniques in text-document, text-image, and text-video retrieval, our approach, Video-ColBERT, introduces a simple and efficient mechanism for fine-grained similarity assessment between queries and videos. Video-ColBERT is built upon 3 main components: a fine-grained spatial and temp…
▽ More
In this work, we tackle the problem of text-to-video retrieval (T2VR). Inspired by the success of late interaction techniques in text-document, text-image, and text-video retrieval, our approach, Video-ColBERT, introduces a simple and efficient mechanism for fine-grained similarity assessment between queries and videos. Video-ColBERT is built upon 3 main components: a fine-grained spatial and temporal token-wise interaction, query and visual expansions, and a dual sigmoid loss during training. We find that this interaction and training paradigm leads to strong individual, yet compatible, representations for encoding video content. These representations lead to increases in performance on common text-to-video retrieval benchmarks compared to other bi-encoder methods.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
Randomly Sampled Language Reasoning Problems Explain Limits of LLMs
Authors:
Kavi Gupta,
Kate Sanders,
Armando Solar-Lezama
Abstract:
While LLMs have revolutionized the field of machine learning due to their high performance across a range of tasks, they are known to perform poorly in planning, hallucinate false answers, have degraded performance on less canonical versions of the same task, and answer incorrectly on a variety of specific prompts. There are several emerging theories of LLM performance with some predictive power,…
▽ More
While LLMs have revolutionized the field of machine learning due to their high performance across a range of tasks, they are known to perform poorly in planning, hallucinate false answers, have degraded performance on less canonical versions of the same task, and answer incorrectly on a variety of specific prompts. There are several emerging theories of LLM performance with some predictive power, among them that LLMs lack world modeling ability, that they have an undesirable bias towards an autoregressive prior, and that they perform less well on more novel problems. The existing literature on novelty has focused on tasks of relatively high complexity, studying perturbations of canonical but complex problems. In this paper, we attempt to isolate novelty as a factor in LLM underperformance. To this end, we consider an extremely simple domain: next token prediction on simple language tasks. The twist is that these language tasks are unseen, as they are randomly drawn from a large, parsimoniously defined set of languages arising from simple grammar rules. This allows us to isolate the effect of task novelty and see if it is sufficient to explain low performance. We find that LLMs uniformly underperform n-gram models (which do not have the capacity for world modeling) on these tasks, both when used as next token predictors and as reasoners.
△ Less
Submitted 26 May, 2025; v1 submitted 6 January, 2025;
originally announced January 2025.
-
MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval
Authors:
Reno Kriz,
Kate Sanders,
David Etter,
Kenton Murray,
Cameron Carpenter,
Kelly Van Ochten,
Hannah Recknor,
Jimena Guallar-Blasco,
Alexander Martin,
Ronald Colaianni,
Nolan King,
Eugene Yang,
Benjamin Van Durme
Abstract:
Efficiently retrieving and synthesizing information from large-scale multimodal collections has become a critical challenge. However, existing video retrieval datasets suffer from scope limitations, primarily focusing on matching descriptive but vague queries with small collections of professionally edited, English-centric videos. To address this gap, we introduce $\textbf{MultiVENT 2.0}$, a large…
▽ More
Efficiently retrieving and synthesizing information from large-scale multimodal collections has become a critical challenge. However, existing video retrieval datasets suffer from scope limitations, primarily focusing on matching descriptive but vague queries with small collections of professionally edited, English-centric videos. To address this gap, we introduce $\textbf{MultiVENT 2.0}$, a large-scale, multilingual event-centric video retrieval benchmark featuring a collection of more than 218,000 news videos and 3,906 queries targeting specific world events. These queries specifically target information found in the visual content, audio, embedded text, and text metadata of the videos, requiring systems leverage all these sources to succeed at the task. Preliminary results show that state-of-the-art vision-language models struggle significantly with this task, and while alternative approaches show promise, they are still insufficient to adequately address this problem. These findings underscore the need for more robust multimodal retrieval systems, as effective video retrieval is a crucial step towards multimodal content understanding and generation.
△ Less
Submitted 10 February, 2025; v1 submitted 15 October, 2024;
originally announced October 2024.
-
Grounding Partially-Defined Events in Multimodal Data
Authors:
Kate Sanders,
Reno Kriz,
David Etter,
Hannah Recknor,
Alexander Martin,
Cameron Carpenter,
Jingyang Lin,
Benjamin Van Durme
Abstract:
How are we able to learn about complex current events just from short snippets of video? While natural language enables straightforward ways to represent under-specified, partially observable events, visual data does not facilitate analogous methods and, consequently, introduces unique challenges in event understanding. With the growing prevalence of vision-capable AI agents, these systems must be…
▽ More
How are we able to learn about complex current events just from short snippets of video? While natural language enables straightforward ways to represent under-specified, partially observable events, visual data does not facilitate analogous methods and, consequently, introduces unique challenges in event understanding. With the growing prevalence of vision-capable AI agents, these systems must be able to model events from collections of unstructured video data. To tackle robust event modeling in multimodal settings, we introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task. We propose a corresponding benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities. We propose a collection of LLM-driven approaches to the task of multimodal event analysis, and evaluate them on MultiVENT-G. Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
△ Less
Submitted 7 October, 2024;
originally announced October 2024.
-
Core: Robust Factual Precision with Informative Sub-Claim Identification
Authors:
Zhengping Jiang,
Jingyu Zhang,
Nathaniel Weir,
Seth Ebner,
Miriam Wanner,
Kate Sanders,
Daniel Khashabi,
Anqi Liu,
Benjamin Van Durme
Abstract:
Hallucinations pose a challenge to the application of large language models (LLMs) thereby motivating the development of metrics to evaluate factual precision. We observe that popular metrics using the Decompose-Then-Verify framework, such as \FActScore, can be manipulated by adding obvious or repetitive subclaims to artificially inflate scores. This observation motivates our new customizable plug…
▽ More
Hallucinations pose a challenge to the application of large language models (LLMs) thereby motivating the development of metrics to evaluate factual precision. We observe that popular metrics using the Decompose-Then-Verify framework, such as \FActScore, can be manipulated by adding obvious or repetitive subclaims to artificially inflate scores. This observation motivates our new customizable plug-and-play subclaim selection component called Core, which filters down individual subclaims according to their uniqueness and informativeness. We show that many popular factual precision metrics augmented by Core are substantially more robust on a wide range of knowledge domains. We release an evaluation framework supporting easy and modular use of Core and various decomposition strategies, which we recommend adoption by the community. We also release an expansion of the FActScore biography dataset to facilitate further studies of decomposition-based factual precision evaluation.
△ Less
Submitted 15 October, 2024; v1 submitted 3 July, 2024;
originally announced July 2024.
-
A Survey of Video Datasets for Grounded Event Understanding
Authors:
Kate Sanders,
Benjamin Van Durme
Abstract:
While existing video benchmarks largely consider specialized downstream tasks like retrieval or question-answering (QA), contemporary multimodal AI systems must be capable of well-rounded common-sense reasoning akin to human visual understanding. A critical component of human temporal-visual perception is our ability to identify and cognitively model "things happening", or events. Historically, vi…
▽ More
While existing video benchmarks largely consider specialized downstream tasks like retrieval or question-answering (QA), contemporary multimodal AI systems must be capable of well-rounded common-sense reasoning akin to human visual understanding. A critical component of human temporal-visual perception is our ability to identify and cognitively model "things happening", or events. Historically, video benchmark tasks have implicitly tested for this ability (e.g., video captioning, in which models describe visual events with natural language), but they do not consider video event understanding as a task in itself. Recent work has begun to explore video analogues to textual event extraction but consists of competing task definitions and datasets limited to highly specific event types. Therefore, while there is a rich domain of event-centric video research spanning the past 10+ years, it is unclear how video event understanding should be framed and what resources we have to study it. In this paper, we survey 105 video datasets that require event understanding capability, consider how they contribute to the study of robust event understanding in video, and assess proposed video event extraction tasks in the context of this body of research. We propose suggestions informed by this survey for dataset curation and task framing, with an emphasis on the uniquely temporal nature of video events and ambiguity in visual content.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
On the Evaluation of Machine-Generated Reports
Authors:
James Mayfield,
Eugene Yang,
Dawn Lawrie,
Sean MacAvaney,
Paul McNamee,
Douglas W. Oard,
Luca Soldaini,
Ian Soboroff,
Orion Weller,
Efsun Kayi,
Kate Sanders,
Marc Mason,
Noah Hibbler
Abstract:
Large Language Models (LLMs) have enabled new ways to satisfy information needs. Although great strides have been made in applying them to settings like document ranking and short-form text generation, they still struggle to compose complete, accurate, and verifiable long-form reports. Reports with these qualities are necessary to satisfy the complex, nuanced, or multi-faceted information needs of…
▽ More
Large Language Models (LLMs) have enabled new ways to satisfy information needs. Although great strides have been made in applying them to settings like document ranking and short-form text generation, they still struggle to compose complete, accurate, and verifiable long-form reports. Reports with these qualities are necessary to satisfy the complex, nuanced, or multi-faceted information needs of users. In this perspective paper, we draw together opinions from industry and academia, and from a variety of related research areas, to present our vision for automatic report generation, and -- critically -- a flexible framework by which such reports can be evaluated. In contrast with other summarization tasks, automatic report generation starts with a detailed description of an information need, stating the necessary background, requirements, and scope of the report. Further, the generated reports should be complete, accurate, and verifiable. These qualities, which are desirable -- if not required -- in many analytic report-writing settings, require rethinking how to build and evaluate systems that exhibit these qualities. To foster new efforts in building these systems, we present an evaluation framework that draws on ideas found in various evaluations. To test completeness and accuracy, the framework uses nuggets of information, expressed as questions and answers, that need to be part of any high-quality generated report. Additionally, evaluation of citations that map claims made in the report to their source documents ensures verifiability.
△ Less
Submitted 9 May, 2024; v1 submitted 1 May, 2024;
originally announced May 2024.
-
Tur[k]ingBench: A Challenge Benchmark for Web Agents
Authors:
Kevin Xu,
Yeganeh Kordi,
Tanay Nayak,
Adi Asija,
Yizhong Wang,
Kate Sanders,
Adam Byerly,
Jingyu Zhang,
Benjamin Van Durme,
Daniel Khashabi
Abstract:
Can advanced multi-modal models effectively tackle complex web-based tasks? Such tasks are often found on crowdsourcing platforms, where crowdworkers engage in challenging micro-tasks within web-based environments.
Building on this idea, we present TurkingBench, a benchmark consisting of tasks presented as web pages with textual instructions and multi-modal contexts. Unlike previous approaches t…
▽ More
Can advanced multi-modal models effectively tackle complex web-based tasks? Such tasks are often found on crowdsourcing platforms, where crowdworkers engage in challenging micro-tasks within web-based environments.
Building on this idea, we present TurkingBench, a benchmark consisting of tasks presented as web pages with textual instructions and multi-modal contexts. Unlike previous approaches that rely on artificially synthesized web pages, our benchmark uses natural HTML pages originally designed for crowdsourcing workers to perform various annotation tasks. Each task's HTML instructions are instantiated with different values derived from crowdsourcing tasks, creating diverse instances. This benchmark includes 32.2K instances spread across 158 tasks.
To support the evaluation of TurkingBench, we have developed a framework that links chatbot responses to actions on web pages (e.g., modifying a text box, selecting a radio button). We assess the performance of cutting-edge private and open-source models, including language-only and vision-language models (such as GPT4 and InternVL), on this benchmark. Our results show that while these models outperform random chance, there is still significant room for improvement. We hope that this benchmark will drive progress in the evaluation and development of web-based agents.
△ Less
Submitted 21 February, 2025; v1 submitted 18 March, 2024;
originally announced March 2024.
-
TV-TREES: Multimodal Entailment Trees for Neuro-Symbolic Video Reasoning
Authors:
Kate Sanders,
Nathaniel Weir,
Benjamin Van Durme
Abstract:
It is challenging for models to understand complex, multimodal content such as television clips, and this is in part because video-language models often rely on single-modality reasoning and lack interpretability. To combat these issues we propose TV-TREES, the first multimodal entailment tree generator. TV-TREES serves as an approach to video understanding that promotes interpretable joint-modali…
▽ More
It is challenging for models to understand complex, multimodal content such as television clips, and this is in part because video-language models often rely on single-modality reasoning and lack interpretability. To combat these issues we propose TV-TREES, the first multimodal entailment tree generator. TV-TREES serves as an approach to video understanding that promotes interpretable joint-modality reasoning by searching for trees of entailment relationships between simple text-video evidence and higher-level conclusions that prove question-answer pairs. We also introduce the task of multimodal entailment tree generation to evaluate reasoning quality. Our method's performance on the challenging TVQA benchmark demonstrates interpretable, state-of-the-art zero-shot performance on full clips, illustrating that multimodal entailment tree generation can be a best-of-both-worlds alternative to black-box systems.
△ Less
Submitted 10 October, 2024; v1 submitted 29 February, 2024;
originally announced February 2024.
-
Enhancing Systematic Decompositional Natural Language Inference Using Informal Logic
Authors:
Nathaniel Weir,
Kate Sanders,
Orion Weller,
Shreya Sharma,
Dongwei Jiang,
Zhengping Jiang,
Bhavana Dalvi Mishra,
Oyvind Tafjord,
Peter Jansen,
Peter Clark,
Benjamin Van Durme
Abstract:
Recent language models enable new opportunities for structured reasoning with text, such as the construction of intuitive, proof-like textual entailment trees without relying on brittle formal logic. However, progress in this direction has been hampered by a long-standing lack of a clear protocol for determining what valid compositional entailment is. This absence causes noisy datasets and limited…
▽ More
Recent language models enable new opportunities for structured reasoning with text, such as the construction of intuitive, proof-like textual entailment trees without relying on brittle formal logic. However, progress in this direction has been hampered by a long-standing lack of a clear protocol for determining what valid compositional entailment is. This absence causes noisy datasets and limited performance gains by modern neuro-symbolic engines. To address these problems, we formulate a consistent and theoretically grounded approach to annotating decompositional entailment and evaluate its impact on LLM-based textual inference. We find that our new dataset, RDTE (Recognizing Decompositional Textual Entailment), has a substantially higher internal consistency (+9%) than prior decompositional entailment datasets. We also find that training an RDTE-oriented entailment classifier via knowledge distillation and employing it in an entailment tree reasoning engine significantly improves both accuracy and proof quality, illustrating the practical benefit of this advance for textual inference.
△ Less
Submitted 12 August, 2024; v1 submitted 22 February, 2024;
originally announced February 2024.
-
MultiVENT: Multilingual Videos of Events with Aligned Natural Text
Authors:
Kate Sanders,
David Etter,
Reno Kriz,
Benjamin Van Durme
Abstract:
Everyday news coverage has shifted from traditional broadcasts towards a wide range of presentation formats such as first-hand, unedited video footage. Datasets that reflect the diverse array of multimodal, multilingual news sources available online could be used to teach models to benefit from this shift, but existing news video datasets focus on traditional news broadcasts produced for English-s…
▽ More
Everyday news coverage has shifted from traditional broadcasts towards a wide range of presentation formats such as first-hand, unedited video footage. Datasets that reflect the diverse array of multimodal, multilingual news sources available online could be used to teach models to benefit from this shift, but existing news video datasets focus on traditional news broadcasts produced for English-speaking audiences. We address this limitation by constructing MultiVENT, a dataset of multilingual, event-centric videos grounded in text documents across five target languages. MultiVENT includes both news broadcast videos and non-professional event footage, which we use to analyze the state of online news videos and how they can be leveraged to build robust, factually accurate models. Finally, we provide a model for complex, multilingual video retrieval to serve as a baseline for information retrieval using MultiVENT.
△ Less
Submitted 6 July, 2023;
originally announced July 2023.
-
Ambiguous Images With Human Judgments for Robust Visual Event Classification
Authors:
Kate Sanders,
Reno Kriz,
Anqi Liu,
Benjamin Van Durme
Abstract:
Contemporary vision benchmarks predominantly consider tasks on which humans can achieve near-perfect performance. However, humans are frequently presented with visual data that they cannot classify with 100% certainty, and models trained on standard vision benchmarks achieve low performance when evaluated on this data. To address this issue, we introduce a procedure for creating datasets of ambigu…
▽ More
Contemporary vision benchmarks predominantly consider tasks on which humans can achieve near-perfect performance. However, humans are frequently presented with visual data that they cannot classify with 100% certainty, and models trained on standard vision benchmarks achieve low performance when evaluated on this data. To address this issue, we introduce a procedure for creating datasets of ambiguous images and use it to produce SQUID-E ("Squidy"), a collection of noisy images extracted from videos. All images are annotated with ground truth values and a test set is annotated with human uncertainty judgments. We use this dataset to characterize human uncertainty in vision tasks and evaluate existing visual event classification models. Experimental results suggest that existing vision models are not sufficiently equipped to provide meaningful outputs for ambiguous images and that datasets of this nature can be used to assess and improve such models through model training and direct evaluation of model calibration. These findings motivate large-scale ambiguous dataset creation and further research focusing on noisy visual data.
△ Less
Submitted 22 October, 2022; v1 submitted 6 October, 2022;
originally announced October 2022.
-
A Multi-Chamber Smart Suction Cup for Adaptive Gripping and Haptic Exploration
Authors:
Tae Myung Huh,
Kate Sanders,
Michael Danielczuk,
Monica Li,
Yunliang Chen,
Ken Goldberg,
Hannah S. Stuart
Abstract:
We present a novel robot end-effector for gripping and haptic exploration. Tactile sensing through suction flow monitoring is applied to a new suction cup design that contains multiple chambers for air flow. Each chamber connects with its own remote pressure transducer, which enables both absolute and differential pressure measures between chambers. By changing the overall vacuum applied to this s…
▽ More
We present a novel robot end-effector for gripping and haptic exploration. Tactile sensing through suction flow monitoring is applied to a new suction cup design that contains multiple chambers for air flow. Each chamber connects with its own remote pressure transducer, which enables both absolute and differential pressure measures between chambers. By changing the overall vacuum applied to this smart suction cup, it can perform different functions such as gentle haptic exploration (low pressure) and monitoring breaks in the seal during strong astrictive gripping (high pressure). Haptic exploration of surfaces through sliding and palpation can guide the selection of suction grasp locations and help to identify the local surface geometry. During suction gripping, this design localizes breaks in the suction seal between four quadrants with up to 97% accuracy and detects breaks in the suction seal early enough to avoid total grasp failure.
△ Less
Submitted 18 October, 2021; v1 submitted 5 May, 2021;
originally announced May 2021.
-
RV-GAN: Segmenting Retinal Vascular Structure in Fundus Photographs using a Novel Multi-scale Generative Adversarial Network
Authors:
Sharif Amit Kamran,
Khondker Fariha Hossain,
Alireza Tavakkoli,
Stewart Lee Zuckerbrod,
Kenton M. Sanders,
Salah A. Baker
Abstract:
High fidelity segmentation of both macro and microvascular structure of the retina plays a pivotal role in determining degenerative retinal diseases, yet it is a difficult problem. Due to successive resolution loss in the encoding phase combined with the inability to recover this lost information in the decoding phase, autoencoding based segmentation approaches are limited in their ability to extr…
▽ More
High fidelity segmentation of both macro and microvascular structure of the retina plays a pivotal role in determining degenerative retinal diseases, yet it is a difficult problem. Due to successive resolution loss in the encoding phase combined with the inability to recover this lost information in the decoding phase, autoencoding based segmentation approaches are limited in their ability to extract retinal microvascular structure. We propose RV-GAN, a new multi-scale generative architecture for accurate retinal vessel segmentation to alleviate this. The proposed architecture uses two generators and two multi-scale autoencoding discriminators for better microvessel localization and segmentation. In order to avoid the loss of fidelity suffered by traditional GAN-based segmentation systems, we introduce a novel weighted feature matching loss. This new loss incorporates and prioritizes features from the discriminator's decoder over the encoder. Doing so combined with the fact that the discriminator's decoder attempts to determine real or fake images at the pixel level better preserves macro and microvascular structure. By combining reconstruction and weighted feature matching loss, the proposed architecture achieves an area under the curve (AUC) of 0.9887, 0.9914, and 0.9887 in pixel-wise segmentation of retinal vasculature from three publicly available datasets, namely DRIVE, CHASE-DB1, and STARE, respectively. Additionally, RV-GAN outperforms other architectures in two additional relevant metrics, mean intersection-over-union (Mean-IOU) and structural similarity measure (SSIM).
△ Less
Submitted 14 May, 2021; v1 submitted 2 January, 2021;
originally announced January 2021.
-
Mechanical Search on Shelves using Lateral Access X-RAY
Authors:
Huang Huang,
Marcus Dominguez-Kuhne,
Jeffrey Ichnowski,
Vishal Satish,
Michael Danielczuk,
Kate Sanders,
Andrew Lee,
Anelia Angelova,
Vincent Vanhoucke,
Ken Goldberg
Abstract:
Efficiently finding an occluded object with lateral access arises in many contexts such as warehouses, retail, healthcare, shipping, and homes. We introduce LAX-RAY (Lateral Access maXimal Reduction of occupancY support Area), a system to automate the mechanical search for occluded objects on shelves. For such lateral access environments, LAX-RAY couples a perception pipeline predicting a target o…
▽ More
Efficiently finding an occluded object with lateral access arises in many contexts such as warehouses, retail, healthcare, shipping, and homes. We introduce LAX-RAY (Lateral Access maXimal Reduction of occupancY support Area), a system to automate the mechanical search for occluded objects on shelves. For such lateral access environments, LAX-RAY couples a perception pipeline predicting a target object occupancy support distribution with a mechanical search policy that sequentially selects occluding objects to push to the side to reveal the target as efficiently as possible. Within the context of extruded polygonal objects and a stationary target with a known aspect ratio, we explore three lateral access search policies: Distribution Area Reduction (DAR), Distribution Entropy Reduction (DER), and Distribution Entropy Reduction over Multiple Time Steps (DER-MT) utilizing the support distribution and prior information. We evaluate these policies using the First-Order Shelf Simulator (FOSS) in which we simulate 800 random shelf environments of varying difficulty, and in a physical shelf environment with a Fetch robot and an embedded PrimeSense RGBD Camera. Average simulation results of 87.3% success rate demonstrate better performance of DER-MT with 2 prediction steps. When deployed on the robot, results show a success rate of at least 80% for all policies, suggesting that LAX-RAY can efficiently reveal the target object in reality. Both results show significantly better performance of the three proposed policies compared to a baseline policy with uniform probability distribution assumption in non-trivial cases, showing the importance of distribution prediction. Code, videos, and supplementary material can be found at https://sites.google.com/berkeley.edu/lax-ray.
△ Less
Submitted 23 November, 2020;
originally announced November 2020.
-
Non-Markov Policies to Reduce Sequential Failures in Robot Bin Picking
Authors:
Kate Sanders,
Michael Danielczuk,
Jeffrey Mahler,
Ajay Tanwani,
Ken Goldberg
Abstract:
A new generation of automated bin picking systems using deep learning is evolving to support increasing demand for e-commerce. To accommodate a wide variety of products, many automated systems include multiple gripper types and/or tool changers. However, for some objects, sequential grasp failures are common: when a computed grasp fails to lift and remove the object, the bin is often left unchange…
▽ More
A new generation of automated bin picking systems using deep learning is evolving to support increasing demand for e-commerce. To accommodate a wide variety of products, many automated systems include multiple gripper types and/or tool changers. However, for some objects, sequential grasp failures are common: when a computed grasp fails to lift and remove the object, the bin is often left unchanged; as the sensor input is consistent, the system retries the same grasp over and over, resulting in a significant reduction in mean successful picks per hour (MPPH). Based on an empirical study of sequential failures, we characterize a class of "sequential failure objects" (SFOs) -- objects prone to sequential failures based on a novel taxonomy. We then propose three non-Markov picking policies that incorporate memory of past failures to modify subsequent actions. Simulation experiments on SFO models and the EGAD dataset suggest that the non-Markov policies significantly outperform the Markov policy in terms of the sequential failure rate and MPPH. In physical experiments on 50 heaps of 12 SFOs the most effective Non-Markov policy increased MPPH over the Dex-Net Markov policy by 107%.
△ Less
Submitted 20 July, 2020;
originally announced July 2020.
-
Fundus2Angio: A Conditional GAN Architecture for Generating Fluorescein Angiography Images from Retinal Fundus Photography
Authors:
Sharif Amit Kamran,
Khondker Fariha Hossain,
Alireza Tavakkoli,
Stewart Lee Zuckerbrod,
Salah A. Baker,
Kenton M. Sanders
Abstract:
Carrying out clinical diagnosis of retinal vascular degeneration using Fluorescein Angiography (FA) is a time consuming process and can pose significant adverse effects on the patient. Angiography requires insertion of a dye that may cause severe adverse effects and can even be fatal. Currently, there are no non-invasive systems capable of generating Fluorescein Angiography images. However, retina…
▽ More
Carrying out clinical diagnosis of retinal vascular degeneration using Fluorescein Angiography (FA) is a time consuming process and can pose significant adverse effects on the patient. Angiography requires insertion of a dye that may cause severe adverse effects and can even be fatal. Currently, there are no non-invasive systems capable of generating Fluorescein Angiography images. However, retinal fundus photography is a non-invasive imaging technique that can be completed in a few seconds. In order to eliminate the need for FA, we propose a conditional generative adversarial network (GAN) to translate fundus images to FA images. The proposed GAN consists of a novel residual block capable of generating high quality FA images. These images are important tools in the differential diagnosis of retinal diseases without the need for invasive procedure with possible side effects. Our experiments show that the proposed architecture outperforms other state-of-the-art generative networks. Furthermore, our proposed model achieves better qualitative results indistinguishable from real angiograms.
△ Less
Submitted 29 September, 2020; v1 submitted 11 May, 2020;
originally announced May 2020.