Search | arXiv e-print repository

PRISM: Efficient Long-Range Reasoning With Short-Context LLMs

Authors: Dulhan Jayalath, James Bradley Wendt, Nicholas Monath, Sandeep Tata, Beliz Gunel

Abstract: Long-range tasks demand reasoning over long inputs. Current solutions require large compute budgets, training data, model weight access, or complex task-specific designs. We introduce PRISM, which processes information as a stream of chunks while maintaining a structured in-context memory specified with a typed hierarchical schema. PRISM outperforms baselines on diverse tasks while using at least… ▽ More Long-range tasks demand reasoning over long inputs. Current solutions require large compute budgets, training data, model weight access, or complex task-specific designs. We introduce PRISM, which processes information as a stream of chunks while maintaining a structured in-context memory specified with a typed hierarchical schema. PRISM outperforms baselines on diverse tasks while using at least 4x shorter contexts than long-context models. This approach is token-efficient, producing concise outputs and efficiently leveraging key-value (KV) caches to reduce costs by up to 54% compared to alternative short-context methods. PRISM scales down to tiny chunks (<500 tokens) without increasing encoding costs or sacrificing quality, and generalizes to new tasks with minimal effort by automatically generating schemas from task descriptions. △ Less

Submitted 12 March, 2025; v1 submitted 25 December, 2024; originally announced December 2024.

Comments: 28 pages, 7 figures, 5 tables

arXiv:2411.05715 [pdf, other]

On the Role of Noise in AudioVisual Integration: Evidence from Artificial Neural Networks that Exhibit the McGurk Effect

Authors: Lukas Grasse, Matthew S. Tata

Abstract: Humans are able to fuse information from both auditory and visual modalities to help with understanding speech. This is frequently demonstrated through an phenomenon known as the McGurk Effect, during which a listener is presented with incongruent auditory and visual speech that fuse together into the percept of an illusory intermediate phoneme. Building on a recent framework that proposes how to… ▽ More Humans are able to fuse information from both auditory and visual modalities to help with understanding speech. This is frequently demonstrated through an phenomenon known as the McGurk Effect, during which a listener is presented with incongruent auditory and visual speech that fuse together into the percept of an illusory intermediate phoneme. Building on a recent framework that proposes how to address developmental 'why' questions using artificial neural networks, we evaluated a set of recent artificial neural networks trained on audiovisual speech by testing them with audiovisually incongruent words designed to elicit the McGurk effect. We compared networks trained on clean speech to those trained on noisy speech, and discovered that training with noisy speech led to an increase in both visual responses and McGurk responses across all models. Furthermore, we observed that systematically increasing the level of auditory noise during ANN training also increased the amount of audiovisual integration up to a point, but at extreme noise levels, this integration failed to develop. These results suggest that excessive noise exposure during critical periods of audiovisual learning may negatively influence the development of audiovisual speech integration. This work also demonstrates that the McGurk effect reliably emerges untrained from the behaviour of both supervised and unsupervised networks. This supports the notion that artificial neural networks might be useful models for certain aspects of perception and cognition. △ Less

Submitted 8 November, 2024; originally announced November 2024.

arXiv:2407.15021 [pdf, other]

Enhancing Incremental Summarization with Structured Representations

Authors: EunJeong Hwang, Yichao Zhou, James Bradley Wendt, Beliz Gunel, Nguyen Vo, Jing Xie, Sandeep Tata

Abstract: Large language models (LLMs) often struggle with processing extensive input contexts, which can lead to redundant, inaccurate, or incoherent summaries. Recent methods have used unstructured memory to incrementally process these contexts, but they still suffer from information overload due to the volume of unstructured data handled. In our study, we introduce structured knowledge representations (… ▽ More Large language models (LLMs) often struggle with processing extensive input contexts, which can lead to redundant, inaccurate, or incoherent summaries. Recent methods have used unstructured memory to incrementally process these contexts, but they still suffer from information overload due to the volume of unstructured data handled. In our study, we introduce structured knowledge representations ($GU_{json}$), which significantly improve summarization performance by 40% and 14% across two public datasets. Most notably, we propose the Chain-of-Key strategy ($CoK_{json}$) that dynamically updates or augments these representations with new information, rather than recreating the structured memory for each new source. This method further enhances performance by 7% and 4% on the datasets. △ Less

Submitted 20 July, 2024; originally announced July 2024.

arXiv:2406.05079 [pdf, other]

SUMIE: A Synthetic Benchmark for Incremental Entity Summarization

Authors: Eunjeong Hwang, Yichao Zhou, Beliz Gunel, James Bradley Wendt, Sandeep Tata

Abstract: No existing dataset adequately tests how well language models can incrementally update entity summaries - a crucial ability as these models rapidly advance. The Incremental Entity Summarization (IES) task is vital for maintaining accurate, up-to-date knowledge. To address this, we introduce SUMIE, a fully synthetic dataset designed to expose real-world IES challenges. This dataset effectively high… ▽ More No existing dataset adequately tests how well language models can incrementally update entity summaries - a crucial ability as these models rapidly advance. The Incremental Entity Summarization (IES) task is vital for maintaining accurate, up-to-date knowledge. To address this, we introduce SUMIE, a fully synthetic dataset designed to expose real-world IES challenges. This dataset effectively highlights problems like incorrect entity association and incomplete information presentation. Unlike common synthetic datasets, ours captures the complexity and nuances found in real-world data. We generate informative and diverse attributes, summaries, and unstructured paragraphs in sequence, ensuring high quality. The alignment between generated summaries and paragraphs exceeds 96%, confirming the dataset's quality. Extensive experiments demonstrate the dataset's difficulty - state-of-the-art LLMs struggle to update summaries with an F1 higher than 80.4%. We will open source the benchmark and the evaluation metrics to help the community make progress on IES tasks. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: 24 figures, 4 tables

arXiv:2404.15565 [pdf, other]

CASPR: Automated Evaluation Metric for Contrastive Summarization

Authors: Nirupan Ananthamurugan, Dat Duong, Philip George, Ankita Gupta, Sandeep Tata, Beliz Gunel

Abstract: Summarizing comparative opinions about entities (e.g., hotels, phones) from a set of source reviews, often referred to as contrastive summarization, can considerably aid users in decision making. However, reliably measuring the contrastiveness of the output summaries without relying on human evaluations remains an open problem. Prior work has proposed token-overlap based metrics, Distinctiveness S… ▽ More Summarizing comparative opinions about entities (e.g., hotels, phones) from a set of source reviews, often referred to as contrastive summarization, can considerably aid users in decision making. However, reliably measuring the contrastiveness of the output summaries without relying on human evaluations remains an open problem. Prior work has proposed token-overlap based metrics, Distinctiveness Score, to measure contrast which does not take into account the sensitivity to meaning-preserving lexical variations. In this work, we propose an automated evaluation metric CASPR to better measure contrast between a pair of summaries. Our metric is based on a simple and light-weight method that leverages natural language inference (NLI) task to measure contrast by segmenting reviews into single-claim sentences and carefully aggregating NLI scores between them to come up with a summary-level score. We compare CASPR with Distinctiveness Score and a simple yet powerful baseline based on BERTScore. Our results on a prior dataset CoCoTRIP demonstrate that CASPR can more reliably capture the contrastiveness of the summary pairs compared to the baselines. △ Less

Submitted 13 May, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

arXiv:2403.19710 [pdf, other]

STRUM-LLM: Attributed and Structured Contrastive Summarization

Authors: Beliz Gunel, James B. Wendt, Jing Xie, Yichao Zhou, Nguyen Vo, Zachary Fisher, Sandeep Tata

Abstract: Users often struggle with decision-making between two options (A vs B), as it usually requires time-consuming research across multiple web pages. We propose STRUM-LLM that addresses this challenge by generating attributed, structured, and helpful contrastive summaries that highlight key differences between the two options. STRUM-LLM identifies helpful contrast: the specific attributes along which… ▽ More Users often struggle with decision-making between two options (A vs B), as it usually requires time-consuming research across multiple web pages. We propose STRUM-LLM that addresses this challenge by generating attributed, structured, and helpful contrastive summaries that highlight key differences between the two options. STRUM-LLM identifies helpful contrast: the specific attributes along which the two options differ significantly and which are most likely to influence the user's decision. Our technique is domain-agnostic, and does not require any human-labeled data or fixed attribute list as supervision. STRUM-LLM attributes all extractions back to the input sources along with textual evidence, and it does not have a limit on the length of input sources that it can process. STRUM-LLM Distilled has 100x more throughput than the models with comparable performance while being 10x smaller. In this paper, we provide extensive evaluations for our method and lay out future directions for our currently deployed system. △ Less

Submitted 25 March, 2024; originally announced March 2024.

arXiv:2304.03932 [pdf, other]

3D GANs and Latent Space: A comprehensive survey

Authors: Satya Pratheek Tata, Subhankar Mishra

Abstract: Generative Adversarial Networks (GANs) have emerged as a significant player in generative modeling by mapping lower-dimensional random noise to higher-dimensional spaces. These networks have been used to generate high-resolution images and 3D objects. The efficient modeling of 3D objects and human faces is crucial in the development process of 3D graphical environments such as games or simulations… ▽ More Generative Adversarial Networks (GANs) have emerged as a significant player in generative modeling by mapping lower-dimensional random noise to higher-dimensional spaces. These networks have been used to generate high-resolution images and 3D objects. The efficient modeling of 3D objects and human faces is crucial in the development process of 3D graphical environments such as games or simulations. 3D GANs are a new type of generative model used for 3D reconstruction, point cloud reconstruction, and 3D semantic scene completion. The choice of distribution for noise is critical as it represents the latent space. Understanding a GAN's latent space is essential for fine-tuning the generated samples, as demonstrated by the morphing of semantically meaningful parts of images. In this work, we explore the latent space and 3D GANs, examine several GAN variants and training methods to gain insights into improving 3D GAN training, and suggest potential future directions for further research. △ Less

Submitted 8 April, 2023; originally announced April 2023.

arXiv:2212.10047 [pdf, other]

An Augmentation Strategy for Visually Rich Documents

Authors: Jing Xie, James B. Wendt, Yichao Zhou, Seth Ebner, Sandeep Tata

Abstract: Many business workflows require extracting important fields from form-like documents (e.g. bank statements, bills of lading, purchase orders, etc.). Recent techniques for automating this task work well only when trained with large datasets. In this work we propose a novel data augmentation technique to improve performance when training data is scarce, e.g. 10-250 documents. Our technique, which we… ▽ More Many business workflows require extracting important fields from form-like documents (e.g. bank statements, bills of lading, purchase orders, etc.). Recent techniques for automating this task work well only when trained with large datasets. In this work we propose a novel data augmentation technique to improve performance when training data is scarce, e.g. 10-250 documents. Our technique, which we call FieldSwap, works by swapping out the key phrases of a source field with the key phrases of a target field to generate new synthetic examples of the target field for use in training. We demonstrate that this approach can yield 1-7 F1 point improvements in extraction performance. △ Less

Submitted 22 December, 2022; v1 submitted 20 December, 2022; originally announced December 2022.

Comments: 9 pages, 6 figures, 3 tables

arXiv:2211.15421 [pdf, other]

doi 10.1145/3580305.3599929

VRDU: A Benchmark for Visually-rich Document Understanding

Authors: Zilong Wang, Yichao Zhou, Wei Wei, Chen-Yu Lee, Sandeep Tata

Abstract: Understanding visually-rich business documents to extract structured data and automate business workflows has been receiving attention both in academia and industry. Although recent multi-modal language models have achieved impressive results, we find that existing benchmarks do not reflect the complexity of real documents seen in industry. In this work, we identify the desiderata for a more compr… ▽ More Understanding visually-rich business documents to extract structured data and automate business workflows has been receiving attention both in academia and industry. Although recent multi-modal language models have achieved impressive results, we find that existing benchmarks do not reflect the complexity of real documents seen in industry. In this work, we identify the desiderata for a more comprehensive benchmark and propose one we call Visually Rich Document Understanding (VRDU). VRDU contains two datasets that represent several challenges: rich schema including diverse data types as well as hierarchical entities, complex templates including tables and multi-column layouts, and diversity of different layouts (templates) within a single document type. We design few-shot and conventional experiment settings along with a carefully designed matching algorithm to evaluate extraction results. We report the performance of strong baselines and offer three observations: (1) generalizing to new document templates is still very challenging, (2) few-shot performance has a lot of headroom, and (3) models struggle with hierarchical fields such as line-items in an invoice. We plan to open source the benchmark and the evaluation toolkit. We hope this helps the community make progress on these challenging tasks in extracting structured data from visually rich documents. △ Less

Submitted 16 September, 2023; v1 submitted 14 November, 2022; originally announced November 2022.

Comments: KDD 2023

arXiv:2210.16391 [pdf, other]

Radically Lower Data-Labeling Costs for Visually Rich Document Extraction Models

Authors: Yichao Zhou, James B. Wendt, Navneet Potti, Jing Xie, Sandeep Tata

Abstract: A key bottleneck in building automatic extraction models for visually rich documents like invoices is the cost of acquiring the several thousand high-quality labeled documents that are needed to train a model with acceptable accuracy. We propose Selective Labeling to simplify the labeling task to provide "yes/no" labels for candidate extractions predicted by a model trained on partially labeled do… ▽ More A key bottleneck in building automatic extraction models for visually rich documents like invoices is the cost of acquiring the several thousand high-quality labeled documents that are needed to train a model with acceptable accuracy. We propose Selective Labeling to simplify the labeling task to provide "yes/no" labels for candidate extractions predicted by a model trained on partially labeled documents. We combine this with a custom active learning strategy to find the predictions that the model is most uncertain about. We show through experiments on document types drawn from 3 different domains that selective labeling can reduce the cost of acquiring labeled data by $10\times$ with a negligible loss in accuracy. △ Less

Submitted 28 October, 2022; originally announced October 2022.

Comments: 9 pages, 8 figures, 3 tables

arXiv:2201.02647 [pdf, other]

Data-Efficient Information Extraction from Form-Like Documents

Authors: Beliz Gunel, Navneet Potti, Sandeep Tata, James B. Wendt, Marc Najork, Jing Xie

Abstract: Automating information extraction from form-like documents at scale is a pressing need due to its potential impact on automating business workflows across many industries like financial services, insurance, and healthcare. The key challenge is that form-like documents in these business workflows can be laid out in virtually infinitely many ways; hence, a good solution to this problem should genera… ▽ More Automating information extraction from form-like documents at scale is a pressing need due to its potential impact on automating business workflows across many industries like financial services, insurance, and healthcare. The key challenge is that form-like documents in these business workflows can be laid out in virtually infinitely many ways; hence, a good solution to this problem should generalize to documents with unseen layouts and languages. A solution to this problem requires a holistic understanding of both the textual segments and the visual cues within a document, which is non-trivial. While the natural language processing and computer vision communities are starting to tackle this problem, there has not been much focus on (1) data-efficiency, and (2) ability to generalize across different document types and languages. In this paper, we show that when we have only a small number of labeled documents for training (~50), a straightforward transfer learning approach from a considerably structurally-different larger labeled corpus yields up to a 27 F1 point improvement over simply training on the small corpus in the target domain. We improve on this with a simple multi-domain transfer learning approach, that is currently in production use, and show that this yields up to a further 8 F1 point improvement. We make the case that data efficiency is critical to enable information extraction systems to scale to handle hundreds of different document-types, and learning good representations is critical to accomplishing this. △ Less

Submitted 7 January, 2022; originally announced January 2022.

Comments: Published at the 2nd Document Intelligence Workshop @ KDD 2021 (https://document-intelligence.github.io/DI-2021/)

arXiv:2101.02415 [pdf, other]

Simplified DOM Trees for Transferable Attribute Extraction from the Web

Authors: Yichao Zhou, Ying Sheng, Nguyen Vo, Nick Edmonds, Sandeep Tata

Abstract: There has been a steady need to precisely extract structured knowledge from the web (i.e. HTML documents). Given a web page, extracting a structured object along with various attributes of interest (e.g. price, publisher, author, and genre for a book) can facilitate a variety of downstream applications such as large-scale knowledge base construction, e-commerce product search, and personalized rec… ▽ More There has been a steady need to precisely extract structured knowledge from the web (i.e. HTML documents). Given a web page, extracting a structured object along with various attributes of interest (e.g. price, publisher, author, and genre for a book) can facilitate a variety of downstream applications such as large-scale knowledge base construction, e-commerce product search, and personalized recommendation. Considering each web page is rendered from an HTML DOM tree, existing approaches formulate the problem as a DOM tree node tagging task. However, they either rely on computationally expensive visual feature engineering or are incapable of modeling the relationship among the tree nodes. In this paper, we propose a novel transferable method, Simplified DOM Trees for Attribute Extraction (SimpDOM), to tackle the problem by efficiently retrieving useful context for each node by leveraging the tree structure. We study two challenging experimental settings: (i) intra-vertical few-shot extraction, and (ii) cross-vertical fewshot extraction with out-of-domain knowledge, to evaluate our approach. Extensive experiments on the SWDE public dataset show that SimpDOM outperforms the state-of-the-art (SOTA) method by 1.44% on the F1 score. We also find that utilizing knowledge from a different vertical (cross-vertical extraction) is surprisingly useful and helps beat the SOTA by a further 1.37%. △ Less

Submitted 7 January, 2021; originally announced January 2021.

Comments: 10 pages, 9 figures

arXiv:2010.10755 [pdf, other]

doi 10.1145/3394486.3403153

FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents

Authors: Bill Yuchen Lin, Ying Sheng, Nguyen Vo, Sandeep Tata

Abstract: Extracting structured data from HTML documents is a long-studied problem with a broad range of applications like augmenting knowledge bases, supporting faceted search, and providing domain-specific experiences for key verticals like shopping and movies. Previous approaches have either required a small number of examples for each target site or relied on carefully handcrafted heuristics built over… ▽ More Extracting structured data from HTML documents is a long-studied problem with a broad range of applications like augmenting knowledge bases, supporting faceted search, and providing domain-specific experiences for key verticals like shopping and movies. Previous approaches have either required a small number of examples for each target site or relied on carefully handcrafted heuristics built over visual renderings of websites. In this paper, we present a novel two-stage neural approach, named FreeDOM, which overcomes both these limitations. The first stage learns a representation for each DOM node in the page by combining both the text and markup information. The second stage captures longer range distance and semantic relatedness using a relational neural network. By combining these stages, FreeDOM is able to generalize to unseen sites after training on a small number of seed sites from that vertical without requiring expensive hand-crafted features over visual renderings of the page. Through experiments on a public dataset with 8 different verticals, we show that FreeDOM beats the previous state of the art by nearly 3.7 F1 points on average without requiring features over rendered pages or expensive hand-crafted features. △ Less

Submitted 21 October, 2020; originally announced October 2020.

Comments: in Proc. of KDD 2020 (Research Track). Figure 5 updated

arXiv:2005.11442 [pdf, other]

Active Learning for Skewed Data Sets

Authors: Abbas Kazerouni, Qi Zhao, Jing Xie, Sandeep Tata, Marc Najork

Abstract: Consider a sequential active learning problem where, at each round, an agent selects a batch of unlabeled data points, queries their labels and updates a binary classifier. While there exists a rich body of work on active learning in this general form, in this paper, we focus on problems with two distinguishing characteristics: severe class imbalance (skew) and small amounts of initial training da… ▽ More Consider a sequential active learning problem where, at each round, an agent selects a batch of unlabeled data points, queries their labels and updates a binary classifier. While there exists a rich body of work on active learning in this general form, in this paper, we focus on problems with two distinguishing characteristics: severe class imbalance (skew) and small amounts of initial training data. Both of these problems occur with surprising frequency in many web applications. For instance, detecting offensive or sensitive content in online communities (pornography, violence, and hate-speech) is receiving enormous attention from industry as well as research communities. Such problems have both the characteristics we describe -- a vast majority of content is not offensive, so the number of positive examples for such content is orders of magnitude smaller than the negative examples. Furthermore, there is usually only a small amount of initial training data available when building machine-learned models to solve such problems. To address both these issues, we propose a hybrid active learning algorithm (HAL) that balances exploiting the knowledge available through the currently labeled training examples with exploring the large amount of unlabeled data available. Through simulation results, we show that HAL makes significantly better choices for what points to label when compared to strong baselines like margin-sampling. Classifiers trained on the examples selected for labeling by HAL easily out-perform the baselines on target metrics (like area under the precision-recall curve) given the same budget for labeling examples. We believe HAL offers a simple, intuitive, and computationally tractable way to structure active learning for a wide range of machine learning applications. △ Less

Submitted 22 May, 2020; originally announced May 2020.

arXiv:2002.02807 [pdf, other]

doi 10.1145/3381755

Adaptive control for hindlimb locomotion in a simulated mouse through temporal cerebellar learning

Authors: T. P. Jensen, S. Tata, A. J. Ijspeert, S. Tolu

Abstract: Human beings and other vertebrates show remarkable performance and efficiency in locomotion, but the functioning of their biological control systems for locomotion is still only partially understood. The basic patterns and timing for locomotion are provided by a central pattern generator (CPG) in the spinal cord. The cerebellum is known to play an important role in adaptive locomotion. Recent stud… ▽ More Human beings and other vertebrates show remarkable performance and efficiency in locomotion, but the functioning of their biological control systems for locomotion is still only partially understood. The basic patterns and timing for locomotion are provided by a central pattern generator (CPG) in the spinal cord. The cerebellum is known to play an important role in adaptive locomotion. Recent studies have given insights into the error signals responsible for driving the cerebellar adaptation in locomotion. However, the question of how the cerebellar output influences the gait remains unanswered. We hypothesize that the cerebellar correction is applied to the pattern formation part of the CPG. Here, a bio-inspired control system for adaptive locomotion of the musculoskeletal system of the mouse is presented, where a cerebellar-like module adapts the step time by using the double support interlimb asymmetry as a temporal teaching signal. The control system is tested on a simulated mouse in a split-belt treadmill setup similar to those used in experiments with real mice. The results show adaptive locomotion behavior in the interlimb parameters similar to that seen in humans and mice. The control system adaptively decreases the double support asymmetry that occurs due to environmental perturbations in the split-belt protocol. △ Less

Submitted 17 February, 2020; v1 submitted 7 February, 2020; originally announced February 2020.

Comments: To be published in NICE '20: Proceedings of the 8th Annual Neuro-inspired Computational Elements Workshop. 8 pages, 13 figures

arXiv:1811.00652 [pdf]

Modeling IoT-aware Business Processes - A State of the Art Report

Authors: Nadja Brouns, Samir Tata, Heiko Ludwig, E. Serral Asensio, Paul Grefen

Abstract: This research report presents an analysis of the state of the art of modeling Internet of Things (IoT)-aware business processes. IOT links the physical world to the digital world. Traditionally, we would find information about events and processes in the physical world in the digital world entered by humans and humans using this information to control the physical world. In the IoT paradigm, the p… ▽ More This research report presents an analysis of the state of the art of modeling Internet of Things (IoT)-aware business processes. IOT links the physical world to the digital world. Traditionally, we would find information about events and processes in the physical world in the digital world entered by humans and humans using this information to control the physical world. In the IoT paradigm, the physical world is equipped with sensors and actuators to create a direct link with the digital world. Business processes are used to coordinate a complex environment including multiple actors for a common goal, typically in the context of administrative work. In the past few years, we have seen research efforts on the possibilities to model IoT- aware business processes, extending process coordination to real world entities directly. This set of research efforts is relatively small when compared to the overall research effort into the IoT and much of the work is still in the early research stage. To create a basis for a bridge between IoT and BPM, the goal of this report is to collect and analyze the state of the art of existing frameworks for modeling IoT-aware business processes. △ Less

Submitted 1 November, 2018; originally announced November 2018.

Comments: 42 pages

Report number: RJ 10540

Journal ref: IBM Research Report 2018

arXiv:1105.4252 [pdf]

Column-Oriented Storage Techniques for MapReduce

Authors: Avrilia Floratou, Jignesh Patel, Eugene Shekita, Sandeep Tata

Abstract: Users of MapReduce often run into performance problems when they scale up their workloads. Many of the problems they encounter can be overcome by applying techniques learned from over three decades of research on parallel DBMSs. However, translating these techniques to a MapReduce implementation such as Hadoop presents unique challenges that can lead to new design choices. This paper describes how… ▽ More Users of MapReduce often run into performance problems when they scale up their workloads. Many of the problems they encounter can be overcome by applying techniques learned from over three decades of research on parallel DBMSs. However, translating these techniques to a MapReduce implementation such as Hadoop presents unique challenges that can lead to new design choices. This paper describes how column-oriented storage techniques can be incorporated in Hadoop in a way that preserves its popular programming APIs. We show that simply using binary storage formats in Hadoop can provide a 3x performance boost over the naive use of text files. We then introduce a column-oriented storage format that is compatible with the replication and scheduling constraints of Hadoop and show that it can speed up MapReduce jobs on real workloads by an order of magnitude. We also show that dealing with complex column types such as arrays, maps, and nested records, which are common in MapReduce jobs, can incur significant CPU overhead. Finally, we introduce a novel skip list column format and lazy record construction strategy that avoids deserializing unwanted records to provide an additional 1.5x performance boost. Experiments on a real intranet crawl are used to show that our column-oriented storage techniques can improve the performance of the map phase in Hadoop by as much as two orders of magnitude. △ Less

Submitted 21 May, 2011; originally announced May 2011.

Comments: VLDB2011

Report number: Proceedings of the VLDB Endowment (PVLDB), Vol. 4, No. 7, pp. 419-429 (2011)

arXiv:1103.2408 [pdf]

Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore

Authors: Jun Rao, Eugene J. Shekita, Sandeep Tata

Abstract: Spinnaker is an experimental datastore that is designed to run on a large cluster of commodity servers in a single datacenter. It features key-based range partitioning, 3-way replication, and a transactional get-put API with the option to choose either strong or timeline consistency on reads. This paper describes Spinnaker's Paxos-based replication protocol. The use of Paxos ensures that a data pa… ▽ More Spinnaker is an experimental datastore that is designed to run on a large cluster of commodity servers in a single datacenter. It features key-based range partitioning, 3-way replication, and a transactional get-put API with the option to choose either strong or timeline consistency on reads. This paper describes Spinnaker's Paxos-based replication protocol. The use of Paxos ensures that a data partition in Spinnaker will be available for reads and writes as long a majority of its replicas are alive. Unlike traditional master-slave replication, this is true regardless of the failure sequence that occurs. We show that Paxos replication can be competitive with alternatives that provide weaker consistency guarantees. Compared to an eventually consistent datastore, we show that Spinnaker can be as fast or even faster on reads and only 5% to 10% slower on writes. △ Less

Submitted 11 March, 2011; originally announced March 2011.

Comments: VLDB2011

Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 4, No. 4, pp. 243-254 (2011)

Showing 1–18 of 18 results for author: Tata, S