-
ShopTalk: A System for Conversational Faceted Search
Authors:
Gurmeet Manku,
James Lee-Thorp,
Bhargav Kanagal,
Joshua Ainslie,
Jingchen Feng,
Zach Pearson,
Ebenezer Anjorin,
Sudeep Gandhe,
Ilya Eckstein,
Jim Rosswog,
Sumit Sanghai,
Michael Pohl,
Larry Adams,
D. Sivakumar
Abstract:
We present ShopTalk, a multi-turn conversational faceted search system for shopping that is designed to handle large and complex schemas that are beyond the scope of state of the art slot-filling systems. ShopTalk decouples dialog management from fulfillment, thereby allowing the dialog understanding system to be domain-agnostic and not tied to the particular shopping application. The dialog under…
▽ More
We present ShopTalk, a multi-turn conversational faceted search system for shopping that is designed to handle large and complex schemas that are beyond the scope of state of the art slot-filling systems. ShopTalk decouples dialog management from fulfillment, thereby allowing the dialog understanding system to be domain-agnostic and not tied to the particular shopping application. The dialog understanding system consists of a deep-learned Contextual Language Understanding module, which interprets user utterances, and a primarily rules-based Dialog-State Tracker (DST), which updates the dialog state and formulates search requests intended for the fulfillment engine. The interface between the two modules consists of a minimal set of domain-agnostic "intent operators," which instruct the DST on how to update the dialog state. ShopTalk was deployed in 2020 on the Google Assistant for Shopping searches.
△ Less
Submitted 2 September, 2021;
originally announced September 2021.
-
ReadTwice: Reading Very Large Documents with Memories
Authors:
Yury Zemlyanskiy,
Joshua Ainslie,
Michiel de Jong,
Philip Pham,
Ilya Eckstein,
Fei Sha
Abstract:
Knowledge-intensive tasks such as question answering often require assimilating information from different sections of large inputs such as books or article collections. We propose ReadTwice, a simple and effective technique that combines several strengths of prior approaches to model long-range dependencies with Transformers. The main idea is to read text in small segments, in parallel, summarizi…
▽ More
Knowledge-intensive tasks such as question answering often require assimilating information from different sections of large inputs such as books or article collections. We propose ReadTwice, a simple and effective technique that combines several strengths of prior approaches to model long-range dependencies with Transformers. The main idea is to read text in small segments, in parallel, summarizing each segment into a memory table to be used in a second read of the text. We show that the method outperforms models of comparable size on several question answering (QA) datasets and sets a new state of the art on the challenging NarrativeQA task, with questions about entire books. Source code and pre-trained checkpoints for ReadTwice can be found at https://goo.gle/research-readtwice.
△ Less
Submitted 11 May, 2021; v1 submitted 10 May, 2021;
originally announced May 2021.
-
FNet: Mixing Tokens with Fourier Transforms
Authors:
James Lee-Thorp,
Joshua Ainslie,
Ilya Eckstein,
Santiago Ontanon
Abstract:
We show that Transformer encoder architectures can be sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that "mix" input tokens. These linear mixers, along with standard nonlinearities in feed-forward layers, prove competent at modeling semantic relationships in several text classification tasks. Most surprisingly, we find that repla…
▽ More
We show that Transformer encoder architectures can be sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that "mix" input tokens. These linear mixers, along with standard nonlinearities in feed-forward layers, prove competent at modeling semantic relationships in several text classification tasks. Most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE benchmark, but trains 80% faster on GPUs and 70% faster on TPUs at standard 512 input lengths. At longer input lengths, our FNet model is significantly faster: when compared to the "efficient" Transformers on the Long Range Arena benchmark, FNet matches the accuracy of the most accurate models, while outpacing the fastest models across all sequence lengths on GPUs (and across relatively shorter lengths on TPUs). Finally, FNet has a light memory footprint and is particularly efficient at smaller model sizes; for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts.
△ Less
Submitted 26 May, 2022; v1 submitted 8 May, 2021;
originally announced May 2021.
-
DOCENT: Learning Self-Supervised Entity Representations from Large Document Collections
Authors:
Yury Zemlyanskiy,
Sudeep Gandhe,
Ruining He,
Bhargav Kanagal,
Anirudh Ravula,
Juraj Gottweis,
Fei Sha,
Ilya Eckstein
Abstract:
This paper explores learning rich self-supervised entity representations from large amounts of the associated text. Once pre-trained, these models become applicable to multiple entity-centric tasks such as ranked retrieval, knowledge base completion, question answering, and more. Unlike other methods that harvest self-supervision signals based merely on a local context within a sentence, we radica…
▽ More
This paper explores learning rich self-supervised entity representations from large amounts of the associated text. Once pre-trained, these models become applicable to multiple entity-centric tasks such as ranked retrieval, knowledge base completion, question answering, and more. Unlike other methods that harvest self-supervision signals based merely on a local context within a sentence, we radically expand the notion of context to include any available text related to an entity. This enables a new class of powerful, high-capacity representations that can ultimately distill much of the useful information about an entity from multiple text sources, without any human supervision.
We present several training strategies that, unlike prior approaches, learn to jointly predict words and entities -- strategies we compare experimentally on downstream tasks in the TV-Movies domain, such as MovieLens tag prediction from user reviews and natural language movie search. As evidenced by results, our models match or outperform competitive baselines, sometimes with little or no fine-tuning, and can scale to very large corpora.
Finally, we make our datasets and pre-trained models publicly available. This includes Reviews2Movielens (see https://goo.gle/research-docent ), mapping the up to 1B word corpus of Amazon movie reviews (He and McAuley, 2016) to MovieLens tags (Harper and Konstan, 2016), as well as Reddit Movie Suggestions (see https://urikz.github.io/docent ) with natural language queries and corresponding community recommendations.
△ Less
Submitted 25 February, 2021;
originally announced February 2021.
-
Google COVID-19 Search Trends Symptoms Dataset: Anonymization Process Description (version 1.0)
Authors:
Shailesh Bavadekar,
Andrew Dai,
John Davis,
Damien Desfontaines,
Ilya Eckstein,
Katie Everett,
Alex Fabrikant,
Gerardo Flores,
Evgeniy Gabrilovich,
Krishna Gadepalli,
Shane Glass,
Rayman Huang,
Chaitanya Kamath,
Dennis Kraft,
Akim Kumok,
Hinali Marfatia,
Yael Mayer,
Benjamin Miller,
Adam Pearce,
Irippuge Milinda Perera,
Venky Ramachandran,
Karthik Raman,
Thomas Roessler,
Izhak Shafran,
Tomer Shekel
, et al. (5 additional authors not shown)
Abstract:
This report describes the aggregation and anonymization process applied to the initial version of COVID-19 Search Trends symptoms dataset (published at https://goo.gle/covid19symptomdataset on September 2, 2020), a publicly available dataset that shows aggregated, anonymized trends in Google searches for symptoms (and some related topics). The anonymization process is designed to protect the daily…
▽ More
This report describes the aggregation and anonymization process applied to the initial version of COVID-19 Search Trends symptoms dataset (published at https://goo.gle/covid19symptomdataset on September 2, 2020), a publicly available dataset that shows aggregated, anonymized trends in Google searches for symptoms (and some related topics). The anonymization process is designed to protect the daily symptom search activity of every user with $\varepsilon$-differential privacy for $\varepsilon$ = 1.68.
△ Less
Submitted 2 September, 2020;
originally announced September 2020.