Skip to main content

Showing 1–11 of 11 results for author: Keung, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2311.08593  [pdf, other

    cs.CL cs.IR

    Summarization-Based Document IDs for Generative Retrieval with Language Models

    Authors: Haoxin Li, Daniel Cheng, Phillip Keung, Jungo Kasai, Noah A. Smith

    Abstract: Generative retrieval (Wang et al., 2022; Tay et al., 2022) is a popular approach for end-to-end document retrieval that directly generates document identifiers given an input query. We introduce summarization-based document IDs, in which each document's ID is composed of an extractive summary or abstractive keyphrases generated by a language model, rather than an integer ID sequence or bags of n-g… ▽ More

    Submitted 29 October, 2024; v1 submitted 14 November, 2023; originally announced November 2023.

    Comments: To appear at the NLP for Wikipedia Workshop in EMNLP 2024

  2. arXiv:2301.04761  [pdf, other

    cs.CL cs.LG

    NarrowBERT: Accelerating Masked Language Model Pretraining and Inference

    Authors: Haoxin Li, Phillip Keung, Daniel Cheng, Jungo Kasai, Noah A. Smith

    Abstract: Large-scale language model pretraining is a very successful form of self-supervised learning in natural language processing, but it is increasingly expensive to perform as the models and pretraining corpora have become larger over time. We propose NarrowBERT, a modified transformer encoder that increases the throughput for masked language model pretraining by more than $2\times$. NarrowBERT sparsi… ▽ More

    Submitted 5 June, 2023; v1 submitted 11 January, 2023; originally announced January 2023.

    Comments: To appear in ACL 2023 (main conference)

  3. arXiv:2211.16671  [pdf, other

    cs.CL

    Domain Mismatch Doesn't Always Prevent Cross-Lingual Transfer Learning

    Authors: Daniel Edmiston, Phillip Keung, Noah A. Smith

    Abstract: Cross-lingual transfer learning without labeled target language data or parallel text has been surprisingly effective in zero-shot cross-lingual classification, question answering, unsupervised machine translation, etc. However, some recent publications have claimed that domain mismatch prevents cross-lingual transfer, and their results show that unsupervised bilingual lexicon induction (UBLI) and… ▽ More

    Submitted 29 November, 2022; originally announced November 2022.

    Comments: 8 pages, 1 figure. Published/presented at LREC (2022)

    Journal ref: Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC 2022), Pages 892-899

  4. arXiv:2010.07761  [pdf, other

    cs.CL cs.LG

    Unsupervised Bitext Mining and Translation via Self-trained Contextual Embeddings

    Authors: Phillip Keung, Julian Salazar, Yichao Lu, Noah A. Smith

    Abstract: We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text. We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training. We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (… ▽ More

    Submitted 15 October, 2020; originally announced October 2020.

    Comments: To appear in the Transactions of the Association for Computational Linguistics

  5. arXiv:2010.02573  [pdf, other

    cs.CL cs.IR cs.LG

    The Multilingual Amazon Reviews Corpus

    Authors: Phillip Keung, Yichao Lu, György Szarvas, Noah A. Smith

    Abstract: We present the Multilingual Amazon Reviews Corpus (MARC), a large-scale collection of Amazon reviews for multilingual text classification. The corpus contains reviews in English, Japanese, German, French, Spanish, and Chinese, which were collected between 2015 and 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized… ▽ More

    Submitted 6 October, 2020; originally announced October 2020.

    Comments: To appear in EMNLP 2020

  6. arXiv:2005.00932  [pdf, other

    cs.CL cs.LG

    Improving Non-autoregressive Neural Machine Translation with Monolingual Data

    Authors: Jiawei Zhou, Phillip Keung

    Abstract: Non-autoregressive (NAR) neural machine translation is usually done via knowledge distillation from an autoregressive (AR) model. Under this framework, we leverage large monolingual corpora to improve the NAR model's performance, with the goal of transferring the AR model's generalization ability while preventing overfitting. On top of a strong NAR baseline, our experimental results on the WMT14 E… ▽ More

    Submitted 29 November, 2020; v1 submitted 2 May, 2020; originally announced May 2020.

    Comments: Published in ACL 2020

  7. arXiv:2004.15001  [pdf, other

    cs.CL cs.LG

    Don't Use English Dev: On the Zero-Shot Cross-Lingual Evaluation of Contextual Embeddings

    Authors: Phillip Keung, Yichao Lu, Julian Salazar, Vikas Bhardwaj

    Abstract: Multilingual contextual embeddings have demonstrated state-of-the-art performance in zero-shot cross-lingual transfer learning, where multilingual BERT is fine-tuned on one source language and evaluated on a different target language. However, published results for mBERT zero-shot accuracy vary as much as 17 points on the MLDoc classification task across four papers. We show that the standard prac… ▽ More

    Submitted 6 October, 2020; v1 submitted 30 April, 2020; originally announced April 2020.

    Comments: To appear in EMNLP 2020

  8. arXiv:2002.05150  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Attentional Speech Recognition Models Misbehave on Out-of-domain Utterances

    Authors: Phillip Keung, Wei Niu, Yichao Lu, Julian Salazar, Vikas Bhardwaj

    Abstract: We discuss the problem of echographic transcription in autoregressive sequence-to-sequence attentional architectures for automatic speech recognition, where a model produces very long sequences of repetitive outputs when presented with out-of-domain utterances. We decode audio from the British National Corpus with an attentional encoder-decoder model trained solely on the LibriSpeech corpus. We ob… ▽ More

    Submitted 12 February, 2020; originally announced February 2020.

    Comments: Artifacts like our filtered Audio BNC dataset can be found at https://github.com/aws-samples/seq2seq-asr-misbehaves

  9. arXiv:1909.00153  [pdf, other

    cs.CL cs.LG

    Adversarial Learning with Contextual Embeddings for Zero-resource Cross-lingual Classification and NER

    Authors: Phillip Keung, Yichao Lu, Vikas Bhardwaj

    Abstract: Contextual word embeddings (e.g. GPT, BERT, ELMo, etc.) have demonstrated state-of-the-art performance on various NLP tasks. Recent work with the multilingual version of BERT has shown that the model performs very well in zero-shot and zero-resource cross-lingual settings, where only labeled English data is used to finetune the model. We improve upon multilingual BERT's zero-resource cross-lingual… ▽ More

    Submitted 19 March, 2020; v1 submitted 31 August, 2019; originally announced September 2019.

    Comments: In EMNLP 2019

  10. arXiv:1804.08198  [pdf, other

    cs.CL

    A neural interlingua for multilingual machine translation

    Authors: Yichao Lu, Phillip Keung, Faisal Ladhak, Vikas Bhardwaj, Shaonan Zhang, Jason Sun

    Abstract: We incorporate an explicit neural interlingua into a multilingual encoder-decoder neural machine translation (NMT) architecture. We demonstrate that our model learns a language-independent representation by performing direct zero-shot translation (without using pivot translation), and by using the source sentence embeddings to create an English Yelp review classifier that, through the mediation of… ▽ More

    Submitted 16 October, 2018; v1 submitted 22 April, 2018; originally announced April 2018.

    Comments: Accepted in WMT 18

  11. arXiv:1703.09439  [pdf, other

    cs.CL cs.NE

    A practical approach to dialogue response generation in closed domains

    Authors: Yichao Lu, Phillip Keung, Shaonan Zhang, Jason Sun, Vikas Bhardwaj

    Abstract: We describe a prototype dialogue response generation model for the customer service domain at Amazon. The model, which is trained in a weakly supervised fashion, measures the similarity between customer questions and agent answers using a dual encoder network, a Siamese-like neural network architecture. Answer templates are extracted from embeddings derived from past agent answers, without turn-by… ▽ More

    Submitted 28 March, 2017; originally announced March 2017.