Search | arXiv e-print repository

arXiv:2005.04961 [pdf, ps, other]

Embedding-based Scientific Literature Discovery in a Text Editor Application

Authors: Onur Gökçe, Jonathan Prada, Nikola I. Nikolov, Nianlong Gu, Richard H. R. Hahnloser

Abstract: Each claim in a research paper requires all relevant prior knowledge to be discovered, assimilated, and appropriately cited. However, despite the availability of powerful search engines and sophisticated text editing software, discovering relevant papers and integrating the knowledge into a manuscript remain complex tasks associated with high cognitive load. To define comprehensive search queries… ▽ More Each claim in a research paper requires all relevant prior knowledge to be discovered, assimilated, and appropriately cited. However, despite the availability of powerful search engines and sophisticated text editing software, discovering relevant papers and integrating the knowledge into a manuscript remain complex tasks associated with high cognitive load. To define comprehensive search queries requires strong motivation from authors, irrespective of their familiarity with the research field. Moreover, switching between independent applications for literature discovery, bibliography management, reading papers, and writing text burdens authors further and interrupts their creative process. Here, we present a web application that combines text editing and literature discovery in an interactive user interface. The application is equipped with a search engine that couples Boolean keyword filtering with nearest neighbor search over text embeddings, providing a discovery experience tuned to an author's manuscript and his interests. Our application aims to take a step towards more enjoyable and effortless academic writing. The demo of the application (https://SciEditorDemo2020.herokuapp.com/) and a short video tutorial (https://youtu.be/pkdVU60IcRc) are available online. △ Less

Submitted 11 May, 2020; originally announced May 2020.

arXiv:2004.14788 [pdf, other]

Character-Level Translation with Self-attention

Authors: Yingqiang Gao, Nikola I. Nikolov, Yuhuang Hu, Richard H. R. Hahnloser

Abstract: We explore the suitability of self-attention models for character-level neural machine translation. We test the standard transformer model, as well as a novel variant in which the encoder block combines information from nearby characters using convolutions. We perform extensive experiments on WMT and UN datasets, testing both bilingual and multilingual translation to English using up to three inpu… ▽ More We explore the suitability of self-attention models for character-level neural machine translation. We test the standard transformer model, as well as a novel variant in which the encoder block combines information from nearby characters using convolutions. We perform extensive experiments on WMT and UN datasets, testing both bilingual and multilingual translation to English using up to three input languages (French, Spanish, and Chinese). Our transformer variant consistently outperforms the standard transformer at the character-level and converges faster while learning more robust character-level alignments. △ Less

Submitted 30 April, 2020; originally announced April 2020.

Comments: ACL 2020

arXiv:2004.03965 [pdf, other]

Rapformer: Conditional Rap Lyrics Generation with Denoising Autoencoders

Authors: Nikola I. Nikolov, Eric Malmi, Curtis G. Northcutt, Loreto Parisi

Abstract: The ability to combine symbols to generate language is a defining characteristic of human intelligence, particularly in the context of artistic story-telling through lyrics. We develop a method for synthesizing a rap verse based on the content of any text (e.g., a news article), or for augmenting pre-existing rap lyrics. Our method, called Rapformer, is based on training a Transformer-based denois… ▽ More The ability to combine symbols to generate language is a defining characteristic of human intelligence, particularly in the context of artistic story-telling through lyrics. We develop a method for synthesizing a rap verse based on the content of any text (e.g., a news article), or for augmenting pre-existing rap lyrics. Our method, called Rapformer, is based on training a Transformer-based denoising autoencoder to reconstruct rap lyrics from content words extracted from the lyrics, trying to preserve the essential meaning, while matching the target style. Rapformer features a novel BERT-based paraphrasing scheme for rhyme enhancement which increases the average rhyme density of output lyrics by 10%. Experimental results on three diverse input domains show that Rapformer is capable of generating technically fluent verses that offer a good trade-off between content preservation and style transfer. Furthermore, a Turing-test-like experiment reveals that Rapformer fools human lyrics experts 25% of the time. △ Less

Submitted 13 December, 2020; v1 submitted 8 April, 2020; originally announced April 2020.

Comments: Accepted at INLG 2020

arXiv:1907.12951 [pdf, other]

Abstractive Document Summarization without Parallel Data

Authors: Nikola I. Nikolov, Richard H. R. Hahnloser

Abstract: Abstractive summarization typically relies on large collections of paired articles and summaries. However, in many cases, parallel data is scarce and costly to obtain. We develop an abstractive summarization system that relies only on large collections of example summaries and non-matching articles. Our approach consists of an unsupervised sentence extractor that selects salient sentences to inclu… ▽ More Abstractive summarization typically relies on large collections of paired articles and summaries. However, in many cases, parallel data is scarce and costly to obtain. We develop an abstractive summarization system that relies only on large collections of example summaries and non-matching articles. Our approach consists of an unsupervised sentence extractor that selects salient sentences to include in the final summary, as well as a sentence abstractor that is trained on pseudo-parallel and synthetic data, that paraphrases each of the extracted sentences. We perform an extensive evaluation of our method: on the CNN/DailyMail benchmark, on which we compare our approach to fully supervised baselines, as well as on the novel task of automatically generating a press release from a scientific journal article, which is well suited for our system. We show promising performance on both tasks, without relying on any article-summary pairs. △ Less

Submitted 3 March, 2020; v1 submitted 30 July, 2019; originally announced July 2019.

Comments: LREC 2020

arXiv:1907.10873 [pdf, other]

Summary Refinement through Denoising

Authors: Nikola I. Nikolov, Alessandro Calmanovici, Richard H. R. Hahnloser

Abstract: We propose a simple method for post-processing the outputs of a text summarization system in order to refine its overall quality. Our approach is to train text-to-text rewriting models to correct information redundancy errors that may arise during summarization. We train on synthetically generated noisy summaries, testing three different types of noise that introduce out-of-context information wit… ▽ More We propose a simple method for post-processing the outputs of a text summarization system in order to refine its overall quality. Our approach is to train text-to-text rewriting models to correct information redundancy errors that may arise during summarization. We train on synthetically generated noisy summaries, testing three different types of noise that introduce out-of-context information within each summary. When applied on top of extractive and abstractive summarization baselines, our summary denoising models yield metric improvements while reducing redundancy. △ Less

Submitted 25 July, 2019; originally announced July 2019.

Comments: RANLP 2019

arXiv:1810.08237 [pdf, other]

Large-scale Hierarchical Alignment for Data-driven Text Rewriting

Authors: Nikola I. Nikolov, Richard H. R. Hahnloser

Abstract: We propose a simple unsupervised method for extracting pseudo-parallel monolingual sentence pairs from comparable corpora representative of two different text styles, such as news articles and scientific papers. Our approach does not require a seed parallel corpus, but instead relies solely on hierarchical search over pre-trained embeddings of documents and sentences. We demonstrate the effectiven… ▽ More We propose a simple unsupervised method for extracting pseudo-parallel monolingual sentence pairs from comparable corpora representative of two different text styles, such as news articles and scientific papers. Our approach does not require a seed parallel corpus, but instead relies solely on hierarchical search over pre-trained embeddings of documents and sentences. We demonstrate the effectiveness of our method through automatic and extrinsic evaluation on text simplification from the normal to the Simple Wikipedia. We show that pseudo-parallel sentences extracted with our method not only supplement existing parallel data, but can even lead to competitive performance on their own. △ Less

Submitted 25 July, 2019; v1 submitted 18 October, 2018; originally announced October 2018.

Comments: RANLP 2019

arXiv:1805.03330 [pdf, other]

Character-level Chinese-English Translation through ASCII Encoding

Authors: Nikola I. Nikolov, Yuhuang Hu, Mi Xue Tan, Richard H. R. Hahnloser

Abstract: Character-level Neural Machine Translation (NMT) models have recently achieved impressive results on many language pairs. They mainly do well for Indo-European language pairs, where the languages share the same writing system. However, for translating between Chinese and English, the gap between the two different writing systems poses a major challenge because of a lack of systematic correspondenc… ▽ More Character-level Neural Machine Translation (NMT) models have recently achieved impressive results on many language pairs. They mainly do well for Indo-European language pairs, where the languages share the same writing system. However, for translating between Chinese and English, the gap between the two different writing systems poses a major challenge because of a lack of systematic correspondence between the individual linguistic units. In this paper, we enable character-level NMT for Chinese, by breaking down Chinese characters into linguistic units similar to that of Indo-European languages. We use the Wubi encoding scheme, which preserves the original shape and semantic information of the characters, while also being reversible. We show promising results from training Wubi-based models on the character- and subword-level with recurrent as well as convolutional models. △ Less

Submitted 27 August, 2018; v1 submitted 8 May, 2018; originally announced May 2018.

Comments: 7 pages, 3 figures, 3rd Conference on Machine Translation (WMT18), 2018

arXiv:1804.08875 [pdf, other]

Data-driven Summarization of Scientific Articles

Authors: Nikola I. Nikolov, Michael Pfeiffer, Richard H. R. Hahnloser

Abstract: Data-driven approaches to sequence-to-sequence modelling have been successfully applied to short text summarization of news articles. Such models are typically trained on input-summary pairs consisting of only a single or a few sentences, partially due to limited availability of multi-sentence training data. Here, we propose to use scientific articles as a new milestone for text summarization: lar… ▽ More Data-driven approaches to sequence-to-sequence modelling have been successfully applied to short text summarization of news articles. Such models are typically trained on input-summary pairs consisting of only a single or a few sentences, partially due to limited availability of multi-sentence training data. Here, we propose to use scientific articles as a new milestone for text summarization: large-scale training data come almost for free with two types of high-quality summaries at different levels - the title and the abstract. We generate two novel multi-sentence summarization datasets from scientific articles and test the suitability of a wide range of existing extractive and abstractive neural network-based summarization approaches. Our analysis demonstrates that scientific papers are suitable for data-driven text summarization. Our results could serve as valuable benchmarks for scaling sequence-to-sequence models to very long sequences. △ Less

Submitted 24 April, 2018; originally announced April 2018.

Comments: 8 pages, 3 figures. 7th International Workshop on Mining Scientific Publications, LREC 2018

Showing 1–8 of 8 results for author: Nikolov, N I