Skip to main content

Showing 1–7 of 7 results for author: Chousa, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.09017  [pdf, other

    cs.CL

    A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining

    Authors: Masaaki Nagata, Makoto Morishita, Katsuki Chousa, Norihito Yasuda

    Abstract: Using crowdsourcing, we collected more than 10,000 URL pairs (parallel top page pairs) of bilingual websites that contain parallel documents and created a Japanese-Chinese parallel corpus of 4.6M sentence pairs from these websites. We used a Japanese-Chinese bilingual dictionary of 160K word pairs for document and sentence alignment. We then used high-quality 1.2M Japanese-Chinese sentence pairs t… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

    Comments: Work in progress

  2. arXiv:2404.09002  [pdf, other

    cs.CL

    WikiSplit++: Easy Data Refinement for Split and Rephrase

    Authors: Hayato Tsukagoshi, Tsutomu Hirao, Makoto Morishita, Katsuki Chousa, Ryohei Sasano, Koichi Takeda

    Abstract: The task of Split and Rephrase, which splits a complex sentence into multiple simple sentences with the same meaning, improves readability and enhances the performance of downstream tasks in natural language processing (NLP). However, while Split and Rephrase can be improved using a text-to-text generation approach that applies encoder-decoder models fine-tuned with a large-scale dataset, it still… ▽ More

    Submitted 13 April, 2024; originally announced April 2024.

    Comments: Accepted at LREC-COLING 2024

  3. arXiv:2202.12607  [pdf, ps, other

    cs.CL

    JParaCrawl v3.0: A Large-scale English-Japanese Parallel Corpus

    Authors: Makoto Morishita, Katsuki Chousa, Jun Suzuki, Masaaki Nagata

    Abstract: Most current machine translation models are mainly trained with parallel corpora, and their translation accuracy largely depends on the quality and quantity of the corpora. Although there are billions of parallel sentences for a few language pairs, effectively dealing with most language pairs is difficult due to a lack of publicly available parallel corpora. This paper creates a large parallel cor… ▽ More

    Submitted 28 February, 2022; v1 submitted 25 February, 2022; originally announced February 2022.

    Comments: 7 pages

  4. arXiv:2106.05450  [pdf, other

    cs.CL

    Input Augmentation Improves Constrained Beam Search for Neural Machine Translation: NTT at WAT 2021

    Authors: Katsuki Chousa, Makoto Morishita

    Abstract: This paper describes our systems that were submitted to the restricted translation task at WAT 2021. In this task, the systems are required to output translated sentences that contain all given word constraints. Our system combined input augmentation and constrained beam search algorithms. Through experiments, we found that this combination significantly improves translation accuracy and can save… ▽ More

    Submitted 9 June, 2021; originally announced June 2021.

    Comments: 9 pages, 4 figures, WAT 2021 Restricted Translation Task

  5. arXiv:2004.14517  [pdf, ps, other

    cs.CL

    Bilingual Text Extraction as Reading Comprehension

    Authors: Katsuki Chousa, Masaaki Nagata, Masaaki Nishino

    Abstract: In this paper, we propose a method to extract bilingual texts automatically from noisy parallel corpora by framing the problem as a token-level span prediction, such as SQuAD-style Reading Comprehension. To extract a span of the target document that is a translation of a given source sentence (span), we use either QANet or multilingual BERT. QANet can be trained for a specific parallel corpus from… ▽ More

    Submitted 29 April, 2020; originally announced April 2020.

    Comments: 7 pages

  6. arXiv:1911.11933  [pdf, other

    cs.CL

    Simultaneous Neural Machine Translation using Connectionist Temporal Classification

    Authors: Katsuki Chousa, Katsuhito Sudoh, Satoshi Nakamura

    Abstract: Simultaneous machine translation is a variant of machine translation that starts the translation process before the end of an input. This task faces a trade-off between translation accuracy and latency. We have to determine when we start the translation for observed inputs so far, to achieve good practical performance. In this work, we propose a neural machine translation method to determine this… ▽ More

    Submitted 26 November, 2019; originally announced November 2019.

  7. arXiv:1807.11219  [pdf, ps, other

    cs.CL

    Training Neural Machine Translation using Word Embedding-based Loss

    Authors: Katsuki Chousa, Katsuhito Sudoh, Satoshi Nakamura

    Abstract: In neural machine translation (NMT), the computational cost at the output layer increases with the size of the target-side vocabulary. Using a limited-size vocabulary instead may cause a significant decrease in translation quality. This trade-off is derived from a softmax-based loss function that handles in-dictionary words independently, in which word similarity is not considered. In this paper,… ▽ More

    Submitted 30 July, 2018; originally announced July 2018.