-
Domain Shift Analysis in Chest Radiographs Classification in a Veterans Healthcare Administration Population
Authors:
Mayanka Chandrashekar,
Ian Goethert,
Md Inzamam Ul Haque,
Benjamin McMahon,
Sayera Dhaubhadel,
Kathryn Knight,
Joseph Erdos,
Donna Reagan,
Caroline Taylor,
Peter Kuzmak,
John Michael Gaziano,
Eileen McAllister,
Lauren Costa,
Yuk-Lam Ho,
Kelly Cho,
Suzanne Tamang,
Samah Fodeh-Jarad,
Olga S. Ovchinnikova,
Amy C. Justice,
Jacob Hinkle,
Ioana Danciu
Abstract:
Objectives: This study aims to assess the impact of domain shift on chest X-ray classification accuracy and to analyze the influence of ground truth label quality and demographic factors such as age group, sex, and study year. Materials and Methods: We used a DenseNet121 model pretrained MIMIC-CXR dataset for deep learning-based multilabel classification using ground truth labels from radiology re…
▽ More
Objectives: This study aims to assess the impact of domain shift on chest X-ray classification accuracy and to analyze the influence of ground truth label quality and demographic factors such as age group, sex, and study year. Materials and Methods: We used a DenseNet121 model pretrained MIMIC-CXR dataset for deep learning-based multilabel classification using ground truth labels from radiology reports extracted using the CheXpert and CheXbert Labeler. We compared the performance of the 14 chest X-ray labels on the MIMIC-CXR and Veterans Healthcare Administration chest X-ray dataset (VA-CXR). The VA-CXR dataset comprises over 259k chest X-ray images spanning between the years 2010 and 2022. Results: The validation of ground truth and the assessment of multi-label classification performance across various NLP extraction tools revealed that the VA-CXR dataset exhibited lower disagreement rates than the MIMIC-CXR datasets. Additionally, there were notable differences in AUC scores between models utilizing CheXpert and CheXbert. When evaluating multi-label classification performance across different datasets, minimal domain shift was observed in unseen datasets, except for the label "Enlarged Cardiomediastinum." The study year's subgroup analyses exhibited the most significant variations in multi-label classification model performance. These findings underscore the importance of considering domain shifts in chest X-ray classification tasks, particularly concerning study years. Conclusion: Our study reveals the significant impact of domain shift and demographic factors on chest X-ray classification, emphasizing the need for improved transfer learning and equitable model development. Addressing these challenges is crucial for advancing medical imaging and enhancing patient care.
△ Less
Submitted 30 July, 2024;
originally announced July 2024.
-
Development and Validation of a Machine Learning Algorithm for Clinical Wellness Visit Classification in Cats and Dogs
Authors:
Donald Szlosek,
Michael Coyne,
Julia Riggot,
Kevin Knight,
DJ McCrann,
Dave Kincaid
Abstract:
Early disease detection in veterinary care relies on identifying subclinical abnormalities in asymptomatic animals during wellness visits. This study introduces an algorithm designed to distinguish between wellness and other veterinary visits.The purpose of this study is to validate the use of a visit classification algorithm compared to manual classification of veterinary visits by three board-ce…
▽ More
Early disease detection in veterinary care relies on identifying subclinical abnormalities in asymptomatic animals during wellness visits. This study introduces an algorithm designed to distinguish between wellness and other veterinary visits.The purpose of this study is to validate the use of a visit classification algorithm compared to manual classification of veterinary visits by three board-certified veterinarians. Using a dataset of 11,105 clinical visits from 2012 to 2017 involving 655 animals (85.3% canines and 14.7% felines) across 544 U.S. veterinary establishments, the model was trained using a Gradient Boosting Machine model. Three validators were tasked with classifying 400 visits, including both wellness and other types of visits, selected randomly from the same database used for initial algorithm training, aiming to maintain consistency and relevance between the training and application phases; visit classifications were subsequently categorized into "wellness" or "other" based on majority consensus among validators to assess the algorithm's performance in identifying wellness visits. The algorithm demonstrated a specificity of 0.94 (95% CI: 0.91 to 0.96), implying its accuracy in distinguishing non-wellness visits. The algorithm had a sensitivity of 0.86 (95% CI: 0.80 to 0.92), indicating its ability to correctly identify wellness visits as compared to the annotations provided by veterinary experts. The balanced accuracy, calculated as 0.90 (95% CI: 0.87 to 0.93), further confirms the algorithm's overall effectiveness. The algorithm exhibits strong specificity and sensitivity, ensuring accurate identification of a high proportion of wellness visits. Overall, this algorithm holds promise for advancing research on preventive care's role in subclinical disease identification, but prospective studies are needed for validation.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
VISION: Toward a Standardized Process for Radiology Image Management at the National Level
Authors:
Kathryn Knight,
Ioana Danciu,
Olga Ovchinnikova,
Jacob Hinkle,
Mayanka Chandra Shekar,
Debangshu Mukherjee,
Eileen McAllister,
Caitlin Rizy,
Kelly Cho,
Amy C. Justice,
Joseph Erdos,
Peter Kuzmak,
Lauren Costa,
Yuk-Lam Ho,
Reddy Madipadga,
Suzanne Tamang,
Ian Goethert
Abstract:
The compilation and analysis of radiological images poses numerous challenges for researchers. The sheer volume of data as well as the computational needs of algorithms capable of operating on images are extensive. Additionally, the assembly of these images alone is difficult, as these exams may differ widely in terms of clinical context, structured annotation available for model training, modalit…
▽ More
The compilation and analysis of radiological images poses numerous challenges for researchers. The sheer volume of data as well as the computational needs of algorithms capable of operating on images are extensive. Additionally, the assembly of these images alone is difficult, as these exams may differ widely in terms of clinical context, structured annotation available for model training, modality, and patient identifiers. In this paper, we describe our experiences and challenges in establishing a trusted collection of radiology images linked to the United States Department of Veterans Affairs (VA) electronic health record database. We also discuss implications in making this repository research-ready for medical investigators. Key insights include uncovering the specific procedures required for transferring images from a clinical to a research-ready environment, as well as roadblocks and bottlenecks in this process that may hinder future efforts at automation.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
Leveraging Large Language Models to Extract Information on Substance Use Disorder Severity from Clinical Notes: A Zero-shot Learning Approach
Authors:
Maria Mahbub,
Gregory M. Dams,
Sudarshan Srinivasan,
Caitlin Rizy,
Ioana Danciu,
Jodie Trafton,
Kathryn Knight
Abstract:
Substance use disorder (SUD) poses a major concern due to its detrimental effects on health and society. SUD identification and treatment depend on a variety of factors such as severity, co-determinants (e.g., withdrawal symptoms), and social determinants of health. Existing diagnostic coding systems used by American insurance providers, like the International Classification of Diseases (ICD-10),…
▽ More
Substance use disorder (SUD) poses a major concern due to its detrimental effects on health and society. SUD identification and treatment depend on a variety of factors such as severity, co-determinants (e.g., withdrawal symptoms), and social determinants of health. Existing diagnostic coding systems used by American insurance providers, like the International Classification of Diseases (ICD-10), lack granularity for certain diagnoses, but clinicians will add this granularity (as that found within the Diagnostic and Statistical Manual of Mental Disorders classification or DSM-5) as supplemental unstructured text in clinical notes. Traditional natural language processing (NLP) methods face limitations in accurately parsing such diverse clinical language. Large Language Models (LLMs) offer promise in overcoming these challenges by adapting to diverse language patterns. This study investigates the application of LLMs for extracting severity-related information for various SUD diagnoses from clinical notes. We propose a workflow employing zero-shot learning of LLMs with carefully crafted prompts and post-processing techniques. Through experimentation with Flan-T5, an open-source LLM, we demonstrate its superior recall compared to the rule-based approach. Focusing on 11 categories of SUD diagnoses, we show the effectiveness of LLMs in extracting severity information, contributing to improved risk assessment and treatment planning for SUD patients.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
Question-Answering System Extracts Information on Injection Drug Use from Clinical Notes
Authors:
Maria Mahbub,
Ian Goethert,
Ioana Danciu,
Kathryn Knight,
Sudarshan Srinivasan,
Suzanne Tamang,
Karine Rozenberg-Ben-Dror,
Hugo Solares,
Susana Martins,
Jodie Trafton,
Edmon Begoli,
Gregory Peterson
Abstract:
Background: Injection drug use (IDU) is a dangerous health behavior that increases mortality and morbidity. Identifying IDU early and initiating harm reduction interventions can benefit individuals at risk. However, extracting IDU behaviors from patients' electronic health records (EHR) is difficult because there is no International Classification of Disease (ICD) code and the only place IDU infor…
▽ More
Background: Injection drug use (IDU) is a dangerous health behavior that increases mortality and morbidity. Identifying IDU early and initiating harm reduction interventions can benefit individuals at risk. However, extracting IDU behaviors from patients' electronic health records (EHR) is difficult because there is no International Classification of Disease (ICD) code and the only place IDU information can be indicated is unstructured free-text clinical notes. Although natural language processing can efficiently extract this information from unstructured data, there are no validated tools. Methods: To address this gap in clinical information, we design and demonstrate a question-answering (QA) framework to extract information on IDU from clinical notes. Our framework involves two main steps: (1) generating a gold-standard QA dataset and (2) developing and testing the QA model. We utilize 2323 clinical notes of 1145 patients sourced from the VA Corporate Data Warehouse to construct the gold-standard dataset for developing and evaluating the QA model. We also demonstrate the QA model's ability to extract IDU-related information on temporally out-of-distribution data. Results: Here we show that for a strict match between gold-standard and predicted answers, the QA model achieves 51.65% F1 score. For a relaxed match between the gold-standard and predicted answers, the QA model obtains 78.03% F1 score, along with 85.38% Precision and 79.02% Recall scores. Moreover, the QA model demonstrates consistent performance when subjected to temporally out-of-distribution data. Conclusions: Our study introduces a QA framework designed to extract IDU information from clinical notes, aiming to enhance the accurate and efficient detection of people who inject drugs, extract relevant information, and ultimately facilitate informed patient care.
△ Less
Submitted 28 December, 2023; v1 submitted 15 May, 2023;
originally announced May 2023.
-
F*** workflows: when parts of FAIR are missing
Authors:
Sean R. Wilkinson,
Greg Eisenhauer,
Anuj J. Kapadia,
Kathryn Knight,
Jeremy Logan,
Patrick Widener,
Matthew Wolf
Abstract:
The FAIR principles for scientific data (Findable, Accessible, Interoperable, Reusable) are also relevant to other digital objects such as research software and scientific workflows that operate on scientific data. The FAIR principles can be applied to the data being handled by a scientific workflow as well as the processes, software, and other infrastructure which are necessary to specify and exe…
▽ More
The FAIR principles for scientific data (Findable, Accessible, Interoperable, Reusable) are also relevant to other digital objects such as research software and scientific workflows that operate on scientific data. The FAIR principles can be applied to the data being handled by a scientific workflow as well as the processes, software, and other infrastructure which are necessary to specify and execute a workflow. The FAIR principles were designed as guidelines, rather than rules, that would allow for differences in standards for different communities and for different degrees of compliance. There are many practical considerations which impact the level of FAIR-ness that can actually be achieved, including policies, traditions, and technologies. Because of these considerations, obstacles are often encountered during the workflow lifecycle that trace directly to shortcomings in the implementation of the FAIR principles. Here, we detail some cases, without naming names, in which data and workflows were Findable but otherwise lacking in areas commonly needed and expected by modern FAIR methods, tools, and users. We describe how some of these problems, all of which were overcome successfully, have motivated us to push on systems and approaches for fully FAIR workflows.
△ Less
Submitted 19 September, 2022;
originally announced September 2022.
-
Two Approaches to Building Collaborative, Task-Oriented Dialog Agents through Self-Play
Authors:
Arkady Arkhangorodsky,
Scot Fang,
Victoria Knight,
Ajay Nagesh,
Maria Ryskina,
Kevin Knight
Abstract:
Task-oriented dialog systems are often trained on human/human dialogs, such as collected from Wizard-of-Oz interfaces. However, human/human corpora are frequently too small for supervised training to be effective. This paper investigates two approaches to training agent-bots and user-bots through self-play, in which they autonomously explore an API environment, discovering communication strategies…
▽ More
Task-oriented dialog systems are often trained on human/human dialogs, such as collected from Wizard-of-Oz interfaces. However, human/human corpora are frequently too small for supervised training to be effective. This paper investigates two approaches to training agent-bots and user-bots through self-play, in which they autonomously explore an API environment, discovering communication strategies that enable them to solve the task. We give empirical results for both reinforcement learning and game-theoretic equilibrium finding.
△ Less
Submitted 20 September, 2021;
originally announced September 2021.
-
MeetDot: Videoconferencing with Live Translation Captions
Authors:
Arkady Arkhangorodsky,
Christopher Chu,
Scot Fang,
Yiqi Huang,
Denglin Jiang,
Ajay Nagesh,
Boliang Zhang,
Kevin Knight
Abstract:
We present MeetDot, a videoconferencing system with live translation captions overlaid on screen. The system aims to facilitate conversation between people who speak different languages, thereby reducing communication barriers between multilingual participants. Currently, our system supports speech and captions in 4 languages and combines automatic speech recognition (ASR) and machine translation…
▽ More
We present MeetDot, a videoconferencing system with live translation captions overlaid on screen. The system aims to facilitate conversation between people who speak different languages, thereby reducing communication barriers between multilingual participants. Currently, our system supports speech and captions in 4 languages and combines automatic speech recognition (ASR) and machine translation (MT) in a cascade. We use the re-translation strategy to translate the streamed speech, resulting in caption flicker. Additionally, our system has very strict latency requirements to have acceptable call quality. We implement several features to enhance user experience and reduce their cognitive load, such as smooth scrolling captions and reducing caption flicker. The modular architecture allows us to integrate different ASR and MT services in our backend. Our system provides an integrated evaluation suite to optimize key intrinsic evaluation metrics such as accuracy, latency and erasure. Finally, we present an innovative cross-lingual word-guessing game as an extrinsic evaluation metric to measure end-to-end system performance. We plan to make our system open-source for research purposes.
△ Less
Submitted 20 September, 2021;
originally announced September 2021.
-
Learning Mathematical Properties of Integers
Authors:
Maria Ryskina,
Kevin Knight
Abstract:
Embedding words in high-dimensional vector spaces has proven valuable in many natural language applications. In this work, we investigate whether similarly-trained embeddings of integers can capture concepts that are useful for mathematical applications. We probe the integer embeddings for mathematical knowledge, apply them to a set of numerical reasoning tasks, and show that by learning the repre…
▽ More
Embedding words in high-dimensional vector spaces has proven valuable in many natural language applications. In this work, we investigate whether similarly-trained embeddings of integers can capture concepts that are useful for mathematical applications. We probe the integer embeddings for mathematical knowledge, apply them to a set of numerical reasoning tasks, and show that by learning the representations from mathematical sequence data, we can substantially improve over number embeddings learned from English text corpora.
△ Less
Submitted 15 September, 2021;
originally announced September 2021.
-
What Clinical Trials Can Teach Us about the Development of More Resilient AI for Cybersecurity
Authors:
Edmon Begoli,
Robert A. Bridges,
Sean Oesch,
Kathryn E. Knight
Abstract:
Policy-mandated, rigorously administered scientific testing is needed to provide transparency into the efficacy of artificial intelligence-based (AI-based) cyber defense tools for consumers and to prioritize future research and development. In this article, we propose a model that is informed by our experience, urged forward by massive scale cyberattacks, and inspired by parallel developments in t…
▽ More
Policy-mandated, rigorously administered scientific testing is needed to provide transparency into the efficacy of artificial intelligence-based (AI-based) cyber defense tools for consumers and to prioritize future research and development. In this article, we propose a model that is informed by our experience, urged forward by massive scale cyberattacks, and inspired by parallel developments in the biomedical field and the unprecedentedly fast development of new vaccines to combat global pathogens.
△ Less
Submitted 13 May, 2021;
originally announced May 2021.
-
A Hybrid Task-Oriented Dialog System with Domain and Task Adaptive Pretraining
Authors:
Boliang Zhang,
Ying Lyu,
Ning Ding,
Tianhao Shen,
Zhaoyang Jia,
Kun Han,
Kevin Knight
Abstract:
This paper describes our submission for the End-to-end Multi-domain Task Completion Dialog shared task at the 9th Dialog System Technology Challenge (DSTC-9). Participants in the shared task build an end-to-end task completion dialog system which is evaluated by human evaluation and a user simulator based automatic evaluation. Different from traditional pipelined approaches where modules are optim…
▽ More
This paper describes our submission for the End-to-end Multi-domain Task Completion Dialog shared task at the 9th Dialog System Technology Challenge (DSTC-9). Participants in the shared task build an end-to-end task completion dialog system which is evaluated by human evaluation and a user simulator based automatic evaluation. Different from traditional pipelined approaches where modules are optimized individually and suffer from cascading failure, we propose an end-to-end dialog system that 1) uses Generative Pretraining 2 (GPT-2) as the backbone to jointly solve Natural Language Understanding, Dialog State Tracking, and Natural Language Generation tasks, 2) adopts Domain and Task Adaptive Pretraining to tailor GPT-2 to the dialog domain before finetuning, 3) utilizes heuristic pre/post-processing rules that greatly simplify the prediction tasks and improve generalizability, and 4) equips a fault tolerance module to correct errors and inappropriate responses. Our proposed method significantly outperforms baselines and ties for first place in the official evaluation. We make our source code publicly available.
△ Less
Submitted 8 February, 2021;
originally announced February 2021.
-
Why Neural Machine Translation Prefers Empty Outputs
Authors:
Xing Shi,
Yijun Xiao,
Kevin Knight
Abstract:
We investigate why neural machine translation (NMT) systems assign high probability to empty translations. We find two explanations. First, label smoothing makes correct-length translations less confident, making it easier for the empty translation to finally outscore them. Second, NMT systems use the same, high-frequency EoS word to end all target sentences, regardless of length. This creates an…
▽ More
We investigate why neural machine translation (NMT) systems assign high probability to empty translations. We find two explanations. First, label smoothing makes correct-length translations less confident, making it easier for the empty translation to finally outscore them. Second, NMT systems use the same, high-frequency EoS word to end all target sentences, regardless of length. This creates an implicit smoothing that increases zero-length translations. Using different EoS types in target sentences of different lengths exposes and eliminates this implicit smoothing.
△ Less
Submitted 24 December, 2020;
originally announced December 2020.
-
MUSE: Textual Attributes Guided Portrait Painting Generation
Authors:
Xiaodan Hu,
Pengfei Yu,
Kevin Knight,
Heng Ji,
Bo Li,
Honghui Shi
Abstract:
We propose a novel approach, MUSE, to illustrate textual attributes visually via portrait generation. MUSE takes a set of attributes written in text, in addition to facial features extracted from a photo of the subject as input. We propose 11 attribute types to represent inspirations from a subject's profile, emotion, story, and environment. We propose a novel stacked neural network architecture b…
▽ More
We propose a novel approach, MUSE, to illustrate textual attributes visually via portrait generation. MUSE takes a set of attributes written in text, in addition to facial features extracted from a photo of the subject as input. We propose 11 attribute types to represent inspirations from a subject's profile, emotion, story, and environment. We propose a novel stacked neural network architecture by extending an image-to-image generative model to accept textual attributes. Experiments show that our approach significantly outperforms several state-of-the-art methods without using textual attributes, with Inception Score score increased by 6% and Fréchet Inception Distance (FID) score decreased by 11%, respectively. We also propose a new attribute reconstruction metric to evaluate whether the generated portraits preserve the subject's attributes. Experiments show that our approach can accurately illustrate 78% textual attributes, which also help MUSE capture the subject in a more creative and expressive way.
△ Less
Submitted 19 September, 2021; v1 submitted 9 November, 2020;
originally announced November 2020.
-
DiDi's Machine Translation System for WMT2020
Authors:
Tanfang Chen,
Weiwei Wang,
Wenyang Wei,
Xing Shi,
Xiangang Li,
Jieping Ye,
Kevin Knight
Abstract:
This paper describes DiDi AI Labs' submission to the WMT2020 news translation shared task. We participate in the translation direction of Chinese->English. In this direction, we use the Transformer as our baseline model, and integrate several techniques for model enhancement, including data filtering, data selection, back-translation, fine-tuning, model ensembling, and re-ranking. As a result, our…
▽ More
This paper describes DiDi AI Labs' submission to the WMT2020 news translation shared task. We participate in the translation direction of Chinese->English. In this direction, we use the Transformer as our baseline model, and integrate several techniques for model enhancement, including data filtering, data selection, back-translation, fine-tuning, model ensembling, and re-ranking. As a result, our submission achieves a BLEU score of $36.6$ in Chinese->English.
△ Less
Submitted 16 October, 2020;
originally announced October 2020.
-
ReviewRobot: Explainable Paper Review Generation based on Knowledge Synthesis
Authors:
Qingyun Wang,
Qi Zeng,
Lifu Huang,
Kevin Knight,
Heng Ji,
Nazneen Fatema Rajani
Abstract:
To assist human review process, we build a novel ReviewRobot to automatically assign a review score and write comments for multiple categories such as novelty and meaningful comparison. A good review needs to be knowledgeable, namely that the comments should be constructive and informative to help improve the paper; and explainable by providing detailed evidence. ReviewRobot achieves these goals v…
▽ More
To assist human review process, we build a novel ReviewRobot to automatically assign a review score and write comments for multiple categories such as novelty and meaningful comparison. A good review needs to be knowledgeable, namely that the comments should be constructive and informative to help improve the paper; and explainable by providing detailed evidence. ReviewRobot achieves these goals via three steps: (1) We perform domain-specific Information Extraction to construct a knowledge graph (KG) from the target paper under review, a related work KG from the papers cited by the target paper, and a background KG from a large collection of previous papers in the domain. (2) By comparing these three KGs, we predict a review score and detailed structured knowledge as evidence for each review category. (3) We carefully select and generalize human review sentences into templates, and apply these templates to transform the review scores and evidence into natural language comments. Experimental results show that our review score predictor reaches 71.4%-100% accuracy. Human assessment by domain experts shows that 41.7%-70.5% of the comments generated by ReviewRobot are valid and constructive, and better than human-written ones for 20% of the time. Thus, ReviewRobot can serve as an assistant for paper reviewers, program chairs and authors.
△ Less
Submitted 3 December, 2020; v1 submitted 12 October, 2020;
originally announced October 2020.
-
MEEP: An Open-Source Platform for Human-Human Dialog Collection and End-to-End Agent Training
Authors:
Arkady Arkhangorodsky,
Amittai Axelrod,
Christopher Chu,
Scot Fang,
Yiqi Huang,
Ajay Nagesh,
Xing Shi,
Boliang Zhang,
Kevin Knight
Abstract:
We create a new task-oriented dialog platform (MEEP) where agents are given considerable freedom in terms of utterances and API calls, but are constrained to work within a push-button environment. We include facilities for collecting human-human dialog corpora, and for training automatic agents in an end-to-end fashion. We demonstrate MEEP with a dialog assistant that lets users specify trip desti…
▽ More
We create a new task-oriented dialog platform (MEEP) where agents are given considerable freedom in terms of utterances and API calls, but are constrained to work within a push-button environment. We include facilities for collecting human-human dialog corpora, and for training automatic agents in an end-to-end fashion. We demonstrate MEEP with a dialog assistant that lets users specify trip destinations.
△ Less
Submitted 9 October, 2020;
originally announced October 2020.
-
Solving Historical Dictionary Codes with a Neural Language Model
Authors:
Christopher Chu,
Raphael Valenti,
Kevin Knight
Abstract:
We solve difficult word-based substitution codes by constructing a decoding lattice and searching that lattice with a neural language model. We apply our method to a set of enciphered letters exchanged between US Army General James Wilkinson and agents of the Spanish Crown in the late 1700s and early 1800s, obtained from the US Library of Congress. We are able to decipher 75.1% of the cipher-word…
▽ More
We solve difficult word-based substitution codes by constructing a decoding lattice and searching that lattice with a neural language model. We apply our method to a set of enciphered letters exchanged between US Army General James Wilkinson and agents of the Spanish Crown in the late 1700s and early 1800s, obtained from the US Library of Congress. We are able to decipher 75.1% of the cipher-word tokens correctly.
△ Less
Submitted 9 October, 2020;
originally announced October 2020.
-
Learning to Pronounce Chinese Without a Pronunciation Dictionary
Authors:
Christopher Chu,
Scot Fang,
Kevin Knight
Abstract:
We demonstrate a program that learns to pronounce Chinese text in Mandarin, without a pronunciation dictionary. From non-parallel streams of Chinese characters and Chinese pinyin syllables, it establishes a many-to-many mapping between characters and pronunciations. Using unsupervised methods, the program effectively deciphers writing into speech. Its token-level character-to-syllable accuracy is…
▽ More
We demonstrate a program that learns to pronounce Chinese text in Mandarin, without a pronunciation dictionary. From non-parallel streams of Chinese characters and Chinese pinyin syllables, it establishes a many-to-many mapping between characters and pronunciations. Using unsupervised methods, the program effectively deciphers writing into speech. Its token-level character-to-syllable accuracy is 89%, which significantly exceeds the 22% accuracy of prior work.
△ Less
Submitted 9 October, 2020;
originally announced October 2020.
-
Automated Empathy Detection for Oncology Encounters
Authors:
Zhuohao Chen,
James Gibson,
Ming-Chang Chiu,
Qiaohong Hu,
Tara K Knight,
Daniella Meeker,
James A Tulsky,
Kathryn I Pollak,
Shrikanth Narayanan
Abstract:
Empathy involves understanding other people's situation, perspective, and feelings. In clinical interactions, it helps clinicians establish rapport with a patient and support patient-centered care and decision making. Understanding physician communication through observation of audio-recorded encounters is largely carried out with manual annotation and analysis. However, manual annotation has a pr…
▽ More
Empathy involves understanding other people's situation, perspective, and feelings. In clinical interactions, it helps clinicians establish rapport with a patient and support patient-centered care and decision making. Understanding physician communication through observation of audio-recorded encounters is largely carried out with manual annotation and analysis. However, manual annotation has a prohibitively high cost. In this paper, a multimodal system is proposed for the first time to automatically detect empathic interactions in recordings of real-world face-to-face oncology encounters that might accelerate manual processes. An automatic speech and language processing pipeline is employed to segment and diarize the audio as well as for transcription of speech into text. Lexical and acoustic features are derived to help detect both empathic opportunities offered by the patient, and the expressed empathy by the oncologist. We make the empathy predictions using Support Vector Machines (SVMs) and evaluate the performance on different combinations of features in terms of average precision (AP).
△ Less
Submitted 1 July, 2020;
originally announced July 2020.
-
Parallel Corpus Filtering via Pre-trained Language Models
Authors:
Boliang Zhang,
Ajay Nagesh,
Kevin Knight
Abstract:
Web-crawled data provides a good source of parallel corpora for training machine translation models. It is automatically obtained, but extremely noisy, and recent work shows that neural machine translation systems are more sensitive to noise than traditional statistical machine translation methods. In this paper, we propose a novel approach to filter out noisy sentence pairs from web-crawled corpo…
▽ More
Web-crawled data provides a good source of parallel corpora for training machine translation models. It is automatically obtained, but extremely noisy, and recent work shows that neural machine translation systems are more sensitive to noise than traditional statistical machine translation methods. In this paper, we propose a novel approach to filter out noisy sentence pairs from web-crawled corpora via pre-trained language models. We measure sentence parallelism by leveraging the multilingual capability of BERT and use the Generative Pre-training (GPT) language model as a domain filter to balance data domains. We evaluate the proposed method on the WMT 2018 Parallel Corpus Filtering shared task, and on our own web-crawled Japanese-Chinese parallel corpus. Our method significantly outperforms baselines and achieves a new state-of-the-art. In an unsupervised setting, our method achieves comparable performance to the top-1 supervised method. We also evaluate on a web-crawled Japanese-Chinese parallel corpus that we make publicly available.
△ Less
Submitted 13 May, 2020;
originally announced May 2020.
-
Translating Translationese: A Two-Step Approach to Unsupervised Machine Translation
Authors:
Nima Pourdamghani,
Nada Aldarrab,
Marjan Ghazvininejad,
Kevin Knight,
Jonathan May
Abstract:
Given a rough, word-by-word gloss of a source language sentence, target language natives can uncover the latent, fully-fluent rendering of the translation. In this work we explore this intuition by breaking translation into a two step process: generating a rough gloss by means of a dictionary and then `translating' the resulting pseudo-translation, or `Translationese' into a fully fluent translati…
▽ More
Given a rough, word-by-word gloss of a source language sentence, target language natives can uncover the latent, fully-fluent rendering of the translation. In this work we explore this intuition by breaking translation into a two step process: generating a rough gloss by means of a dictionary and then `translating' the resulting pseudo-translation, or `Translationese' into a fully fluent translation. We build our Translationese decoder once from a mish-mash of parallel data that has the target language in common and then can build dictionaries on demand using unsupervised techniques, resulting in rapidly generated unsupervised neural MT systems for many source languages. We apply this process to 14 test languages, obtaining better or comparable translation results on high-resource languages than previously published unsupervised MT studies, and obtaining good quality results for low-resource languages that have never been used in an unsupervised MT scenario.
△ Less
Submitted 11 June, 2019;
originally announced June 2019.
-
One SQL to Rule Them All
Authors:
Edmon Begoli,
Tyler Akidau,
Fabian Hueske,
Julian Hyde,
Kathryn Knight,
Kenneth Knowles
Abstract:
Real-time data analysis and management are increasingly critical for today`s businesses. SQL is the de facto lingua franca for these endeavors, yet support for robust streaming analysis and management with SQL remains limited. Many approaches restrict semantics to a reduced subset of features and/or require a suite of non-standard constructs. Additionally, use of event timestamps to provide native…
▽ More
Real-time data analysis and management are increasingly critical for today`s businesses. SQL is the de facto lingua franca for these endeavors, yet support for robust streaming analysis and management with SQL remains limited. Many approaches restrict semantics to a reduced subset of features and/or require a suite of non-standard constructs. Additionally, use of event timestamps to provide native support for analyzing events according to when they actually occurred is not pervasive, and often comes with important limitations. We present a three-part proposal for integrating robust streaming into the SQL standard, namely: (1) time-varying relations as a foundation for classical tables as well as streaming data, (2) event time semantics, (3) a limited set of optional keyword extensions to control the materialization of time-varying query results. Motivated and illustrated using examples and lessons learned from implementations in Apache Calcite, Apache Flink, and Apache Beam, we show how with these minimal additions it is possible to utilize the complete suite of standard SQL semantics to perform robust stream processing.
△ Less
Submitted 28 May, 2019;
originally announced May 2019.
-
PaperRobot: Incremental Draft Generation of Scientific Ideas
Authors:
Qingyun Wang,
Lifu Huang,
Zhiying Jiang,
Kevin Knight,
Heng Ji,
Mohit Bansal,
Yi Luan
Abstract:
We present a PaperRobot who performs as an automatic research assistant by (1) conducting deep understanding of a large collection of human-written papers in a target domain and constructing comprehensive background knowledge graphs (KGs); (2) creating new ideas by predicting links from the background KGs, by combining graph attention and contextual text attention; (3) incrementally writing some k…
▽ More
We present a PaperRobot who performs as an automatic research assistant by (1) conducting deep understanding of a large collection of human-written papers in a target domain and constructing comprehensive background knowledge graphs (KGs); (2) creating new ideas by predicting links from the background KGs, by combining graph attention and contextual text attention; (3) incrementally writing some key elements of a new paper based on memory-attention networks: from the input title along with predicted related entities to generate a paper abstract, from the abstract to generate conclusion and future work, and finally from future work to generate a title for a follow-on paper. Turing Tests, where a biomedical domain expert is asked to compare a system output and a human-authored string, show PaperRobot generated abstracts, conclusion and future work sections, and new titles are chosen over human-written ones up to 30%, 24% and 12% of the time, respectively.
△ Less
Submitted 31 May, 2019; v1 submitted 20 May, 2019;
originally announced May 2019.
-
Plan-And-Write: Towards Better Automatic Storytelling
Authors:
Lili Yao,
Nanyun Peng,
Ralph Weischedel,
Kevin Knight,
Dongyan Zhao,
Rui Yan
Abstract:
Automatic storytelling is challenging since it requires generating long, coherent natural language to describes a sensible sequence of events. Despite considerable efforts on automatic story generation in the past, prior work either is restricted in plot planning, or can only generate stories in a narrow domain. In this paper, we explore open-domain story generation that writes stories given a tit…
▽ More
Automatic storytelling is challenging since it requires generating long, coherent natural language to describes a sensible sequence of events. Despite considerable efforts on automatic story generation in the past, prior work either is restricted in plot planning, or can only generate stories in a narrow domain. In this paper, we explore open-domain story generation that writes stories given a title (topic) as input. We propose a plan-and-write hierarchical generation framework that first plans a storyline, and then generates a story based on the storyline. We compare two planning strategies. The dynamic schema interweaves story planning and its surface realization in text, while the static schema plans out the entire storyline before generating stories. Experiments show that with explicit storyline planning, the generated stories are more diverse, coherent, and on topic than those generated without creating a full plan, according to both automatic and human evaluations.
△ Less
Submitted 19 February, 2019; v1 submitted 14 November, 2018;
originally announced November 2018.
-
Decipherment of Historical Manuscript Images
Authors:
Xusen Yin,
Nada Aldarrab,
Beáta Megyesi,
Kevin Knight
Abstract:
European libraries and archives are filled with enciphered manuscripts from the early modern period. These include military and diplomatic correspondence, records of secret societies, private letters, and so on. Although they are enciphered with classical cryptographic algorithms, their contents are unavailable to working historians. We therefore attack the problem of automatically converting ciph…
▽ More
European libraries and archives are filled with enciphered manuscripts from the early modern period. These include military and diplomatic correspondence, records of secret societies, private letters, and so on. Although they are enciphered with classical cryptographic algorithms, their contents are unavailable to working historians. We therefore attack the problem of automatically converting cipher manuscript images into plaintext. We develop unsupervised models for character segmentation, character-image clustering, and decipherment of cluster sequences. We experiment with both pipelined and joint models, and we give empirical results for multiple ciphers.
△ Less
Submitted 2 June, 2019; v1 submitted 9 October, 2018;
originally announced October 2018.
-
Describing a Knowledge Base
Authors:
Qingyun Wang,
Xiaoman Pan,
Lifu Huang,
Boliang Zhang,
Zhiying Jiang,
Heng Ji,
Kevin Knight
Abstract:
We aim to automatically generate natural language descriptions about an input structured knowledge base (KB). We build our generation framework based on a pointer network which can copy facts from the input KB, and add two attention mechanisms: (i) slot-aware attention to capture the association between a slot type and its corresponding slot value; and (ii) a new \emph{table position self-attentio…
▽ More
We aim to automatically generate natural language descriptions about an input structured knowledge base (KB). We build our generation framework based on a pointer network which can copy facts from the input KB, and add two attention mechanisms: (i) slot-aware attention to capture the association between a slot type and its corresponding slot value; and (ii) a new \emph{table position self-attention} to capture the inter-dependencies among related slots. For evaluation, besides standard metrics including BLEU, METEOR, and ROUGE, we propose a KB reconstruction based metric by extracting a KB from the generation output and comparing it with the input KB. We also create a new data set which includes 106,216 pairs of structured KBs and their corresponding natural language descriptions for two distinct entity types. Experiments show that our approach significantly outperforms state-of-the-art methods. The reconstructed KB achieves 68.8% - 72.6% F-score.
△ Less
Submitted 30 September, 2018; v1 submitted 5 September, 2018;
originally announced September 2018.
-
Augmenting Statistical Machine Translation with Subword Translation of Out-of-Vocabulary Words
Authors:
Nelson F. Liu,
Jonathan May,
Michael Pust,
Kevin Knight
Abstract:
Most statistical machine translation systems cannot translate words that are unseen in the training data. However, humans can translate many classes of out-of-vocabulary (OOV) words (e.g., novel morphological variants, misspellings, and compounds) without context by using orthographic clues. Following this observation, we describe and evaluate several general methods for OOV translation that use o…
▽ More
Most statistical machine translation systems cannot translate words that are unseen in the training data. However, humans can translate many classes of out-of-vocabulary (OOV) words (e.g., novel morphological variants, misspellings, and compounds) without context by using orthographic clues. Following this observation, we describe and evaluate several general methods for OOV translation that use only subword information. We pose the OOV translation problem as a standalone task and intrinsically evaluate our approaches on fourteen typologically diverse languages across varying resource levels. Adding OOV translators to a statistical machine translation system yields consistent BLEU gains (0.5 points on average, and up to 2.0) for all fourteen languages, especially in low-resource scenarios.
△ Less
Submitted 16 August, 2018;
originally announced August 2018.
-
Fast Locality Sensitive Hashing for Beam Search on GPU
Authors:
Xing Shi,
Shizhen Xu,
Kevin Knight
Abstract:
We present a GPU-based Locality Sensitive Hashing (LSH) algorithm to speed up beam search for sequence models. We utilize the winner-take-all (WTA) hash, which is based on relative ranking order of hidden dimensions and thus resilient to perturbations in numerical values. Our algorithm is designed by fully considering the underling architecture of CUDA-enabled GPUs (Algorithm/Architecture Co-desig…
▽ More
We present a GPU-based Locality Sensitive Hashing (LSH) algorithm to speed up beam search for sequence models. We utilize the winner-take-all (WTA) hash, which is based on relative ranking order of hidden dimensions and thus resilient to perturbations in numerical values. Our algorithm is designed by fully considering the underling architecture of CUDA-enabled GPUs (Algorithm/Architecture Co-design): 1) A parallel Cuckoo hash table is applied for LSH code lookup (guaranteed O(1) lookup time); 2) Candidate lists are shared across beams to maximize the parallelism; 3) Top frequent words are merged into candidate lists to improve performance. Experiments on 4 large-scale neural machine translation models demonstrate that our algorithm can achieve up to 4x speedup on softmax module, and 2x overall speedup without hurting BLEU on GPU.
△ Less
Submitted 2 June, 2018;
originally announced June 2018.
-
Modeling Naive Psychology of Characters in Simple Commonsense Stories
Authors:
Hannah Rashkin,
Antoine Bosselut,
Maarten Sap,
Kevin Knight,
Yejin Choi
Abstract:
Understanding a narrative requires reading between the lines and reasoning about the unspoken but obvious implications about events and people's mental states - a capability that is trivial for humans but remarkably hard for machines. To facilitate research addressing this challenge, we introduce a new annotation framework to explain naive psychology of story characters as fully-specified chains o…
▽ More
Understanding a narrative requires reading between the lines and reasoning about the unspoken but obvious implications about events and people's mental states - a capability that is trivial for humans but remarkably hard for machines. To facilitate research addressing this challenge, we introduce a new annotation framework to explain naive psychology of story characters as fully-specified chains of mental states with respect to motivations and emotional reactions. Our work presents a new large-scale dataset with rich low-level annotations and establishes baseline performance on several new tasks, suggesting avenues for future research.
△ Less
Submitted 16 May, 2018;
originally announced May 2018.
-
Paper Abstract Writing through Editing Mechanism
Authors:
Qingyun Wang,
Zhihao Zhou,
Lifu Huang,
Spencer Whitehead,
Boliang Zhang,
Heng Ji,
Kevin Knight
Abstract:
We present a paper abstract writing system based on an attentive neural sequence-to-sequence model that can take a title as input and automatically generate an abstract. We design a novel Writing-editing Network that can attend to both the title and the previously generated abstract drafts and then iteratively revise and polish the abstract. With two series of Turing tests, where the human judges…
▽ More
We present a paper abstract writing system based on an attentive neural sequence-to-sequence model that can take a title as input and automatically generate an abstract. We design a novel Writing-editing Network that can attend to both the title and the previously generated abstract drafts and then iteratively revise and polish the abstract. With two series of Turing tests, where the human judges are asked to distinguish the system-generated abstracts from human-written ones, our system passes Turing tests by junior domain experts at a rate up to 30% and by non-expert at a rate up to 80%.
△ Less
Submitted 15 May, 2018;
originally announced May 2018.
-
Multi-lingual Common Semantic Space Construction via Cluster-consistent Word Embedding
Authors:
Lifu Huang,
Kyunghyun Cho,
Boliang Zhang,
Heng Ji,
Kevin Knight
Abstract:
We construct a multilingual common semantic space based on distributional semantics, where words from multiple languages are projected into a shared space to enable knowledge and resource transfer across languages. Beyond word alignment, we introduce multiple cluster-level alignments and enforce the word clusters to be consistently distributed across multiple languages. We exploit three signals fo…
▽ More
We construct a multilingual common semantic space based on distributional semantics, where words from multiple languages are projected into a shared space to enable knowledge and resource transfer across languages. Beyond word alignment, we introduce multiple cluster-level alignments and enforce the word clusters to be consistently distributed across multiple languages. We exploit three signals for clustering: (1) neighbor words in the monolingual word embedding space; (2) character-level information; and (3) linguistic properties (e.g., apposition, locative suffix) derived from linguistic structure knowledge bases available for thousands of languages. We introduce a new cluster-consistent correlational neural network to construct the common semantic space by aligning words as well as clusters. Intrinsic evaluation on monolingual and multilingual QVEC tasks shows our approach achieves significantly higher correlation with linguistic features than state-of-the-art multi-lingual embedding learning methods do. Using low-resource language name tagging as a case study for extrinsic evaluation, our approach achieves up to 24.5\% absolute F-score gain over the state of the art.
△ Less
Submitted 20 April, 2018;
originally announced April 2018.
-
Learning from Past Mistakes: Improving Automatic Speech Recognition Output via Noisy-Clean Phrase Context Modeling
Authors:
Prashanth Gurunath Shivakumar,
Haoqi Li,
Kevin Knight,
Panayiotis Georgiou
Abstract:
Automatic speech recognition (ASR) systems often make unrecoverable errors due to subsystem pruning (acoustic, language and pronunciation models); for example pruning words due to acoustics using short-term context, prior to rescoring with long-term context based on linguistics. In this work we model ASR as a phrase-based noisy transformation channel and propose an error correction system that can…
▽ More
Automatic speech recognition (ASR) systems often make unrecoverable errors due to subsystem pruning (acoustic, language and pronunciation models); for example pruning words due to acoustics using short-term context, prior to rescoring with long-term context based on linguistics. In this work we model ASR as a phrase-based noisy transformation channel and propose an error correction system that can learn from the aggregate errors of all the independent modules constituting the ASR and attempt to invert those. The proposed system can exploit long-term context using a neural network language model and can better choose between existing ASR output possibilities as well as re-introduce previously pruned or unseen (out-of-vocabulary) phrases. It provides corrections under poorly performing ASR conditions without degrading any accurate transcriptions; such corrections are greater on top of out-of-domain and mismatched data ASR. Our system consistently provides improvements over the baseline ASR, even when baseline is further optimized through recurrent neural network language model rescoring. This demonstrates that any ASR improvements can be exploited independently and that our proposed system can potentially still provide benefits on highly optimized ASR. Finally, we present an extensive analysis of the type of errors corrected by our system.
△ Less
Submitted 28 March, 2019; v1 submitted 7 February, 2018;
originally announced February 2018.
-
Recurrent Neural Networks as Weighted Language Recognizers
Authors:
Yining Chen,
Sorcha Gilroy,
Andreas Maletti,
Jonathan May,
Kevin Knight
Abstract:
We investigate the computational complexity of various problems for simple recurrent neural networks (RNNs) as formal models for recognizing weighted languages. We focus on the single-layer, ReLU-activation, rational-weight RNNs with softmax, which are commonly used in natural language processing applications. We show that most problems for such RNNs are undecidable, including consistency, equival…
▽ More
We investigate the computational complexity of various problems for simple recurrent neural networks (RNNs) as formal models for recognizing weighted languages. We focus on the single-layer, ReLU-activation, rational-weight RNNs with softmax, which are commonly used in natural language processing applications. We show that most problems for such RNNs are undecidable, including consistency, equivalence, minimization, and the determination of the highest-weighted string. However, for consistent RNNs the last problem becomes decidable, although the solution length can surpass all computable bounds. If additionally the string is limited to polynomial length, the problem becomes NP-complete and APX-hard. In summary, this shows that approximations and heuristic algorithms are necessary in practical applications of those RNNs.
△ Less
Submitted 4 March, 2018; v1 submitted 14 November, 2017;
originally announced November 2017.
-
Unsupervised Neural Hidden Markov Models
Authors:
Ke Tran,
Yonatan Bisk,
Ashish Vaswani,
Daniel Marcu,
Kevin Knight
Abstract:
In this work, we present the first results for neuralizing an Unsupervised Hidden Markov Model. We evaluate our approach on tag in- duction. Our approach outperforms existing generative models and is competitive with the state-of-the-art though with a simpler model easily extended to include additional context.
In this work, we present the first results for neuralizing an Unsupervised Hidden Markov Model. We evaluate our approach on tag in- duction. Our approach outperforms existing generative models and is competitive with the state-of-the-art though with a simpler model easily extended to include additional context.
△ Less
Submitted 28 September, 2016;
originally announced September 2016.
-
Transfer Learning for Low-Resource Neural Machine Translation
Authors:
Barret Zoph,
Deniz Yuret,
Jonathan May,
Kevin Knight
Abstract:
The encoder-decoder framework for neural machine translation (NMT) has been shown effective in large data scenarios, but is much less effective for low-resource languages. We present a transfer learning method that significantly improves Bleu scores across a range of low-resource languages. Our key idea is to first train a high-resource language pair (the parent model), then transfer some of the l…
▽ More
The encoder-decoder framework for neural machine translation (NMT) has been shown effective in large data scenarios, but is much less effective for low-resource languages. We present a transfer learning method that significantly improves Bleu scores across a range of low-resource languages. Our key idea is to first train a high-resource language pair (the parent model), then transfer some of the learned parameters to the low-resource pair (the child model) to initialize and constrain training. Using our transfer learning method we improve baseline NMT models by an average of 5.6 Bleu on four low-resource language pairs. Ensembling and unknown word replacement add another 2 Bleu which brings the NMT performance on low-resource machine translation close to a strong syntax based machine translation (SBMT) system, exceeding its performance on one language pair. Additionally, using the transfer learning model for re-scoring, we can improve the SBMT system by an average of 1.3 Bleu, improving the state-of-the-art on low-resource machine translation.
△ Less
Submitted 7 April, 2016;
originally announced April 2016.
-
Multi-Source Neural Translation
Authors:
Barret Zoph,
Kevin Knight
Abstract:
We build a multi-source machine translation model and train it to maximize the probability of a target English string given French and German sources. Using the neural encoder-decoder framework, we explore several combination methods and report up to +4.8 Bleu increases on top of a very strong attention-based neural translation model.
We build a multi-source machine translation model and train it to maximize the probability of a target English string given French and German sources. Using the neural encoder-decoder framework, we explore several combination methods and report up to +4.8 Bleu increases on top of a very strong attention-based neural translation model.
△ Less
Submitted 4 January, 2016;
originally announced January 2016.
-
Using Syntax-Based Machine Translation to Parse English into Abstract Meaning Representation
Authors:
Michael Pust,
Ulf Hermjakob,
Kevin Knight,
Daniel Marcu,
Jonathan May
Abstract:
We present a parser for Abstract Meaning Representation (AMR). We treat English-to-AMR conversion within the framework of string-to-tree, syntax-based machine translation (SBMT). To make this work, we transform the AMR structure into a form suitable for the mechanics of SBMT and useful for modeling. We introduce an AMR-specific language model and add data and features drawn from semantic resources…
▽ More
We present a parser for Abstract Meaning Representation (AMR). We treat English-to-AMR conversion within the framework of string-to-tree, syntax-based machine translation (SBMT). To make this work, we transform the AMR structure into a form suitable for the mechanics of SBMT and useful for modeling. We introduce an AMR-specific language model and add data and features drawn from semantic resources. Our resulting AMR parser improves upon state-of-the-art results by 7 Smatch points.
△ Less
Submitted 28 April, 2015; v1 submitted 24 April, 2015;
originally announced April 2015.
-
Empirical Methods for Compound Splitting
Authors:
Philipp Koehn,
Kevin Knight
Abstract:
Compounded words are a challenge for NLP applications such as machine translation (MT). We introduce methods to learn splitting rules from monolingual and parallel corpora. We evaluate them against a gold standard and measure their impact on performance of statistical MT systems. Results show accuracy of 99.1% and performance gains for MT of 0.039 BLEU on a German-English noun phrase translation…
▽ More
Compounded words are a challenge for NLP applications such as machine translation (MT). We introduce methods to learn splitting rules from monolingual and parallel corpora. We evaluate them against a gold standard and measure their impact on performance of statistical MT systems. Results show accuracy of 99.1% and performance gains for MT of 0.039 BLEU on a German-English noun phrase translation task.
△ Less
Submitted 22 February, 2003;
originally announced February 2003.
-
Machine Transliteration
Authors:
Kevin Knight,
Jonathan Graehl
Abstract:
It is challenging to translate names and technical terms across languages with different alphabets and sound inventories. These items are commonly transliterated, i.e., replaced with approximate phonetic equivalents. For example, "computer" in English comes out as "konpyuutaa" in Japanese. Translating such items from Japanese back to English is even more challenging, and of practical interest, a…
▽ More
It is challenging to translate names and technical terms across languages with different alphabets and sound inventories. These items are commonly transliterated, i.e., replaced with approximate phonetic equivalents. For example, "computer" in English comes out as "konpyuutaa" in Japanese. Translating such items from Japanese back to English is even more challenging, and of practical interest, as transliterated items make up the bulk of text phrases not found in bilingual dictionaries. We describe and evaluate a method for performing backwards transliterations by machine. This method uses a generative model, incorporating several distinct stages in the transliteration process.
△ Less
Submitted 14 April, 1997;
originally announced April 1997.
-
Unification-Based Glossing
Authors:
Vasileios Hatzivassiloglou,
Kevin Knight
Abstract:
We present an approach to syntax-based machine translation that combines unification-style interpretation with statistical processing. This approach enables us to translate any Japanese newspaper article into English, with quality far better than a word-for-word translation. Novel ideas include the use of feature structures to encode word lattices and the use of unification to compose and manipu…
▽ More
We present an approach to syntax-based machine translation that combines unification-style interpretation with statistical processing. This approach enables us to translate any Japanese newspaper article into English, with quality far better than a word-for-word translation. Novel ideas include the use of feature structures to encode word lattices and the use of unification to compose and manipulate lattices. Unification also allows us to specify abstract features that delay target-language synthesis until enough source-language information is assembled. Our statistical component enables us to search efficiently among competing translations and locate those with high English fluency.
△ Less
Submitted 9 June, 1995;
originally announced June 1995.
-
Two-level, Many-Paths Generation
Authors:
Kevin Knight,
Vasileios Hatzivassiloglou
Abstract:
Large-scale natural language generation requires the integration of vast amounts of knowledge: lexical, grammatical, and conceptual. A robust generator must be able to operate well even when pieces of knowledge are missing. It must also be robust against incomplete or inaccurate inputs. To attack these problems, we have built a hybrid generator, in which gaps in symbolic knowledge are filled by…
▽ More
Large-scale natural language generation requires the integration of vast amounts of knowledge: lexical, grammatical, and conceptual. A robust generator must be able to operate well even when pieces of knowledge are missing. It must also be robust against incomplete or inaccurate inputs. To attack these problems, we have built a hybrid generator, in which gaps in symbolic knowledge are filled by statistical methods. We describe algorithms and show experimental results. We also discuss how the hybrid generation model can be used to simplify current generators and enhance their portability, even when perfect knowledge is in principle obtainable.
△ Less
Submitted 9 June, 1995;
originally announced June 1995.
-
Filling Knowledge Gaps in a Broad-Coverage Machine Translation System
Authors:
Kevin Knight,
Ishwar Chander,
Matthew Haines,
Vasileios Hatzivassiloglou,
Eduard Hovy,
Masayo Iida,
Steve K. Luk,
Richard Whitney,
Kenji Yamada
Abstract:
Knowledge-based machine translation (KBMT) techniques yield high quality in domains with detailed semantic models, limited vocabulary, and controlled input grammar. Scaling up along these dimensions means acquiring large knowledge resources. It also means behaving reasonably when definitive knowledge is not yet available. This paper describes how we can fill various KBMT knowledge gaps, often us…
▽ More
Knowledge-based machine translation (KBMT) techniques yield high quality in domains with detailed semantic models, limited vocabulary, and controlled input grammar. Scaling up along these dimensions means acquiring large knowledge resources. It also means behaving reasonably when definitive knowledge is not yet available. This paper describes how we can fill various KBMT knowledge gaps, often using robust statistical techniques. We describe quantitative and qualitative results from JAPANGLOSS, a broad-coverage Japanese-English MT system.
△ Less
Submitted 9 June, 1995;
originally announced June 1995.
-
Integrating Knowledge Bases and Statistics in MT
Authors:
Kevin Knight,
Ishwar Chander,
Matthew Haines,
Vasileios Hatzivassiloglou,
Eduard Hovy,
Masayo Iida,
Steve K. Luk,
Akitoshi Okumura,
Richard Whitney,
Kenji Yamada
Abstract:
We summarize recent machine translation (MT) research at the Information Sciences Institute of USC, and we describe its application to the development of a Japanese-English newspaper MT system. Our work aims at scaling up grammar-based, knowledge-based MT techniques. This scale-up involves the use of statistical methods, both in acquiring effective knowledge resources and in making reasonable li…
▽ More
We summarize recent machine translation (MT) research at the Information Sciences Institute of USC, and we describe its application to the development of a Japanese-English newspaper MT system. Our work aims at scaling up grammar-based, knowledge-based MT techniques. This scale-up involves the use of statistical methods, both in acquiring effective knowledge resources and in making reasonable linguistic choices in the face of knowledge gaps.
△ Less
Submitted 5 September, 1994;
originally announced September 1994.
-
Building a Large-Scale Knowledge Base for Machine Translation
Authors:
Kevin Knight,
Steve K. Luk
Abstract:
Knowledge-based machine translation (KBMT) systems have achieved excellent results in constrained domains, but have not yet scaled up to newspaper text. The reason is that knowledge resources (lexicons, grammar rules, world models) must be painstakingly handcrafted from scratch. One of the hypotheses being tested in the PANGLOSS machine translation project is whether or not these resources can b…
▽ More
Knowledge-based machine translation (KBMT) systems have achieved excellent results in constrained domains, but have not yet scaled up to newspaper text. The reason is that knowledge resources (lexicons, grammar rules, world models) must be painstakingly handcrafted from scratch. One of the hypotheses being tested in the PANGLOSS machine translation project is whether or not these resources can be semi-automatically acquired on a very large scale. This paper focuses on the construction of a large ontology (or knowledge base, or world model) for supporting KBMT. It contains representations for some 70,000 commonly encountered objects, processes, qualities, and relations. The ontology was constructed by merging various online dictionaries, semantic networks, and bilingual resources, through semi-automatic methods. Some of these methods (e.g., conceptual matching of semantic taxonomies) are broadly applicable to problems of importing/exporting knowledge from one KB to another. Other methods (e.g., bilingual matching) allow a knowledge engineer to build up an index to a KB in a second language, such as Spanish or Japanese.
△ Less
Submitted 29 July, 1994;
originally announced July 1994.
-
Automated Postediting of Documents
Authors:
Kevin Knight,
Ishwar Chander
Abstract:
Large amounts of low- to medium-quality English texts are now being produced by machine translation (MT) systems, optical character readers (OCR), and non-native speakers of English. Most of this text must be postedited by hand before it sees the light of day. Improving text quality is tedious work, but its automation has not received much research attention. Anyone who has postedited a technica…
▽ More
Large amounts of low- to medium-quality English texts are now being produced by machine translation (MT) systems, optical character readers (OCR), and non-native speakers of English. Most of this text must be postedited by hand before it sees the light of day. Improving text quality is tedious work, but its automation has not received much research attention. Anyone who has postedited a technical report or thesis written by a non-native speaker of English knows the potential of an automated postediting system. For the case of MT-generated text, we argue for the construction of postediting modules that are portable across MT systems, as an alternative to hardcoding improvements inside any one system. As an example, we have built a complete self-contained postediting module for the task of article selection (a, an, the) for English noun phrases. This is a notoriously difficult problem for Japanese-English MT. Our system contains over 200,000 rules derived automatically from online text resources. We report on learning algorithms, accuracy, and comparisons with human performance.
△ Less
Submitted 29 July, 1994;
originally announced July 1994.