-
Many-to-English Machine Translation Tools, Data, and Pretrained Models
Authors:
Thamme Gowda,
Zhao Zhang,
Chris A Mattmann,
Jonathan May
Abstract:
While there are more than 7000 languages in the world, most translation research efforts have targeted a few high-resource languages. Commercial translation systems support only one hundred languages or fewer, and do not make these models available for transfer to low resource languages. In this work, we present useful tools for machine translation research: MTData, NLCodec, and RTG. We demonstrat…
▽ More
While there are more than 7000 languages in the world, most translation research efforts have targeted a few high-resource languages. Commercial translation systems support only one hundred languages or fewer, and do not make these models available for transfer to low resource languages. In this work, we present useful tools for machine translation research: MTData, NLCodec, and RTG. We demonstrate their usefulness by creating a multilingual neural machine translation model capable of translating from 500 source languages to English. We make this multilingual model readily downloadable and usable as a service, or as a parent model for transfer-learning to even lower-resource languages.
△ Less
Submitted 1 July, 2021; v1 submitted 1 April, 2021;
originally announced April 2021.
-
Technology Readiness Levels for Machine Learning Systems
Authors:
Alexander Lavin,
Ciarán M. Gilligan-Lee,
Alessya Visnjic,
Siddha Ganju,
Dava Newman,
Atılım Güneş Baydin,
Sujoy Ganguly,
Danny Lange,
Amit Sharma,
Stephan Zheng,
Eric P. Xing,
Adam Gibson,
James Parr,
Chris Mattmann,
Yarin Gal
Abstract:
The development and deployment of machine learning (ML) systems can be executed easily with modern tools, but the process is typically rushed and means-to-an-end. The lack of diligence can lead to technical debt, scope creep and misaligned objectives, model misuse and failures, and expensive consequences. Engineering systems, on the other hand, follow well-defined processes and testing standards t…
▽ More
The development and deployment of machine learning (ML) systems can be executed easily with modern tools, but the process is typically rushed and means-to-an-end. The lack of diligence can lead to technical debt, scope creep and misaligned objectives, model misuse and failures, and expensive consequences. Engineering systems, on the other hand, follow well-defined processes and testing standards to streamline development for high-quality, reliable results. The extreme is spacecraft systems, where mission critical measures and robustness are ingrained in the development process. Drawing on experience in both spacecraft engineering and ML (from research through product across domain areas), we have developed a proven systems engineering approach for machine learning development and deployment. Our "Machine Learning Technology Readiness Levels" (MLTRL) framework defines a principled process to ensure robust, reliable, and responsible systems while being streamlined for ML workflows, including key distinctions from traditional software engineering. Even more, MLTRL defines a lingua franca for people across teams and organizations to work collaboratively on artificial intelligence and machine learning technologies. Here we describe the framework and elucidate it with several real world use-cases of developing ML methods from basic research through productization and deployment, in areas such as medical diagnostics, consumer computer vision, satellite imagery, and particle physics.
△ Less
Submitted 29 November, 2021; v1 submitted 11 January, 2021;
originally announced January 2021.
-
MARVIN: An Open Machine Learning Corpus and Environment for Automated Machine Learning Primitive Annotation and Execution
Authors:
Chris A. Mattmann,
Sujen Shah,
Brian Wilson
Abstract:
In this demo paper, we introduce the DARPA D3M program for automatic machine learning (ML) and JPL's MARVIN tool that provides an environment to locate, annotate, and execute machine learning primitives for use in ML pipelines. MARVIN is a web-based application and associated back-end interface written in Python that enables composition of ML pipelines from hundreds of primitives from the world of…
▽ More
In this demo paper, we introduce the DARPA D3M program for automatic machine learning (ML) and JPL's MARVIN tool that provides an environment to locate, annotate, and execute machine learning primitives for use in ML pipelines. MARVIN is a web-based application and associated back-end interface written in Python that enables composition of ML pipelines from hundreds of primitives from the world of Scikit-Learn, Keras, DL4J and other widely used libraries. MARVIN allows for the creation of Docker containers that run on Kubernetes clusters within DARPA to provide an execution environment for automated machine learning. MARVIN currently contains over 400 datasets and challenge problems from a wide array of ML domains including routine classification and regression to advanced video/image classification and remote sensing.
△ Less
Submitted 11 August, 2018;
originally announced August 2018.
-
Measurement Context Extraction from Text: Discovering Opportunities and Gaps in Earth Science
Authors:
Kyle Hundman,
Chris A. Mattmann
Abstract:
We propose Marve, a system for extracting measurement values, units, and related words from natural language text. Marve uses conditional random fields (CRF) to identify measurement values and units, followed by a rule-based system to find related entities, descriptors and modifiers within a sentence. Sentence tokens are represented by an undirected graphical model, and rules are based on part-of-…
▽ More
We propose Marve, a system for extracting measurement values, units, and related words from natural language text. Marve uses conditional random fields (CRF) to identify measurement values and units, followed by a rule-based system to find related entities, descriptors and modifiers within a sentence. Sentence tokens are represented by an undirected graphical model, and rules are based on part-of-speech and word dependency patterns connecting values and units to contextual words. Marve is unique in its focus on measurement context and early experimentation demonstrates Marve's ability to generate high-precision extractions with strong recall. We also discuss Marve's role in refining measurement requirements for NASA's proposed HyspIRI mission, a hyperspectral infrared imaging satellite that will study the world's ecosystems. In general, our work with HyspIRI demonstrates the value of semantic measurement extractions in characterizing quantitative discussion contained in large corpuses of natural language text. These extractions accelerate broad, cross-cutting research and expose scientists new algorithmic approaches and experimental nuances. They also facilitate identification of scientific opportunities enabled by HyspIRI leading to more efficient scientific investment and research.
△ Less
Submitted 11 October, 2017;
originally announced October 2017.
-
Scalable Pooled Time Series of Big Video Data from the Deep Web
Authors:
Chris Mattmann,
Madhav Sharan
Abstract:
We contribute a scalable implementation of Ryoo et al's Pooled Time Series algorithm from CVPR 2015. The updated algorithm has been evaluated on a large and diverse dataset of approximately 6800 videos collected from a crawl of the deep web related to human trafficking on DARPA's MEMEX effort. We describe the properties of Pooled Time Series and the motivation for using it to relate videos collect…
▽ More
We contribute a scalable implementation of Ryoo et al's Pooled Time Series algorithm from CVPR 2015. The updated algorithm has been evaluated on a large and diverse dataset of approximately 6800 videos collected from a crawl of the deep web related to human trafficking on DARPA's MEMEX effort. We describe the properties of Pooled Time Series and the motivation for using it to relate videos collected from the deep web. We highlight issues that we found while running Pooled Time Series on larger datasets and discuss solutions for those issues. Our solution centers are re-imagining Pooled Time Series as a Hadoop-based algorithm in which we compute portions of the eventual solution in parallel on large commodity clusters. We demonstrate that our new Hadoop-based algorithm works well on the 6800 video dataset and shares all of the properties described in the CVPR 2015 paper. We suggest avenues of future work in the project.
△ Less
Submitted 21 October, 2016;
originally announced October 2016.
-
Ensemble Maximum Entropy Classification and Linear Regression for Author Age Prediction
Authors:
Joey Hong,
Chris Mattmann,
Paul Ramirez
Abstract:
The evolution of the internet has created an abundance of unstructured data on the web, a significant part of which is textual. The task of author profiling seeks to find the demographics of people solely from their linguistic and content-based features in text. The ability to describe traits of authors clearly has applications in fields such as security and forensics, as well as marketing. Instea…
▽ More
The evolution of the internet has created an abundance of unstructured data on the web, a significant part of which is textual. The task of author profiling seeks to find the demographics of people solely from their linguistic and content-based features in text. The ability to describe traits of authors clearly has applications in fields such as security and forensics, as well as marketing. Instead of seeing age as just a classification problem, we also frame age as a regression one, but use an ensemble chain method that incorporates the power of both classification and regression to learn the authors exact age.
△ Less
Submitted 4 October, 2016;
originally announced October 2016.