-
Scientific Statement Classification over arXiv.org
Authors:
Deyan Ginev,
Bruce R. Miller
Abstract:
We introduce a new classification task for scientific statements and release a large-scale dataset for supervised learning. Our resource is derived from a machine-readable representation of the arXiv.org collection of preprint articles. We explore fifty author-annotated categories and empirically motivate a task design of grouping 10.5 million annotated paragraphs into thirteen classes. We demonst…
▽ More
We introduce a new classification task for scientific statements and release a large-scale dataset for supervised learning. Our resource is derived from a machine-readable representation of the arXiv.org collection of preprint articles. We explore fifty author-annotated categories and empirically motivate a task design of grouping 10.5 million annotated paragraphs into thirteen classes. We demonstrate that the task setup aligns with known success rates from the state of the art, peaking at a 0.91 F1-score via a BiLSTM encoder-decoder model. Additionally, we introduce a lexeme serialization for mathematical formulas, and observe that context-aware models could improve when also trained on the symbolic modality. Finally, we discuss the limitations of both data and task design, and outline potential directions towards increasingly complex models of scientific discourse, beyond isolated statements.
△ Less
Submitted 28 August, 2019;
originally announced August 2019.
-
Strategies for Parallel Markup
Authors:
Bruce R. Miller
Abstract:
Cross-referenced parallel markup for mathematics allows the combination of both presentation and content representations while associating the components of each. Interesting applications are enabled by such an arrangement, such as interaction with parts of the presentation to manipulate and querying the corresponding content, and enhanced search indexing. Although the idea of such markup is hardl…
▽ More
Cross-referenced parallel markup for mathematics allows the combination of both presentation and content representations while associating the components of each. Interesting applications are enabled by such an arrangement, such as interaction with parts of the presentation to manipulate and querying the corresponding content, and enhanced search indexing. Although the idea of such markup is hardly new, effective techniques for creating and manipulating it are more difficult than it appears. Since the structures and tokens in the two formats often do not correspond one-to-one, decisions and heuristics must be developed to determine in which way each component refers to and is referred to by components of the other representation. Conversion between fine and coarse grained parallel markup complicates ID assignments. In this paper, we will describe the techniques developed for \LaTeXML, a \TeX/\LaTeX to XML converter, to create cross-referenced parallel MathML. While we do not yet consider \LaTeXML's content MathML to be useful, the current effort is a step towards that continuing goal.
△ Less
Submitted 2 July, 2015;
originally announced July 2015.
-
LaTeXML 2012 - A Year of LaTeXML
Authors:
Deyan Ginev,
Bruce R. Miller
Abstract:
LaTeXML, a $\TeX$ to XML converter, is being used in a wide range of MKM applications. In this paper, we present a progress report for the 2012 calendar year. Noteworthy enhancements include: increased coverage such as Wikipedia syntax; enhanced capabilities such as embeddable JavaScript and CSS resources and RDFa support; a web service for remote processing via web-sockets; along with general acc…
▽ More
LaTeXML, a $\TeX$ to XML converter, is being used in a wide range of MKM applications. In this paper, we present a progress report for the 2012 calendar year. Noteworthy enhancements include: increased coverage such as Wikipedia syntax; enhanced capabilities such as embeddable JavaScript and CSS resources and RDFa support; a web service for remote processing via web-sockets; along with general accuracy and reliability improvements.
△ Less
Submitted 25 April, 2014;
originally announced April 2014.
-
E-books and Graphics with LaTeXML
Authors:
Deyan Ginev,
Bruce R. Miller,
Silviu Oprea
Abstract:
Marked by the highlights of native generation of EPUB E-books and TikZ support for creating SVG images, we present an annual report of LaTeXML development in 2013. LaTeXML provides a reimplementation of the $\TeX$ parser, geared towards preserving macro semantics; it supports an array of output formats, notably HTML5, EPUB, XHTML and its own $\LaTeX$-near XML. Other highlights include enhancing pe…
▽ More
Marked by the highlights of native generation of EPUB E-books and TikZ support for creating SVG images, we present an annual report of LaTeXML development in 2013. LaTeXML provides a reimplementation of the $\TeX$ parser, geared towards preserving macro semantics; it supports an array of output formats, notably HTML5, EPUB, XHTML and its own $\LaTeX$-near XML. Other highlights include enhancing performance when used inside high-throughput build-systems, via incorporating a native ZIP archive workflow, as well as a simplified installation procedure that now allows to deploy LaTeXML as a cloud service. To this end, we also introduce an official plugin-based scheme for publishing new features that go beyond the core scope of LaTeXML, such as web services or unconventional post-processors. The software suite has now migrated to GitHub and we welcome forks and patches from the wider FLOSS community.
△ Less
Submitted 25 April, 2014;
originally announced April 2014.