-
Audience Reach of Scientific Data Visualizations in Planetarium-Screened Films
Authors:
Kalina Borkiewicz,
Eric Jensen,
Yiwen Miao,
Stuart Levy,
J. P. Naiman,
Jeff Carpenter,
Katherine E. Isaacs
Abstract:
Quantifying the global reach of planetarium dome shows presents significant challenges due to the lack of standardized viewership tracking mechanisms across diverse planetarium venues. We present an analysis of the global impact of dome shows, presenting data regarding four documentary films from a single visualization lab. Specifically, we designed and administered a viewership survey of four lon…
▽ More
Quantifying the global reach of planetarium dome shows presents significant challenges due to the lack of standardized viewership tracking mechanisms across diverse planetarium venues. We present an analysis of the global impact of dome shows, presenting data regarding four documentary films from a single visualization lab. Specifically, we designed and administered a viewership survey of four long-running shows that contained cinematic scientific visualizations. Reported survey data shows that between 1.2 - 2.6 million people have viewed these four films across the 68 responding planetariums (mean: 1.9 million). When we include estimates and extrapolate for the 315 planetariums that licensed these shows, we arrive at an estimate of 16.5 - 24.1 million people having seen these films (mean: 20.3 million).
△ Less
Submitted 30 October, 2024;
originally announced November 2024.
-
pathfinder: A Semantic Framework for Literature Review and Knowledge Discovery in Astronomy
Authors:
Kartheik G. Iyer,
Mikaeel Yunus,
Charles O'Neill,
Christine Ye,
Alina Hyk,
Kiera McCormick,
Ioana Ciuca,
John F. Wu,
Alberto Accomazzi,
Simone Astarita,
Rishabh Chakrabarty,
Jesse Cranney,
Anjalie Field,
Tirthankar Ghosal,
Michele Ginolfi,
Marc Huertas-Company,
Maja Jablonska,
Sandor Kruk,
Huiling Liu,
Gabriel Marchidan,
Rohit Mistry,
J. P. Naiman,
J. E. G. Peek,
Mugdha Polimera,
Sergio J. Rodriguez
, et al. (5 additional authors not shown)
Abstract:
The exponential growth of astronomical literature poses significant challenges for researchers navigating and synthesizing general insights or even domain-specific knowledge. We present Pathfinder, a machine learning framework designed to enable literature review and knowledge discovery in astronomy, focusing on semantic searching with natural language instead of syntactic searches with keywords.…
▽ More
The exponential growth of astronomical literature poses significant challenges for researchers navigating and synthesizing general insights or even domain-specific knowledge. We present Pathfinder, a machine learning framework designed to enable literature review and knowledge discovery in astronomy, focusing on semantic searching with natural language instead of syntactic searches with keywords. Utilizing state-of-the-art large language models (LLMs) and a corpus of 350,000 peer-reviewed papers from the Astrophysics Data System (ADS), Pathfinder offers an innovative approach to scientific inquiry and literature exploration. Our framework couples advanced retrieval techniques with LLM-based synthesis to search astronomical literature by semantic context as a complement to currently existing methods that use keywords or citation graphs. It addresses complexities of jargon, named entities, and temporal aspects through time-based and citation-based weighting schemes. We demonstrate the tool's versatility through case studies, showcasing its application in various research scenarios. The system's performance is evaluated using custom benchmarks, including single-paper and multi-paper tasks. Beyond literature review, Pathfinder offers unique capabilities for reformatting answers in ways that are accessible to various audiences (e.g. in a different language or as simplified text), visualizing research landscapes, and tracking the impact of observatories and methodologies. This tool represents a significant advancement in applying AI to astronomical research, aiding researchers at all career stages in navigating modern astronomy literature.
△ Less
Submitted 2 August, 2024;
originally announced August 2024.
-
Large Synthetic Data from the arXiv for OCR Post Correction of Historic Scientific Articles
Authors:
Jill P. Naiman,
Morgan G. Cosillo,
Peter K. G. Williams,
Alyssa Goodman
Abstract:
Scientific articles published prior to the "age of digitization" (~1997) require Optical Character Recognition (OCR) to transform scanned documents into machine-readable text, a process that often produces errors. We develop a pipeline for the generation of a synthetic ground truth/OCR dataset to correct the OCR results of the astrophysics literature holdings of the NASA Astrophysics Data System (…
▽ More
Scientific articles published prior to the "age of digitization" (~1997) require Optical Character Recognition (OCR) to transform scanned documents into machine-readable text, a process that often produces errors. We develop a pipeline for the generation of a synthetic ground truth/OCR dataset to correct the OCR results of the astrophysics literature holdings of the NASA Astrophysics Data System (ADS). By mining the arXiv we create, to the authors' knowledge, the largest scientific synthetic ground truth/OCR post correction dataset of 203,354,393 character pairs. We provide baseline models trained with this dataset and find the mean improvement in character and word error rates of 7.71% and 18.82% for historical OCR text, respectively. When used to classify parts of sentences as inline math, we find a classification F1 score of 77.82%. Interactive dashboards to explore the dataset are available online: https://readingtimemachine.github.io/projects/1-ocr-groundtruth-may2023, and data and code, within the limitations of our agreement with the arXiv, are hosted on GitHub: https://github.com/ReadingTimeMachine/ocr_post_correction.
△ Less
Submitted 20 September, 2023;
originally announced September 2023.
-
The Digitization of Historical Astrophysical Literature with Highly-Localized Figures and Figure Captions
Authors:
Jill P. Naiman,
Peter K. G. Williams,
Alyssa Goodman
Abstract:
Scientific articles published prior to the "age of digitization" in the late 1990s contain figures which are "trapped" within their scanned pages. While progress to extract figures and their captions has been made, there is currently no robust method for this process. We present a YOLO-based method for use on scanned pages, after they have been processed with Optical Character Recognition (OCR), w…
▽ More
Scientific articles published prior to the "age of digitization" in the late 1990s contain figures which are "trapped" within their scanned pages. While progress to extract figures and their captions has been made, there is currently no robust method for this process. We present a YOLO-based method for use on scanned pages, after they have been processed with Optical Character Recognition (OCR), which uses both grayscale and OCR-features. We focus our efforts on translating the intersection-over-union (IOU) metric from the field of object detection to document layout analysis and quantify "high localization" levels as an IOU of 0.9. When applied to the astrophysics literature holdings of the NASA Astrophysics Data System (ADS), we find F1 scores of 90.9% (92.2%) for figures (figure captions) with the IOU cut-off of 0.9 which is a significant improvement over other state-of-the-art methods.
△ Less
Submitted 22 February, 2023;
originally announced February 2023.
-
Generalizability in Document Layout Analysis for Scientific Article Figure & Caption Extraction
Authors:
Jill P. Naiman
Abstract:
The lack of generalizability -- in which a model trained on one dataset cannot provide accurate results for a different dataset -- is a known problem in the field of document layout analysis. Thus, when a model is used to locate important page objects in scientific literature such as figures, tables, captions, and math formulas, the model often cannot be applied successfully to new domains. While…
▽ More
The lack of generalizability -- in which a model trained on one dataset cannot provide accurate results for a different dataset -- is a known problem in the field of document layout analysis. Thus, when a model is used to locate important page objects in scientific literature such as figures, tables, captions, and math formulas, the model often cannot be applied successfully to new domains. While several solutions have been proposed, including newer and updated deep learning models, larger hand-annotated datasets, and the generation of large synthetic datasets, so far there is no "magic bullet" for translating a model trained on a particular domain or historical time period to a new field. Here we present our ongoing work in translating our document layout analysis model from the historical astrophysical literature to the larger corpus of scientific documents within the HathiTrust U.S. Federal Documents collection. We use this example as an avenue to highlight some of the problems with generalizability in the document layout analysis community and discuss several challenges and possible solutions to address these issues. All code for this work is available on The Reading Time Machine GitHub repository (https://github.com/ReadingTimeMachine/htrc_short_conf).
△ Less
Submitted 25 January, 2023;
originally announced January 2023.
-
Figure and Figure Caption Extraction for Mixed Raster and Vector PDFs: Digitization of Astronomical Literature with OCR Features
Authors:
J. P. Naiman,
Peter K. G. Williams,
Alyssa Goodman
Abstract:
Scientific articles published prior to the "age of digitization" in the late 1990s contain figures which are "trapped" within their scanned pages. While progress to extract figures and their captions has been made, there is currently no robust method for this process. We present a YOLO-based method for use on scanned pages, post-Optical Character Recognition (OCR), which uses both grayscale and OC…
▽ More
Scientific articles published prior to the "age of digitization" in the late 1990s contain figures which are "trapped" within their scanned pages. While progress to extract figures and their captions has been made, there is currently no robust method for this process. We present a YOLO-based method for use on scanned pages, post-Optical Character Recognition (OCR), which uses both grayscale and OCR-features. When applied to the astrophysics literature holdings of the Astrophysics Data System (ADS), we find F1 scores of 90.9% (92.2%) for figures (figure captions) with the intersection-over-union (IOU) cut-off of 0.9 which is a significant improvement over other state-of-the-art methods.
△ Less
Submitted 9 September, 2022;
originally announced September 2022.
-
CloudFindr: A Deep Learning Cloud Artifact Masker for Satellite DEM Data
Authors:
Kalina Borkiewicz,
Viraj Shah,
J. P. Naiman,
Chuanyue Shen,
Stuart Levy,
Jeff Carpenter
Abstract:
Artifact removal is an integral component of cinematic scientific visualization, and is especially challenging with big datasets in which artifacts are difficult to define. In this paper, we describe a method for creating cloud artifact masks which can be used to remove artifacts from satellite imagery using a combination of traditional image processing together with deep learning based on U-Net.…
▽ More
Artifact removal is an integral component of cinematic scientific visualization, and is especially challenging with big datasets in which artifacts are difficult to define. In this paper, we describe a method for creating cloud artifact masks which can be used to remove artifacts from satellite imagery using a combination of traditional image processing together with deep learning based on U-Net. Compared to previous methods, our approach does not require multi-channel spectral imagery but performs successfully on single-channel Digital Elevation Models (DEMs). DEMs are a representation of the topography of the Earth and have a variety applications including planetary science, geology, flood modeling, and city planning.
△ Less
Submitted 26 October, 2021;
originally announced October 2021.
-
Clustering-informed Cinematic Astrophysical Data Visualization with Application to the Moon-forming Terrestrial Synestia
Authors:
Patrick D. Aleo,
Simon J. Lock,
Donna J. Cox,
Stuart A. Levy,
J. P. Naiman,
A. J. Christensen,
Kalina Borkiewicz,
Robert Patterson
Abstract:
Scientific visualization tools are currently not optimized to create cinematic, production-quality representations of numerical data for the purpose of science communication. In our pipeline \texttt{Estra}, we outline a step-by-step process from a raw simulation into a finished render as a way to teach non-experts in the field of visualization how to achieve production-quality outputs on their own…
▽ More
Scientific visualization tools are currently not optimized to create cinematic, production-quality representations of numerical data for the purpose of science communication. In our pipeline \texttt{Estra}, we outline a step-by-step process from a raw simulation into a finished render as a way to teach non-experts in the field of visualization how to achieve production-quality outputs on their own. We demonstrate feasibility of using the visual effects software Houdini for cinematic astrophysical data visualization, informed by machine learning clustering algorithms. To demonstrate the capabilities of this pipeline, we used a post-impact, thermally-equilibrated Moon-forming synestia from \cite{Lock18}. Our approach aims to identify "physically interpretable" clusters, where clusters identified in an appropriate phase space (e.g. here we use a temperature-entropy phase-space) correspond to physically meaningful structures within the simulation data. Clustering results can then be used to highlight these structures by informing the color-mapping process in a simplified Houdini software shading network, where dissimilar phase-space clusters are mapped to different color values for easier visual identification. Cluster information can also be used in 3D position space, via Houdini's Scene View, to aid in physical cluster finding, simulation prototyping, and data exploration. Our clustering-based renders are compared to those created by the Advanced Visualization Lab (AVL) team for the full dome show "Imagine the Moon" as proof of concept. With \texttt{Estra}, scientists have a tool to create their own production-quality, data-driven visualizations.
△ Less
Submitted 29 May, 2020;
originally announced June 2020.
-
Cinematic Visualization of Multiresolution Data: Ytini for Adaptive Mesh Refinement in Houdini
Authors:
Kalina Borkiewicz,
J. P. Naiman,
Haoming Lai
Abstract:
We have entered the era of large multidimensional datasets represented by increasingly complex data structures. Current tools for scientific visualization are not optimized to efficiently and intuitively create cinematic production quality, time-evolving representations of numerical data for broad impact science communication via film, media, or journalism. To present such data in a cinematic envi…
▽ More
We have entered the era of large multidimensional datasets represented by increasingly complex data structures. Current tools for scientific visualization are not optimized to efficiently and intuitively create cinematic production quality, time-evolving representations of numerical data for broad impact science communication via film, media, or journalism. To present such data in a cinematic environment, it is advantageous to develop methods that integrate these complex data structures into industry standard visual effects software packages, which provide a myriad of control features otherwise unavailable in traditional scientific visualization software. In this paper, we present the general methodology for the import and visualization of nested multiresolution datasets into commercially available visual effects software. We further provide a specific example of importing Adaptive Mesh Refinement data into the software Houdini. This paper builds on our previous work, which describes a method for using Houdini to visualize uniform Cartesian datasets. We summarize a tutorial available on the website www.ytini.com, which includes sample data downloads, Python code, and various other resources to simplify the process of importing and rendering multiresolution data.
△ Less
Submitted 1 November, 2018; v1 submitted 8 August, 2018;
originally announced August 2018.