-
Campaign Knowledge Network: Building Knowledge for Campaign Efficiency
Authors:
Sachith Withana,
Kshitij Mehta,
Matthew Wolf,
Beth Plale
Abstract:
In the landscape of exascale computing collaborative research campaigns are conducted as co-design activities of loosely coordinated experiments. But the higher level context and the knowledge of individual experimental activity is lost over time. We undertook a knowledge capture and representation aid called Campaign Knowledge Network(CKN), a co-design design and analysis tool. We demonstrate tha…
▽ More
In the landscape of exascale computing collaborative research campaigns are conducted as co-design activities of loosely coordinated experiments. But the higher level context and the knowledge of individual experimental activity is lost over time. We undertook a knowledge capture and representation aid called Campaign Knowledge Network(CKN), a co-design design and analysis tool. We demonstrate that CKN can satisfy the Hoarde abstraction and can distill campaign context from runtime information thereby creating a knowledge resource upon which analysis tools can run to provide more efficient experimentation
△ Less
Submitted 6 December, 2021;
originally announced December 2021.
-
Pilot evaluation of Collection API with PID Kernel Information
Authors:
Yu Luo,
Beth Plale
Abstract:
As digital data become increasingly available for research, there is a growing awareness of the value of domain agnostic Persistent Identifiers (PIDs) for data. A PID is a globally unique reference to a digital object, which in our case is data. In an ecosystem of connected digital objects, a PID will reference a digital object, and the digital object will be a simple entity, a collection of homog…
▽ More
As digital data become increasingly available for research, there is a growing awareness of the value of domain agnostic Persistent Identifiers (PIDs) for data. A PID is a globally unique reference to a digital object, which in our case is data. In an ecosystem of connected digital objects, a PID will reference a digital object, and the digital object will be a simple entity, a collection of homogeneous objects, or a set of heterogeneous objects.
In this paper, we study two recent recommendations from the Research Data Alliance (RDA) that both address pieces of an ecosystem of connected digital objects. The recommendations address Persistent ID records and representations of collections of data. We evaluate different approaches in where to locate key information about a data collection between these two component solutions.
△ Less
Submitted 3 July, 2019; v1 submitted 8 May, 2019;
originally announced May 2019.
-
Reliable Access to Massive Restricted Texts: Experience-based Evaluation
Authors:
Zong Peng,
Beth Plale
Abstract:
Libraries are seeing growing numbers of digitized textual corpora that frequently come with restrictions on their content. Computational analysis corpora that are large, while of interest to scholars, can be cumbersome because of the combination of size, granularity of access, and access restrictions. Efficient management of such a collection for general access especially under failures depends on…
▽ More
Libraries are seeing growing numbers of digitized textual corpora that frequently come with restrictions on their content. Computational analysis corpora that are large, while of interest to scholars, can be cumbersome because of the combination of size, granularity of access, and access restrictions. Efficient management of such a collection for general access especially under failures depends on the primary storage system. In this paper, we identify the requirements of managing for computational analysis a massive text corpus and use it as basis to evaluate candidate storage solutions. The study based on the 5.9 billion page collection of the HathiTrust digital library. Our findings led to the choice of Cassandra 3.x for the primary back end store, which is currently in deployment in the HathiTrust Research Center.
△ Less
Submitted 2 March, 2019;
originally announced March 2019.
-
Fast Data Management with Distributed Streaming SQL
Authors:
Milinda Pathirage,
Beth Plale
Abstract:
To stay competitive in today's data driven economy, enterprises large and small are turning to stream processing platforms to process high volume, high velocity, and diverse streams of data (fast data) as they arrive. Low-level programming models provided by the popular systems of today suffer from lack of responsiveness to change: enhancements require code changes with attendant large turn-around…
▽ More
To stay competitive in today's data driven economy, enterprises large and small are turning to stream processing platforms to process high volume, high velocity, and diverse streams of data (fast data) as they arrive. Low-level programming models provided by the popular systems of today suffer from lack of responsiveness to change: enhancements require code changes with attendant large turn-around times. Even though distributed SQL query engines have been available for Big Data, we still lack support for SQL-based stream querying capabilities in distributed stream processing systems. In this white paper, we identify a set of requirements and propose a standard SQL based streaming query model for management of what has been referred to as Fast Data.
△ Less
Submitted 12 November, 2015;
originally announced November 2015.