Search | arXiv e-print repository

Elimination of annotation dependencies in validation for Modern JSON Schema

Authors: Lyes Attouche, Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Stefan Klessinger, Carlo Sartiani, Stefanie Scherzinger

Abstract: JSON Schema is a logical language used to define the structure of JSON values. JSON Schema syntax is based on nested schema objects. In all versions of JSON Schema until Draft-07, collectively known as Classical JSON Schema, the semantics of a schema was entirely described by the set of JSON values that it validates. This semantics was the basis for a thorough theoretical study and for the develop… ▽ More JSON Schema is a logical language used to define the structure of JSON values. JSON Schema syntax is based on nested schema objects. In all versions of JSON Schema until Draft-07, collectively known as Classical JSON Schema, the semantics of a schema was entirely described by the set of JSON values that it validates. This semantics was the basis for a thorough theoretical study and for the development of tools to decide satisfiability and equivalence of schemas. Unfortunately, Classical JSON Schema suffered a severe limitation in its ability to express extensions of object schemas, which caused the introduction, with Draft 2019-09, of two disruptive features: annotation dependency and dynamic references. These new features undermine the previously developed semantic theory, and the algorithms used to decide satisfiability for Classical JSON Schema are not easy to extend. One possible solution is rewriting a schema written in Modern JSON Schema into an equivalent schema in Classical JSON Schema. In this paper we prove that the elimination of annotation dependent keywords cannot, in general, avoid an exponential increase of the schema dimension. We provide an algorithm to eliminate these keywords that, despite the theoretical lower bound, behaves quite well in practice, as we verify with an extensive set of experiments. △ Less

Submitted 14 March, 2025; originally announced March 2025.

arXiv:2307.10034 [pdf, other]

Validation of Modern JSON Schema: Formalization and Complexity

Authors: Lyes Attouche, Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, Stefanie Scherzinger

Abstract: JSON Schema is the de-facto standard schema language for JSON data. The language went through many minor revisions, but the most recent versions of the language added two novel features, dynamic references and annotation-dependent validation, that change the evaluation model. Modern JSON Schema is the name used to indicate all versions from Draft 2019-09, which are characterized by these new featu… ▽ More JSON Schema is the de-facto standard schema language for JSON data. The language went through many minor revisions, but the most recent versions of the language added two novel features, dynamic references and annotation-dependent validation, that change the evaluation model. Modern JSON Schema is the name used to indicate all versions from Draft 2019-09, which are characterized by these new features, while Classical JSON Schema is used to indicate the previous versions. These new "modern" features make the schema language quite difficult to understand, and have generated many discussions about the correct interpretation of their official specifications; for this reason we undertook the task of their formalization. During this process, we also analyzed the complexity of data validation in Modern JSON Schema, with the idea of confirming the PTIME complexity of Classical JSON Schema validation, and we were surprised to discover a completely different truth: data validation, that is expected to be an extremely efficient process, acquires, with Modern JSON Schema features, a PSPACE complexity. In this paper, we give the first formal description of Modern JSON Schema, which we consider a central contribution of the work that we present here. We then prove that its data validation problem is PSPACE-complete. We prove that the origin of the problem lies in dynamic references, and not in annotation-dependent validation. We study the schema and data complexities, showing that the problem is PSPACE-complete with respect to the schema size even with a fixed instance, but is in PTIME when the schema is fixed and only the instance size is allowed to vary. Finally, we run experiments that show that there are families of schemas where the difference in asymptotic complexity between dynamic and static references is extremely visible, even with small schemas. △ Less

Submitted 1 February, 2024; v1 submitted 19 July, 2023; originally announced July 2023.

arXiv:2202.13434 [pdf, ps, other]

Negation-Closure for JSON Schema

Authors: Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, Stefanie Scherzinger

Abstract: JSON Schema is an evolving standard for describing families of JSON documents. It is a logical language, based on a set of assertions that describe features of the JSON value under analysis and on logical or structural combinators for these assertions, including a negation operator. Most logical languages with negation enjoy negation closure, that is, for every operator they have a negation dual t… ▽ More JSON Schema is an evolving standard for describing families of JSON documents. It is a logical language, based on a set of assertions that describe features of the JSON value under analysis and on logical or structural combinators for these assertions, including a negation operator. Most logical languages with negation enjoy negation closure, that is, for every operator they have a negation dual that expresses its negation. We show that this is not the case for JSON Schema, we study how that changed with the latest versions of the Draft, and we discuss how the language may be enriched accordingly. In the process, we define an algebraic reformulation of JSON Schema, which we successfully employed in a prototype system for generating schema witnesses. △ Less

Submitted 27 February, 2022; originally announced February 2022.

arXiv:2202.12849 [pdf, other]

Witness Generation for JSON Schema

Authors: Lyes Attouche, Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, Stefanie Scherzinger

Abstract: JSON Schema is an important, evolving standard schema language for families of JSON documents. It is based on a complex combination of structural and Boolean assertions, and features negation and recursion. The static analysis of JSON Schema documents comprises practically relevant problems, including schema satisfiability, inclusion, and equivalence. These three problems can be reduced to witness… ▽ More JSON Schema is an important, evolving standard schema language for families of JSON documents. It is based on a complex combination of structural and Boolean assertions, and features negation and recursion. The static analysis of JSON Schema documents comprises practically relevant problems, including schema satisfiability, inclusion, and equivalence. These three problems can be reduced to witness generation: given a schema, generate an element of the schema, if it exists, and report failure otherwise. Schema satisfiability, inclusion, and equivalence have been shown to be decidable, by reduction to reachability in alternating tree automata. However, no witness generation algorithm has yet been formally described. We contribute a first, direct algorithm for JSON Schema witness generation. We study its effectiveness and efficiency, in experiments over several schema collections, including thousands of real-world schemas. Our focus is on the completeness of the language, where we only exclude the uniqueItems operator, and on the ability of the algorithm to run in a reasonable time on a large set of real-world examples, despite the exponential complexity of the underlying problem. △ Less

Submitted 16 July, 2022; v1 submitted 25 February, 2022; originally announced February 2022.

arXiv:2107.08677 [pdf, ps, other]

An Empirical Study on the "Usage of Not" in Real-World JSON Schema Documents (Long Version)

Authors: Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, Stefanie Scherzinger

Abstract: In this paper, we study the usage of negation in JSON Schema data modeling. Negation is a logical operator that is rarely present in type systems and schema description languages, since it complicates decision problems. As a consequence, many software tools, but also formal frameworks for working with JSON Schema, do not fully support negation. As of today, the question whether covering negation i… ▽ More In this paper, we study the usage of negation in JSON Schema data modeling. Negation is a logical operator that is rarely present in type systems and schema description languages, since it complicates decision problems. As a consequence, many software tools, but also formal frameworks for working with JSON Schema, do not fully support negation. As of today, the question whether covering negation is practically relevant, or a mainly theoretical exercise (albeit challenging), is open. This motivates us to study whether negation is really used in practice, for which aims, and whether it could be - in principle - replaced by simpler operators. We have collected the most diverse corpus of JSON Schema documents analyzed so far, based on a crawl of 90k open source schemas hosted on GitHub. We perform a systematic analysis, quantify usage patterns of negation, and also qualitatively analyze schemas. We show that negation is indeed used, following a stable set of patterns, with the potential to mature into design patterns. △ Less

Submitted 19 July, 2021; originally announced July 2021.

arXiv:2104.14828 [pdf, ps, other]

Not Elimination and Witness Generation for JSON Schema

Authors: Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, Stefanie Scherzinger

Abstract: JSON Schema is an evolving standard for the description of families of JSON documents. JSON Schema is a logical language, based on a set of assertions that describe features of the JSON value under analysis and on logical or structural combinators for these assertions. As for any logical language, problems like satisfaction, not-elimination, schema satisfiability, schema inclusion and equivalence,… ▽ More JSON Schema is an evolving standard for the description of families of JSON documents. JSON Schema is a logical language, based on a set of assertions that describe features of the JSON value under analysis and on logical or structural combinators for these assertions. As for any logical language, problems like satisfaction, not-elimination, schema satisfiability, schema inclusion and equivalence, as well as witness generation, have both theoretical and practical interest. While satisfaction is trivial, all other problems are quite difficult, due to the combined presence of negation, recursion, and complex assertions in JSON Schema. To make things even more complex and interesting, JSON Schema is not algebraic, since we have both syntactic and semantic interactions between different keywords in the same schema object. With such motivations, we present in this paper an algebraic characterization of JSON Schema, obtained by adding opportune operators, and by mirroring existing ones. We present then algebra-based approaches for dealing with not-elimination and witness generation problems, which play a central role as they lead to solutions for the other mentioned complex problems. △ Less

Submitted 7 May, 2021; v1 submitted 30 April, 2021; originally announced April 2021.

arXiv:1810.00751 [pdf, other]

CBPF: leveraging context and content information for better recommendations

Authors: Zahra Vahidi Ferdousi, Dario Colazzo, Elsa Negre

Abstract: Recommender systems help users to find their appropriate items among large volumes of information. Different types of recommender systems have been proposed. Among these, context-aware recommender systems aim at personalizing as much as possible the recommendations based on the context situation in which the user is. In this paper we present an approach integrating contextual information into the… ▽ More Recommender systems help users to find their appropriate items among large volumes of information. Different types of recommender systems have been proposed. Among these, context-aware recommender systems aim at personalizing as much as possible the recommendations based on the context situation in which the user is. In this paper we present an approach integrating contextual information into the recommendation process by modeling either item-based or user-based influence of the context on ratings, using the Pearson Correlation Coefficient. The proposed solution aims at taking advantage of content and contextual information in the recommendation process. We evaluate and show effectiveness of our approach on three different contextual datasets and analyze the performances of the variants of our approach based on the characteristics of these datasets, especially the sparsity level of the input data and amount of available information. △ Less

Submitted 1 October, 2018; originally announced October 2018.

Comments: 15 pages, 4 figures, this is the long version of the paper submitted to the conference ADMA'18

arXiv:1507.01708 [pdf, ps, other]

Typing Regular Path Query Languages for Data Graphs

Authors: Dario Colazzo, Carlo Sartiani

Abstract: Regular path query languages for data graphs are essentially \emph{untyped}. The lack of type information greatly limits the optimization opportunities for query engines and makes application development more complex. In this paper we discuss a simple, yet expressive, schema language for edge-labelled data graphs. This schema language is, then, used to define a query type inference approach with g… ▽ More Regular path query languages for data graphs are essentially \emph{untyped}. The lack of type information greatly limits the optimization opportunities for query engines and makes application development more complex. In this paper we discuss a simple, yet expressive, schema language for edge-labelled data graphs. This schema language is, then, used to define a query type inference approach with good precision properties. △ Less

Submitted 7 July, 2015; originally announced July 2015.

arXiv:1205.6698 [pdf, other]

Type-Based Detection of XML Query-Update Independence

Authors: Nicole Bidoit-Tollu, Dario Colazzo, Federico Ulliana

Abstract: This paper presents a novel static analysis technique to detect XML query-update independence, in the presence of a schema. Rather than types, our system infers chains of types. Each chain represents a path that can be traversed on a valid document during query/update evaluation. The resulting independence analysis is precise, although it raises a challenging issue: recursive schemas may lead to i… ▽ More This paper presents a novel static analysis technique to detect XML query-update independence, in the presence of a schema. Rather than types, our system infers chains of types. Each chain represents a path that can be traversed on a valid document during query/update evaluation. The resulting independence analysis is precise, although it raises a challenging issue: recursive schemas may lead to infer infinitely many chains. A sound and complete approximation technique ensuring a finite analysis in any case is presented, together with an efficient implementation performing the chain-based analysis in polynomial space and time. △ Less

Submitted 30 May, 2012; originally announced May 2012.

Comments: VLDB2012

Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 9, pp. 872-883 (2012)

arXiv:1108.4596 [pdf]

XML content warehousing: Improving sociological studies of mailing lists and web data

Authors: Benjamin Nguyen, Antoine Vion, François-Xavier Dudouet, Dario Colazzo, Ioana Manolescu, Pierre Senellart

Abstract: In this paper, we present the guidelines for an XML-based approach for the sociological study of Web data such as the analysis of mailing lists or databases available online. The use of an XML warehouse is a flexible solution for storing and processing this kind of data. We propose an implemented solution and show possible applications with our case study of profiles of experts involved in W3C sta… ▽ More In this paper, we present the guidelines for an XML-based approach for the sociological study of Web data such as the analysis of mailing lists or databases available online. The use of an XML warehouse is a flexible solution for storing and processing this kind of data. We propose an implemented solution and show possible applications with our case study of profiles of experts involved in W3C standard-setting activity. We illustrate the sociological use of semi-structured databases by presenting our XML Schema for mailing-list warehousing. An XML Schema allows many adjunctions or crossings of data sources, without modifying existing data sets, while allowing possible structural evolution. We also show that the existence of hidden data implies increased complexity for traditional SQL users. XML content warehousing allows altogether exhaustive warehousing and recursive queries through contents, with far less dependence on the initial storage. We finally present the possibility of exporting the data stored in the warehouse to commonly-used advanced software devoted to sociological analysis. △ Less

Submitted 23 August, 2011; originally announced August 2011.

Journal ref: Bulletin de Méthodologie Sociologique (BMS) (2011) 27p

arXiv:1104.2079 [pdf, ps, other]

Optimizing XML querying using type-based document projection

Authors: Véronique Benzaken, Giuseppe Castagna, Dario Colazzo, Kim Nguyen

Abstract: XML data projection (or pruning) is a natural optimization for main memory query engines: given a query Q over a document D, the subtrees of D that are not necessary to evaluate Q are pruned, thus producing a smaller document D'; the query Q is then executed on D', hence avoiding to allocate and process nodes that will never be reached by Q. In this article, we propose a new approach, based on typ… ▽ More XML data projection (or pruning) is a natural optimization for main memory query engines: given a query Q over a document D, the subtrees of D that are not necessary to evaluate Q are pruned, thus producing a smaller document D'; the query Q is then executed on D', hence avoiding to allocate and process nodes that will never be reached by Q. In this article, we propose a new approach, based on types, that greatly improves current solutions. Besides providing comparable or greater precision and far lesser pruning overhead, our solution ---unlike current approaches--- takes into account backward axes, predicates, and can be applied to multiple queries rather than just to single ones. A side contribution is a new type system for XPath able to handle backward axes. The soundness of our approach is formally proved. Furthermore, we prove that the approach is also complete (i.e., yields the best possible type-driven pruning) for a relevant class of queries and Schemas. We further validate our approach using the XMark and XPathMark benchmarks and show that pruning not only improves the main memory query engine's performances (as expected) but also those of state of the art native XML databases. △ Less

Submitted 11 April, 2011; originally announced April 2011.

Comments: 65 pages A4 format

ACM Class: H.2.5; F.3.2

arXiv:1002.0971 [pdf]

The WebStand Project

Authors: Benjamin Nguyen, François-Xavier Dudouet, Dario Colazzo, Antoine Vion, Ioana Manolescu, Pierre Senellart

Abstract: In this paper we present the state of advancement of the French ANR WebStand project. The objective of this project is to construct a customizable XML based warehouse platform to acquire, transform, analyze, store, query and export data from the web, in particular mailing lists, with the final intension of using this data to perform sociological studies focused on social groups of World Wide Web… ▽ More In this paper we present the state of advancement of the French ANR WebStand project. The objective of this project is to construct a customizable XML based warehouse platform to acquire, transform, analyze, store, query and export data from the web, in particular mailing lists, with the final intension of using this data to perform sociological studies focused on social groups of World Wide Web, with a specific emphasis on the temporal aspects of this data. We are currently using this system to analyze the standardization process of the W3C, through its social network of standard setters. △ Less

Submitted 4 February, 2010; originally announced February 2010.

Journal ref: WebSci'09: Society On-Line Conference, Greece (2009)

Showing 1–12 of 12 results for author: Colazzo, D