-
Representing and querying data tensors in RDF and SPARQL
Authors:
Piotr Marciniak,
Piotr Sowinski,
Maria Ganzha
Abstract:
Embedding tensors in databases has recently gained in significance, due to the rapid proliferation of machine learning methods (including LLMs) which produce embeddings in the form of tensors. To support emerging use cases hybridizing machine learning with knowledge graphs, a robust and efficient tensor representation scheme is needed. We introduce a novel approach for representing data tensors as…
▽ More
Embedding tensors in databases has recently gained in significance, due to the rapid proliferation of machine learning methods (including LLMs) which produce embeddings in the form of tensors. To support emerging use cases hybridizing machine learning with knowledge graphs, a robust and efficient tensor representation scheme is needed. We introduce a novel approach for representing data tensors as literals in RDF, along with an extension of SPARQL implementing specialized functionalities for handling such literals. The extension includes 36 SPARQL functions and four aggregates. To support this approach, we provide a thoroughly tested, open-source implementation based on Apache Jena, along with an exemplary knowledge graph and query set.
△ Less
Submitted 27 April, 2025;
originally announced April 2025.
-
Faster ED-String Matching with $k$ Mismatches
Authors:
Paweł Gawrychowski,
Adam Górkiewicz,
Pola Marciniak,
Solon P. Pissis,
Karol Pokorski
Abstract:
We revisit the complexity of approximate pattern matching in an elastic-degenerate string. Such a string is a sequence of $n$ finite sets of strings of total length $N$, and compactly describes a collection of strings obtained by first choosing exactly one string in every set, and then concatenating them together. This is motivated by the need of storing a collection of highly similar DNA sequence…
▽ More
We revisit the complexity of approximate pattern matching in an elastic-degenerate string. Such a string is a sequence of $n$ finite sets of strings of total length $N$, and compactly describes a collection of strings obtained by first choosing exactly one string in every set, and then concatenating them together. This is motivated by the need of storing a collection of highly similar DNA sequences.
The basic algorithmic question on elastic-degenerate strings is pattern matching: given such an elastic-degenerate string and a standard pattern of length $m$, check if the pattern occurs in one of the strings in the described collection. Bernardini et al.~[SICOMP 2022] showed how to leverage fast matrix multiplication to obtain an $\tilde{\mathcal{O}}(nm^{ω-1})+\mathcal{O}(N)$-time complexity for this problem, where $w$ is the matrix multiplication exponent. However, the best result so far for finding occurrences with $k$ mismatches, where $k$ is a constant, is the $\tilde{\mathcal{O}}(nm^{2}+N)$-time algorithm of Pissis et al.~[CPM 2025]. This brings the question whether increasing the dependency on $m$ from $m^{ω-1}$ to quadratic is necessary when moving from $k=0$ to larger (but still constant) $k$.
We design an $\tilde{\mathcal{O}}(nm^{1.5}+N)$-time algorithm for pattern matching with $k$ mismatches in an elastic-degenerate string, for any constant $k$. To obtain this time bound, we leverage the structural characterization of occurrences with $k$ mismatches of Charalampopoulos et al.~[FOCS 2020] together with fast Fourier transform. We need to work with multiple patterns at the same time, instead of a single pattern, which requires refining the original characterization. This might be of independent interest.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Deepfake tweets automatic detection
Authors:
Adam Frej,
Adrian Kaminski,
Piotr Marciniak,
Szymon Szmajdzinski,
Soveatin Kuntur,
Anna Wroblewska
Abstract:
This study addresses the critical challenge of detecting DeepFake tweets by leveraging advanced natural language processing (NLP) techniques to distinguish between genuine and AI-generated texts. Given the increasing prevalence of misinformation, our research utilizes the TweepFake dataset to train and evaluate various machine learning models. The objective is to identify effective strategies for…
▽ More
This study addresses the critical challenge of detecting DeepFake tweets by leveraging advanced natural language processing (NLP) techniques to distinguish between genuine and AI-generated texts. Given the increasing prevalence of misinformation, our research utilizes the TweepFake dataset to train and evaluate various machine learning models. The objective is to identify effective strategies for recognizing DeepFake content, thereby enhancing the integrity of digital communications. By developing reliable methods for detecting AI-generated misinformation, this work contributes to a more trustworthy online information environment.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Small Is Not Always Beautiful
Authors:
Pawel Marciniak,
Nikitas Liogkas,
Arnaud Legout,
Eddie Kohler
Abstract:
Peer-to-peer content distribution systems have been enjoying great popularity, and are now gaining momentum as a means of disseminating video streams over the Internet. In many of these protocols, including the popular BitTorrent, content is split into mostly fixed-size pieces, allowing a client to download data from many peers simultaneously. This makes piece size potentially critical for perfo…
▽ More
Peer-to-peer content distribution systems have been enjoying great popularity, and are now gaining momentum as a means of disseminating video streams over the Internet. In many of these protocols, including the popular BitTorrent, content is split into mostly fixed-size pieces, allowing a client to download data from many peers simultaneously. This makes piece size potentially critical for performance. However, previous research efforts have largely overlooked this parameter, opting to focus on others instead. This paper presents the results of real experiments with varying piece sizes on a controlled BitTorrent testbed. We demonstrate that this parameter is indeed critical, as it determines the degree of parallelism in the system, and we investigate optimal piece sizes for distributing small and large content. We also pinpoint a related design trade-off, and explain how BitTorrent's choice of dividing pieces into subpieces attempts to address it.
△ Less
Submitted 7 February, 2008;
originally announced February 2008.