-
Decomposing Words for Enhanced Compression: Exploring the Number of Runs in the Extended Burrows-Wheeler Transform
Authors:
Florian Ingels,
Anaïs Denis,
Bastien Cazaux
Abstract:
The Burrows-Wheeler Transform (BWT) is a fundamental component in many data structures for text indexing and compression, widely used in areas such as bioinformatics and information retrieval. The extended BWT (eBWT) generalizes the classical BWT to multisets of strings, providing a flexible framework that captures many BWT-like constructions. Several known variants of the BWT can be viewed as ins…
▽ More
The Burrows-Wheeler Transform (BWT) is a fundamental component in many data structures for text indexing and compression, widely used in areas such as bioinformatics and information retrieval. The extended BWT (eBWT) generalizes the classical BWT to multisets of strings, providing a flexible framework that captures many BWT-like constructions. Several known variants of the BWT can be viewed as instances of the eBWT applied to specific decompositions of a word. A central property of the BWT, essential for its compressibility, is the number of maximal ranges of equal letters, named runs. In this article, we explore how different decompositions of a word impact the number of runs in the resulting eBWT. First, we show that the number of decompositions of a word is exponential, even under minimal constraints on the size of the subsets in the decomposition. Second, we present an infinite family of words for which the ratio of the number of runs between the worst and best decompositions is unbounded, under the same minimal constraints. These results illustrate the potential cost of decomposition choices in eBWT-based compression and underline the challenges in optimizing run-length encoding in generalized BWT frameworks.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
Do Convolutional Networks need to be Deep for Text Classification ?
Authors:
Hoa T. Le,
Christophe Cerisara,
Alexandre Denis
Abstract:
We study in this work the importance of depth in convolutional models for text classification, either when character or word inputs are considered. We show on 5 standard text classification and sentiment analysis tasks that deep models indeed give better performances than shallow networks when the text input is represented as a sequence of characters. However, a simple shallow-and-wide network out…
▽ More
We study in this work the importance of depth in convolutional models for text classification, either when character or word inputs are considered. We show on 5 standard text classification and sentiment analysis tasks that deep models indeed give better performances than shallow networks when the text input is represented as a sequence of characters. However, a simple shallow-and-wide network outperforms deep models such as DenseNet with word inputs. Our shallow word model further establishes new state-of-the-art performances on two datasets: Yelp Binary (95.9\%) and Yelp Full (64.9\%).
△ Less
Submitted 13 July, 2017;
originally announced July 2017.
-
General Purpose Textual Sentiment Analysis and Emotion Detection Tools
Authors:
Alexandre Denis,
Samuel Cruz-Lara,
Nadia Bellalem
Abstract:
Textual sentiment analysis and emotion detection consists in retrieving the sentiment or emotion carried by a text or document. This task can be useful in many domains: opinion mining, prediction, feedbacks, etc. However, building a general purpose tool for doing sentiment analysis and emotion detection raises a number of issues, theoretical issues like the dependence to the domain or to the langu…
▽ More
Textual sentiment analysis and emotion detection consists in retrieving the sentiment or emotion carried by a text or document. This task can be useful in many domains: opinion mining, prediction, feedbacks, etc. However, building a general purpose tool for doing sentiment analysis and emotion detection raises a number of issues, theoretical issues like the dependence to the domain or to the language but also pratical issues like the emotion representation for interoperability. In this paper we present our sentiment/emotion analysis tools, the way we propose to circumvent the di culties and the applications they are used for.
△ Less
Submitted 11 September, 2013;
originally announced September 2013.
-
Multimodal Meaning Representation for Generic Dialogue Systems Architectures
Authors:
Frédéric Landragin,
Alexandre Denis,
Annalisa Ricci,
Laurent Romary
Abstract:
An unified language for the communicative acts between agents is essential for the design of multi-agents architectures. Whatever the type of interaction (linguistic, multimodal, including particular aspects such as force feedback), whatever the type of application (command dialogue, request dialogue, database querying), the concepts are common and we need a generic meta-model. In order to tend…
▽ More
An unified language for the communicative acts between agents is essential for the design of multi-agents architectures. Whatever the type of interaction (linguistic, multimodal, including particular aspects such as force feedback), whatever the type of application (command dialogue, request dialogue, database querying), the concepts are common and we need a generic meta-model. In order to tend towards task-independent systems, we need to clarify the modules parameterization procedures. In this paper, we focus on the characteristics of a meta-model designed to represent meaning in linguistic and multimodal applications. This meta-model is called MMIL for MultiModal Interface Language, and has first been specified in the framework of the IST MIAMM European project. What we want to test here is how relevant is MMIL for a completely different context (a different task, a different interaction type, a different linguistic domain). We detail the exploitation of MMIL in the framework of the IST OZONE European project, and we draw the conclusions on the role of MMIL in the parameterization of task-independent dialogue managers.
△ Less
Submitted 16 March, 2007;
originally announced March 2007.