-
Index and Materialized View Selection in Data Warehouses
Authors:
Kamel Aouiche,
Jérôme Darmont
Abstract:
The aim of this article is to present an overview of the major families of state-of-the-art index and materialized view selection methods, and to discuss the issues and future trends in data warehouse performance optimization. We particularly focus on data mining-based heuristics we developed to reduce the selection problem complexity and target the most pertinent candidate indexes and materialize…
▽ More
The aim of this article is to present an overview of the major families of state-of-the-art index and materialized view selection methods, and to discuss the issues and future trends in data warehouse performance optimization. We particularly focus on data mining-based heuristics we developed to reduce the selection problem complexity and target the most pertinent candidate indexes and materialized views.
△ Less
Submitted 27 January, 2017;
originally announced January 2017.
-
Web 2.0 OLAP: From Data Cubes to Tag Clouds
Authors:
Kamel Aouiche,
Daniel Lemire,
Robert Godin
Abstract:
Increasingly, business projects are ephemeral. New Business Intelligence tools must support ad-lib data sources and quick perusal. Meanwhile, tag clouds are a popular community-driven visualization technique. Hence, we investigate tag-cloud views with support for OLAP operations such as roll-ups, slices, dices, clustering, and drill-downs. As a case study, we implemented an application where users…
▽ More
Increasingly, business projects are ephemeral. New Business Intelligence tools must support ad-lib data sources and quick perusal. Meanwhile, tag clouds are a popular community-driven visualization technique. Hence, we investigate tag-cloud views with support for OLAP operations such as roll-ups, slices, dices, clustering, and drill-downs. As a case study, we implemented an application where users can upload data and immediately navigate through its ad hoc dimensions. To support social networking, views can be easily shared and embedded in other Web sites. Algorithmically, our tag-cloud views are approximate range top-k queries over spontaneous data cubes. We present experimental evidence that iceberg cuboids provide adequate online approximations. We benchmark several browser-oblivious tag-cloud layout optimizations.
△ Less
Submitted 15 March, 2016; v1 submitted 16 May, 2009;
originally announced May 2009.
-
Sorting improves word-aligned bitmap indexes
Authors:
Daniel Lemire,
Owen Kaser,
Kamel Aouiche
Abstract:
Bitmap indexes must be compressed to reduce input/output costs and minimize CPU usage. To accelerate logical operations (AND, OR, XOR) over bitmaps, we use techniques based on run-length encoding (RLE), such as Word-Aligned Hybrid (WAH) compression. These techniques are sensitive to the order of the rows: a simple lexicographical sort can divide the index size by 9 and make indexes several times f…
▽ More
Bitmap indexes must be compressed to reduce input/output costs and minimize CPU usage. To accelerate logical operations (AND, OR, XOR) over bitmaps, we use techniques based on run-length encoding (RLE), such as Word-Aligned Hybrid (WAH) compression. These techniques are sensitive to the order of the rows: a simple lexicographical sort can divide the index size by 9 and make indexes several times faster. We investigate row-reordering heuristics. Simply permuting the columns of the table can increase the sorting efficiency by 40%. Secondary contributions include efficient algorithms to construct and aggregate bitmaps. The effect of word length is also reviewed by constructing 16-bit, 32-bit and 64-bit indexes. Using 64-bit CPUs, we find that 64-bit indexes are slightly faster than 32-bit indexes despite being nearly twice as large.
△ Less
Submitted 29 July, 2016; v1 submitted 23 January, 2009;
originally announced January 2009.
-
Frequent itemsets mining for database auto-administration
Authors:
Kamel Aouiche,
Jérôme Darmont,
Le Gruenwald
Abstract:
With the wide development of databases in general and data warehouses in particular, it is important to reduce the tasks that a database administrator must perform manually. The aim of auto-administrative systems is to administrate and adapt themselves automatically without loss (or even with a gain) in performance. The idea of using data mining techniques to extract useful knowledge for adminis…
▽ More
With the wide development of databases in general and data warehouses in particular, it is important to reduce the tasks that a database administrator must perform manually. The aim of auto-administrative systems is to administrate and adapt themselves automatically without loss (or even with a gain) in performance. The idea of using data mining techniques to extract useful knowledge for administration from the data themselves has existed for some years. However, little research has been achieved. This idea nevertheless remains a very promising approach, notably in the field of data warehousing, where queries are very heterogeneous and cannot be interpreted easily. The aim of this study is to search for a way of extracting useful knowledge from stored data themselves to automatically apply performance optimization techniques, and more particularly indexing techniques. We have designed a tool that extracts frequent itemsets from a given workload to compute an index configuration that helps optimizing data access time. The experiments we performed showed that the index configurations generated by our tool allowed performance gains of 15% to 25% on a test database and a test data warehouse.
△ Less
Submitted 16 September, 2008;
originally announced September 2008.
-
A Join Index for XML Data Warehouses
Authors:
Hadj Mahboubi,
Kamel Aouiche,
Jérôme Darmont
Abstract:
XML data warehouses form an interesting basis for decision-support applications that exploit complex data. However, native-XML database management systems (DBMSs) currently bear limited performances and it is necessary to research for ways to optimize them. In this paper, we propose a new join index that is specifically adapted to the multidimensional architecture of XML warehouses. It eliminate…
▽ More
XML data warehouses form an interesting basis for decision-support applications that exploit complex data. However, native-XML database management systems (DBMSs) currently bear limited performances and it is necessary to research for ways to optimize them. In this paper, we propose a new join index that is specifically adapted to the multidimensional architecture of XML warehouses. It eliminates join operations while preserving the information contained in the original warehouse. A theoretical study and experimental results demonstrate the efficiency of our join index. They also show that native XML DBMSs can compete with XML-compatible, relational DBMSs when warehousing and analyzing XML data.
△ Less
Submitted 11 September, 2008;
originally announced September 2008.
-
Dynamic index selection in data warehouses
Authors:
Stéphane Azefack,
Kamel Aouiche,
Jérôme Darmont
Abstract:
Analytical queries defined on data warehouses are complex and use several join operations that are very costly, especially when run on very large data volumes. To improve response times, data warehouse administrators casually use indexing techniques. This task is nevertheless complex and fastidious. In this paper, we present an automatic, dynamic index selection method for data warehouses that i…
▽ More
Analytical queries defined on data warehouses are complex and use several join operations that are very costly, especially when run on very large data volumes. To improve response times, data warehouse administrators casually use indexing techniques. This task is nevertheless complex and fastidious. In this paper, we present an automatic, dynamic index selection method for data warehouses that is based on incremental frequent itemset mining from a given query workload. The main advantage of this approach is that it helps update the set of selected indexes when workload evolves instead of recreating it from scratch. Preliminary experimental results illustrate the efficiency of this approach, both in terms of performance enhancement and overhead.
△ Less
Submitted 11 September, 2008;
originally announced September 2008.
-
Materialized View Selection by Query Clustering in XML Data Warehouses
Authors:
Hadj Mahboubi,
Kamel Aouiche,
Jérôme Darmont
Abstract:
XML data warehouses form an interesting basis for decision-support applications that exploit complex data. However, native XML database management systems currently bear limited performances and it is necessary to design strategies to optimize them. In this paper, we propose an automatic strategy for the selection of XML materialized views that exploits a data mining technique, more precisely th…
▽ More
XML data warehouses form an interesting basis for decision-support applications that exploit complex data. However, native XML database management systems currently bear limited performances and it is necessary to design strategies to optimize them. In this paper, we propose an automatic strategy for the selection of XML materialized views that exploits a data mining technique, more precisely the clustering of the query workload. To validate our strategy, we implemented an XML warehouse modeled along the XCube specifications. We executed a workload of XQuery decision-support queries on this warehouse, with and without using our strategy. Our experimental results demonstrate its efficiency, even when queries are complex.
△ Less
Submitted 11 September, 2008;
originally announced September 2008.
-
Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes
Authors:
Owen Kaser,
Daniel Lemire,
Kamel Aouiche
Abstract:
Bitmap indexes must be compressed to reduce input/output costs and minimize CPU usage. To accelerate logical operations (AND, OR, XOR) over bitmaps, we use techniques based on run-length encoding (RLE), such as Word-Aligned Hybrid (WAH) compression. These techniques are sensitive to the order of the rows: a simple lexicographical sort can divide the index size by 9 and make indexes several times…
▽ More
Bitmap indexes must be compressed to reduce input/output costs and minimize CPU usage. To accelerate logical operations (AND, OR, XOR) over bitmaps, we use techniques based on run-length encoding (RLE), such as Word-Aligned Hybrid (WAH) compression. These techniques are sensitive to the order of the rows: a simple lexicographical sort can divide the index size by 9 and make indexes several times faster. We investigate reordering heuristics based on computed attribute-value histograms. Simply permuting the columns of the table based on these histograms can increase the sorting efficiency by 40%.
△ Less
Submitted 19 January, 2009; v1 submitted 14 August, 2008;
originally announced August 2008.
-
Tri de la table de faits et compression des index bitmaps avec alignement sur les mots
Authors:
Kamel Aouiche,
Daniel Lemire,
Owen Kaser
Abstract:
Bitmap indexes are frequently used to index multidimensional data. They rely mostly on sequential input/output. Bitmaps can be compressed to reduce input/output costs and minimize CPU usage. The most efficient compression techniques are based on run-length encoding (RLE), such as Word-Aligned Hybrid (WAH) compression. This type of compression accelerates logical operations (AND, OR) over the bit…
▽ More
Bitmap indexes are frequently used to index multidimensional data. They rely mostly on sequential input/output. Bitmaps can be compressed to reduce input/output costs and minimize CPU usage. The most efficient compression techniques are based on run-length encoding (RLE), such as Word-Aligned Hybrid (WAH) compression. This type of compression accelerates logical operations (AND, OR) over the bitmaps. However, run-length encoding is sensitive to the order of the facts. Thus, we propose to sort the fact tables. We review lexicographic, Gray-code, and block-wise sorting. We found that a lexicographic sort improves compression--sometimes generating indexes twice as small--and make indexes several times faster. While sorting takes time, this is partially offset by the fact that it is faster to index a sorted table. Column order is significant: it is generally preferable to put the columns having more distinct values at the beginning. A block-wise sort is much less efficient than a full sort. Moreover, we found that Gray-code sorting is not better than lexicographic sorting when using word-aligned compression.
△ Less
Submitted 14 August, 2008; v1 submitted 21 May, 2008;
originally announced May 2008.
-
Collaborative OLAP with Tag Clouds: Web 2.0 OLAP Formalism and Experimental Evaluation
Authors:
Kamel Aouiche,
Daniel Lemire,
Robert Godin
Abstract:
Increasingly, business projects are ephemeral. New Business Intelligence tools must support ad-lib data sources and quick perusal. Meanwhile, tag clouds are a popular community-driven visualization technique. Hence, we investigate tag-cloud views with support for OLAP operations such as roll-ups, slices, dices, clustering, and drill-downs. As a case study, we implemented an application where users…
▽ More
Increasingly, business projects are ephemeral. New Business Intelligence tools must support ad-lib data sources and quick perusal. Meanwhile, tag clouds are a popular community-driven visualization technique. Hence, we investigate tag-cloud views with support for OLAP operations such as roll-ups, slices, dices, clustering, and drill-downs. As a case study, we implemented an application where users can upload data and immediately navigate through its ad hoc dimensions. To support social networking, views can be easily shared and embedded in other Web sites. Algorithmically, our tag-cloud views are approximate range top-k queries over spontaneous data cubes. We present experimental evidence that iceberg cuboids provide adequate online approximations. We benchmark several browser-oblivious tag-cloud layout optimizations.
△ Less
Submitted 15 March, 2016; v1 submitted 11 October, 2007;
originally announced October 2007.
-
Data Mining-based Materialized View and Index Selection in Data Warehouses
Authors:
Kamel Aouiche,
Jérôme Darmont
Abstract:
Materialized views and indexes are physical structures for accelerating data access that are casually used in data warehouses. However, these data structures generate some maintenance overhead. They also share the same storage space. Most existing studies about materialized view and index selection consider these structures separately. In this paper, we adopt the opposite stance and couple mater…
▽ More
Materialized views and indexes are physical structures for accelerating data access that are casually used in data warehouses. However, these data structures generate some maintenance overhead. They also share the same storage space. Most existing studies about materialized view and index selection consider these structures separately. In this paper, we adopt the opposite stance and couple materialized view and index selection to take view-index interactions into account and achieve efficient storage space sharing. Candidate materialized views and indexes are selected through a data mining process. We also exploit cost models that evaluate the respective benefit of indexing and view materialization, and help select a relevant configuration of indexes and materialized views among the candidates. Experimental results show that our strategy performs better than an independent selection of materialized views and indexes.
△ Less
Submitted 10 July, 2007;
originally announced July 2007.
-
An Architecture Framework for Complex Data Warehouses
Authors:
Jérôme Darmont,
Omar Boussaid,
Jean-Christian Ralaivao,
Kamel Aouiche
Abstract:
Nowadays, many decision support applications need to exploit data that are not only numerical or symbolic, but also multimedia, multistructure, multisource, multimodal, and/or multiversion. We term such data complex data. Managing and analyzing complex data involves a lot of different issues regarding their structure, storage and processing, and metadata are a key element in all these processes.…
▽ More
Nowadays, many decision support applications need to exploit data that are not only numerical or symbolic, but also multimedia, multistructure, multisource, multimodal, and/or multiversion. We term such data complex data. Managing and analyzing complex data involves a lot of different issues regarding their structure, storage and processing, and metadata are a key element in all these processes. Such problems have been addressed by classical data warehousing (i.e., applied to "simple" data). However, data warehousing approaches need to be adapted for complex data. In this paper, we first propose a precise, though open, definition of complex data. Then we present a general architecture framework for warehousing complex data. This architecture heavily relies on metadata and domain-related knowledge, and rests on the XML language, which helps storing data, metadata and domain-specific knowledge altogether, and facilitates communication between the various warehousing processes.
△ Less
Submitted 10 July, 2007;
originally announced July 2007.
-
Sélection simultanée d'index et de vues matérialisées
Authors:
Nora Maiz,
Kamel Aouiche,
Jérôme Darmont
Abstract:
Indices and materialized views are physical structures that accelerate data access in data warehouses. However, these data structures generate some maintenance overhead. They also share the same storage space. The existing studies about index and materialized view selection consider these structures separately. In this paper, we adopt the opposite stance and couple index and materialized view se…
▽ More
Indices and materialized views are physical structures that accelerate data access in data warehouses. However, these data structures generate some maintenance overhead. They also share the same storage space. The existing studies about index and materialized view selection consider these structures separately. In this paper, we adopt the opposite stance and couple index and materialized view selection to take into account the interactions between them and achieve an efficient storage space sharing. We develop cost models that evaluate the respective benefit of indexing and view materialization. These cost models are then exploited by a greedy algorithm to select a relevant configuration of indices and materialized views. Experimental results show that our strategy performs better than the independent selection of indices and materialized views.
△ Less
Submitted 9 July, 2007;
originally announced July 2007.
-
Un index de jointure pour les entrepôts de données XML
Authors:
Hadj Mahboubi,
Kamel Aouiche,
Jérôme Darmont
Abstract:
XML data warehouses form an interesting basis for decision-support applications that exploit heterogeneous data from multiple sources. However, XML-native database systems currently bear limited performances and it is necessary to research ways to optimize them. In this paper, we propose a new index that is specifically adapted to the multidimensional architecture of XML warehouses and eliminate…
▽ More
XML data warehouses form an interesting basis for decision-support applications that exploit heterogeneous data from multiple sources. However, XML-native database systems currently bear limited performances and it is necessary to research ways to optimize them. In this paper, we propose a new index that is specifically adapted to the multidimensional architecture of XML warehouses and eliminates join operations, while preserving the information contained in the original warehouse. A theoretical study and experimental results demonstrate the efficiency of our index, even when queries are complex.
△ Less
Submitted 9 July, 2007;
originally announced July 2007.
-
Espaces de représentation multidimensionnels dédiés à la visualisation
Authors:
Riadh Ben Messaoud,
Kamel Aouiche,
Cécile Favre
Abstract:
In decision-support systems, the visual component is important for On Line Analysis Processing (OLAP). In this paper, we propose a new approach that faces the visualization problem due to data sparsity. We use the results of a Multiple Correspondence Analysis (MCA) to reduce the negative effect of sparsity by organizing differently data cube cells. Our approach does not reduce sparsity, however…
▽ More
In decision-support systems, the visual component is important for On Line Analysis Processing (OLAP). In this paper, we propose a new approach that faces the visualization problem due to data sparsity. We use the results of a Multiple Correspondence Analysis (MCA) to reduce the negative effect of sparsity by organizing differently data cube cells. Our approach does not reduce sparsity, however it tries to build relevant representation spaces where facts are efficiently gathered. In order to evaluate our approach, we propose an homogeneity criterion based on geometric neighborhood of cells. The obtained experimental results have shown the efficiency of our method.
△ Less
Submitted 9 July, 2007;
originally announced July 2007.
-
Vers l'auto-administration des entrepôts de données
Authors:
Kamel Aouiche,
Jérôme Darmont
Abstract:
With the wide development of databases in general and data warehouses in particular, it is important to reduce the tasks that a database administrator must perform manually. The idea of using data mining techniques to extract useful knowledge for administration from the data themselves has existed for some years. However, little research has been achieved. The aim of this study is to search for…
▽ More
With the wide development of databases in general and data warehouses in particular, it is important to reduce the tasks that a database administrator must perform manually. The idea of using data mining techniques to extract useful knowledge for administration from the data themselves has existed for some years. However, little research has been achieved. The aim of this study is to search for a way of extracting useful knowledge from stored data to automatically apply performance optimization techniques, and more particularly indexing techniques. We have designed a tool that extracts frequent itemsets from a given workload to compute an index configuration that helps optimizing data access time. The experiments we performed showed that the index configurations generated by our tool allowed performance gains of 15% to 25% on a test database and a test data warehouse.
△ Less
Submitted 26 April, 2007;
originally announced April 2007.
-
Clustering-Based Materialized View Selection in Data Warehouses
Authors:
Kamel Aouiche,
Pierre-Emmanuel Jouve,
Jerome Darmont
Abstract:
Materialized view selection is a non-trivial task. Hence, its complexity must be reduced. A judicious choice of views must be cost-driven and influenced by the workload experienced by the system. In this paper, we propose a framework for materialized view selection that exploits a data mining technique (clustering), in order to determine clusters of similar queries. We also propose a view mergin…
▽ More
Materialized view selection is a non-trivial task. Hence, its complexity must be reduced. A judicious choice of views must be cost-driven and influenced by the workload experienced by the system. In this paper, we propose a framework for materialized view selection that exploits a data mining technique (clustering), in order to determine clusters of similar queries. We also propose a view merging algorithm that builds a set of candidate views, as well as a greedy process for selecting a set of views to materialize. This selection is based on cost models that evaluate the cost of accessing data using views and the cost of storing these views. To validate our strategy, we executed a workload of decision-support queries on a test data warehouse, with and without using our strategy. Our experimental results demonstrate its efficiency, even when storage space is limited.
△ Less
Submitted 23 March, 2007;
originally announced March 2007.
-
Automatic Selection of Bitmap Join Indexes in Data Warehouses
Authors:
Kamel Aouiche,
Jerome Darmont,
Omar Boussaid,
Fadila Bentayeb
Abstract:
The queries defined on data warehouses are complex and use several join operations that induce an expensive computational cost. This cost becomes even more prohibitive when queries access very large volumes of data. To improve response time, data warehouse administrators generally use indexing techniques such as star join indexes or bitmap join indexes. This task is nevertheless complex and fast…
▽ More
The queries defined on data warehouses are complex and use several join operations that induce an expensive computational cost. This cost becomes even more prohibitive when queries access very large volumes of data. To improve response time, data warehouse administrators generally use indexing techniques such as star join indexes or bitmap join indexes. This task is nevertheless complex and fastidious. Our solution lies in the field of data warehouse auto-administration. In this framework, we propose an automatic index selection strategy. We exploit a data mining technique ; more precisely frequent itemset mining, in order to determine a set of candidate indexes from a given workload. Then, we propose several cost models allowing to create an index configuration composed by the indexes providing the best profit. These models evaluate the cost of accessing data using bitmap join indexes, and the cost of updating and storing these indexes.
△ Less
Submitted 23 March, 2007;
originally announced March 2007.
-
A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP
Authors:
Kamel Aouiche,
Daniel Lemire
Abstract:
A data warehouse cannot materialize all possible views, hence we must estimate quickly, accurately, and reliably the size of views to determine the best candidates for materialization. Many available techniques for view-size estimation make particular statistical assumptions and their error can be large. Comparatively, unassuming probabilistic techniques are slower, but they estimate accurately…
▽ More
A data warehouse cannot materialize all possible views, hence we must estimate quickly, accurately, and reliably the size of views to determine the best candidates for materialization. Many available techniques for view-size estimation make particular statistical assumptions and their error can be large. Comparatively, unassuming probabilistic techniques are slower, but they estimate accurately and reliability very large view sizes using little memory. We compare five unassuming hashing-based view-size estimation techniques including Stochastic Probabilistic Counting and LogLog Probabilistic Counting. Our experiments show that only Generalized Counting, Gibbons-Tirthapura, and Adaptive Counting provide universally tight estimates irrespective of the size of the view; of those, only Adaptive Counting remains constantly fast as we increase the memory budget.
△ Less
Submitted 8 December, 2008; v1 submitted 13 March, 2007;
originally announced March 2007.
-
Unasssuming View-Size Estimation Techniques in OLAP
Authors:
Kamel Aouiche,
Daniel Lemire
Abstract:
Even if storage was infinite, a data warehouse could not materialize all possible views due to the running time and update requirements. Therefore, it is necessary to estimate quickly, accurately, and reliably the size of views. Many available techniques make particular statistical assumptions and their error can be quite large. Unassuming techniques exist, but typically assume we have independe…
▽ More
Even if storage was infinite, a data warehouse could not materialize all possible views due to the running time and update requirements. Therefore, it is necessary to estimate quickly, accurately, and reliably the size of views. Many available techniques make particular statistical assumptions and their error can be quite large. Unassuming techniques exist, but typically assume we have independent hashing for which there is no known practical implementation. We adapt an unassuming estimator due to Gibbons and Tirthapura: its theoretical bounds do not make unpractical assumptions. We compare this technique experimentally with stochastic probabilistic counting, LogLog probabilistic counting, and multifractal statistical models. Our experiments show that we can reliably and accurately (within 10%, 19 times out 20) estimate view sizes over large data sets (1.5 GB) within minutes, using almost no memory. However, only Gibbons-Tirthapura provides universally tight estimates irrespective of the size of the view. For large views, probabilistic counting has a small edge in accuracy, whereas the competitive sampling-based method (multifractal) we tested is an order of magnitude faster but can sometimes provide poor estimates (relative error of 100%). In our tests, LogLog probabilistic counting is not competitive. Experimental validation on the US Census 1990 data set and on the Transaction Processing Performance (TPC H) data set is provided.
△ Less
Submitted 8 December, 2008; v1 submitted 12 March, 2007;
originally announced March 2007.