On the Use of Relative Validity Indices for Comparing Clustering Approaches
Authors:
Luke W. Yerbury,
Ricardo J. G. B. Campello,
G. C. Livingston Jr,
Mark Goldsworthy,
Lachlan O'Neil
Abstract:
Relative Validity Indices (RVIs) such as the Silhouette Width Criterion and Davies Bouldin indices are the most widely used tools for evaluating and optimising clustering outcomes. Traditionally, their ability to rank collections of candidate dataset partitions has been used to guide the selection of the number of clusters, and to compare partitions from different clustering algorithms. However, t…
▽ More
Relative Validity Indices (RVIs) such as the Silhouette Width Criterion and Davies Bouldin indices are the most widely used tools for evaluating and optimising clustering outcomes. Traditionally, their ability to rank collections of candidate dataset partitions has been used to guide the selection of the number of clusters, and to compare partitions from different clustering algorithms. However, there is a growing trend in the literature to use RVIs when selecting a Similarity Paradigm (SP) for clustering - the combination of normalisation procedure, representation method, and distance measure which affects the computation of object dissimilarities used in clustering. Despite the growing prevalence of this practice, there has been no empirical or theoretical investigation into the suitability of RVIs for this purpose. Moreover, since RVIs are computed using object dissimilarities, it remains unclear how they would need to be implemented for fair comparisons of different SPs. This study presents the first comprehensive investigation into the reliability of RVIs for SP selection. We conducted extensive experiments with seven popular RVIs on over 2.7 million clustering partitions of synthetic and real-world datasets, encompassing feature-vector and time-series data. We identified fundamental conceptual limitations undermining the use of RVIs for SP selection, and our empirical findings confirmed this predicted unsuitability. Among our recommendations, we suggest instead that practitioners select SPs by using external validation on high quality labelled datasets or carefully designed outcome-oriented objective criteria, both of which should be informed by careful consideration of dataset characteristics, and domain requirements. Our findings have important implications for clustering methodology and evaluation, suggesting the need for more rigorous approaches to SP selection.
△ Less
Submitted 20 November, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
A smart building semantic platform to enable data re-use in energy analytics applications: the Data Clearing House
Authors:
Daniel Hugo,
John McCulloch,
Akram Hameed,
Will Borghei,
Martin Grimeland,
Verity Felstead,
Mark Goldsworthy
Abstract:
Systems in the built environment continuously emit time series data about resource usage (e.g., energy and water), embedded electrical generation/storage, status of equipment, patterns of building occupancy, and readings from IoT sensors. This presents opportunities for new analytics and supervisory control applications that help reduce greenhouse gas emissions due to energy demand, if the barrier…
▽ More
Systems in the built environment continuously emit time series data about resource usage (e.g., energy and water), embedded electrical generation/storage, status of equipment, patterns of building occupancy, and readings from IoT sensors. This presents opportunities for new analytics and supervisory control applications that help reduce greenhouse gas emissions due to energy demand, if the barrier of data heterogeneity can be overcome. Semantic models of buildings -- representing structure, integrated equipment, and the many internal connections -- can help achieve interoperable data re-use by describing overall context, in addition to metadata. In this paper, we describe the Data Clearing House (DCH), a semantic building platform that hosts sensor data, building models, and analytics applications. This fulfills the key phases in the lifecycle of semantic building data, which includes: cost-effective ingestion of Building Management System (BMS), IoT, metering and meteorological time series data from a wide range of open and proprietary systems; importing and validating semantic models of sites and buildings using the Brick Schema; interacting with a discovery API via a high-level domain-specific query language; and deploying applications to modelled buildings. Having onboarded multiple buildings belonging to our own organisation and external partners, we are able to comment on the challenges to success of this approach. As an example use-case of the semantic building platform, we describe a measurement and verification (M&V) application implementing the 'whole facility' (Option C) method of the International Performance Measurement and Verification Protocol (IPMVP) for evaluating electrical metering data. This compares energy consumption between nominated baseline and analysis time periods, to quantify the energy savings achieved after implementing an intervention on a site.
△ Less
Submitted 20 November, 2023;
originally announced November 2023.