-
Data clustering: an essential technique in data science
Authors:
Tai Dinh,
Wong Hauchi,
Daniil Lisik,
Michal Koren,
Dat Tran,
Philip S. Yu,
Joaquín Torres-Sospedra
Abstract:
This paper explores the critical role of data clustering in data science, emphasizing its methodologies, tools, and diverse applications. Traditional techniques, such as partitional and hierarchical clustering, are analyzed alongside advanced approaches such as data stream, density-based, graph-based, and model-based clustering for handling complex structured datasets. The paper highlights key pri…
▽ More
This paper explores the critical role of data clustering in data science, emphasizing its methodologies, tools, and diverse applications. Traditional techniques, such as partitional and hierarchical clustering, are analyzed alongside advanced approaches such as data stream, density-based, graph-based, and model-based clustering for handling complex structured datasets. The paper highlights key principles underpinning clustering, outlines widely used tools and frameworks, introduces the workflow of clustering in data science, discusses challenges in practical implementation, and examines various applications of clustering. By focusing on these foundations and applications, the discussion underscores clustering's transformative potential. The paper concludes with insights into future research directions, emphasizing clustering's role in driving innovation and enabling data-driven decision-making.
△ Less
Submitted 30 January, 2025; v1 submitted 24 December, 2024;
originally announced December 2024.
-
UJI Probes: Dataset of Wi-Fi Probe Requests
Authors:
Tomáš Bravenec,
Joaquín Torres-Sospedra,
Michael Gould,
Tomas Fryza
Abstract:
This paper focuses on the creation of a new, publicly available Wi-Fi probe request dataset. Probe requests belong to the family of management frames used by the 802.11 (Wi-Fi) protocol. As the situation changes year by year, and technology improves probe request studies are necessary to be done on up-to-date data. We provide a month-long probe request capture in an office environment, including w…
▽ More
This paper focuses on the creation of a new, publicly available Wi-Fi probe request dataset. Probe requests belong to the family of management frames used by the 802.11 (Wi-Fi) protocol. As the situation changes year by year, and technology improves probe request studies are necessary to be done on up-to-date data. We provide a month-long probe request capture in an office environment, including work days, weekends, and holidays consisting of over 1 400 000 probe requests. We provide a description of all the important aspects of the dataset. Apart from the raw packet capture we also provide a Radio Map (RM) of the office to ensure the users of the dataset have all the possible information about the environment. To protect privacy, user information in the dataset is anonymized. This anonymization is done in a way that protects the privacy of users while preserving the ability to analyze the dataset to almost the same level as raw data. Furthermore, we showcase several possible use cases for the dataset, like presence detection, temporal Received Signal Strength Indicator (RSSI) stability, and privacy protection evaluation.
△ Less
Submitted 8 December, 2023; v1 submitted 20 July, 2023;
originally announced August 2023.
-
SURIMI: Supervised Radio Map Augmentation with Deep Learning and a Generative Adversarial Network for Fingerprint-based Indoor Positioning
Authors:
Darwin Quezada-Gaibor,
Joaquín Torres-Sospedra,
Jari Nurmi,
Yevgeni Koucheryavy,
Joaquín Huerta
Abstract:
Indoor Positioning based on Machine Learning has drawn increasing attention both in the academy and the industry as meaningful information from the reference data can be extracted. Many researchers are using supervised, semi-supervised, and unsupervised Machine Learning models to reduce the positioning error and offer reliable solutions to the end-users. In this article, we propose a new architect…
▽ More
Indoor Positioning based on Machine Learning has drawn increasing attention both in the academy and the industry as meaningful information from the reference data can be extracted. Many researchers are using supervised, semi-supervised, and unsupervised Machine Learning models to reduce the positioning error and offer reliable solutions to the end-users. In this article, we propose a new architecture by combining Convolutional Neural Network (CNN), Long short-term memory (LSTM) and Generative Adversarial Network (GAN) in order to increase the training data and thus improve the position accuracy. The proposed combination of supervised and unsupervised models was tested in 17 public datasets, providing an extensive analysis of its performance. As a result, the positioning error has been reduced in more than 70% of them.
△ Less
Submitted 13 July, 2022;
originally announced July 2022.
-
What Your Wearable Devices Revealed About You and Possibilities of Non-Cooperative 802.11 Presence Detection During Your Last IPIN Visit
Authors:
Tomas Bravenec,
Joaquín Torres-Sospedra,
Michael Gould,
Tomas Fryza
Abstract:
The focus on privacy-related measures regarding wireless networks grew in last couple of years. This is especially important with technologies like Wi-Fi or Bluetooth, which are all around us and our smartphones use them not just for connection to the internet or other devices, but for localization purposes as well. In this paper, we analyze and evaluate probe request frames of 802.11 wireless pro…
▽ More
The focus on privacy-related measures regarding wireless networks grew in last couple of years. This is especially important with technologies like Wi-Fi or Bluetooth, which are all around us and our smartphones use them not just for connection to the internet or other devices, but for localization purposes as well. In this paper, we analyze and evaluate probe request frames of 802.11 wireless protocol captured during the 11th international conference on Indoor Positioning and Indoor Navigation (IPIN) 2021. We explore the temporal occupancy of the conference space during four days of the conference as well as non-cooperatively track the presence of devices in the proximity of the session rooms using 802.11 management frames, with and without using MAC address randomization. We carried out this analysis without trying to identify/reveal the identity of the users or in any way reverse the MAC address randomization. As a result of the analysis, we detected that there are still many devices not adopting MAC randomization, because either it is not implemented, or users disabled it. In addition, many devices can be easily tracked despite employing MAC randomization.
△ Less
Submitted 7 November, 2022; v1 submitted 11 July, 2022;
originally announced July 2022.
-
Exploration of User Privacy in 802.11 Probe Requests with MAC Address Randomization Using Temporal Pattern Analysis
Authors:
Tomas Bravenec,
Joaquín Torres-Sospedra,
Michael Gould,
Tomas Fryza
Abstract:
Wireless networks have become an integral part of our daily lives and lately there is increased concern about privacy and protecting the identity of individual users. In this paper we address the evolution of privacy measures in Wi-Fi probe request frames. We focus on the lack of privacy measures before the implementation of MAC Address Randomization, and on the way anti-tracking measures evolved…
▽ More
Wireless networks have become an integral part of our daily lives and lately there is increased concern about privacy and protecting the identity of individual users. In this paper we address the evolution of privacy measures in Wi-Fi probe request frames. We focus on the lack of privacy measures before the implementation of MAC Address Randomization, and on the way anti-tracking measures evolved throughout the last decade. We do not try to reverse MAC address randomization to get the real ad-dress of the device, but instead analyse the possibility of further tracking/localization without needing the real MAC address of the specific users. To gain better analysis results, we introduce temporal pattern matching approach to identification of devices using randomized MAC addresses.
△ Less
Submitted 22 June, 2022;
originally announced June 2022.
-
Data Cleansing for Indoor Positioning Wi-Fi Fingerprinting Datasets
Authors:
Darwin Quezada-Gaibor,
Lucie Klus,
Joaquín Torres-Sospedra,
Elena Simona Lohan,
Jari Nurmi,
Carlos Granell,
Joaquín Huerta
Abstract:
Wearable and IoT devices requiring positioning and localisation services grow in number exponentially every year. This rapid growth also produces millions of data entries that need to be pre-processed prior to being used in any indoor positioning system to ensure the data quality and provide a high Quality of Service (QoS) to the end-user. In this paper, we offer a novel and straightforward data c…
▽ More
Wearable and IoT devices requiring positioning and localisation services grow in number exponentially every year. This rapid growth also produces millions of data entries that need to be pre-processed prior to being used in any indoor positioning system to ensure the data quality and provide a high Quality of Service (QoS) to the end-user. In this paper, we offer a novel and straightforward data cleansing algorithm for WLAN fingerprinting radio maps. This algorithm is based on the correlation among fingerprints using the Received Signal Strength (RSS) values and the Access Points (APs)'s identifier. We use those to compute the correlation among all samples in the dataset and remove fingerprints with low level of correlation from the dataset. We evaluated the proposed method on 14 independent publicly-available datasets. As a result, an average of 14% of fingerprints were removed from the datasets. The 2D positioning error was reduced by 2.7% and 3D positioning error by 5.3% with a slight increase in the floor hit rate by 1.2% on average. Consequently, the average speed of position prediction was also increased by 14%.
△ Less
Submitted 4 May, 2022;
originally announced May 2022.
-
Lightweight Hybrid CNN-ELM Model for Multi-building and Multi-floor Classification
Authors:
Darwin Quezada-Gaibor,
Joaquín Torres-Sospedra,
Jari Nurmi,
Yevgeni Koucheryavy,
Joaquín Huerta
Abstract:
Machine learning models have become an essential tool in current indoor positioning solutions, given their high capabilities to extract meaningful information from the environment. Convolutional neural networks (CNNs) are one of the most used neural networks (NNs) due to that they are capable of learning complex patterns from the input data. Another model used in indoor positioning solutions is th…
▽ More
Machine learning models have become an essential tool in current indoor positioning solutions, given their high capabilities to extract meaningful information from the environment. Convolutional neural networks (CNNs) are one of the most used neural networks (NNs) due to that they are capable of learning complex patterns from the input data. Another model used in indoor positioning solutions is the Extreme Learning Machine (ELM), which provides an acceptable generalization performance as well as a fast speed of learning. In this paper, we offer a lightweight combination of CNN and ELM, which provides a quick and accurate classification of building and floor, suitable for power and resource-constrained devices. As a result, the proposed model is 58\% faster than the benchmark, with a slight improvement in the classification accuracy (by less than 1\%
△ Less
Submitted 21 April, 2022;
originally announced April 2022.
-
Towards Ubiquitous Indoor Positioning: Comparing Systems across Heterogeneous Datasets
Authors:
Joaquín Torres-Sospedra,
Ivo Silva,
Lucie Klus,
Darwin Quezada-Gaibor,
Antonino Crivello,
Paolo Barsocchi,
Cristiano Pendão,
Elena Simona Lohan,
Jari Nurmi,
Adriano Moreira
Abstract:
The evaluation of Indoor Positioning Systems (IPS) mostly relies on local deployments in the researchers' or partners' facilities. The complexity of preparing comprehensive experiments, collecting data, and considering multiple scenarios usually limits the evaluation area and, therefore, the assessment of the proposed systems. The requirements and features of controlled experiments cannot be gener…
▽ More
The evaluation of Indoor Positioning Systems (IPS) mostly relies on local deployments in the researchers' or partners' facilities. The complexity of preparing comprehensive experiments, collecting data, and considering multiple scenarios usually limits the evaluation area and, therefore, the assessment of the proposed systems. The requirements and features of controlled experiments cannot be generalized since the use of the same sensors or anchors density cannot be guaranteed. The dawn of datasets is pushing IPS evaluation to a similar level as machine-learning models, where new proposals are evaluated over many heterogeneous datasets. This paper proposes a way to evaluate IPSs in multiple scenarios, that is validated with three use cases. The results prove that the proposed aggregation of the evaluation metric values is a useful tool for high-level comparison of IPSs.
△ Less
Submitted 20 September, 2021;
originally announced September 2021.