-
Digital Shielding for Cross-Domain Wi-Fi Signal Adaptation using Relativistic Average Generative Adversarial Network
Authors:
Danilo Avola,
Federica Bruni,
Gian Luca Foresti,
Daniele Pannone,
Amedeo Ranaldi
Abstract:
Wi-Fi sensing uses radio-frequency signals from Wi-Fi devices to analyze environments, enabling tasks such as tracking people, detecting intrusions, and recognizing gestures. The rise of this technology is driven by the IEEE 802.11bf standard and growing demand for tools that can ensure privacy and operate through obstacles. However, the performance of Wi-Fi sensing is heavily influenced by enviro…
▽ More
Wi-Fi sensing uses radio-frequency signals from Wi-Fi devices to analyze environments, enabling tasks such as tracking people, detecting intrusions, and recognizing gestures. The rise of this technology is driven by the IEEE 802.11bf standard and growing demand for tools that can ensure privacy and operate through obstacles. However, the performance of Wi-Fi sensing is heavily influenced by environmental conditions, especially when extracting spatial and temporal features from the surrounding scene. A key challenge is achieving robust generalization across domains, ensuring stable performance even when the sensing environment changes significantly. This paper introduces a novel deep learning model for cross-domain adaptation of Wi-Fi signals, inspired by physical signal shielding. The model uses a Relativistic average Generative Adversarial Network (RaGAN) with Bidirectional Long Short-Term Memory (Bi-LSTM) architectures for both the generator and discriminator. To simulate physical shielding, an acrylic box lined with electromagnetic shielding fabric was constructed, mimicking a Faraday cage. Wi-Fi signal spectra were collected from various materials both inside (domain-free) and outside (domain-dependent) the box to train the model. A multi-class Support Vector Machine (SVM) was trained on domain-free spectra and tested on signals denoised by the RaGAN. The system achieved 96% accuracy and demonstrated strong material discrimination capabilities, offering potential for use in security applications to identify concealed objects based on their composition.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
Autoencoder Models for Point Cloud Environmental Synthesis from WiFi Channel State Information: A Preliminary Study
Authors:
Daniele Pannone,
Danilo Avola
Abstract:
This paper introduces a deep learning framework for generating point clouds from WiFi Channel State Information data. We employ a two-stage autoencoder approach: a PointNet autoencoder with convolutional layers for point cloud generation, and a Convolutional Neural Network autoencoder to map CSI data to a matching latent space. By aligning these latent spaces, our method enables accurate environme…
▽ More
This paper introduces a deep learning framework for generating point clouds from WiFi Channel State Information data. We employ a two-stage autoencoder approach: a PointNet autoencoder with convolutional layers for point cloud generation, and a Convolutional Neural Network autoencoder to map CSI data to a matching latent space. By aligning these latent spaces, our method enables accurate environmental point cloud reconstruction from WiFi data. Experimental results validate the effectiveness of our approach, highlighting its potential for wireless sensing and environmental mapping applications.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
Semantically Guided Representation Learning For Action Anticipation
Authors:
Anxhelo Diko,
Danilo Avola,
Bardh Prenkaj,
Federico Fontana,
Luigi Cinque
Abstract:
Action anticipation is the task of forecasting future activity from a partially observed sequence of events. However, this task is exposed to intrinsic future uncertainty and the difficulty of reasoning upon interconnected actions. Unlike previous works that focus on extrapolating better visual and temporal information, we concentrate on learning action representations that are aware of their sema…
▽ More
Action anticipation is the task of forecasting future activity from a partially observed sequence of events. However, this task is exposed to intrinsic future uncertainty and the difficulty of reasoning upon interconnected actions. Unlike previous works that focus on extrapolating better visual and temporal information, we concentrate on learning action representations that are aware of their semantic interconnectivity based on prototypical action patterns and contextual co-occurrences. To this end, we propose the novel Semantically Guided Representation Learning (S-GEAR) framework. S-GEAR learns visual action prototypes and leverages language models to structure their relationship, inducing semanticity. To gather insights on S-GEAR's effectiveness, we test it on four action anticipation benchmarks, obtaining improved results compared to previous works: +3.5, +2.7, and +3.5 absolute points on Top-1 Accuracy on Epic-Kitchen 55, EGTEA Gaze+ and 50 Salads, respectively, and +0.8 on Top-5 Recall on Epic-Kitchens 100. We further observe that S-GEAR effectively transfers the geometric associations between actions from language to visual prototypes. Finally, S-GEAR opens new research frontiers in anticipation tasks by demonstrating the intricate impact of action semantic interconnectivity.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
ReViT: Enhancing Vision Transformers Feature Diversity with Attention Residual Connections
Authors:
Anxhelo Diko,
Danilo Avola,
Marco Cascio,
Luigi Cinque
Abstract:
Vision Transformer (ViT) self-attention mechanism is characterized by feature collapse in deeper layers, resulting in the vanishing of low-level visual features. However, such features can be helpful to accurately represent and identify elements within an image and increase the accuracy and robustness of vision-based recognition systems. Following this rationale, we propose a novel residual attent…
▽ More
Vision Transformer (ViT) self-attention mechanism is characterized by feature collapse in deeper layers, resulting in the vanishing of low-level visual features. However, such features can be helpful to accurately represent and identify elements within an image and increase the accuracy and robustness of vision-based recognition systems. Following this rationale, we propose a novel residual attention learning method for improving ViT-based architectures, increasing their visual feature diversity and model robustness. In this way, the proposed network can capture and preserve significant low-level features, providing more details about the elements within the scene being analyzed. The effectiveness and robustness of the presented method are evaluated on five image classification benchmarks, including ImageNet1k, CIFAR10, CIFAR100, Oxford Flowers-102, and Oxford-IIIT Pet, achieving improved performances. Additionally, experiments on the COCO2017 dataset show that the devised approach discovers and incorporates semantic and spatial relationships for object detection and instance segmentation when implemented into spatial-aware transformer models.
△ Less
Submitted 2 August, 2024; v1 submitted 17 February, 2024;
originally announced February 2024.
-
Medicinal Boxes Recognition on a Deep Transfer Learning Augmented Reality Mobile Application
Authors:
Danilo Avola,
Luigi Cinque,
Alessio Fagioli,
Gian Luca Foresti,
Marco Raoul Marini,
Alessio Mecca,
Daniele Pannone
Abstract:
Taking medicines is a fundamental aspect to cure illnesses. However, studies have shown that it can be hard for patients to remember the correct posology. More aggravating, a wrong dosage generally causes the disease to worsen. Although, all relevant instructions for a medicine are summarized in the corresponding patient information leaflet, the latter is generally difficult to navigate and unders…
▽ More
Taking medicines is a fundamental aspect to cure illnesses. However, studies have shown that it can be hard for patients to remember the correct posology. More aggravating, a wrong dosage generally causes the disease to worsen. Although, all relevant instructions for a medicine are summarized in the corresponding patient information leaflet, the latter is generally difficult to navigate and understand. To address this problem and help patients with their medication, in this paper we introduce an augmented reality mobile application that can present to the user important details on the framed medicine. In particular, the app implements an inference engine based on a deep neural network, i.e., a densenet, fine-tuned to recognize a medicinal from its package. Subsequently, relevant information, such as posology or a simplified leaflet, is overlaid on the camera feed to help a patient when taking a medicine. Extensive experiments to select the best hyperparameters were performed on a dataset specifically collected to address this task; ultimately obtaining up to 91.30\% accuracy as well as real-time capabilities.
△ Less
Submitted 26 March, 2022;
originally announced March 2022.
-
Analyzing EEG Data with Machine and Deep Learning: A Benchmark
Authors:
Danilo Avola,
Marco Cascio,
Luigi Cinque,
Alessio Fagioli,
Gian Luca Foresti,
Marco Raoul Marini,
Daniele Pannone
Abstract:
Nowadays, machine and deep learning techniques are widely used in different areas, ranging from economics to biology. In general, these techniques can be used in two ways: trying to adapt well-known models and architectures to the available data, or designing custom architectures. In both cases, to speed up the research process, it is useful to know which type of models work best for a specific pr…
▽ More
Nowadays, machine and deep learning techniques are widely used in different areas, ranging from economics to biology. In general, these techniques can be used in two ways: trying to adapt well-known models and architectures to the available data, or designing custom architectures. In both cases, to speed up the research process, it is useful to know which type of models work best for a specific problem and/or data type. By focusing on EEG signal analysis, and for the first time in literature, in this paper a benchmark of machine and deep learning for EEG signal classification is proposed. For our experiments we used the four most widespread models, i.e., multilayer perceptron, convolutional neural network, long short-term memory, and gated recurrent unit, highlighting which one can be a good starting point for developing EEG classification models.
△ Less
Submitted 18 March, 2022;
originally announced March 2022.
-
Human Silhouette and Skeleton Video Synthesis through Wi-Fi signals
Authors:
Danilo Avola,
Marco Cascio,
Luigi Cinque,
Alessio Fagioli,
Gian Luca Foresti
Abstract:
The increasing availability of wireless access points (APs) is leading towards human sensing applications based on Wi-Fi signals as support or alternative tools to the widespread visual sensors, where the signals enable to address well-known vision-related problems such as illumination changes or occlusions. Indeed, using image synthesis techniques to translate radio frequencies to the visible spe…
▽ More
The increasing availability of wireless access points (APs) is leading towards human sensing applications based on Wi-Fi signals as support or alternative tools to the widespread visual sensors, where the signals enable to address well-known vision-related problems such as illumination changes or occlusions. Indeed, using image synthesis techniques to translate radio frequencies to the visible spectrum can become essential to obtain otherwise unavailable visual data. This domain-to-domain translation is feasible because both objects and people affect electromagnetic waves, causing radio and optical frequencies variations. In literature, models capable of inferring radio-to-visual features mappings have gained momentum in the last few years since frequency changes can be observed in the radio domain through the channel state information (CSI) of Wi-Fi APs, enabling signal-based feature extraction, e.g., amplitude. On this account, this paper presents a novel two-branch generative neural network that effectively maps radio data into visual features, following a teacher-student design that exploits a cross-modality supervision strategy. The latter conditions signal-based features in the visual domain to completely replace visual data. Once trained, the proposed method synthesizes human silhouette and skeleton videos using exclusively Wi-Fi signals. The approach is evaluated on publicly available data, where it obtains remarkable results for both silhouette and skeleton videos generation, demonstrating the effectiveness of the proposed cross-modality supervision strategy.
△ Less
Submitted 11 March, 2022;
originally announced March 2022.
-
Study on Transfer Learning Capabilities for Pneumonia Classification in Chest-X-Rays Image
Authors:
Danilo Avola,
Andrea Bacciu,
Luigi Cinque,
Alessio Fagioli,
Marco Raoul Marini,
Riccardo Taiello
Abstract:
Over the last year, the severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) and its variants have highlighted the importance of screening tools with high diagnostic accuracy for new illnesses such as COVID-19. To that regard, deep learning approaches have proven as effective solutions for pneumonia classification, especially when considering chest-x-rays images. However, this lung infecti…
▽ More
Over the last year, the severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) and its variants have highlighted the importance of screening tools with high diagnostic accuracy for new illnesses such as COVID-19. To that regard, deep learning approaches have proven as effective solutions for pneumonia classification, especially when considering chest-x-rays images. However, this lung infection can also be caused by other viral, bacterial or fungi pathogens. Consequently, efforts are being poured toward distinguishing the infection source to help clinicians to diagnose the correct disease origin. Following this tendency, this study further explores the effectiveness of established neural network architectures on the pneumonia classification task through the transfer learning paradigm. To present a comprehensive comparison, 12 well-known ImageNet pre-trained models were fine-tuned and used to discriminate among chest-x-rays of healthy people, and those showing pneumonia symptoms derived from either a viral (i.e., generic or SARS-CoV-2) or bacterial source. Furthermore, since a common public collection distinguishing between such categories is currently not available, two distinct datasets of chest-x-rays images, describing the aforementioned sources, were combined and employed to evaluate the various architectures. The experiments were performed using a total of 6330 images split between train, validation and test sets. For all models, common classification metrics were computed (e.g., precision, f1-score) and most architectures obtained significant performances, reaching, among the others, up to 84.46% average f1-score when discriminating the 4 identified classes. Moreover, confusion matrices and activation maps computed via the Grad-CAM algorithm were also reported to present an informed discussion on the networks classifications.
△ Less
Submitted 23 April, 2022; v1 submitted 6 October, 2021;
originally announced October 2021.
-
SIRe-Networks: Convolutional Neural Networks Architectural Extension for Information Preservation via Skip/Residual Connections and Interlaced Auto-Encoders
Authors:
Danilo Avola,
Luigi Cinque,
Alessio Fagioli,
Gian Luca Foresti
Abstract:
Improving existing neural network architectures can involve several design choices such as manipulating the loss functions, employing a diverse learning strategy, exploiting gradient evolution at training time, optimizing the network hyper-parameters, or increasing the architecture depth. The latter approach is a straightforward solution, since it directly enhances the representation capabilities…
▽ More
Improving existing neural network architectures can involve several design choices such as manipulating the loss functions, employing a diverse learning strategy, exploiting gradient evolution at training time, optimizing the network hyper-parameters, or increasing the architecture depth. The latter approach is a straightforward solution, since it directly enhances the representation capabilities of a network; however, the increased depth generally incurs in the well-known vanishing gradient problem. In this paper, borrowing from different methods addressing this issue, we introduce an interlaced multi-task learning strategy, defined SIRe, to reduce the vanishing gradient in relation to the object classification task. The presented methodology directly improves a convolutional neural network (CNN) by preserving information from the input image through interlaced auto-encoders (AEs), and further refines the base network architecture by means of skip and residual connections. To validate the presented methodology, a simple CNN and various implementations of famous networks are extended via the SIRe strategy and extensively tested on five collections, i.e., MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100, and Caltech-256; where the SIRe-extended architectures achieve significantly increased performances across all models and datasets, thus confirming the presented approach effectiveness.
△ Less
Submitted 26 October, 2022; v1 submitted 6 October, 2021;
originally announced October 2021.
-
3D Hand Pose and Shape Estimation from RGB Images for Keypoint-Based Hand Gesture Recognition
Authors:
Danilo Avola,
Luigi Cinque,
Alessio Fagioli,
Gian Luca Foresti,
Adriano Fragomeni,
Daniele Pannone
Abstract:
Estimating the 3D pose of a hand from a 2D image is a well-studied problem and a requirement for several real-life applications such as virtual reality, augmented reality, and hand gesture recognition. Currently, reasonable estimations can be computed from single RGB images, especially when a multi-task learning approach is used to force the system to consider the shape of the hand when its pose i…
▽ More
Estimating the 3D pose of a hand from a 2D image is a well-studied problem and a requirement for several real-life applications such as virtual reality, augmented reality, and hand gesture recognition. Currently, reasonable estimations can be computed from single RGB images, especially when a multi-task learning approach is used to force the system to consider the shape of the hand when its pose is determined. However, depending on the method used to represent the hand, the performance can drop considerably in real-life tasks, suggesting that stable descriptions are required to achieve satisfactory results. In this paper, we present a keypoint-based end-to-end framework for 3D hand and pose estimation and successfully apply it to the task of hand gesture recognition as a study case. Specifically, after a pre-processing step in which the images are normalized, the proposed pipeline uses a multi-task semantic feature extractor generating 2D heatmaps and hand silhouettes from RGB images, a viewpoint encoder to predict the hand and camera view parameters, a stable hand estimator to produce the 3D hand pose and shape, and a loss function to guide all of the components jointly during the learning phase. Tests were performed on a 3D pose and shape estimation benchmark dataset to assess the proposed framework, which obtained state-of-the-art performance. Our system was also evaluated on two hand-gesture recognition benchmark datasets and significantly outperformed other keypoint-based approaches, indicating that it is an effective solution that is able to generate stable 3D estimates for hand pose and shape.
△ Less
Submitted 9 May, 2022; v1 submitted 28 September, 2021;
originally announced September 2021.
-
Multimodal Feature Fusion and Knowledge-Driven Learning via Experts Consult for Thyroid Nodule Classification
Authors:
Danilo Avola,
Luigi Cinque,
Alessio Fagioli,
Sebastiano Filetti,
Giorgio Grani,
Emanuele RodolĂ
Abstract:
Computer-aided diagnosis (CAD) is becoming a prominent approach to assist clinicians spanning across multiple fields. These automated systems take advantage of various computer vision (CV) procedures, as well as artificial intelligence (AI) techniques, to formulate a diagnosis of a given image, e.g., computed tomography and ultrasound. Advances in both areas (CV and AI) are enabling ever increasin…
▽ More
Computer-aided diagnosis (CAD) is becoming a prominent approach to assist clinicians spanning across multiple fields. These automated systems take advantage of various computer vision (CV) procedures, as well as artificial intelligence (AI) techniques, to formulate a diagnosis of a given image, e.g., computed tomography and ultrasound. Advances in both areas (CV and AI) are enabling ever increasing performances of CAD systems, which can ultimately avoid performing invasive procedures such as fine-needle aspiration. In this study, a novel end-to-end knowledge-driven classification framework is presented. The system focuses on multimodal data generated by thyroid ultrasonography, and acts as a CAD system by providing a thyroid nodule classification into the benign and malignant categories. Specifically, the proposed system leverages cues provided by an ensemble of experts to guide the learning phase of a densely connected convolutional network (DenseNet). The ensemble is composed by various networks pretrained on ImageNet, including AlexNet, ResNet, VGG, and others. The previously computed multimodal feature parameters are used to create ultrasonography domain experts via transfer learning, decreasing, moreover, the number of samples required for training. To validate the proposed method, extensive experiments were performed, providing detailed performances for both the experts ensemble and the knowledge-driven DenseNet. As demonstrated by the results, the proposed system achieves relevant performances in terms of qualitative metrics for the thyroid nodule classification task, thus resulting in a great asset when formulating a diagnosis.
△ Less
Submitted 25 October, 2021; v1 submitted 28 May, 2020;
originally announced May 2020.
-
Wi-Fi Passive Person Re-Identification based on Channel State Information
Authors:
Danilo Avola,
Marco Cascio,
Luigi Cinque,
Daniele Pannone
Abstract:
With the increasing need for wireless data transfer, Wi-Fi networks have rapidly grown in recent years providing high throughput and easy deployment. Nowadays, Access Points (APs) can be found easily wherever we go, therefore Wi-Fi sensing applications have caught a great deal of interest from the research community. Since human presence and movement influence the Wi-Fi signals transmitted by APs,…
▽ More
With the increasing need for wireless data transfer, Wi-Fi networks have rapidly grown in recent years providing high throughput and easy deployment. Nowadays, Access Points (APs) can be found easily wherever we go, therefore Wi-Fi sensing applications have caught a great deal of interest from the research community. Since human presence and movement influence the Wi-Fi signals transmitted by APs, it is possible to exploit those signals for person Re-Identification (Re-ID) task. Traditional techniques for Wi-Fi sensing applications are usually based on the Received Signal Strength Indicator (RSSI) measurement. However, recently, due to the RSSI instability, the researchers in this field propose Channel State Information (CSI) measurement based methods. In this paper we explain how changes in Signal Noise Ratio (SNR), obtained from CSI measurements, combined with Neural Networks can be used for person Re-ID achieving remarkable preliminary results. Due to the lack of available public data in the current state-of-the-art to test such type of task, we acquired a dataset that properly fits the aforementioned task.
△ Less
Submitted 12 November, 2019;
originally announced November 2019.
-
Deep Temporal Analysis for Non-Acted Body Affect Recognition
Authors:
Danilo Avola,
Luigi Cinque,
Alessio Fagioli,
Gian Luca Foresti,
Cristiano Massaroni
Abstract:
Affective computing is a field of great interest in many computer vision applications, including video surveillance, behaviour analysis, and human-robot interaction. Most of the existing literature has addressed this field by analysing different sets of face features. However, in the last decade, several studies have shown how body movements can play a key role even in emotion recognition. The maj…
▽ More
Affective computing is a field of great interest in many computer vision applications, including video surveillance, behaviour analysis, and human-robot interaction. Most of the existing literature has addressed this field by analysing different sets of face features. However, in the last decade, several studies have shown how body movements can play a key role even in emotion recognition. The majority of these experiments on the body are performed by trained actors whose aim is to simulate emotional reactions. These unnatural expressions differ from the more challenging genuine emotions, thus invalidating the obtained results. In this paper, a solution for basic non-acted emotion recognition based on 3D skeleton and Deep Neural Networks (DNNs) is provided. The proposed work introduces three majors contributions. First, unlike the current state-of-the-art in non-acted body affect recognition, where only static or global body features are considered, in this work also temporal local movements performed by subjects in each frame are examined. Second, an original set of global and time-dependent features for body movement description is provided. Third, to the best of out knowledge, this is the first attempt to use deep learning methods for non-acted body affect recognition. Due to the novelty of the topic, only the UCLIC dataset is currently considered the benchmark for comparative tests. On the latter, the proposed method outperforms all the competitors.
△ Less
Submitted 23 July, 2019;
originally announced July 2019.
-
Exploiting Recurrent Neural Networks and Leap Motion Controller for Sign Language and Semaphoric Gesture Recognition
Authors:
Danilo Avola,
Marco Bernardi,
Luigi Cinque,
Gian Luca Foresti,
Cristiano Massaroni
Abstract:
In human interactions, hands are a powerful way of expressing information that, in some cases, can be used as a valid substitute for voice, as it happens in Sign Language. Hand gesture recognition has always been an interesting topic in the areas of computer vision and multimedia. These gestures can be represented as sets of feature vectors that change over time. Recurrent Neural Networks (RNNs) a…
▽ More
In human interactions, hands are a powerful way of expressing information that, in some cases, can be used as a valid substitute for voice, as it happens in Sign Language. Hand gesture recognition has always been an interesting topic in the areas of computer vision and multimedia. These gestures can be represented as sets of feature vectors that change over time. Recurrent Neural Networks (RNNs) are suited to analyse this type of sets thanks to their ability to model the long term contextual information of temporal sequences. In this paper, a RNN is trained by using as features the angles formed by the finger bones of human hands. The selected features, acquired by a Leap Motion Controller (LMC) sensor, have been chosen because the majority of human gestures produce joint movements that generate truly characteristic corners. A challenging subset composed by a large number of gestures defined by the American Sign Language (ASL) is used to test the proposed solution and the effectiveness of the selected angles. Moreover, the proposed method has been compared to other state of the art works on the SHREC dataset, thus demonstrating its superiority in hand gesture recognition accuracy.
△ Less
Submitted 28 March, 2018;
originally announced March 2018.
-
The UMCD Dataset
Authors:
Danilo Avola,
Gian Luca Foresti,
Niki Martinel,
Daniele Pannone,
Claudio Piciarelli
Abstract:
In recent years, the technological improvements of low-cost small-scale Unmanned Aerial Vehicles (UAVs) are promoting an ever-increasing use of them in different tasks. In particular, the use of small-scale UAVs is useful in all these low-altitude tasks in which common UAVs cannot be adopted, such as recurrent comprehensive view of wide environments, frequent monitoring of military areas, real-tim…
▽ More
In recent years, the technological improvements of low-cost small-scale Unmanned Aerial Vehicles (UAVs) are promoting an ever-increasing use of them in different tasks. In particular, the use of small-scale UAVs is useful in all these low-altitude tasks in which common UAVs cannot be adopted, such as recurrent comprehensive view of wide environments, frequent monitoring of military areas, real-time classification of static and moving entities (e.g., people, cars, etc.). These tasks can be supported by mosaicking and change detection algorithms achieved at low-altitude. Currently, public datasets for testing these algorithms are not available. This paper presents the UMCD dataset, the first collection of geo-referenced video sequences acquired at low-altitude for mosaicking and change detection purposes. Five reference scenarios are also reported.
△ Less
Submitted 5 April, 2017;
originally announced April 2017.