-
SiM3D: Single-instance Multiview Multimodal and Multisetup 3D Anomaly Detection Benchmark
Authors:
Alex Costanzino,
Pierluigi Zama Ramirez,
Luigi Lella,
Matteo Ragaglia,
Alessandro Oliva,
Giuseppe Lisanti,
Luigi Di Stefano
Abstract:
We propose SiM3D, the first benchmark considering the integration of multiview and multimodal information for comprehensive 3D anomaly detection and segmentation (ADS), where the task is to produce a voxel-based Anomaly Volume. Moreover, SiM3D focuses on a scenario of high interest in manufacturing: single-instance anomaly detection, where only one object, either real or synthetic, is available fo…
▽ More
We propose SiM3D, the first benchmark considering the integration of multiview and multimodal information for comprehensive 3D anomaly detection and segmentation (ADS), where the task is to produce a voxel-based Anomaly Volume. Moreover, SiM3D focuses on a scenario of high interest in manufacturing: single-instance anomaly detection, where only one object, either real or synthetic, is available for training. In this respect, SiM3D stands out as the first ADS benchmark that addresses the challenge of generalising from synthetic training data to real test data. SiM3D includes a novel multimodal multiview dataset acquired using top-tier industrial sensors and robots. The dataset features multiview high-resolution images (12 Mpx) and point clouds (7M points) for 333 instances of eight types of objects, alongside a CAD model for each type. We also provide manually annotated 3D segmentation GTs for anomalous test samples. To establish reference baselines for the proposed multiview 3D ADS task, we adapt prominent singleview methods and assess their performance using novel metrics that operate on Anomaly Volumes.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
Understanding Adversarial Training with Energy-based Models
Authors:
Mujtaba Hussain Mirza,
Maria Rosaria Briglia,
Filippo Bartolucci,
Senad Beadini,
Giuseppe Lisanti,
Iacopo Masi
Abstract:
We aim at using Energy-based Model (EBM) framework to better understand adversarial training (AT) in classifiers, and additionally to analyze the intrinsic generative capabilities of robust classifiers. By viewing standard classifiers through an energy lens, we begin by analyzing how the energies of adversarial examples, generated by various attacks, differ from those of the natural samples. The c…
▽ More
We aim at using Energy-based Model (EBM) framework to better understand adversarial training (AT) in classifiers, and additionally to analyze the intrinsic generative capabilities of robust classifiers. By viewing standard classifiers through an energy lens, we begin by analyzing how the energies of adversarial examples, generated by various attacks, differ from those of the natural samples. The central focus of our work is to understand the critical phenomena of Catastrophic Overfitting (CO) and Robust Overfitting (RO) in AT from an energy perspective. We analyze the impact of existing AT approaches on the energy of samples during training and observe that the behavior of the ``delta energy' -- change in energy between original sample and its adversarial counterpart -- diverges significantly when CO or RO occurs. After a thorough analysis of these energy dynamics and their relationship with overfitting, we propose a novel regularizer, the Delta Energy Regularizer (DER), designed to smoothen the energy landscape during training. We demonstrate that DER is effective in mitigating both CO and RO across multiple benchmarks. We further show that robust classifiers, when being used as generative models, have limits in handling trade-off between image quality and variability. We propose an improved technique based on a local class-wise principal component analysis (PCA) and energy-based guidance for better class-specific initialization and adaptive stopping, enhancing sample diversity and generation quality. Considering that we do not explicitly train for generative modeling, we achieve a competitive Inception Score (IS) and Fréchet inception distance (FID) compared to hybrid discriminative-generative models.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
What is Adversarial Training for Diffusion Models?
Authors:
Briglia Maria Rosaria,
Mujtaba Hussain Mirza,
Giuseppe Lisanti,
Iacopo Masi
Abstract:
We answer the question in the title, showing that adversarial training (AT) for diffusion models (DMs) fundamentally differs from classifiers: while AT in classifiers enforces output invariance, AT in DMs requires equivariance to keep the diffusion process aligned with the data distribution. AT is a way to enforce smoothness in the diffusion flow, improving robustness to outliers and corrupted dat…
▽ More
We answer the question in the title, showing that adversarial training (AT) for diffusion models (DMs) fundamentally differs from classifiers: while AT in classifiers enforces output invariance, AT in DMs requires equivariance to keep the diffusion process aligned with the data distribution. AT is a way to enforce smoothness in the diffusion flow, improving robustness to outliers and corrupted data. Unlike prior art, our method makes no assumptions about the noise model and integrates seamlessly into diffusion training by adding random noise, similar to randomized smoothing, or adversarial noise, akin to AT. This enables intrinsic capabilities such as handling noisy data, dealing with extreme variability such as outliers, preventing memorization, and improving robustness. We rigorously evaluate our approach with proof-of-concept datasets with known distributions in low- and high-dimensional space, thereby taking a perfect measure of errors; we further evaluate on standard benchmarks such as CIFAR-10, CelebA and LSUN Bedroom, showing strong performance under severe noise, data corruption, and iterative adversarial attacks.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Scaling LLaNA: Advancing NeRF-Language Understanding Through Large-Scale Training
Authors:
Andrea Amaduzzi,
Pierluigi Zama Ramirez,
Giuseppe Lisanti,
Samuele Salti,
Luigi Di Stefano
Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in understanding both images and 3D data, yet these modalities face inherent limitations in comprehensively representing object geometry and appearance. Neural Radiance Fields (NeRFs) have emerged as a promising alternative, encoding both geometric and photorealistic properties within the weights of a si…
▽ More
Recent advances in Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in understanding both images and 3D data, yet these modalities face inherent limitations in comprehensively representing object geometry and appearance. Neural Radiance Fields (NeRFs) have emerged as a promising alternative, encoding both geometric and photorealistic properties within the weights of a simple Multi-Layer Perceptron (MLP). This work investigates the feasibility and effectiveness of ingesting NeRFs into an MLLM. We introduce LLaNA, the first MLLM able to perform new tasks such as NeRF captioning and Q\&A, by directly processing the weights of a NeRF's MLP. Notably, LLaNA is able to extract information about the represented objects without the need to render images or materialize 3D data structures. In addition, we build the first large-scale NeRF-language dataset, composed by more than 300K NeRFs trained on ShapeNet and Objaverse, with paired textual annotations that enable various NeRF-language tasks. Based on this dataset, we develop a benchmark to evaluate the NeRF understanding capability of our method. Results show that directly processing NeRF weights leads to better performance on NeRF-Language tasks compared to approaches that rely on either 2D or 3D representations derived from NeRFs.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
Perturb, Attend, Detect and Localize (PADL): Robust Proactive Image Defense
Authors:
Filippo Bartolucci,
Iacopo Masi,
Giuseppe Lisanti
Abstract:
Image manipulation detection and localization have received considerable attention from the research community given the blooming of Generative Models (GMs). Detection methods that follow a passive approach may overfit to specific GMs, limiting their application in real-world scenarios, due to the growing diversity of generative models. Recently, approaches based on a proactive framework have show…
▽ More
Image manipulation detection and localization have received considerable attention from the research community given the blooming of Generative Models (GMs). Detection methods that follow a passive approach may overfit to specific GMs, limiting their application in real-world scenarios, due to the growing diversity of generative models. Recently, approaches based on a proactive framework have shown the possibility of dealing with this limitation. However, these methods suffer from two main limitations, which raises concerns about potential vulnerabilities: i) the manipulation detector is not robust to noise and hence can be easily fooled; ii) the fact that they rely on fixed perturbations for image protection offers a predictable exploit for malicious attackers, enabling them to reverse-engineer and evade detection. To overcome this issue we propose PADL, a new solution able to generate image-specific perturbations using a symmetric scheme of encoding and decoding based on cross-attention, which drastically reduces the possibility of reverse engineering, even when evaluated with adaptive attack [31]. Additionally, PADL is able to pinpoint manipulated areas, facilitating the identification of specific regions that have undergone alterations, and has more generalization power than prior art on held-out generative models. Indeed, although being trained only on an attribute manipulation GAN model [15], our method generalizes to a range of unseen models with diverse architectural designs, such as StarGANv2, BlendGAN, DiffAE, StableDiffusion and StableDiffusionXL. Additionally, we introduce a novel evaluation protocol, which offers a fair evaluation of localisation performance in function of detection accuracy and better captures real-world scenarios.
△ Less
Submitted 26 September, 2024;
originally announced September 2024.
-
Learning to Be a Transformer to Pinpoint Anomalies
Authors:
Alex Costanzino,
Pierluigi Zama Ramirez,
Giuseppe Lisanti,
Luigi Di Stefano
Abstract:
To efficiently deploy strong, often pre-trained feature extractors, recent Industrial Anomaly Detection and Segmentation (IADS) methods process low-resolution images, e.g., 224x224 pixels, obtained by downsampling the original input images. However, while numerous industrial applications demand the identification of both large and small defects, downsampling the input image to a low resolution may…
▽ More
To efficiently deploy strong, often pre-trained feature extractors, recent Industrial Anomaly Detection and Segmentation (IADS) methods process low-resolution images, e.g., 224x224 pixels, obtained by downsampling the original input images. However, while numerous industrial applications demand the identification of both large and small defects, downsampling the input image to a low resolution may hinder a method's ability to pinpoint tiny anomalies. We propose a novel Teacher--Student paradigm to leverage strong pre-trained features while processing high-resolution input images very efficiently. The core idea concerns training two shallow MLPs (the Students) by nominal images so as to mimic the mappings between the patch embeddings induced by the self-attention layers of a frozen vision Transformer (the Teacher). Indeed, learning these mappings sets forth a challenging pretext task that small-capacity models are unlikely to accomplish on out-of-distribution data such as anomalous images. Our method can spot anomalies from high-resolution images and runs way faster than competitors, achieving state-of-the-art performance on MVTec AD and the best segmentation results on VisA. We also propose novel evaluation metrics to capture robustness to defect size, i.e., the ability to preserve good localisation from large anomalies to tiny ones. Evaluating our method also by these metrics reveals its neatly superior performance.
△ Less
Submitted 26 June, 2025; v1 submitted 4 July, 2024;
originally announced July 2024.
-
LLaNA: Large Language and NeRF Assistant
Authors:
Andrea Amaduzzi,
Pierluigi Zama Ramirez,
Giuseppe Lisanti,
Samuele Salti,
Luigi Di Stefano
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated an excellent understanding of images and 3D data. However, both modalities have shortcomings in holistically capturing the appearance and geometry of objects. Meanwhile, Neural Radiance Fields (NeRFs), which encode information within the weights of a simple Multi-Layer Perceptron (MLP), have emerged as an increasingly widespread modality t…
▽ More
Multimodal Large Language Models (MLLMs) have demonstrated an excellent understanding of images and 3D data. However, both modalities have shortcomings in holistically capturing the appearance and geometry of objects. Meanwhile, Neural Radiance Fields (NeRFs), which encode information within the weights of a simple Multi-Layer Perceptron (MLP), have emerged as an increasingly widespread modality that simultaneously encodes the geometry and photorealistic appearance of objects. This paper investigates the feasibility and effectiveness of ingesting NeRF into MLLM. We create LLaNA, the first general-purpose NeRF-language assistant capable of performing new tasks such as NeRF captioning and Q\&A. Notably, our method directly processes the weights of the NeRF's MLP to extract information about the represented objects without the need to render images or materialize 3D data structures. Moreover, we build a dataset of NeRFs with text annotations for various NeRF-language tasks with no human intervention. Based on this dataset, we develop a benchmark to evaluate the NeRF understanding capability of our method. Results show that processing NeRF weights performs favourably against extracting 2D or 3D representations from NeRFs.
△ Less
Submitted 22 November, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
Test Time Training for Industrial Anomaly Segmentation
Authors:
Alex Costanzino,
Pierluigi Zama Ramirez,
Mirko Del Moro,
Agostino Aiezzo,
Giuseppe Lisanti,
Samuele Salti,
Luigi Di Stefano
Abstract:
Anomaly Detection and Segmentation (AD&S) is crucial for industrial quality control. While existing methods excel in generating anomaly scores for each pixel, practical applications require producing a binary segmentation to identify anomalies. Due to the absence of labeled anomalies in many real scenarios, standard practices binarize these maps based on some statistics derived from a validation s…
▽ More
Anomaly Detection and Segmentation (AD&S) is crucial for industrial quality control. While existing methods excel in generating anomaly scores for each pixel, practical applications require producing a binary segmentation to identify anomalies. Due to the absence of labeled anomalies in many real scenarios, standard practices binarize these maps based on some statistics derived from a validation set containing only nominal samples, resulting in poor segmentation performance. This paper addresses this problem by proposing a test time training strategy to improve the segmentation performance. Indeed, at test time, we can extract rich features directly from anomalous samples to train a classifier that can discriminate defects effectively. Our general approach can work downstream to any AD&S method that provides an anomaly score map as output, even in multimodal settings. We demonstrate the effectiveness of our approach over baselines through extensive experimentation and evaluation on MVTec AD and MVTec 3D-AD.
△ Less
Submitted 4 April, 2024;
originally announced April 2024.
-
Multimodal Industrial Anomaly Detection by Crossmodal Feature Mapping
Authors:
Alex Costanzino,
Pierluigi Zama Ramirez,
Giuseppe Lisanti,
Luigi Di Stefano
Abstract:
The paper explores the industrial multimodal Anomaly Detection (AD) task, which exploits point clouds and RGB images to localize anomalies. We introduce a novel light and fast framework that learns to map features from one modality to the other on nominal samples. At test time, anomalies are detected by pinpointing inconsistencies between observed and mapped features. Extensive experiments show th…
▽ More
The paper explores the industrial multimodal Anomaly Detection (AD) task, which exploits point clouds and RGB images to localize anomalies. We introduce a novel light and fast framework that learns to map features from one modality to the other on nominal samples. At test time, anomalies are detected by pinpointing inconsistencies between observed and mapped features. Extensive experiments show that our approach achieves state-of-the-art detection and segmentation performance in both the standard and few-shot settings on the MVTec 3D-AD dataset while achieving faster inference and occupying less memory than previous multimodal AD methods. Moreover, we propose a layer-pruning technique to improve memory and time efficiency with a marginal sacrifice in performance.
△ Less
Submitted 8 July, 2024; v1 submitted 7 December, 2023;
originally announced December 2023.
-
Looking at words and points with attention: a benchmark for text-to-shape coherence
Authors:
Andrea Amaduzzi,
Giuseppe Lisanti,
Samuele Salti,
Luigi Di Stefano
Abstract:
While text-conditional 3D object generation and manipulation have seen rapid progress, the evaluation of coherence between generated 3D shapes and input textual descriptions lacks a clear benchmark. The reason is twofold: a) the low quality of the textual descriptions in the only publicly available dataset of text-shape pairs; b) the limited effectiveness of the metrics used to quantitatively asse…
▽ More
While text-conditional 3D object generation and manipulation have seen rapid progress, the evaluation of coherence between generated 3D shapes and input textual descriptions lacks a clear benchmark. The reason is twofold: a) the low quality of the textual descriptions in the only publicly available dataset of text-shape pairs; b) the limited effectiveness of the metrics used to quantitatively assess such coherence. In this paper, we propose a comprehensive solution that addresses both weaknesses. Firstly, we employ large language models to automatically refine textual descriptions associated with shapes. Secondly, we propose a quantitative metric to assess text-to-shape coherence, through cross-attention mechanisms. To validate our approach, we conduct a user study and compare quantitatively our metric with existing ones. The refined dataset, the new metric and a set of text-shape pairs validated by the user study comprise a novel, fine-grained benchmark that we publicly release to foster research on text-to-shape coherence of text-conditioned 3D generative models. Benchmark available at https://cvlab-unibo.github.io/CrossCoherence-Web/.
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
Semantic Image Synthesis via Class-Adaptive Cross-Attention
Authors:
Tomaso Fontanini,
Claudio Ferrari,
Giuseppe Lisanti,
Massimo Bertozzi,
Andrea Prati
Abstract:
In semantic image synthesis the state of the art is dominated by methods that use customized variants of the SPatially-Adaptive DE-normalization (SPADE) layers, which allow for good visual generation quality and editing versatility. By design, such layers learn pixel-wise modulation parameters to de-normalize the generator activations based on the semantic class each pixel belongs to. Thus, they t…
▽ More
In semantic image synthesis the state of the art is dominated by methods that use customized variants of the SPatially-Adaptive DE-normalization (SPADE) layers, which allow for good visual generation quality and editing versatility. By design, such layers learn pixel-wise modulation parameters to de-normalize the generator activations based on the semantic class each pixel belongs to. Thus, they tend to overlook global image statistics, ultimately leading to unconvincing local style editing and causing global inconsistencies such as color or illumination distribution shifts. Also, SPADE layers require the semantic segmentation mask for mapping styles in the generator, preventing shape manipulations without manual intervention. In response, we designed a novel architecture where cross-attention layers are used in place of SPADE for learning shape-style correlations and so conditioning the image generation process. Our model inherits the versatility of SPADE, at the same time obtaining state-of-the-art generation quality, as well as improved global and local style transfer. Code and models available at https://github.com/TFonta/CA2SIS.
△ Less
Submitted 30 July, 2024; v1 submitted 30 August, 2023;
originally announced August 2023.
-
Conditioning Diffusion Models via Attributes and Semantic Masks for Face Generation
Authors:
Nico Giambi,
Giuseppe Lisanti
Abstract:
Deep generative models have shown impressive results in generating realistic images of faces. GANs managed to generate high-quality, high-fidelity images when conditioned on semantic masks, but they still lack the ability to diversify their output. Diffusion models partially solve this problem and are able to generate diverse samples given the same condition. In this paper, we propose a multi-cond…
▽ More
Deep generative models have shown impressive results in generating realistic images of faces. GANs managed to generate high-quality, high-fidelity images when conditioned on semantic masks, but they still lack the ability to diversify their output. Diffusion models partially solve this problem and are able to generate diverse samples given the same condition. In this paper, we propose a multi-conditioning approach for diffusion models via cross-attention exploiting both attributes and semantic masks to generate high-quality and controllable face images. We also studied the impact of applying perceptual-focused loss weighting into the latent space instead of the pixel space. Our method extends the previous approaches by introducing conditioning on more than one set of features, guaranteeing a more fine-grained control over the generated face images. We evaluate our approach on the CelebA-HQ dataset, and we show that it can generate realistic and diverse samples while allowing for fine-grained control over multiple attributes and semantic regions. Additionally, we perform an ablation study to evaluate the impact of different conditioning strategies on the quality and diversity of the generated images.
△ Less
Submitted 27 September, 2023; v1 submitted 1 June, 2023;
originally announced June 2023.
-
Multimodal Side-Tuning for Document Classification
Authors:
Stefano Pio Zingaro,
Giuseppe Lisanti,
Maurizio Gabbrielli
Abstract:
In this paper, we propose to exploit the side-tuning framework for multimodal document classification. Side-tuning is a methodology for network adaptation recently introduced to solve some of the problems related to previous approaches. Thanks to this technique it is actually possible to overcome model rigidity and catastrophic forgetting of transfer learning by fine-tuning. The proposed solution…
▽ More
In this paper, we propose to exploit the side-tuning framework for multimodal document classification. Side-tuning is a methodology for network adaptation recently introduced to solve some of the problems related to previous approaches. Thanks to this technique it is actually possible to overcome model rigidity and catastrophic forgetting of transfer learning by fine-tuning. The proposed solution uses off-the-shelf deep learning architectures leveraging the side-tuning framework to combine a base model with a tandem of two side networks. We show that side-tuning can be successfully employed also when different data sources are considered, e.g. text and images in document classification. The experimental results show that this approach pushes further the limit for document classification accuracy with respect to the state of the art.
△ Less
Submitted 23 January, 2023; v1 submitted 16 January, 2023;
originally announced January 2023.
-
IMAGO: A family photo album dataset for a socio-historical analysis of the twentieth century
Authors:
Lorenzo Stacchio,
Alessia Angeli,
Giuseppe Lisanti,
Daniela Calanca,
Gustavo Marfia
Abstract:
Although one of the most popular practices in photography since the end of the 19th century, an increase in scholarly interest in family photo albums dates back to the early 1980s. Such collections of photos may reveal sociological and historical insights regarding specific cultures and times. They are, however, in most cases scattered among private homes and only available on paper or photographi…
▽ More
Although one of the most popular practices in photography since the end of the 19th century, an increase in scholarly interest in family photo albums dates back to the early 1980s. Such collections of photos may reveal sociological and historical insights regarding specific cultures and times. They are, however, in most cases scattered among private homes and only available on paper or photographic film, thus making their analysis by academics such as historians, social-cultural anthropologists and cultural theorists very cumbersome. In this paper, we analyze the IMAGO dataset including photos belonging to family albums assembled at the University of Bologna's Rimini campus since 2004. Following a deep learning-based approach, the IMAGO dataset has offered the opportunity of experimenting with photos taken between year 1845 and year 2009, with the goals of assessing the dates and the socio-historical contexts of the images, without use of any other sources of information. Exceeding our initial expectations, such analysis has revealed its merit not only in terms of the performance of the approach adopted in this work, but also in terms of the foreseeable implications and use for the benefit of socio-historical research. To the best of our knowledge, this is the first work that moves along this path in literature.
△ Less
Submitted 3 December, 2020;
originally announced December 2020.
-
Group Re-Identification via Unsupervised Transfer of Sparse Features Encoding
Authors:
Giuseppe Lisanti,
Niki Martinel,
Alberto Del Bimbo,
Gian Luca Foresti
Abstract:
Person re-identification is best known as the problem of associating a single person that is observed from one or more disjoint cameras. The existing literature has mainly addressed such an issue, neglecting the fact that people usually move in groups, like in crowded scenarios. We believe that the additional information carried by neighboring individuals provides a relevant visual context that ca…
▽ More
Person re-identification is best known as the problem of associating a single person that is observed from one or more disjoint cameras. The existing literature has mainly addressed such an issue, neglecting the fact that people usually move in groups, like in crowded scenarios. We believe that the additional information carried by neighboring individuals provides a relevant visual context that can be exploited to obtain a more robust match of single persons within the group. Despite this, re-identifying groups of people compound the common single person re-identification problems by introducing changes in the relative position of persons within the group and severe self-occlusions. In this paper, we propose a solution for group re-identification that grounds on transferring knowledge from single person re-identification to group re-identification by exploiting sparse dictionary learning. First, a dictionary of sparse atoms is learned using patches extracted from single person images. Then, the learned dictionary is exploited to obtain a sparsity-driven residual group representation, which is finally matched to perform the re-identification. Extensive experiments on the i-LIDS groups and two newly collected datasets show that the proposed solution outperforms state-of-the-art approaches.
△ Less
Submitted 28 July, 2017;
originally announced July 2017.
-
Context-Aware Trajectory Prediction
Authors:
Federico Bartoli,
Giuseppe Lisanti,
Lamberto Ballan,
Alberto Del Bimbo
Abstract:
Human motion and behaviour in crowded spaces is influenced by several factors, such as the dynamics of other moving agents in the scene, as well as the static elements that might be perceived as points of attraction or obstacles. In this work, we present a new model for human trajectory prediction which is able to take advantage of both human-human and human-space interactions. The future trajecto…
▽ More
Human motion and behaviour in crowded spaces is influenced by several factors, such as the dynamics of other moving agents in the scene, as well as the static elements that might be perceived as points of attraction or obstacles. In this work, we present a new model for human trajectory prediction which is able to take advantage of both human-human and human-space interactions. The future trajectory of humans, are generated by observing their past positions and interactions with the surroundings. To this end, we propose a "context-aware" recurrent neural network LSTM model, which can learn and predict human motion in crowded spaces such as a sidewalk, a museum or a shopping mall. We evaluate our model on a public pedestrian datasets, and we contribute a new challenging dataset that collects videos of humans that navigate in a (real) crowded space such as a big museum. Results show that our approach can predict human trajectories better when compared to previous state-of-the-art forecasting models.
△ Less
Submitted 6 May, 2017;
originally announced May 2017.
-
Multi Channel-Kernel Canonical Correlation Analysis for Cross-View Person Re-Identification
Authors:
Giuseppe Lisanti,
Svebor Karaman,
Iacopo Masi
Abstract:
In this paper we introduce a method to overcome one of the main challenges of person re-identification in multi-camera networks, namely cross-view appearance changes. The proposed solution addresses the extreme variability of person appearance in different camera views by exploiting multiple feature representations. For each feature, Kernel Canonical Correlation Analysis (KCCA) with different kern…
▽ More
In this paper we introduce a method to overcome one of the main challenges of person re-identification in multi-camera networks, namely cross-view appearance changes. The proposed solution addresses the extreme variability of person appearance in different camera views by exploiting multiple feature representations. For each feature, Kernel Canonical Correlation Analysis (KCCA) with different kernels is exploited to learn several projection spaces in which the appearance correlation between samples of the same person observed from different cameras is maximized. An iterative logistic regression is finally used to select and weigh the contributions of each feature projections and perform the matching between the two views. Experimental evaluation shows that the proposed solution obtains comparable performance on VIPeR and PRID 450s datasets and improves on PRID and CUHK01 datasets with respect to the state of the art.
△ Less
Submitted 21 March, 2017; v1 submitted 7 July, 2016;
originally announced July 2016.
-
A Multi-Camera Image Processing and Visualization System for Train Safety Assessment
Authors:
Giuseppe Lisanti,
Svebor Karaman,
Daniele Pezzatini,
Alberto Del Bimbo
Abstract:
In this paper we present a machine vision system to efficiently monitor, analyze and present visual data acquired with a railway overhead gantry equipped with multiple cameras. This solution aims to improve the safety of daily life railway transportation in a two- fold manner: (1) by providing automatic algorithms that can process large imagery of trains (2) by helping train operators to keep atte…
▽ More
In this paper we present a machine vision system to efficiently monitor, analyze and present visual data acquired with a railway overhead gantry equipped with multiple cameras. This solution aims to improve the safety of daily life railway transportation in a two- fold manner: (1) by providing automatic algorithms that can process large imagery of trains (2) by helping train operators to keep attention on any possible malfunction. The system is designed with the latest cutting edge, high-rate visible and thermal cameras that ob- serve a train passing under an railway overhead gantry. The machine vision system is composed of three principal modules: (1) an automatic wagon identification system, recognizing the wagon ID according to the UIC classification of railway coaches; (2) a temperature monitoring system; (3) a system for the detection, localization and visualization of the pantograph of the train. These three machine vision modules process batch trains sequences and their resulting analysis are presented to an operator using a multitouch user interface. We detail all technical aspects of our multi-camera portal: the hardware requirements, the software developed to deal with the high-frame rate cameras and ensure reliable acquisition, the algorithms proposed to solve each computer vision task, and the multitouch interaction and visualization interface. We evaluate each component of our system on a dataset recorded in an ad-hoc railway test-bed, showing the potential of our proposed portal for train safety assessment.
△ Less
Submitted 28 July, 2015;
originally announced July 2015.
-
Continuous Localization and Mapping of a Pan Tilt Zoom Camera for Wide Area Tracking
Authors:
Giuseppe Lisanti,
Iacopo Masi,
Federico Pernici,
Alberto Del Bimbo
Abstract:
Pan-tilt-zoom (PTZ) cameras are powerful to support object identification and recognition in far-field scenes. However, the effective use of PTZ cameras in real contexts is complicated by the fact that a continuous on-line camera calibration is needed and the absolute pan, tilt and zoom positional values provided by the camera actuators cannot be used because are not synchronized with the video st…
▽ More
Pan-tilt-zoom (PTZ) cameras are powerful to support object identification and recognition in far-field scenes. However, the effective use of PTZ cameras in real contexts is complicated by the fact that a continuous on-line camera calibration is needed and the absolute pan, tilt and zoom positional values provided by the camera actuators cannot be used because are not synchronized with the video stream. So, accurate calibration must be directly extracted from the visual content of the frames. Moreover, the large and abrupt scale changes, the scene background changes due to the camera operation and the need of camera motion compensation make target tracking with these cameras extremely challenging. In this paper, we present a solution that provides continuous on-line calibration of PTZ cameras which is robust to rapid camera motion, changes of the environment due to illumination or moving objects and scales beyond thousands of landmarks. The method directly derives the relationship between the position of a target in the 3D world plane and the corresponding scale and position in the 2D image, and allows real-time tracking of multiple targets with high and stable degree of accuracy even at far distances and any zooming level.
△ Less
Submitted 23 March, 2015; v1 submitted 25 January, 2014;
originally announced January 2014.