-
Multi-scale Image Super Resolution with a Single Auto-Regressive Model
Authors:
Enrique Sanchez,
Isma Hadji,
Adrian Bulat,
Christos Tzelepis,
Brais Martinez,
Georgios Tzimiropoulos
Abstract:
In this paper we tackle Image Super Resolution (ISR), using recent advances in Visual Auto-Regressive (VAR) modeling. VAR iteratively estimates the residual in latent space between gradually increasing image scales, a process referred to as next-scale prediction. Thus, the strong priors learned during pre-training align well with the downstream task (ISR). To our knowledge, only VARSR has exploite…
▽ More
In this paper we tackle Image Super Resolution (ISR), using recent advances in Visual Auto-Regressive (VAR) modeling. VAR iteratively estimates the residual in latent space between gradually increasing image scales, a process referred to as next-scale prediction. Thus, the strong priors learned during pre-training align well with the downstream task (ISR). To our knowledge, only VARSR has exploited this synergy so far, showing promising results. However, due to the limitations of existing residual quantizers, VARSR works only at a fixed resolution, i.e. it fails to map intermediate outputs to the corresponding image scales. Additionally, it relies on a 1B transformer architecture (VAR-d24), and leverages a large-scale private dataset to achieve state-of-the-art results. We address these limitations through two novel components: a) a Hierarchical Image Tokenization approach with a multi-scale image tokenizer that progressively represents images at different scales while simultaneously enforcing token overlap across scales, and b) a Direct Preference Optimization (DPO) regularization term that, relying solely on the LR and HR tokenizations, encourages the transformer to produce the latter over the former. To the best of our knowledge, this is the first time a quantizer is trained to force semantically consistent residuals at different scales, and the first time that preference-based optimization is used to train a VAR. Using these two components, our model can denoise the LR image and super-resolve at half and full target upscale factors in a single forward pass. Additionally, we achieve \textit{state-of-the-art results on ISR}, while using a small model (300M params vs ~1B params of VARSR), and without using external training data.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
A European Multi-Center Breast Cancer MRI Dataset
Authors:
Gustav Müller-Franzes,
Lorena Escudero Sánchez,
Nicholas Payne,
Alexandra Athanasiou,
Michael Kalogeropoulos,
Aitor Lopez,
Alfredo Miguel Soro Busto,
Julia Camps Herrero,
Nika Rasoolzadeh,
Tianyu Zhang,
Ritse Mann,
Debora Jutz,
Maike Bode,
Christiane Kuhl,
Wouter Veldhuis,
Oliver Lester Saldanha,
JieFu Zhu,
Jakob Nikolas Kather,
Daniel Truhn,
Fiona J. Gilbert
Abstract:
Detecting breast cancer early is of the utmost importance to effectively treat the millions of women afflicted by breast cancer worldwide every year. Although mammography is the primary imaging modality for screening breast cancer, there is an increasing interest in adding magnetic resonance imaging (MRI) to screening programmes, particularly for women at high risk. Recent guidelines by the Europe…
▽ More
Detecting breast cancer early is of the utmost importance to effectively treat the millions of women afflicted by breast cancer worldwide every year. Although mammography is the primary imaging modality for screening breast cancer, there is an increasing interest in adding magnetic resonance imaging (MRI) to screening programmes, particularly for women at high risk. Recent guidelines by the European Society of Breast Imaging (EUSOBI) recommended breast MRI as a supplemental screening tool for women with dense breast tissue. However, acquiring and reading MRI scans requires significantly more time from expert radiologists. This highlights the need to develop new automated methods to detect cancer accurately using MRI and Artificial Intelligence (AI), which have the potential to support radiologists in breast MRI interpretation and classification and help detect cancer earlier. For this reason, the ODELIA consortium has made this multi-centre dataset publicly available to assist in developing AI tools for the detection of breast cancer on MRI.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation
Authors:
The Omnilingual MT Team,
Pierre Andrews,
Mikel Artetxe,
Mariano Coria Meglioli,
Marta R. Costa-jussà,
Joe Chuang,
David Dale,
Cynthia Gao,
Jean Maillard,
Alex Mourachko,
Christophe Ropers,
Safiyyah Saleem,
Eduardo Sánchez,
Ioannis Tsiamas,
Arina Turkatenko,
Albert Ventayol-Boada,
Shireen Yates
Abstract:
This paper presents BOUQuET, a multicentric and multi-register/domain dataset and benchmark, and its broader collaborative extension initiative. This dataset is handcrafted in non-English languages first, each of these source languages being represented among the 23 languages commonly used by half of the world's population and therefore having the potential to serve as pivot languages that will en…
▽ More
This paper presents BOUQuET, a multicentric and multi-register/domain dataset and benchmark, and its broader collaborative extension initiative. This dataset is handcrafted in non-English languages first, each of these source languages being represented among the 23 languages commonly used by half of the world's population and therefore having the potential to serve as pivot languages that will enable more accurate translations. The dataset is specially designed to avoid contamination and be multicentric, so as to enforce representation of multilingual language features. In addition, the dataset goes beyond the sentence level, as it is organized in paragraphs of various lengths. Compared with related machine translation (MT) datasets, we show that BOUQuET has a broader representation of domains while simplifying the translation task for non-experts. Therefore, BOUQuET is specially suitable for the open initiative and call for translation participation that we are launching to extend it to a multi-way parallel corpus to any written language.
△ Less
Submitted 6 February, 2025;
originally announced February 2025.
-
Large Concept Models: Language Modeling in a Sentence Representation Space
Authors:
LCM team,
Loïc Barrault,
Paul-Ambroise Duquenne,
Maha Elbayad,
Artyom Kozhevnikov,
Belen Alastruey,
Pierre Andrews,
Mariano Coria,
Guillaume Couairon,
Marta R. Costa-jussà,
David Dale,
Hady Elsahar,
Kevin Heffernan,
João Maria Janeiro,
Tuan Tran,
Christophe Ropers,
Eduardo Sánchez,
Robin San Roman,
Alexandre Mourachko,
Safiyyah Saleem,
Holger Schwenk
Abstract:
LLMs have revolutionized the field of artificial intelligence and have emerged as the de-facto tool for many tasks. The current established technology of LLMs is to process input and generate output at the token level. This is in sharp contrast to humans who operate at multiple levels of abstraction, well beyond single words, to analyze information and to generate creative content. In this paper,…
▽ More
LLMs have revolutionized the field of artificial intelligence and have emerged as the de-facto tool for many tasks. The current established technology of LLMs is to process input and generate output at the token level. This is in sharp contrast to humans who operate at multiple levels of abstraction, well beyond single words, to analyze information and to generate creative content. In this paper, we present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a concept. Concepts are language- and modality-agnostic and represent a higher level idea or action in a flow. Hence, we build a "Large Concept Model". In this study, as proof of feasibility, we assume that a concept corresponds to a sentence, and use an existing sentence embedding space, SONAR, which supports up to 200 languages in both text and speech modalities.
The Large Concept Model is trained to perform autoregressive sentence prediction in an embedding space. We explore multiple approaches, namely MSE regression, variants of diffusion-based generation, and models operating in a quantized SONAR space. These explorations are performed using 1.6B parameter models and training data in the order of 1.3T tokens. We then scale one architecture to a model size of 7B parameters and training data of about 2.7T tokens. We perform an experimental evaluation on several generative tasks, namely summarization and a new task of summary expansion. Finally, we show that our model exhibits impressive zero-shot generalization performance to many languages, outperforming existing LLMs of the same size. The training code of our models is freely available.
△ Less
Submitted 15 December, 2024; v1 submitted 11 December, 2024;
originally announced December 2024.
-
Y-NQ: English-Yorùbá Evaluation dataset for Open-Book Reading Comprehension and Text Generation
Authors:
Marta R. Costa-jussà,
Joy Chen,
Ifeoluwanimi Adebara,
Joe Chuang,
Christophe Ropers,
Eduardo Sánchez
Abstract:
The purpose of this work is to share an English-Yorùbá evaluation dataset for open-book reading comprehension and text generation to assess the performance of models both in a high- and a low- resource language. The dataset contains 358 questions and answers on 338 English documents and 208 Yorùbá documents. The average document length is ~ 10k words for English and 430 words for Yorùbá. Experimen…
▽ More
The purpose of this work is to share an English-Yorùbá evaluation dataset for open-book reading comprehension and text generation to assess the performance of models both in a high- and a low- resource language. The dataset contains 358 questions and answers on 338 English documents and 208 Yorùbá documents. The average document length is ~ 10k words for English and 430 words for Yorùbá. Experiments show a consistent disparity in performance between the two languages, with Yorùbá falling behind English for automatic metrics even if documents are much shorter for this language. For a small set of documents with comparable length, performance of Yorùbá drops by x2.5 times. When analyzing performance by length, we observe that Yorùbá decreases performance dramatically for documents that reach 1500 words while English performance is barely affected at that length. Our dataset opens the door to showcasing if English LLM reading comprehension capabilities extend to Yorùbá, which for the evaluated LLMs is not the case.
△ Less
Submitted 11 December, 2024;
originally announced December 2024.
-
LCFO: Long Context and Long Form Output Dataset and Benchmarking
Authors:
Marta R. Costa-jussà,
Pierre Andrews,
Mariano Coria Meglioli,
Joy Chen,
Joe Chuang,
David Dale,
Christophe Ropers,
Alexandre Mourachko,
Eduardo Sánchez,
Holger Schwenk,
Tuan Tran,
Arina Turkatenko,
Carleigh Wood
Abstract:
This paper presents the Long Context and Form Output (LCFO) benchmark, a novel evaluation framework for assessing gradual summarization and summary expansion capabilities across diverse domains. LCFO consists of long input documents (5k words average length), each of which comes with three summaries of different lengths (20%, 10%, and 5% of the input text), as well as approximately 15 questions an…
▽ More
This paper presents the Long Context and Form Output (LCFO) benchmark, a novel evaluation framework for assessing gradual summarization and summary expansion capabilities across diverse domains. LCFO consists of long input documents (5k words average length), each of which comes with three summaries of different lengths (20%, 10%, and 5% of the input text), as well as approximately 15 questions and answers (QA) related to the input content. Notably, LCFO also provides alignments between specific QA pairs and corresponding summaries in 7 domains. The primary motivation behind providing summaries of different lengths is to establish a controllable framework for generating long texts from shorter inputs, i.e. summary expansion. To establish an evaluation metric framework for summarization and summary expansion, we provide human evaluation scores for human-generated outputs, as well as results from various state-of-the-art large language models (LLMs). GPT-4o-mini achieves best human scores among automatic systems in both summarization and summary expansion tasks (~ +10% and +20%, respectively). It even surpasses human output quality in the case of short summaries (~ +7%). Overall automatic metrics achieve low correlations with human evaluation scores (~ 0.4) but moderate correlation on specific evaluation aspects such as fluency and attribution (~ 0.6). The LCFO benchmark offers a standardized platform for evaluating summarization and summary expansion performance, as well as corresponding automatic metrics, thereby providing an important evaluation framework to advance generative AI.
△ Less
Submitted 12 December, 2024; v1 submitted 11 December, 2024;
originally announced December 2024.
-
On the Role of Speech Data in Reducing Toxicity Detection Bias
Authors:
Samuel J. Bell,
Mariano Coria Meglioli,
Megan Richards,
Eduardo Sánchez,
Christophe Ropers,
Skyler Wang,
Adina Williams,
Levent Sagun,
Marta R. Costa-jussà
Abstract:
Text toxicity detection systems exhibit significant biases, producing disproportionate rates of false positives on samples mentioning demographic groups. But what about toxicity detection in speech? To investigate the extent to which text-based biases are mitigated by speech-based systems, we produce a set of high-quality group annotations for the multilingual MuTox dataset, and then leverage thes…
▽ More
Text toxicity detection systems exhibit significant biases, producing disproportionate rates of false positives on samples mentioning demographic groups. But what about toxicity detection in speech? To investigate the extent to which text-based biases are mitigated by speech-based systems, we produce a set of high-quality group annotations for the multilingual MuTox dataset, and then leverage these annotations to systematically compare speech- and text-based toxicity classifiers. Our findings indicate that access to speech data during inference supports reduced bias against group mentions, particularly for ambiguous and disagreement-inducing samples. Our results also suggest that improving classifiers, rather than transcription pipelines, is more helpful for reducing group bias. We publicly release our annotations and provide recommendations for future toxicity dataset construction.
△ Less
Submitted 16 May, 2025; v1 submitted 12 November, 2024;
originally announced November 2024.
-
Digitizing Touch with an Artificial Multimodal Fingertip
Authors:
Mike Lambeta,
Tingfan Wu,
Ali Sengul,
Victoria Rose Most,
Nolan Black,
Kevin Sawyer,
Romeo Mercado,
Haozhi Qi,
Alexander Sohn,
Byron Taylor,
Norb Tydingco,
Gregg Kammerer,
Dave Stroud,
Jake Khatha,
Kurt Jenkins,
Kyle Most,
Neal Stein,
Ricardo Chavira,
Thomas Craven-Bartle,
Eric Sanchez,
Yitian Ding,
Jitendra Malik,
Roberto Calandra
Abstract:
Touch is a crucial sensing modality that provides rich information about object properties and interactions with the physical environment. Humans and robots both benefit from using touch to perceive and interact with the surrounding environment (Johansson and Flanagan, 2009; Li et al., 2020; Calandra et al., 2017). However, no existing systems provide rich, multi-modal digital touch-sensing capabi…
▽ More
Touch is a crucial sensing modality that provides rich information about object properties and interactions with the physical environment. Humans and robots both benefit from using touch to perceive and interact with the surrounding environment (Johansson and Flanagan, 2009; Li et al., 2020; Calandra et al., 2017). However, no existing systems provide rich, multi-modal digital touch-sensing capabilities through a hemispherical compliant embodiment. Here, we describe several conceptual and technological innovations to improve the digitization of touch. These advances are embodied in an artificial finger-shaped sensor with advanced sensing capabilities. Significantly, this fingertip contains high-resolution sensors (~8.3 million taxels) that respond to omnidirectional touch, capture multi-modal signals, and use on-device artificial intelligence to process the data in real time. Evaluations show that the artificial fingertip can resolve spatial features as small as 7 um, sense normal and shear forces with a resolution of 1.01 mN and 1.27 mN, respectively, perceive vibrations up to 10 kHz, sense heat, and even sense odor. Furthermore, it embeds an on-device AI neural network accelerator that acts as a peripheral nervous system on a robot and mimics the reflex arc found in humans. These results demonstrate the possibility of digitizing touch with superhuman performance. The implications are profound, and we anticipate potential applications in robotics (industrial, medical, agricultural, and consumer-level), virtual reality and telepresence, prosthetics, and e-commerce. Toward digitizing touch at scale, we open-source a modular platform to facilitate future research on the nature of touch.
△ Less
Submitted 4 November, 2024;
originally announced November 2024.
-
Uncovering the Genetic Basis of Glioblastoma Heterogeneity through Multimodal Analysis of Whole Slide Images and RNA Sequencing Data
Authors:
Ahmad Berjaoui,
Louis Roussel,
Eduardo Hugo Sanchez,
Elizabeth Cohen-Jonathan Moyal
Abstract:
Glioblastoma is a highly aggressive form of brain cancer characterized by rapid progression and poor prognosis. Despite advances in treatment, the underlying genetic mechanisms driving this aggressiveness remain poorly understood. In this study, we employed multimodal deep learning approaches to investigate glioblastoma heterogeneity using joint image/RNA-seq analysis. Our results reveal novel gen…
▽ More
Glioblastoma is a highly aggressive form of brain cancer characterized by rapid progression and poor prognosis. Despite advances in treatment, the underlying genetic mechanisms driving this aggressiveness remain poorly understood. In this study, we employed multimodal deep learning approaches to investigate glioblastoma heterogeneity using joint image/RNA-seq analysis. Our results reveal novel genes associated with glioblastoma. By leveraging a combination of whole-slide images and RNA-seq, as well as introducing novel methods to encode RNA-seq data, we identified specific genetic profiles that may explain different patterns of glioblastoma progression. These findings provide new insights into the genetic mechanisms underlying glioblastoma heterogeneity and highlight potential targets for therapeutic intervention. Code and data downloading instructions are available at: https://github.com/ma3oun/gbheterogeneity.
△ Less
Submitted 19 May, 2025; v1 submitted 23 October, 2024;
originally announced October 2024.
-
The Overfocusing Bias of Convolutional Neural Networks: A Saliency-Guided Regularization Approach
Authors:
David Bertoin,
Eduardo Hugo Sanchez,
Mehdi Zouitine,
Emmanuel Rachelson
Abstract:
Despite transformers being considered as the new standard in computer vision, convolutional neural networks (CNNs) still outperform them in low-data regimes. Nonetheless, CNNs often make decisions based on narrow, specific regions of input images, especially when training data is limited. This behavior can severely compromise the model's generalization capabilities, making it disproportionately de…
▽ More
Despite transformers being considered as the new standard in computer vision, convolutional neural networks (CNNs) still outperform them in low-data regimes. Nonetheless, CNNs often make decisions based on narrow, specific regions of input images, especially when training data is limited. This behavior can severely compromise the model's generalization capabilities, making it disproportionately dependent on certain features that might not represent the broader context of images. While the conditions leading to this phenomenon remain elusive, the primary intent of this article is to shed light on this observed behavior of neural networks. Our research endeavors to prioritize comprehensive insight and to outline an initial response to this phenomenon. In line with this, we introduce Saliency Guided Dropout (SGDrop), a pioneering regularization approach tailored to address this specific issue. SGDrop utilizes attribution methods on the feature map to identify and then reduce the influence of the most salient features during training. This process encourages the network to diversify its attention and not focus solely on specific standout areas. Our experiments across several visual classification benchmarks validate SGDrop's role in enhancing generalization. Significantly, models incorporating SGDrop display more expansive attributions and neural activity, offering a more comprehensive view of input images in contrast to their traditionally trained counterparts.
△ Less
Submitted 25 September, 2024;
originally announced September 2024.
-
Linguini: A benchmark for language-agnostic linguistic reasoning
Authors:
Eduardo Sánchez,
Belen Alastruey,
Christophe Ropers,
Pontus Stenetorp,
Mikel Artetxe,
Marta R. Costa-jussà
Abstract:
We propose a new benchmark to measure a language model's linguistic reasoning skills without relying on pre-existing language-specific knowledge. The test covers 894 questions grouped in 160 problems across 75 (mostly) extremely low-resource languages, extracted from the International Linguistic Olympiad corpus. To attain high accuracy on this benchmark, models don't need previous knowledge of the…
▽ More
We propose a new benchmark to measure a language model's linguistic reasoning skills without relying on pre-existing language-specific knowledge. The test covers 894 questions grouped in 160 problems across 75 (mostly) extremely low-resource languages, extracted from the International Linguistic Olympiad corpus. To attain high accuracy on this benchmark, models don't need previous knowledge of the tested language, as all the information needed to solve the linguistic puzzle is presented in the context. We find that, while all analyzed models rank below 25% accuracy, there is a significant gap between open and closed models, with the best-performing proprietary model at 24.05% and the best-performing open model at 8.84%.
△ Less
Submitted 18 September, 2024;
originally announced September 2024.
-
The Role of Generative Systems in Historical Photography Management: A Case Study on Catalan Archives
Authors:
Èric Śanchez,
Adrià Molina,
Oriol Ramos Terrades
Abstract:
The use of image analysis in automated photography management is an increasing trend in heritage institutions. Such tools alleviate the human cost associated with the manual and expensive annotation of new data sources while facilitating fast access to the citizenship through online indexes and search engines. However, available tagging and description tools are usually designed around modern phot…
▽ More
The use of image analysis in automated photography management is an increasing trend in heritage institutions. Such tools alleviate the human cost associated with the manual and expensive annotation of new data sources while facilitating fast access to the citizenship through online indexes and search engines. However, available tagging and description tools are usually designed around modern photographs in English, neglecting historical corpora in minoritized languages, each of which exhibits intrinsic particularities. The primary objective of this research is to study the quantitative contribution of generative systems in the description of historical sources. This is done by contextualizing the task of captioning historical photographs from the Catalan archives as a case study. Our findings provide practitioners with tools and directions on transfer learning for captioning models based on visual adaptation and linguistic proximity.
△ Less
Submitted 5 September, 2024;
originally announced September 2024.
-
Mitigating Metropolitan Carbon Emissions with Dynamic Eco-driving at Scale
Authors:
Vindula Jayawardana,
Baptiste Freydt,
Ao Qu,
Cameron Hickert,
Edgar Sanchez,
Catherine Tang,
Mark Taylor,
Blaine Leonard,
Cathy Wu
Abstract:
The sheer scale and diversity of transportation make it a formidable sector to decarbonize. Here, we consider an emerging opportunity to reduce carbon emissions: the growing adoption of semi-autonomous vehicles, which can be programmed to mitigate stop-and-go traffic through intelligent speed commands and, thus, reduce emissions. But would such dynamic eco-driving move the needle on climate change…
▽ More
The sheer scale and diversity of transportation make it a formidable sector to decarbonize. Here, we consider an emerging opportunity to reduce carbon emissions: the growing adoption of semi-autonomous vehicles, which can be programmed to mitigate stop-and-go traffic through intelligent speed commands and, thus, reduce emissions. But would such dynamic eco-driving move the needle on climate change? A comprehensive impact analysis has been out of reach due to the vast array of traffic scenarios and the complexity of vehicle emissions. We address this challenge with large-scale scenario modeling efforts and by using multi-task deep reinforcement learning with a carefully designed network decomposition strategy. We perform an in-depth prospective impact assessment of dynamic eco-driving at 6,011 signalized intersections across three major US metropolitan cities, simulating a million traffic scenarios. Overall, we find that vehicle trajectories optimized for emissions can cut city-wide intersection carbon emissions by 11-22%, without harming throughput or safety, and with reasonable assumptions, equivalent to the national emissions of Israel and Nigeria, respectively. We find that 10% eco-driving adoption yields 25%-50% of the total reduction, and nearly 70% of the benefits come from 20% of intersections, suggesting near-term implementation pathways. However, the composition of this high-impact subset of intersections varies considerably across different adoption levels, with minimal overlap, calling for careful strategic planning for eco-driving deployments. Moreover, the impact of eco-driving, when considered jointly with projections of vehicle electrification and hybrid vehicle adoption remains significant. More broadly, this work paves the way for large-scale analysis of traffic externalities, such as time, safety, and air quality, and the potential impact of solution strategies.
△ Less
Submitted 10 August, 2024;
originally announced August 2024.
-
Machine Translation Hallucination Detection for Low and High Resource Languages using Large Language Models
Authors:
Kenza Benkirane,
Laura Gongas,
Shahar Pelles,
Naomi Fuchs,
Joshua Darmon,
Pontus Stenetorp,
David Ifeoluwa Adelani,
Eduardo Sánchez
Abstract:
Recent advancements in massively multilingual machine translation systems have significantly enhanced translation accuracy; however, even the best performing systems still generate hallucinations, severely impacting user trust. Detecting hallucinations in Machine Translation (MT) remains a critical challenge, particularly since existing methods excel with High-Resource Languages (HRLs) but exhibit…
▽ More
Recent advancements in massively multilingual machine translation systems have significantly enhanced translation accuracy; however, even the best performing systems still generate hallucinations, severely impacting user trust. Detecting hallucinations in Machine Translation (MT) remains a critical challenge, particularly since existing methods excel with High-Resource Languages (HRLs) but exhibit substantial limitations when applied to Low-Resource Languages (LRLs). This paper evaluates sentence-level hallucination detection approaches using Large Language Models (LLMs) and semantic similarity within massively multilingual embeddings. Our study spans 16 language directions, covering HRLs, LRLs, with diverse scripts. We find that the choice of model is essential for performance. On average, for HRLs, Llama3-70B outperforms the previous state of the art by as much as 0.16 MCC (Matthews Correlation Coefficient). However, for LRLs we observe that Claude Sonnet outperforms other LLMs on average by 0.03 MCC. The key takeaway from our study is that LLMs can achieve performance comparable or even better than previously proposed models, despite not being explicitly trained for any machine translation task. However, their advantage is less significant for LRLs.
△ Less
Submitted 20 October, 2024; v1 submitted 23 July, 2024;
originally announced July 2024.
-
MeMSVD: Long-Range Temporal Structure Capturing Using Incremental SVD
Authors:
Ioanna Ntinou,
Enrique Sanchez,
Georgios Tzimiropoulos
Abstract:
This paper is on long-term video understanding where the goal is to recognise human actions over long temporal windows (up to minutes long). In prior work, long temporal context is captured by constructing a long-term memory bank consisting of past and future video features which are then integrated into standard (short-term) video recognition backbones through the use of attention mechanisms. Two…
▽ More
This paper is on long-term video understanding where the goal is to recognise human actions over long temporal windows (up to minutes long). In prior work, long temporal context is captured by constructing a long-term memory bank consisting of past and future video features which are then integrated into standard (short-term) video recognition backbones through the use of attention mechanisms. Two well-known problems related to this approach are the quadratic complexity of the attention operation and the fact that the whole feature bank must be stored in memory for inference. To address both issues, we propose an alternative to attention-based schemes which is based on a low-rank approximation of the memory obtained using Singular Value Decomposition. Our scheme has two advantages: (a) it reduces complexity by more than an order of magnitude, and (b) it is amenable to an efficient implementation for the calculation of the memory bases in an incremental fashion which does not require the storage of the whole feature bank in memory. The proposed scheme matches or surpasses the accuracy achieved by attention-based mechanisms while being memory-efficient. Through extensive experiments, we demonstrate that our framework generalises to different architectures and tasks, outperforming the state-of-the-art in three datasets.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Reorienting Learning Game Design in Design-Based Research: a Case Study
Authors:
Nadine Mandran,
Estelle Prior,
Eric Sanchez,
Mathieu Vermeulen
Abstract:
One of the main difficulties remains the collaboration between the various experts involved in designing the Learning Games (LG). Our literature review focuses on the pitfalls and principles that have been identified by various authors in learning games design. Based on this review, a prototype was designed to support the LG design process and to study more precisely the collaboration between acto…
▽ More
One of the main difficulties remains the collaboration between the various experts involved in designing the Learning Games (LG). Our literature review focuses on the pitfalls and principles that have been identified by various authors in learning games design. Based on this review, a prototype was designed to support the LG design process and to study more precisely the collaboration between actors (teachers, researchers, game designers, data analyst and computer scientist). Indeed, according to the state of the art, the skills and knowledge involved in design are difficult to integrate. It has been tested in a real-world scenario for designing learning games to teach algorithmic. Through participant observation in thirty-three workshops involving nine experts, we were able to identify recurring pitfalls as we applied the recommendations in the literature. The analysis of these workshops led to propose eight principles aimed at facilitating the collaboration between the learning games design process and re-evaluating research on its.
△ Less
Submitted 9 January, 2024;
originally announced January 2024.
-
Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization
Authors:
Ioanna Ntinou,
Enrique Sanchez,
Georgios Tzimiropoulos
Abstract:
Action Localization is a challenging problem that combines detection and recognition tasks, which are often addressed separately. State-of-the-art methods rely on off-the-shelf bounding box detections pre-computed at high resolution, and propose transformer models that focus on the classification task alone. Such two-stage solutions are prohibitive for real-time deployment. On the other hand, sing…
▽ More
Action Localization is a challenging problem that combines detection and recognition tasks, which are often addressed separately. State-of-the-art methods rely on off-the-shelf bounding box detections pre-computed at high resolution, and propose transformer models that focus on the classification task alone. Such two-stage solutions are prohibitive for real-time deployment. On the other hand, single-stage methods target both tasks by devoting part of the network (generally the backbone) to sharing the majority of the workload, compromising performance for speed. These methods build on adding a DETR head with learnable queries that after cross- and self-attention can be sent to corresponding MLPs for detecting a person's bounding box and action. However, DETR-like architectures are challenging to train and can incur in big complexity.
In this paper, we observe that \textbf{a straight bipartite matching loss can be applied to the output tokens of a vision transformer}. This results in a backbone + MLP architecture that can do both tasks without the need of an extra encoder-decoder head and learnable queries. We show that a single MViTv2-S architecture trained with bipartite matching to perform both tasks surpasses the same MViTv2-S when trained with RoI align on pre-computed bounding boxes. With a careful design of token pooling and the proposed training pipeline, our Bipartite-Matching Vision Transformer model, \textbf{BMViT}, achieves +3 mAP on AVA2.2. w.r.t. the two-stage MViTv2-S counterpart. Code is available at \href{https://github.com/IoannaNti/BMViT}{https://github.com/IoannaNti/BMViT}
△ Less
Submitted 23 May, 2024; v1 submitted 29 December, 2023;
originally announced December 2023.
-
Data-Driven Traffic Reconstruction and Kernel Methods for Identifying Stop-and-Go Congestion
Authors:
Edgar Ramirez Sanchez,
Shreyaa Raghavan,
Cathy Wu
Abstract:
Identifying stop-and-go events (SAGs) in traffic flow presents an important avenue for advancing data-driven research for climate change mitigation and sustainability, owing to their substantial impact on carbon emissions, travel time, fuel consumption, and roadway safety. In fact, SAGs are estimated to account for 33-50% of highway driving externalities. However, insufficient attention has been p…
▽ More
Identifying stop-and-go events (SAGs) in traffic flow presents an important avenue for advancing data-driven research for climate change mitigation and sustainability, owing to their substantial impact on carbon emissions, travel time, fuel consumption, and roadway safety. In fact, SAGs are estimated to account for 33-50% of highway driving externalities. However, insufficient attention has been paid to precisely quantifying where, when, and how much these SAGs take place -necessary for downstream decision making, such as intervention design and policy analysis. A key challenge is that the data available to researchers and governments are typically sparse and aggregated to a granularity that obscures SAGs. To overcome such data limitations, this study thus explores the use of traffic reconstruction techniques for SAG identification. In particular, we introduce a kernel-based method for identifying spatio-temporal features in traffic and leverage bootstrapping to quantify the uncertainty of the reconstruction process. Experimental results on California highway data demonstrate the promise of the method for capturing SAGs. This work contributes to a foundation for data-driven decision making to advance sustainability of traffic systems.
△ Less
Submitted 5 December, 2023;
originally announced December 2023.
-
ViKi-HyCo: A Hybrid-Control approach for complex car-like maneuvers
Authors:
Edison P. Velasco Sánchez,
Miguel Ángel Muñoz-Bañón,
Francisco A. Candelas,
Santiago T. Puente,
Fernando Torres
Abstract:
While Visual Servoing is deeply studied to perform simple maneuvers, the literature does not commonly address complex cases where the target is far out of the camera's field of view (FOV) during the maneuver. For this reason, in this paper, we present ViKi-HyCo (Visual Servoing and Kinematic Hybrid-Controller). This approach generates the necessary maneuvers for the complex positioning of a non-ho…
▽ More
While Visual Servoing is deeply studied to perform simple maneuvers, the literature does not commonly address complex cases where the target is far out of the camera's field of view (FOV) during the maneuver. For this reason, in this paper, we present ViKi-HyCo (Visual Servoing and Kinematic Hybrid-Controller). This approach generates the necessary maneuvers for the complex positioning of a non-holonomic mobile robot in outdoor environments. In this method, we use \hbox{LiDAR-camera} fusion to estimate objects bounding boxes using image and metrics modalities. With the multi-modality nature of our representation, we can automatically obtain a target for a visual servoing controller. At the same time, we also have a metric target, which allows us to hybridize with a kinematic controller. Given this hybridization, we can perform complex maneuvers even when the target is far away from the camera's FOV. The proposed approach does not require an object-tracking algorithm and can be applied to any robotic positioning task where its kinematic model is known. ViKi-HyCo has an error of 0.0428 \pm 0.0467 m in the X-axis and 0.0515 \pm 0.0323 m in the Y-axis at the end of a complete positioning task.
△ Less
Submitted 16 May, 2024; v1 submitted 13 November, 2023;
originally announced November 2023.
-
Gender-specific Machine Translation with Large Language Models
Authors:
Eduardo Sánchez,
Pierre Andrews,
Pontus Stenetorp,
Mikel Artetxe,
Marta R. Costa-jussà
Abstract:
While machine translation (MT) systems have seen significant improvements, it is still common for translations to reflect societal biases, such as gender bias. Decoder-only Large Language Models (LLMs) have demonstrated potential in MT, albeit with performance slightly lagging behind traditional encoder-decoder Neural Machine Translation (NMT) systems. However, LLMs offer a unique advantage: the a…
▽ More
While machine translation (MT) systems have seen significant improvements, it is still common for translations to reflect societal biases, such as gender bias. Decoder-only Large Language Models (LLMs) have demonstrated potential in MT, albeit with performance slightly lagging behind traditional encoder-decoder Neural Machine Translation (NMT) systems. However, LLMs offer a unique advantage: the ability to control the properties of the output through prompts. In this study, we leverage this flexibility to explore LLaMa's capability to produce gender-specific translations. Our results indicate that LLaMa can generate gender-specific translations with translation accuracy and gender bias comparable to NLLB, a state-of-the-art multilingual NMT system. Furthermore, our experiments reveal that LLaMa's gender-specific translations rely on coreference resolution to determine gender, showing higher gender variance in gender-ambiguous datasets but maintaining consistency in less ambiguous contexts. This research investigates the potential and challenges of using LLMs for gender-specific translations as an instance of the controllability of outputs offered by LLMs.
△ Less
Submitted 16 April, 2024; v1 submitted 6 September, 2023;
originally announced September 2023.
-
Local primordial non-Gaussianity from the large-scale clustering of photometric DESI luminous red galaxies
Authors:
Mehdi Rezaie,
Ashley J. Ross,
Hee-Jong Seo,
Hui Kong,
Anna Porredon,
Lado Samushia,
Edmond Chaussidon,
Alex Krolewski,
Arnaud de Mattia,
Florian Beutler,
Jessica Nicole Aguilar,
Steven Ahlen,
Shadab Alam,
Santiago Avila,
Benedict Bahr-Kalus,
Jose Bermejo-Climent,
David Brooks,
Todd Claybaugh,
Shaun Cole,
Kyle Dawson,
Axel de la Macorra,
Peter Doel,
Andreu Font-Ribera,
Jaime E. Forero-Romero,
Satya Gontcho A Gontcho
, et al. (24 additional authors not shown)
Abstract:
We use angular clustering of luminous red galaxies from the Dark Energy Spectroscopic Instrument (DESI) imaging surveys to constrain the local primordial non-Gaussianity parameter $\fnl$. Our sample comprises over 12 million targets, covering 14,000 square degrees of the sky, with redshifts in the range $0.2< z < 1.35$. We identify Galactic extinction, survey depth, and astronomical seeing as the…
▽ More
We use angular clustering of luminous red galaxies from the Dark Energy Spectroscopic Instrument (DESI) imaging surveys to constrain the local primordial non-Gaussianity parameter $\fnl$. Our sample comprises over 12 million targets, covering 14,000 square degrees of the sky, with redshifts in the range $0.2< z < 1.35$. We identify Galactic extinction, survey depth, and astronomical seeing as the primary sources of systematic error, and employ linear regression and artificial neural networks to alleviate non-cosmological excess clustering on large scales. Our methods are tested against simulations with and without $\fnl$ and systematics, showing superior performance of the neural network treatment. The neural network with a set of nine imaging property maps passes our systematic null test criteria, and is chosen as the fiducial treatment. Assuming the universality relation, we find $\fnl = 34^{+24(+50)}_{-44(-73)}$ at 68\%(95\%) confidence. We apply a series of robustness tests (e.g., cuts on imaging, declination, or scales used) that show consistency in the obtained constraints. We study how the regression method biases the measured angular power-spectrum and degrades the $\fnl$ constraining power. The use of the nine maps more than doubles the uncertainty compared to using only the three primary maps in the regression. Our results thus motivate the development of more efficient methods that avoid over-correction, protect large-scale clustering information, and preserve constraining power. Additionally, our results encourage further studies of $\fnl$ with DESI spectroscopic samples, where the inclusion of 3D clustering modes should help separate imaging systematics and lessen the degradation in the $\fnl$ uncertainty.
△ Less
Submitted 25 June, 2024; v1 submitted 4 July, 2023;
originally announced July 2023.
-
Read, look and detect: Bounding box annotation from image-caption pairs
Authors:
Eduardo Hugo Sanchez
Abstract:
Various methods have been proposed to detect objects while reducing the cost of data annotation. For instance, weakly supervised object detection (WSOD) methods rely only on image-level annotations during training. Unfortunately, data annotation remains expensive since annotators must provide the categories describing the content of each image and labeling is restricted to a fixed set of categorie…
▽ More
Various methods have been proposed to detect objects while reducing the cost of data annotation. For instance, weakly supervised object detection (WSOD) methods rely only on image-level annotations during training. Unfortunately, data annotation remains expensive since annotators must provide the categories describing the content of each image and labeling is restricted to a fixed set of categories. In this paper, we propose a method to locate and label objects in an image by using a form of weaker supervision: image-caption pairs. By leveraging recent advances in vision-language (VL) models and self-supervised vision transformers (ViTs), our method is able to perform phrase grounding and object detection in a weakly supervised manner. Our experiments demonstrate the effectiveness of our approach by achieving a 47.51% recall@1 score in phrase grounding on Flickr30k Entities and establishing a new state-of-the-art in object detection by achieving 21.1 mAP 50 and 10.5 mAP 50:95 on MS COCO when exclusively relying on image-caption pairs.
△ Less
Submitted 9 June, 2023;
originally announced June 2023.
-
Special Session: Approximation and Fault Resiliency of DNN Accelerators
Authors:
Mohammad Hasan Ahmadilivani,
Mario Barbareschi,
Salvatore Barone,
Alberto Bosio,
Masoud Daneshtalab,
Salvatore Della Torca,
Gabriele Gavarini,
Maksim Jenihhin,
Jaan Raik,
Annachiara Ruospo,
Ernesto Sanchez,
Mahdi Taheri
Abstract:
Deep Learning, and in particular, Deep Neural Network (DNN) is nowadays widely used in many scenarios, including safety-critical applications such as autonomous driving. In this context, besides energy efficiency and performance, reliability plays a crucial role since a system failure can jeopardize human life. As with any other device, the reliability of hardware architectures running DNNs has to…
▽ More
Deep Learning, and in particular, Deep Neural Network (DNN) is nowadays widely used in many scenarios, including safety-critical applications such as autonomous driving. In this context, besides energy efficiency and performance, reliability plays a crucial role since a system failure can jeopardize human life. As with any other device, the reliability of hardware architectures running DNNs has to be evaluated, usually through costly fault injection campaigns. This paper explores the approximation and fault resiliency of DNN accelerators. We propose to use approximate (AxC) arithmetic circuits to agilely emulate errors in hardware without performing fault injection on the DNN. To allow fast evaluation of AxC DNN, we developed an efficient GPU-based simulation framework. Further, we propose a fine-grain analysis of fault resiliency by examining fault propagation and masking in networks
△ Less
Submitted 31 May, 2023;
originally announced June 2023.
-
PSP Framework: A novel risk assessment method in compliance with ISO/SAE-21434
Authors:
Franco Oberti,
Ernesto Sanchez,
Alessandro Savino,
Filippo Parisi,
Stefano Di Carlo
Abstract:
As more cars connect to the internet and other devices, the automotive market has become a lucrative target for cyberattacks. This has made the industry more vulnerable to security threats. As a result, car manufacturers and governments are working together to reduce risks and prevent cyberattacks in the automotive sector. However, existing attack feasibility models derived from the information te…
▽ More
As more cars connect to the internet and other devices, the automotive market has become a lucrative target for cyberattacks. This has made the industry more vulnerable to security threats. As a result, car manufacturers and governments are working together to reduce risks and prevent cyberattacks in the automotive sector. However, existing attack feasibility models derived from the information technology field may not always provide accurate assessments of the potential risks faced by Vehicle Electronic Control Units in different operating conditions and domains. This paper introduces the PUNCH Softronix and Politecnico di Torino (PSP) framework to address this issue. This framework is designed to provide accurate assessments compatible with the attack feasibility models defined by the automotive product security standards. The PSP framework utilizes social sentiment analysis to evaluate the real threat risk levels.
△ Less
Submitted 9 May, 2023;
originally announced May 2023.
-
Automated Segmentation of Computed Tomography Images with Submanifold Sparse Convolutional Networks
Authors:
Saúl Alonso-Monsalve,
Leigh H. Whitehead,
Adam Aurisano,
Lorena Escudero Sanchez
Abstract:
Quantitative cancer image analysis relies on the accurate delineation of tumours, a very specialised and time-consuming task. For this reason, methods for automated segmentation of tumours in medical imaging have been extensively developed in recent years, being Computed Tomography one of the most popular imaging modalities explored. However, the large amount of 3D voxels in a typical scan is proh…
▽ More
Quantitative cancer image analysis relies on the accurate delineation of tumours, a very specialised and time-consuming task. For this reason, methods for automated segmentation of tumours in medical imaging have been extensively developed in recent years, being Computed Tomography one of the most popular imaging modalities explored. However, the large amount of 3D voxels in a typical scan is prohibitive for the entire volume to be analysed at once in conventional hardware. To overcome this issue, the processes of downsampling and/or resampling are generally implemented when using traditional convolutional neural networks in medical imaging. In this paper, we propose a new methodology that introduces a process of sparsification of the input images and submanifold sparse convolutional networks as an alternative to downsampling. As a proof of concept, we applied this new methodology to Computed Tomography images of renal cancer patients, obtaining performances of segmentations of kidneys and tumours competitive with previous methods (~84.6% Dice similarity coefficient), while achieving a significant improvement in computation time (2-3 min per training epoch).
△ Less
Submitted 6 December, 2022;
originally announced December 2022.
-
Bayesian Prompt Learning for Image-Language Model Generalization
Authors:
Mohammad Mahdi Derakhshani,
Enrique Sanchez,
Adrian Bulat,
Victor Guilherme Turrisi da Costa,
Cees G. M. Snoek,
Georgios Tzimiropoulos,
Brais Martinez
Abstract:
Foundational image-language models have generated considerable interest due to their efficient adaptation to downstream tasks by prompt learning. Prompt learning treats part of the language model input as trainable while freezing the rest, and optimizes an Empirical Risk Minimization objective. However, Empirical Risk Minimization is known to suffer from distributional shifts which hurt generaliza…
▽ More
Foundational image-language models have generated considerable interest due to their efficient adaptation to downstream tasks by prompt learning. Prompt learning treats part of the language model input as trainable while freezing the rest, and optimizes an Empirical Risk Minimization objective. However, Empirical Risk Minimization is known to suffer from distributional shifts which hurt generalizability to prompts unseen during training. By leveraging the regularization ability of Bayesian methods, we frame prompt learning from the Bayesian perspective and formulate it as a variational inference problem. Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts. Our framework is implemented by modeling the input prompt space in a probabilistic manner, as an a priori distribution which makes our proposal compatible with prompt learning approaches that are unconditional or conditional on the image. We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space, prevents learning spurious features, and exploits transferable invariant features. This results in better generalization of unseen prompts, even across different datasets and domains. Code available at: https://github.com/saic-fi/Bayesian-Prompt-Learning
△ Less
Submitted 20 August, 2023; v1 submitted 5 October, 2022;
originally announced October 2022.
-
REST: REtrieve & Self-Train for generative action recognition
Authors:
Adrian Bulat,
Enrique Sanchez,
Brais Martinez,
Georgios Tzimiropoulos
Abstract:
This work is on training a generative action/video recognition model whose output is a free-form action-specific caption describing the video (rather than an action class label). A generative approach has practical advantages like producing more fine-grained and human-readable output, and being naturally open-world. To this end, we propose to adapt a pre-trained generative Vision & Language (V&L)…
▽ More
This work is on training a generative action/video recognition model whose output is a free-form action-specific caption describing the video (rather than an action class label). A generative approach has practical advantages like producing more fine-grained and human-readable output, and being naturally open-world. To this end, we propose to adapt a pre-trained generative Vision & Language (V&L) Foundation Model for video/action recognition. While recently there have been a few attempts to adapt V&L models trained with contrastive learning (e.g. CLIP) for video/action, to the best of our knowledge, we propose the very first method that sets outs to accomplish this goal for a generative model. We firstly show that direct fine-tuning of a generative model to produce action classes suffers from severe overfitting. To alleviate this, we introduce REST, a training framework consisting of two key components: an unsupervised method for adapting the generative model to action/video by means of pseudo-caption generation and Self-training, i.e. without using any action-specific labels; (b) a Retrieval approach based on CLIP for discovering a diverse set of pseudo-captions for each video to train the model. Importantly, we show that both components are necessary to obtain high accuracy. We evaluate REST on the problem of zero-shot action recognition where we show that our approach is very competitive when compared to contrastive learning-based methods. Code will be made available.
△ Less
Submitted 29 September, 2022;
originally announced September 2022.
-
Calibrating Ensembles for Scalable Uncertainty Quantification in Deep Learning-based Medical Segmentation
Authors:
Thomas Buddenkotte,
Lorena Escudero Sanchez,
Mireia Crispin-Ortuzar,
Ramona Woitek,
Cathal McCague,
James D. Brenton,
Ozan Öktem,
Evis Sala,
Leonardo Rundo
Abstract:
Uncertainty quantification in automated image analysis is highly desired in many applications. Typically, machine learning models in classification or segmentation are only developed to provide binary answers; however, quantifying the uncertainty of the models can play a critical role for example in active learning or machine human interaction. Uncertainty quantification is especially difficult wh…
▽ More
Uncertainty quantification in automated image analysis is highly desired in many applications. Typically, machine learning models in classification or segmentation are only developed to provide binary answers; however, quantifying the uncertainty of the models can play a critical role for example in active learning or machine human interaction. Uncertainty quantification is especially difficult when using deep learning-based models, which are the state-of-the-art in many imaging applications. The current uncertainty quantification approaches do not scale well in high-dimensional real-world problems. Scalable solutions often rely on classical techniques, such as dropout, during inference or training ensembles of identical models with different random seeds to obtain a posterior distribution. In this paper, we show that these approaches fail to approximate the classification probability. On the contrary, we propose a scalable and intuitive framework to calibrate ensembles of deep learning models to produce uncertainty quantification measurements that approximate the classification probability. On unseen test data, we demonstrate improved calibration, sensitivity (in two out of three cases) and precision when being compared with the standard approaches. We further motivate the usage of our method in active learning, creating pseudo-labels to learn from unlabeled images and human-machine collaboration.
△ Less
Submitted 20 September, 2022;
originally announced September 2022.
-
Lumen Shape Reconstruction using a Soft Robotic Balloon Catheter and Electrical Impedance Tomography
Authors:
James Avery,
Mark Runciman,
Cristina Fiani,
Elena Monfort Sanchez,
Saina Akhond,
Zhuang Liu,
Kirill Aristovich,
George Mylonas
Abstract:
Incorrectly sized balloon catheters can lead to increased post-surgical complications, yet even with preoperative imaging, correct selection remains a challenge. With limited feedback during surgery, it is difficult to verify correct deployment. We propose the use of integrated impedance measurements and Electrical Impedance Tomography (EIT) imaging to assess the deformation of the balloon and det…
▽ More
Incorrectly sized balloon catheters can lead to increased post-surgical complications, yet even with preoperative imaging, correct selection remains a challenge. With limited feedback during surgery, it is difficult to verify correct deployment. We propose the use of integrated impedance measurements and Electrical Impedance Tomography (EIT) imaging to assess the deformation of the balloon and determine the size and shape of the surrounding lumen. Previous work using single impedance measurements, or pressure data and analytical models, whilst demonstrating high sizing accuracy, have assumed a circular cross section. Here we extend these methods by adding a multitude of electrodes to detect elliptical and occluded lumen and obtain EIT images to localise deformations. Using a 14 Fr (5.3 mm) catheter as an example, numerical simulations were performed to find the optimal electrode configuration of two rings of 8 electrodes spaced 10 mm apart. The simulations predicted that the maximum detectable aspect ratio decreased from 0.9 for a 14mm balloon to 0.5 at 30mm. The sizing and ellipticity detection results were verified experimentally. A prototype robotic balloon catheter was constructed to automatically inflate a compliant balloon while simultaneously recording EIT and pressure data. Data were collected in experiments replicating stenotic vessels with an elliptical and asymmetrical profile, and the widening of a lumen during angioplasty. After calibration, the system was able to correctly localise the occlusion and detect aspect ratios of 0.75. EIT images further localised the occlusion and visualised the dilation of the lumen during balloon inflation.
△ Less
Submitted 23 August, 2022; v1 submitted 25 July, 2022;
originally announced July 2022.
-
Molecular information theory meets protein folding
Authors:
Ignacio E. Sánchez,
Ezequiel A. Galpern,
Martín M. Garibaldi,
Diego U. Ferreiro
Abstract:
We propose an application of molecular information theory to analyze the folding of single domain proteins. We analyze results from various areas of protein science, such as sequence-based potentials, reduced amino acid alphabets, backbone configurational entropy, secondary structure content, residue burial layers, and mutational studies of protein stability changes. We found that the average info…
▽ More
We propose an application of molecular information theory to analyze the folding of single domain proteins. We analyze results from various areas of protein science, such as sequence-based potentials, reduced amino acid alphabets, backbone configurational entropy, secondary structure content, residue burial layers, and mutational studies of protein stability changes. We found that the average information contained in the sequences of evolved proteins is very close to the average information needed to specify a fold ~2.2 $\pm$ 0.3 bits/(site operation). The effective alphabet size in evolved proteins equals the effective number of conformations of a residue in the compact unfolded state at around 5. We calculated an energy-to-information conversion efficiency upon folding of around 50%, lower than the theoretical limit of 70%, but much higher than human built macroscopic machines. We propose a simple mapping between molecular information theory and energy landscape theory and explore the connections between sequence evolution, configurational entropy and the energetics of protein folding.
△ Less
Submitted 29 June, 2022;
originally announced June 2022.
-
CAN-MM: Multiplexed Message Authentication Code for Controller Area Network message authentication in road vehicles
Authors:
Franco Oberti,
Ernesto Sanchez,
Alessandro Savino,
Filippo Parisi,
Stefano Di Carlo
Abstract:
The automotive market is increasingly profitable for cyberattacks with the constant shift toward fully interconnected vehicles. Electronic Control Units (ECUs) installed on cars often operate in a critical and hostile environment. Hence, both carmakers and governments have decided to support a series of initiatives to mitigate risks and threats belonging to the automotive domain. The Controller Ar…
▽ More
The automotive market is increasingly profitable for cyberattacks with the constant shift toward fully interconnected vehicles. Electronic Control Units (ECUs) installed on cars often operate in a critical and hostile environment. Hence, both carmakers and governments have decided to support a series of initiatives to mitigate risks and threats belonging to the automotive domain. The Controller Area Network (CAN) is the primary communication protocol in the automotive field, and the integrity of the communication over this network is assured through Message Authentication Codes (MAC). However, limitations in throughput and frame size limit the application of this technique to specific versions of the CAN protocol, leaving several vehicles still unprotected. This paper presents CAN Multiplexed MAC (CAN-MM), a new approach exploiting frequency modulation to multiplex MAC data with standard CAN communication. CAN-MM allows transmitting MAC payloads maintaining full-back compatibility with all versions of the standard CAN protocol. Moreover, multiplexing allows sending DATA and MAC simultaneously.
△ Less
Submitted 22 May, 2024; v1 submitted 6 June, 2022;
originally announced June 2022.
-
LIN-MM: Multiplexed Message Authentication Code for Local Interconnect Network message authentication in road vehicles
Authors:
Franco Oberti,
Ernesto Sanchez,
Alessandro Savino,
Filippo Parisi,
Mirco Brero,
Stefano Di Carlo
Abstract:
The automotive market is profitable for cyberattacks with the constant shift toward interconnected vehicles. Electronic Control Units (ECUs) installed on cars often operate in a critical and hostile environment. Hence, both carmakers and governments have supported initiatives to mitigate risks and threats belonging to the automotive domain. The Local Interconnect Network (LIN) is one of the most u…
▽ More
The automotive market is profitable for cyberattacks with the constant shift toward interconnected vehicles. Electronic Control Units (ECUs) installed on cars often operate in a critical and hostile environment. Hence, both carmakers and governments have supported initiatives to mitigate risks and threats belonging to the automotive domain. The Local Interconnect Network (LIN) is one of the most used communication protocols in the automotive field. Today's LIN buses have just a few light security mechanisms to assure integrity through Message Authentication Codes (MAC). However, several limitations with strong constraints make applying those techniques to LIN networks challenging, leaving several vehicles still unprotected. This paper presents LIN Multiplexed MAC (LINMM), a new approach for exploiting signal modulation to multiplex MAC data with standard LIN communication. LINMM allows for transmitting MAC payloads, maintaining fullback compatibility with all versions of the standard LIN protocol.
△ Less
Submitted 6 June, 2022;
originally announced June 2022.
-
From Keypoints to Object Landmarks via Self-Training Correspondence: A novel approach to Unsupervised Landmark Discovery
Authors:
Dimitrios Mallis,
Enrique Sanchez,
Matt Bell,
Georgios Tzimiropoulos
Abstract:
This paper proposes a novel paradigm for the unsupervised learning of object landmark detectors. Contrary to existing methods that build on auxiliary tasks such as image generation or equivariance, we propose a self-training approach where, departing from generic keypoints, a landmark detector and descriptor is trained to improve itself, tuning the keypoints into distinctive landmarks. To this end…
▽ More
This paper proposes a novel paradigm for the unsupervised learning of object landmark detectors. Contrary to existing methods that build on auxiliary tasks such as image generation or equivariance, we propose a self-training approach where, departing from generic keypoints, a landmark detector and descriptor is trained to improve itself, tuning the keypoints into distinctive landmarks. To this end, we propose an iterative algorithm that alternates between producing new pseudo-labels through feature clustering and learning distinctive features for each pseudo-class through contrastive learning. With a shared backbone for the landmark detector and descriptor, the keypoint locations progressively converge to stable landmarks, filtering those less stable. Compared to previous works, our approach can learn points that are more flexible in terms of capturing large viewpoint changes. We validate our method on a variety of difficult datasets, including LS3D, BBCPose, Human3.6M and PennAction, achieving new state of the art results.
△ Less
Submitted 25 February, 2023; v1 submitted 31 May, 2022;
originally announced May 2022.
-
Schrödinger's FP: Dynamic Adaptation of Floating-Point Containers for Deep Learning Training
Authors:
Miloš Nikolić,
Enrique Torres Sanchez,
Jiahui Wang,
Ali Hadi Zadeh,
Mostafa Mahmoud,
Ameer Abdelhadi,
Kareem Ibrahim,
Andreas Moshovos
Abstract:
The transfer of tensors from/to memory during neural network training dominates time and energy. To improve energy efficiency and performance, research has been exploring ways to use narrower data representations. So far, these attempts relied on user-directed trial-and-error to achieve convergence. We present methods that relieve users from this responsibility. Our methods dynamically adjust the…
▽ More
The transfer of tensors from/to memory during neural network training dominates time and energy. To improve energy efficiency and performance, research has been exploring ways to use narrower data representations. So far, these attempts relied on user-directed trial-and-error to achieve convergence. We present methods that relieve users from this responsibility. Our methods dynamically adjust the size and format of the floating-point containers used for activations and weights during training, achieving adaptivity across three dimensions: i) which datatype to use, ii) on which tensor, and iii) how it changes over time. The different meanings and distributions of exponent and mantissas lead us to tailored approaches for each. We present two lossy pairs of methods to eliminate as many mantissa and exponent bits as possible without affecting accuracy. Quantum Mantissa and Quantum Exponent are machine learning compression methods that tap into the gradient descent algorithm to learn the minimal mantissa and exponent bitlengths on a per-layer granularity. They automatically learn that many tensors can use just 1 or 2 mantissa bits and 3 or 4 exponent bits. Overall, the two machine learning methods reduce the footprint by $4.74\times$. Alternatively, BitWave observes changes in the loss function during training to adjust mantissa and exponent bitlengths network-wide, yielding a $3.19\times$ reduction in footprint. Finally, we present an optional method, Gecko, to exploit the naturally emerging, lop-sided exponent distribution to losslessly compress resulting exponents from Quantum Exponent or BitWave and, on average, improve compression rates to $5.64\times$ and $4.56\times$.
△ Less
Submitted 16 May, 2024; v1 submitted 28 April, 2022;
originally announced April 2022.
-
EXT-TAURUM P2T: an Extended Secure CAN-FD Architecture for Road Vehicles
Authors:
Franco Oberti,
Alessandro Savino,
Ernesto Sanchez,
Filippo Parisi,
Stefano Di Carlo
Abstract:
The automobile industry is no longer relying on pure mechanical systems; instead, it benefits from advanced Electronic Control Units (ECUs) in order to provide new and complex functionalities in the effort to move toward fully connected cars. However, connected cars provide a dangerous playground for hackers. Vehicles are becoming increasingly vulnerable to cyber attacks as they come equipped with…
▽ More
The automobile industry is no longer relying on pure mechanical systems; instead, it benefits from advanced Electronic Control Units (ECUs) in order to provide new and complex functionalities in the effort to move toward fully connected cars. However, connected cars provide a dangerous playground for hackers. Vehicles are becoming increasingly vulnerable to cyber attacks as they come equipped with more connected features and control systems. This situation may expose strategic assets in the automotive value chain. In this scenario, the Controller Area Network (CAN) is the most widely used communication protocol in the automotive domain. However, this protocol lacks encryption and authentication. Consequently, any malicious/hijacked node can cause catastrophic accidents and financial loss. Starting from the analysis of the vulnerability connected to the CAN communication protocol in the automotive domain, this paper proposes EXT-TAURUM P2T a new low-cost secure CAN-FD architecture for the automotive domain implementing secure communication among ECUs, a novel key provisioning strategy, intelligent throughput management, and hardware signature mechanisms. The proposed architecture has been implemented, resorting to a commercial Multi-Protocol Vehicle Interface module, and the obtained results experimentally demonstrate the approach's feasibility.
△ Less
Submitted 7 March, 2022; v1 submitted 15 December, 2021;
originally announced December 2021.
-
Advancing COVID-19 Diagnosis with Privacy-Preserving Collaboration in Artificial Intelligence
Authors:
Xiang Bai,
Hanchen Wang,
Liya Ma,
Yongchao Xu,
Jiefeng Gan,
Ziwei Fan,
Fan Yang,
Ke Ma,
Jiehua Yang,
Song Bai,
Chang Shu,
Xinyu Zou,
Renhao Huang,
Changzheng Zhang,
Xiaowu Liu,
Dandan Tu,
Chuou Xu,
Wenqing Zhang,
Xi Wang,
Anguo Chen,
Yu Zeng,
Dehua Yang,
Ming-Wei Wang,
Nagaraj Holalkere,
Neil J. Halin
, et al. (21 additional authors not shown)
Abstract:
Artificial intelligence (AI) provides a promising substitution for streamlining COVID-19 diagnoses. However, concerns surrounding security and trustworthiness impede the collection of large-scale representative medical data, posing a considerable challenge for training a well-generalised model in clinical practices. To address this, we launch the Unified CT-COVID AI Diagnostic Initiative (UCADI),…
▽ More
Artificial intelligence (AI) provides a promising substitution for streamlining COVID-19 diagnoses. However, concerns surrounding security and trustworthiness impede the collection of large-scale representative medical data, posing a considerable challenge for training a well-generalised model in clinical practices. To address this, we launch the Unified CT-COVID AI Diagnostic Initiative (UCADI), where the AI model can be distributedly trained and independently executed at each host institution under a federated learning framework (FL) without data sharing. Here we show that our FL model outperformed all the local models by a large yield (test sensitivity /specificity in China: 0.973/0.951, in the UK: 0.730/0.942), achieving comparable performance with a panel of professional radiologists. We further evaluated the model on the hold-out (collected from another two hospitals leaving out the FL) and heterogeneous (acquired with contrast materials) data, provided visual explanations for decisions made by the model, and analysed the trade-offs between the model performance and the communication costs in the federated training process. Our study is based on 9,573 chest computed tomography scans (CTs) from 3,336 patients collected from 23 hospitals located in China and the UK. Collectively, our work advanced the prospects of utilising federated learning for privacy-preserving AI in digital health.
△ Less
Submitted 17 November, 2021;
originally announced November 2021.
-
Subpixel Heatmap Regression for Facial Landmark Localization
Authors:
Adrian Bulat,
Enrique Sanchez,
Georgios Tzimiropoulos
Abstract:
Deep Learning models based on heatmap regression have revolutionized the task of facial landmark localization with existing models working robustly under large poses, non-uniform illumination and shadows, occlusions and self-occlusions, low resolution and blur. However, despite their wide adoption, heatmap regression approaches suffer from discretization-induced errors related to both the heatmap…
▽ More
Deep Learning models based on heatmap regression have revolutionized the task of facial landmark localization with existing models working robustly under large poses, non-uniform illumination and shadows, occlusions and self-occlusions, low resolution and blur. However, despite their wide adoption, heatmap regression approaches suffer from discretization-induced errors related to both the heatmap encoding and decoding process. In this work we show that these errors have a surprisingly large negative impact on facial alignment accuracy. To alleviate this problem, we propose a new approach for the heatmap encoding and decoding process by leveraging the underlying continuous distribution. To take full advantage of the newly proposed encoding-decoding mechanism, we also introduce a Siamese-based training that enforces heatmap consistency across various geometric image transformations. Our approach offers noticeable gains across multiple datasets setting a new state-of-the-art result in facial landmark localization. Code alongside the pretrained models will be made available at https://www.adrianbulat.com/face-alignment
△ Less
Submitted 3 November, 2021;
originally announced November 2021.
-
Pre-training strategies and datasets for facial representation learning
Authors:
Adrian Bulat,
Shiyang Cheng,
Jing Yang,
Andrew Garbett,
Enrique Sanchez,
Georgios Tzimiropoulos
Abstract:
What is the best way to learn a universal face representation? Recent work on Deep Learning in the area of face analysis has focused on supervised learning for specific tasks of interest (e.g. face recognition, facial landmark localization etc.) but has overlooked the overarching question of how to find a facial representation that can be readily adapted to several facial analysis tasks and datase…
▽ More
What is the best way to learn a universal face representation? Recent work on Deep Learning in the area of face analysis has focused on supervised learning for specific tasks of interest (e.g. face recognition, facial landmark localization etc.) but has overlooked the overarching question of how to find a facial representation that can be readily adapted to several facial analysis tasks and datasets. To this end, we make the following 4 contributions: (a) we introduce, for the first time, a comprehensive evaluation benchmark for facial representation learning consisting of 5 important face analysis tasks. (b) We systematically investigate two ways of large-scale representation learning applied to faces: supervised and unsupervised pre-training. Importantly, we focus our evaluations on the case of few-shot facial learning. (c) We investigate important properties of the training datasets including their size and quality (labelled, unlabelled or even uncurated). (d) To draw our conclusions, we conducted a very large number of experiments. Our main two findings are: (1) Unsupervised pre-training on completely in-the-wild, uncurated data provides consistent and, in some cases, significant accuracy improvements for all facial tasks considered. (2) Many existing facial video datasets seem to have a large amount of redundancy. We will release code, and pre-trained models to facilitate future research.
△ Less
Submitted 20 July, 2022; v1 submitted 30 March, 2021;
originally announced March 2021.
-
Affective Processes: stochastic modelling of temporal context for emotion and facial expression recognition
Authors:
Enrique Sanchez,
Mani Kumar Tellamekala,
Michel Valstar,
Georgios Tzimiropoulos
Abstract:
Temporal context is key to the recognition of expressions of emotion. Existing methods, that rely on recurrent or self-attention models to enforce temporal consistency, work on the feature level, ignoring the task-specific temporal dependencies, and fail to model context uncertainty. To alleviate these issues, we build upon the framework of Neural Processes to propose a method for apparent emotion…
▽ More
Temporal context is key to the recognition of expressions of emotion. Existing methods, that rely on recurrent or self-attention models to enforce temporal consistency, work on the feature level, ignoring the task-specific temporal dependencies, and fail to model context uncertainty. To alleviate these issues, we build upon the framework of Neural Processes to propose a method for apparent emotion recognition with three key novel components: (a) probabilistic contextual representation with a global latent variable model; (b) temporal context modelling using task-specific predictions in addition to features; and (c) smart temporal context selection. We validate our approach on four databases, two for Valence and Arousal estimation (SEWA and AffWild2), and two for Action Unit intensity estimation (DISFA and BP4D). Results show a consistent improvement over a series of strong baselines as well as over state-of-the-art methods.
△ Less
Submitted 24 March, 2021;
originally announced March 2021.
-
A machine learning approach to galaxy properties: joint redshift-stellar mass probability distributions with Random Forest
Authors:
S. Mucesh,
W. G. Hartley,
A. Palmese,
O. Lahav,
L. Whiteway,
A. F. L. Bluck,
A. Alarcon,
A. Amon,
K. Bechtol,
G. M. Bernstein,
A. Carnero Rosell,
M. Carrasco Kind,
A. Choi,
K. Eckert,
S. Everett,
D. Gruen,
R. A. Gruendl,
I. Harrison,
E. M. Huff,
N. Kuropatkin,
I. Sevilla-Noarbe,
E. Sheldon,
B. Yanny,
M. Aguena,
S. Allam
, et al. (50 additional authors not shown)
Abstract:
We demonstrate that highly accurate joint redshift-stellar mass probability distribution functions (PDFs) can be obtained using the Random Forest (RF) machine learning (ML) algorithm, even with few photometric bands available. As an example, we use the Dark Energy Survey (DES), combined with the COSMOS2015 catalogue for redshifts and stellar masses. We build two ML models: one containing deep phot…
▽ More
We demonstrate that highly accurate joint redshift-stellar mass probability distribution functions (PDFs) can be obtained using the Random Forest (RF) machine learning (ML) algorithm, even with few photometric bands available. As an example, we use the Dark Energy Survey (DES), combined with the COSMOS2015 catalogue for redshifts and stellar masses. We build two ML models: one containing deep photometry in the $griz$ bands, and the second reflecting the photometric scatter present in the main DES survey, with carefully constructed representative training data in each case. We validate our joint PDFs for $10,699$ test galaxies by utilizing the copula probability integral transform and the Kendall distribution function, and their univariate counterparts to validate the marginals. Benchmarked against a basic set-up of the template-fitting code BAGPIPES, our ML-based method outperforms template fitting on all of our predefined performance metrics. In addition to accuracy, the RF is extremely fast, able to compute joint PDFs for a million galaxies in just under $6$ min with consumer computer hardware. Such speed enables PDFs to be derived in real time within analysis codes, solving potential storage issues. As part of this work we have developed GALPRO, a highly intuitive and efficient Python package to rapidly generate multivariate PDFs on-the-fly. GALPRO is documented and available for researchers to use in their cosmology and galaxy evolution studies.
△ Less
Submitted 19 February, 2021; v1 submitted 10 December, 2020;
originally announced December 2020.
-
Semi-supervised Facial Action Unit Intensity Estimation with Contrastive Learning
Authors:
Enrique Sanchez,
Adrian Bulat,
Anestis Zaganidis,
Georgios Tzimiropoulos
Abstract:
This paper tackles the challenging problem of estimating the intensity of Facial Action Units with few labeled images. Contrary to previous works, our method does not require to manually select key frames, and produces state-of-the-art results with as little as $2\%$ of annotated frames, which are \textit{randomly chosen}. To this end, we propose a semi-supervised learning approach where a spatio-…
▽ More
This paper tackles the challenging problem of estimating the intensity of Facial Action Units with few labeled images. Contrary to previous works, our method does not require to manually select key frames, and produces state-of-the-art results with as little as $2\%$ of annotated frames, which are \textit{randomly chosen}. To this end, we propose a semi-supervised learning approach where a spatio-temporal model combining a feature extractor and a temporal module are learned in two stages. The first stage uses datasets of unlabeled videos to learn a strong spatio-temporal representation of facial behavior dynamics based on contrastive learning. To our knowledge we are the first to build upon this framework for modeling facial behavior in an unsupervised manner. The second stage uses another dataset of randomly chosen labeled frames to train a regressor on top of our spatio-temporal model for estimating the AU intensity. We show that although backpropagation through time is applied only with respect to the output of the network for extremely sparse and randomly chosen labeled frames, our model can be effectively trained to estimate AU intensity accurately, thanks to the unsupervised pre-training of the first stage. We experimentally validate that our method outperforms existing methods when working with as little as $2\%$ of randomly chosen data for both DISFA and BP4D datasets, without a careful choice of labeled frames, a time-consuming task still required in previous approaches.
△ Less
Submitted 4 November, 2020; v1 submitted 3 November, 2020;
originally announced November 2020.
-
Machine Learning for Searching the Dark Energy Survey for Trans-Neptunian Objects
Authors:
B. Henghes,
O. Lahav,
D. W. Gerdes,
E. Lin,
R. Morgan,
T. M. C. Abbott,
M. Aguena,
S. Allam,
J. Annis,
S. Avila,
E. Bertin,
D. Brooks,
D. L. Burke,
A. CarneroRosell,
M. CarrascoKind,
J. Carretero,
C. Conselice,
M. Costanzi,
L. N. da Costa,
J. DeVicente,
S. Desai,
H. T. Diehl,
P. Doel,
S. Everett,
I. Ferrero
, et al. (34 additional authors not shown)
Abstract:
In this paper we investigate how implementing machine learning could improve the efficiency of the search for Trans-Neptunian Objects (TNOs) within Dark Energy Survey (DES) data when used alongside orbit fitting. The discovery of multiple TNOs that appear to show a similarity in their orbital parameters has led to the suggestion that one or more undetected planets, an as yet undiscovered "Planet 9…
▽ More
In this paper we investigate how implementing machine learning could improve the efficiency of the search for Trans-Neptunian Objects (TNOs) within Dark Energy Survey (DES) data when used alongside orbit fitting. The discovery of multiple TNOs that appear to show a similarity in their orbital parameters has led to the suggestion that one or more undetected planets, an as yet undiscovered "Planet 9", may be present in the outer Solar System. DES is well placed to detect such a planet and has already been used to discover many other TNOs. Here, we perform tests on eight different supervised machine learning algorithms, using a dataset consisting of simulated TNOs buried within real DES noise data. We found that the best performing classifier was the Random Forest which, when optimised, performed well at detecting the rare objects. We achieve an area under the receiver operating characteristic (ROC) curve, (AUC) $= 0.996 \pm 0.001$. After optimizing the decision threshold of the Random Forest, we achieve a recall of 0.96 while maintaining a precision of 0.80. Finally, by using the optimized classifier to pre-select objects, we are able to run the orbit-fitting stage of our detection pipeline five times faster.
△ Less
Submitted 10 December, 2020; v1 submitted 27 September, 2020;
originally announced September 2020.
-
Dynamic Portfolio Optimization with Real Datasets Using Quantum Processors and Quantum-Inspired Tensor Networks
Authors:
Samuel Mugel,
Carlos Kuchkovsky,
Escolastico Sanchez,
Samuel Fernandez-Lorenzo,
Jorge Luis-Hita,
Enrique Lizaso,
Roman Orus
Abstract:
In this paper we tackle the problem of dynamic portfolio optimization, i.e., determining the optimal trading trajectory for an investment portfolio of assets over a period of time, taking into account transaction costs and other possible constraints. This problem is central to quantitative finance. After a detailed introduction to the problem, we implement a number of quantum and quantum-inspired…
▽ More
In this paper we tackle the problem of dynamic portfolio optimization, i.e., determining the optimal trading trajectory for an investment portfolio of assets over a period of time, taking into account transaction costs and other possible constraints. This problem is central to quantitative finance. After a detailed introduction to the problem, we implement a number of quantum and quantum-inspired algorithms on different hardware platforms to solve its discrete formulation using real data from daily prices over 8 years of 52 assets, and do a detailed comparison of the obtained Sharpe ratios, profits and computing times. In particular, we implement classical solvers (Gekko, exhaustive), D-Wave Hybrid quantum annealing, two different approaches based on Variational Quantum Eigensolvers on IBM-Q (one of them brand-new and tailored to the problem), and for the first time in this context also a quantum-inspired optimizer based on Tensor Networks. In order to fit the data into each specific hardware platform, we also consider doing a preprocessing based on clustering of assets. From our comparison, we conclude that D-Wave Hybrid and Tensor Networks are able to handle the largest systems, where we do calculations up to 1272 fully-connected qubits for demonstrative purposes. Finally, we also discuss how to mathematically implement other possible real-life constraints, as well as several ideas to further improve the performance of the studied methods.
△ Less
Submitted 6 December, 2021; v1 submitted 30 June, 2020;
originally announced July 2020.
-
A recurrent cycle consistency loss for progressive face-to-face synthesis
Authors:
Enrique Sanchez,
Michel Valstar
Abstract:
This paper addresses a major flaw of the cycle consistency loss when used to preserve the input appearance in the face-to-face synthesis domain. In particular, we show that the images generated by a network trained using this loss conceal a noise that hinders their use for further tasks. To overcome this limitation, we propose a ''recurrent cycle consistency loss" which for different sequences of…
▽ More
This paper addresses a major flaw of the cycle consistency loss when used to preserve the input appearance in the face-to-face synthesis domain. In particular, we show that the images generated by a network trained using this loss conceal a noise that hinders their use for further tasks. To overcome this limitation, we propose a ''recurrent cycle consistency loss" which for different sequences of target attributes minimises the distance between the output images, independent of any intermediate step. We empirically validate not only that our loss enables the re-use of generated images, but that it also improves their quality. In addition, we propose the very first network that covers the task of unconstrained landmark-guided face-to-face synthesis. Contrary to previous works, our proposed approach enables the transfer of a particular set of input features to a large span of poses and expressions, whereby the target landmarks become the ground-truth points. We then evaluate the consistency of our proposed approach to synthesise faces at the target landmarks. To the best of our knowledge, we are the first to propose a loss to overcome the limitation of the cycle consistency loss, and the first to propose an ''in-the-wild'' landmark guided synthesis approach. Code and models for this paper can be found in https://github.com/ESanchezLozano/GANnotation
△ Less
Submitted 14 April, 2020;
originally announced April 2020.
-
A Transfer Learning approach to Heatmap Regression for Action Unit intensity estimation
Authors:
Ioanna Ntinou,
Enrique Sanchez,
Adrian Bulat,
Michel Valstar,
Georgios Tzimiropoulos
Abstract:
Action Units (AUs) are geometrically-based atomic facial muscle movements known to produce appearance changes at specific facial locations. Motivated by this observation we propose a novel AU modelling problem that consists of jointly estimating their localisation and intensity. To this end, we propose a simple yet efficient approach based on Heatmap Regression that merges both problems into a sin…
▽ More
Action Units (AUs) are geometrically-based atomic facial muscle movements known to produce appearance changes at specific facial locations. Motivated by this observation we propose a novel AU modelling problem that consists of jointly estimating their localisation and intensity. To this end, we propose a simple yet efficient approach based on Heatmap Regression that merges both problems into a single task. A Heatmap models whether an AU occurs or not at a given spatial location. To accommodate the joint modelling of AUs intensity, we propose variable size heatmaps, with their amplitude and size varying according to the labelled intensity. Using Heatmap Regression, we can inherit from the progress recently witnessed in facial landmark localisation. Building upon the similarities between both problems, we devise a transfer learning approach where we exploit the knowledge of a network trained on large-scale facial landmark datasets. In particular, we explore different alternatives for transfer learning through a) fine-tuning, b) adaptation layers, c) attention maps, and d) reparametrisation. Our approach effectively inherits the rich facial features produced by a strong face alignment network, with minimal extra computational cost. We empirically validate that our system sets a new state-of-the-art on three popular datasets, namely BP4D, DISFA, and FERA2017.
△ Less
Submitted 14 April, 2020;
originally announced April 2020.
-
Learning Disentangled Representations via Mutual Information Estimation
Authors:
Eduardo Hugo Sanchez,
Mathieu Serrurier,
Mathias Ortner
Abstract:
In this paper, we investigate the problem of learning disentangled representations. Given a pair of images sharing some attributes, we aim to create a low-dimensional representation which is split into two parts: a shared representation that captures the common information between the images and an exclusive representation that contains the specific information of each image. To address this issue…
▽ More
In this paper, we investigate the problem of learning disentangled representations. Given a pair of images sharing some attributes, we aim to create a low-dimensional representation which is split into two parts: a shared representation that captures the common information between the images and an exclusive representation that contains the specific information of each image. To address this issue, we propose a model based on mutual information estimation without relying on image reconstruction or image generation. Mutual information maximization is performed to capture the attributes of data in the shared and exclusive representations while we minimize the mutual information between the shared and exclusive representation to enforce representation disentanglement. We show that these representations are useful to perform downstream tasks such as image classification and image retrieval based on the shared or exclusive component. Moreover, classification results show that our model outperforms the state-of-the-art model based on VAE/GAN approaches in representation disentanglement.
△ Less
Submitted 9 December, 2019;
originally announced December 2019.
-
Object landmark discovery through unsupervised adaptation
Authors:
Enrique Sanchez,
Georgios Tzimiropoulos
Abstract:
This paper proposes a method to ease the unsupervised learning of object landmark detectors. Similarly to previous methods, our approach is fully unsupervised in a sense that it does not require or make any use of annotated landmarks for the target object category. Contrary to previous works, we do however assume that a landmark detector, which has already learned a structured representation for a…
▽ More
This paper proposes a method to ease the unsupervised learning of object landmark detectors. Similarly to previous methods, our approach is fully unsupervised in a sense that it does not require or make any use of annotated landmarks for the target object category. Contrary to previous works, we do however assume that a landmark detector, which has already learned a structured representation for a given object category in a fully supervised manner, is available. Under this setting, our main idea boils down to adapting the given pre-trained network to the target object categories in a fully unsupervised manner. To this end, our method uses the pre-trained network as a core which remains frozen and does not get updated during training, and learns, in an unsupervised manner, only a projection matrix to perform the adaptation to the target categories. By building upon an existing structured representation learned in a supervised manner, the optimization problem solved by our method is much more constrained with significantly less parameters to learn which seems to be important for the case of unsupervised learning. We show that our method surpasses fully unsupervised techniques trained from scratch as well as a strong baseline based on fine-tuning, and produces state-of-the-art results on several datasets. Code can be found at https://github.com/ESanchezLozano/SAIC-Unsupervised-landmark-detection-NeurIPS2019 .
△ Less
Submitted 21 October, 2019;
originally announced October 2019.
-
CADS: Core-Aware Dynamic Scheduler for Multicore Memory Controllers
Authors:
Eduardo Olmedo Sanchez,
Xian-He Sun
Abstract:
Memory controller scheduling is crucial in multicore processors, where DRAM bandwidth is shared. Since increased number of requests from multiple cores of processors becomes a source of bottleneck, scheduling the requests efficiently is necessary to utilize all the computing power these processors offer. However, current multicore processors are using traditional memory controllers, which are desi…
▽ More
Memory controller scheduling is crucial in multicore processors, where DRAM bandwidth is shared. Since increased number of requests from multiple cores of processors becomes a source of bottleneck, scheduling the requests efficiently is necessary to utilize all the computing power these processors offer. However, current multicore processors are using traditional memory controllers, which are designed for single-core processors. They are unable to adapt to changing characteristics of memory workloads that run simultaneously on multiple cores. Existing schedulers may disrupt locality and bank parallelism among data requests coming from different cores. Hence, novel memory controllers that consider and adapt to the memory access characteristics, and share memory resources efficiently and fairly are necessary. We introduce Core-Aware Dynamic Scheduler (CADS) for multicore memory controller. CADS uses Reinforcement Learning (RL) to alter its scheduling strategy dynamically at runtime. Our scheduler utilizes locality among data requests from multiple cores and exploits parallelism in accessing multiple banks of DRAM. CADS is also able to share the DRAM while guaranteeing fairness to all cores accessing memory. Using CADS policy, we achieve 20% better cycles per instruction (CPI) in running memory intensive and compute intensive PARSEC parallel benchmarks simultaneously, and 16% better CPI with SPEC 2006 benchmarks.
△ Less
Submitted 17 July, 2019;
originally announced July 2019.
-
Learning Disentangled Representations of Satellite Image Time Series
Authors:
Eduardo Sanchez,
Mathieu Serrurier,
Mathias Ortner
Abstract:
In this paper, we investigate how to learn a suitable representation of satellite image time series in an unsupervised manner by leveraging large amounts of unlabeled data. Additionally , we aim to disentangle the representation of time series into two representations: a shared representation that captures the common information between the images of a time series and an exclusive representation t…
▽ More
In this paper, we investigate how to learn a suitable representation of satellite image time series in an unsupervised manner by leveraging large amounts of unlabeled data. Additionally , we aim to disentangle the representation of time series into two representations: a shared representation that captures the common information between the images of a time series and an exclusive representation that contains the specific information of each image of the time series. To address these issues, we propose a model that combines a novel component called cross-domain autoencoders with the variational autoencoder (VAE) and generative ad-versarial network (GAN) methods. In order to learn disentangled representations of time series, our model learns the multimodal image-to-image translation task. We train our model using satellite image time series from the Sentinel-2 mission. Several experiments are carried out to evaluate the obtained representations. We show that these disentangled representations can be very useful to perform multiple tasks such as image classification, image retrieval, image segmentation and change detection.
△ Less
Submitted 21 March, 2019;
originally announced March 2019.
-
Triple consistency loss for pairing distributions in GAN-based face synthesis
Authors:
Enrique Sanchez,
Michel Valstar
Abstract:
Generative Adversarial Networks have shown impressive results for the task of object translation, including face-to-face translation. A key component behind the success of recent approaches is the self-consistency loss, which encourages a network to recover the original input image when the output generated for a desired attribute is itself passed through the same network, but with the target attr…
▽ More
Generative Adversarial Networks have shown impressive results for the task of object translation, including face-to-face translation. A key component behind the success of recent approaches is the self-consistency loss, which encourages a network to recover the original input image when the output generated for a desired attribute is itself passed through the same network, but with the target attribute inverted. While the self-consistency loss yields photo-realistic results, it can be shown that the input and target domains, supposed to be close, differ substantially. This is empirically found by observing that a network recovers the input image even if attributes other than the inversion of the original goal are set as target. This stops one combining networks for different tasks, or using a network to do progressive forward passes. In this paper, we show empirical evidence of this effect, and propose a new loss to bridge the gap between the distributions of the input and target domains. This "triple consistency loss", aims to minimise the distance between the outputs generated by the network for different routes to the target, independent of any intermediate steps. To show this is effective, we incorporate the triple consistency loss into the training of a new landmark-guided face to face synthesis, where, contrary to previous works, the generated images can simultaneously undergo a large transformation in both expression and pose. To the best of our knowledge, we are the first to tackle the problem of mismatching distributions in self-domain synthesis, and to propose "in-the-wild" landmark-guided synthesis. Code will be available at https://github.com/ESanchezLozano/GANnotation
△ Less
Submitted 8 November, 2018;
originally announced November 2018.