-
The Aloe Family Recipe for Open and Specialized Healthcare LLMs
Authors:
Dario Garcia-Gasulla,
Jordi Bayarri-Planas,
Ashwin Kumar Gururajan,
Enrique Lopez-Cuena,
Adrian Tormos,
Daniel Hinjos,
Pablo Bernabeu-Perez,
Anna Arias-Duart,
Pablo Agustin Martin-Torres,
Marta Gonzalez-Mallo,
Sergio Alvarez-Napagao,
Eduard Ayguadé-Parra,
Ulises Cortés
Abstract:
Purpose: With advancements in Large Language Models (LLMs) for healthcare, the need arises for competitive open-source models to protect the public interest. This work contributes to the field of open medical LLMs by optimizing key stages of data preprocessing and training, while showing how to improve model safety (through DPO) and efficacy (through RAG). The evaluation methodology used, which in…
▽ More
Purpose: With advancements in Large Language Models (LLMs) for healthcare, the need arises for competitive open-source models to protect the public interest. This work contributes to the field of open medical LLMs by optimizing key stages of data preprocessing and training, while showing how to improve model safety (through DPO) and efficacy (through RAG). The evaluation methodology used, which includes four different types of tests, defines a new standard for the field. The resultant models, shown to be competitive with the best private alternatives, are released with a permisive license.
Methods: Building on top of strong base models like Llama 3.1 and Qwen 2.5, Aloe Beta uses a custom dataset to enhance public data with synthetic Chain of Thought examples. The models undergo alignment with Direct Preference Optimization, emphasizing ethical and policy-aligned performance in the presence of jailbreaking attacks. Evaluation includes close-ended, open-ended, safety and human assessments, to maximize the reliability of results.
Results: Recommendations are made across the entire pipeline, backed by the solid performance of the Aloe Family. These models deliver competitive performance across healthcare benchmarks and medical fields, and are often preferred by healthcare professionals. On bias and toxicity, the Aloe Beta models significantly improve safety, showing resilience to unseen jailbreaking attacks. For a responsible release, a detailed risk assessment specific to healthcare is attached to the Aloe Family models.
Conclusion: The Aloe Beta models, and the recipe that leads to them, are a significant contribution to the open-source medical LLM field, offering top-of-the-line performance while maintaining high ethical requirements. This work sets a new standard for developing and reporting aligned LLMs in healthcare.
△ Less
Submitted 28 May, 2025; v1 submitted 7 May, 2025;
originally announced May 2025.
-
TuRTLe: A Unified Evaluation of LLMs for RTL Generation
Authors:
Dario Garcia-Gasulla,
Gokcen Kestor,
Emanuele Parisi,
Miquel Albertí-Binimelis,
Cristian Gutierrez,
Razine Moundir Ghorab,
Orlando Montenegro,
Bernat Homs,
Miquel Moreto
Abstract:
The rapid advancements in LLMs have driven the adoption of generative AI in various domains, including Electronic Design Automation (EDA). Unlike traditional software development, EDA presents unique challenges, as generated RTL code must not only be syntactically correct and functionally accurate but also synthesizable by hardware generators while meeting performance, power, and area constraints.…
▽ More
The rapid advancements in LLMs have driven the adoption of generative AI in various domains, including Electronic Design Automation (EDA). Unlike traditional software development, EDA presents unique challenges, as generated RTL code must not only be syntactically correct and functionally accurate but also synthesizable by hardware generators while meeting performance, power, and area constraints. These additional requirements introduce complexities that existing code-generation benchmarks often fail to capture, limiting their effectiveness in evaluating LLMs for RTL generation. To address this gap, we propose TuRTLe, a unified evaluation framework designed to systematically assess LLMs across key RTL generation tasks. TuRTLe integrates multiple existing benchmarks and automates the evaluation process, enabling a comprehensive assessment of LLM performance in syntax correctness, functional correctness, synthesis, PPA optimization, and exact line completion. Using this framework, we benchmark a diverse set of open LLMs and analyze their strengths and weaknesses in EDA-specific tasks. Our results show that reasoning-based models, such as DeepSeek R1, consistently outperform others across multiple evaluation criteria, but at the cost of increased computational overhead and inference latency. Additionally, base models are better suited in module completion tasks, while instruct-tuned models perform better in specification-to-RTL tasks.
△ Less
Submitted 30 May, 2025; v1 submitted 31 March, 2025;
originally announced April 2025.
-
Efficient Safety Retrofitting Against Jailbreaking for LLMs
Authors:
Dario Garcia-Gasulla,
Adrian Tormos,
Anna Arias-Duart,
Daniel Hinjos,
Oscar Molina-Sedano,
Ashwin Kumar Gururajan,
Maria Eugenia Cardello
Abstract:
Direct Preference Optimization (DPO) is an efficient alignment technique that steers LLMs towards preferable outputs by training on preference data, bypassing the need for explicit reward models. Its simplicity enables easy adaptation to various domains and safety requirements. This paper examines DPO's effectiveness in model safety against jailbreaking attacks while minimizing data requirements a…
▽ More
Direct Preference Optimization (DPO) is an efficient alignment technique that steers LLMs towards preferable outputs by training on preference data, bypassing the need for explicit reward models. Its simplicity enables easy adaptation to various domains and safety requirements. This paper examines DPO's effectiveness in model safety against jailbreaking attacks while minimizing data requirements and training costs. We introduce Egida, a dataset expanded from multiple sources, which includes 27 different safety topics and 18 different attack styles, complemented with synthetic and human labels. This data is used to boost the safety of state-of-the-art LLMs (Llama-3.1-8B/70B-Instruct, Qwen-2.5-7B/72B-Instruct) across topics and attack styles. In addition to safety evaluations, we assess their post-alignment performance degradation in general purpose tasks, and their tendency to over refusal. Following the proposed methodology, trained models reduce their Attack Success Rate by 10%-30%, using small training efforts (2,000 samples) with low computational cost (3\$ for 8B models, 20\$ for 72B models). Safety aligned models generalize to unseen topics and attack styles, with the most successful attack style reaching a success rate around 5%. Size and family are found to strongly influence model malleability towards safety, pointing at the importance of pre-training choices. To validate our findings, a large independent assessment of human preference agreement with Llama-Guard-3-8B is conducted by the authors and the associated dataset Egida-HSafe is released. Overall, this study illustrates how affordable and accessible it is to enhance LLM safety using DPO while outlining its current limitations. All datasets and models are released to enable reproducibility and further research.
△ Less
Submitted 25 February, 2025; v1 submitted 19 February, 2025;
originally announced February 2025.
-
Automatic Evaluation of Healthcare LLMs Beyond Question-Answering
Authors:
Anna Arias-Duart,
Pablo Agustin Martin-Torres,
Daniel Hinjos,
Pablo Bernabeu-Perez,
Lucia Urcelay Ganzabal,
Marta Gonzalez Mallo,
Ashwin Kumar Gururajan,
Enrique Lopez-Cuena,
Sergio Alvarez-Napagao,
Dario Garcia-Gasulla
Abstract:
Current Large Language Models (LLMs) benchmarks are often based on open-ended or close-ended QA evaluations, avoiding the requirement of human labor. Close-ended measurements evaluate the factuality of responses but lack expressiveness. Open-ended capture the model's capacity to produce discourse responses but are harder to assess for correctness. These two approaches are commonly used, either ind…
▽ More
Current Large Language Models (LLMs) benchmarks are often based on open-ended or close-ended QA evaluations, avoiding the requirement of human labor. Close-ended measurements evaluate the factuality of responses but lack expressiveness. Open-ended capture the model's capacity to produce discourse responses but are harder to assess for correctness. These two approaches are commonly used, either independently or together, though their relationship remains poorly understood. This work is focused on the healthcare domain, where both factuality and discourse matter greatly. It introduces a comprehensive, multi-axis suite for healthcare LLM evaluation, exploring correlations between open and close benchmarks and metrics. Findings include blind spots and overlaps in current methodologies. As an updated sanity check, we release a new medical benchmark --CareQA-- with both open and closed variants. Finally, we propose a novel metric for open-ended evaluations -- Relaxed Perplexity -- to mitigate the identified limitations.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
Data Augmentation with Diffusion Models for Colon Polyp Localization on the Low Data Regime: How much real data is enough?
Authors:
Adrian Tormos,
Blanca Llauradó,
Fernando Núñez,
Axel Romero,
Dario Garcia-Gasulla,
Javier Béjar
Abstract:
The scarcity of data in medical domains hinders the performance of Deep Learning models. Data augmentation techniques can alleviate that problem, but they usually rely on functional transformations of the data that do not guarantee to preserve the original tasks. To approximate the distribution of the data using generative models is a way of reducing that problem and also to obtain new samples tha…
▽ More
The scarcity of data in medical domains hinders the performance of Deep Learning models. Data augmentation techniques can alleviate that problem, but they usually rely on functional transformations of the data that do not guarantee to preserve the original tasks. To approximate the distribution of the data using generative models is a way of reducing that problem and also to obtain new samples that resemble the original data. Denoising Diffusion models is a promising Deep Learning technique that can learn good approximations of different kinds of data like images, time series or tabular data.
Automatic colonoscopy analysis and specifically Polyp localization in colonoscopy videos is a task that can assist clinical diagnosis and treatment. The annotation of video frames for training a deep learning model is a time consuming task and usually only small datasets can be obtained. The fine tuning of application models using a large dataset of generated data could be an alternative to improve their performance. We conduct a set of experiments training different diffusion models that can generate jointly colonoscopy images with localization annotations using a combination of existing open datasets. The generated data is used on various transfer learning experiments in the task of polyp localization with a model based on YOLO v9 on the low data regime.
△ Less
Submitted 28 November, 2024;
originally announced November 2024.
-
Pareto-Optimized Open-Source LLMs for Healthcare via Context Retrieval
Authors:
Jordi Bayarri-Planas,
Ashwin Kumar Gururajan,
Dario Garcia-Gasulla
Abstract:
This study leverages optimized context retrieval to enhance open-source Large Language Models (LLMs) for cost-effective, high performance healthcare AI. We demonstrate that this approach achieves state-of-the-art accuracy on medical question answering at a fraction of the cost of proprietary models, significantly improving the cost-accuracy Pareto frontier on the MedQA benchmark. Key contributions…
▽ More
This study leverages optimized context retrieval to enhance open-source Large Language Models (LLMs) for cost-effective, high performance healthcare AI. We demonstrate that this approach achieves state-of-the-art accuracy on medical question answering at a fraction of the cost of proprietary models, significantly improving the cost-accuracy Pareto frontier on the MedQA benchmark. Key contributions include: (1) OpenMedQA, a novel benchmark revealing a performance gap in open-ended medical QA compared to multiple-choice formats; (2) a practical, reproducible pipeline for context retrieval optimization; and (3) open-source resources (Prompt Engine, CoT/ToT/Thinking databases) to empower healthcare AI development. By advancing retrieval techniques and QA evaluation, we enable more affordable and reliable LLM solutions for healthcare.
△ Less
Submitted 3 April, 2025; v1 submitted 23 September, 2024;
originally announced September 2024.
-
Present and Future Generalization of Synthetic Image Detectors
Authors:
Pablo Bernabeu-Perez,
Enrique Lopez-Cuena,
Dario Garcia-Gasulla
Abstract:
The continued release of increasingly realistic image generation models creates a demand for synthetic image detectors. To build effective detectors we must first understand how factors like data source diversity, training methodologies and image alterations affect their generalization capabilities. This work conducts a systematic analysis and uses its insights to develop practical guidelines for…
▽ More
The continued release of increasingly realistic image generation models creates a demand for synthetic image detectors. To build effective detectors we must first understand how factors like data source diversity, training methodologies and image alterations affect their generalization capabilities. This work conducts a systematic analysis and uses its insights to develop practical guidelines for training robust synthetic image detectors. Model generalization capabilities are evaluated across different setups (e.g. scale, sources, transformations) including real-world deployment conditions. Through an extensive benchmarking of state-of-the-art detectors across diverse and recent datasets, we show that while current approaches excel in specific scenarios, no single detector achieves universal effectiveness. Critical flaws are identified in detectors, and workarounds are proposed to enable the deployment of real-world detector applications enhancing accuracy, reliability and robustness beyond the limitations of current systems.
△ Less
Submitted 26 November, 2024; v1 submitted 21 September, 2024;
originally announced September 2024.
-
Aloe: A Family of Fine-tuned Open Healthcare LLMs
Authors:
Ashwin Kumar Gururajan,
Enrique Lopez-Cuena,
Jordi Bayarri-Planas,
Adrian Tormos,
Daniel Hinjos,
Pablo Bernabeu-Perez,
Anna Arias-Duart,
Pablo Agustin Martin-Torres,
Lucia Urcelay-Ganzabal,
Marta Gonzalez-Mallo,
Sergio Alvarez-Napagao,
Eduard Ayguadé-Parra,
Ulises Cortés Dario Garcia-Gasulla
Abstract:
As the capabilities of Large Language Models (LLMs) in healthcare and medicine continue to advance, there is a growing need for competitive open-source models that can safeguard public interest. With the increasing availability of highly competitive open base models, the impact of continued pre-training is increasingly uncertain. In this work, we explore the role of instruct tuning, model merging,…
▽ More
As the capabilities of Large Language Models (LLMs) in healthcare and medicine continue to advance, there is a growing need for competitive open-source models that can safeguard public interest. With the increasing availability of highly competitive open base models, the impact of continued pre-training is increasingly uncertain. In this work, we explore the role of instruct tuning, model merging, alignment, red teaming and advanced inference schemes, as means to improve current open models. To that end, we introduce the Aloe family, a set of open medical LLMs highly competitive within its scale range. Aloe models are trained on the current best base models (Mistral, LLaMA 3), using a new custom dataset which combines public data sources improved with synthetic Chain of Thought (CoT). Aloe models undergo an alignment phase, becoming one of the first few policy-aligned open healthcare LLM using Direct Preference Optimization, setting a new standard for ethical performance in healthcare LLMs. Model evaluation expands to include various bias and toxicity datasets, a dedicated red teaming effort, and a much-needed risk assessment for healthcare LLMs. Finally, to explore the limits of current LLMs in inference, we study several advanced prompt engineering strategies to boost performance across benchmarks, yielding state-of-the-art results for open healthcare 7B LLMs, unprecedented at this scale.
△ Less
Submitted 3 May, 2024;
originally announced May 2024.
-
Padding Aware Neurons
Authors:
Dario Garcia-Gasulla,
Victor Gimenez-Abalos,
Pablo Martin-Torres
Abstract:
Convolutional layers are a fundamental component of most image-related models. These layers often implement by default a static padding policy (\eg zero padding), to control the scale of the internal representations, and to allow kernel activations centered on the border regions. In this work we identify Padding Aware Neurons (PANs), a type of filter that is found in most (if not all) convolutiona…
▽ More
Convolutional layers are a fundamental component of most image-related models. These layers often implement by default a static padding policy (\eg zero padding), to control the scale of the internal representations, and to allow kernel activations centered on the border regions. In this work we identify Padding Aware Neurons (PANs), a type of filter that is found in most (if not all) convolutional models trained with static padding. PANs focus on the characterization and recognition of input border location, introducing a spatial inductive bias into the model (e.g., how close to the input's border a pattern typically is). We propose a method to identify PANs through their activations, and explore their presence in several popular pre-trained models, finding PANs on all models explored, from dozens to hundreds. We discuss and illustrate different types of PANs, their kernels and behaviour. To understand their relevance, we test their impact on model performance, and find padding and PANs to induce strong and characteristic biases in the data. Finally, we discuss whether or not PANs are desirable, as well as the potential side effects of their presence in the context of model performance, generalisation, efficiency and safety.
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
Exploring the Role of Explainability in AI-Assisted Embryo Selection
Authors:
Lucia Urcelay,
Daniel Hinjos,
Pablo A. Martin-Torres,
Marta Gonzalez,
Marta Mendez,
Salva Cívico,
Sergio Álvarez-Napagao,
Dario Garcia-Gasulla
Abstract:
In Vitro Fertilization is among the most widespread treatments for infertility. One of its main challenges is the evaluation and selection of embryo for implantation, a process with large inter- and intra-clinician variability. Deep learning based methods are gaining attention, but their opaque nature compromises their acceptance in the clinical context, where transparency in the decision making i…
▽ More
In Vitro Fertilization is among the most widespread treatments for infertility. One of its main challenges is the evaluation and selection of embryo for implantation, a process with large inter- and intra-clinician variability. Deep learning based methods are gaining attention, but their opaque nature compromises their acceptance in the clinical context, where transparency in the decision making is key. In this paper we analyze the current work in the explainability of AI-assisted embryo analysis models, identifying the limitations. We also discuss how these models could be integrated in the clinical context as decision support systems, considering the needs of clinicians and patients. Finally, we propose guidelines for the sake of increasing interpretability and trustworthiness, pushing this technology forward towards established clinical practice.
△ Less
Submitted 1 August, 2023;
originally announced August 2023.
-
When & How to Transfer with Transfer Learning
Authors:
Adrian Tormos,
Dario Garcia-Gasulla,
Victor Gimenez-Abalos,
Sergio Alvarez-Napagao
Abstract:
In deep learning, transfer learning (TL) has become the de facto approach when dealing with image related tasks. Visual features learnt for one task have been shown to be reusable for other tasks, improving performance significantly. By reusing deep representations, TL enables the use of deep models in domains with limited data availability, limited computational resources and/or limited access to…
▽ More
In deep learning, transfer learning (TL) has become the de facto approach when dealing with image related tasks. Visual features learnt for one task have been shown to be reusable for other tasks, improving performance significantly. By reusing deep representations, TL enables the use of deep models in domains with limited data availability, limited computational resources and/or limited access to human experts. Domains which include the vast majority of real-life applications. This paper conducts an experimental evaluation of TL, exploring its trade-offs with respect to performance, environmental footprint, human hours and computational requirements. Results highlight the cases were a cheap feature extraction approach is preferable, and the situations where an expensive fine-tuning effort may be worth the added cost. Finally, a set of guidelines on the use of TL are proposed.
△ Less
Submitted 8 November, 2022;
originally announced November 2022.
-
Healthy Twitter discussions? Time will tell
Authors:
Dmitry Gnatyshak,
Dario Garcia-Gasulla,
Sergio Alvarez-Napagao,
Jamie Arjona,
Tommaso Venturini
Abstract:
Studying misinformation and how to deal with unhealthy behaviours within online discussions has recently become an important field of research within social studies. With the rapid development of social media, and the increasing amount of available information and sources, rigorous manual analysis of such discourses has become unfeasible. Many approaches tackle the issue by studying the semantic a…
▽ More
Studying misinformation and how to deal with unhealthy behaviours within online discussions has recently become an important field of research within social studies. With the rapid development of social media, and the increasing amount of available information and sources, rigorous manual analysis of such discourses has become unfeasible. Many approaches tackle the issue by studying the semantic and syntactic properties of discussions following a supervised approach, for example using natural language processing on a dataset labeled for abusive, fake or bot-generated content. Solutions based on the existence of a ground truth are limited to those domains which may have ground truth. However, within the context of misinformation, it may be difficult or even impossible to assign labels to instances. In this context, we consider the use of temporal dynamic patterns as an indicator of discussion health. Working in a domain for which ground truth was unavailable at the time (early COVID-19 pandemic discussions) we explore the characterization of discussions based on the the volume and time of contributions. First we explore the types of discussions in an unsupervised manner, and then characterize these types using the concept of ephemerality, which we formalize. In the end, we discuss the potential use of our ephemerality definition for labeling online discourses based on how desirable, healthy and constructive they are.
△ Less
Submitted 12 May, 2022; v1 submitted 21 March, 2022;
originally announced March 2022.
-
Focus! Rating XAI Methods and Finding Biases
Authors:
Anna Arias-Duart,
Ferran Parés,
Dario Garcia-Gasulla,
Victor Gimenez-Abalos
Abstract:
AI explainability improves the transparency of models, making them more trustworthy. Such goals are motivated by the emergence of deep learning models, which are obscure by nature; even in the domain of images, where deep learning has succeeded the most, explainability is still poorly assessed. In the field of image recognition many feature attribution methods have been proposed with the purpose o…
▽ More
AI explainability improves the transparency of models, making them more trustworthy. Such goals are motivated by the emergence of deep learning models, which are obscure by nature; even in the domain of images, where deep learning has succeeded the most, explainability is still poorly assessed. In the field of image recognition many feature attribution methods have been proposed with the purpose of explaining a model's behavior using visual cues. However, no metrics have been established so far to assess and select these methods objectively. In this paper we propose a consistent evaluation score for feature attribution methods -- the Focus -- designed to quantify their coherency to the task. While most previous work adds out-of-distribution noise to samples, we introduce a methodology to add noise from within the distribution. This is done through mosaics of instances from different classes, and the explanations these generate. On those, we compute a visual pseudo-precision metric, Focus. First, we show the robustness of the approach through a set of randomization experiments. Then we use Focus to compare six popular explainability techniques across several CNN architectures and classification datasets. Our results find some methods to be consistently reliable (LRP, GradCAM), while others produce class-agnostic explanations (SmoothGrad, IG). Finally we introduce another application of Focus, using it for the identification and characterization of biases found in models. This empowers bias-management tools, in another small step towards trustworthy AI.
△ Less
Submitted 28 February, 2022; v1 submitted 28 September, 2021;
originally announced September 2021.
-
Signs for Ethical AI: A Route Towards Transparency
Authors:
Dario Garcia-Gasulla,
Atia Cortés,
Sergio Alvarez-Napagao,
Ulises Cortés
Abstract:
Today, Artificial Intelligence (AI) has a direct impact on the daily life of billions of people. Being applied to sectors like finance, health, security and advertisement, AI fuels some of the biggest companies and research institutions in the world. Its impact in the near future seems difficult to predict or bound. In contrast to all this power, society remains mostly ignorant of the capabilities…
▽ More
Today, Artificial Intelligence (AI) has a direct impact on the daily life of billions of people. Being applied to sectors like finance, health, security and advertisement, AI fuels some of the biggest companies and research institutions in the world. Its impact in the near future seems difficult to predict or bound. In contrast to all this power, society remains mostly ignorant of the capabilities and standard practices of AI today. To address this imbalance, improving current interactions between people and AI systems, we propose a transparency scheme to be implemented on any AI system open to the public. The scheme is based on two pillars: Data Privacy and AI Transparency. The first recognizes the relevance of data for AI, and is supported by GDPR. The second considers aspects of AI transparency currently unregulated: AI capabilities, purpose and source. We design this pillar based on ethical principles. For each of the two pillars, we define a three-level display. The first level is based on visual signs, inspired by traffic signs managing the interaction between people and cars, and designed for quick and universal interpretability. The second level uses factsheets, providing limited details. The last level provides access to all available information. After detailing and exemplifying the proposed transparency scheme, we define a set of principles for creating transparent by design software, to be used during the integration of AI components on user-oriented services.
△ Less
Submitted 9 May, 2022; v1 submitted 29 September, 2020;
originally announced September 2020.
-
The MAMe Dataset: On the relevance of High Resolution and Variable Shape image properties
Authors:
Ferran Parés,
Anna Arias-Duart,
Dario Garcia-Gasulla,
Gema Campo-Francés,
Nina Viladrich,
Eduard Ayguadé,
Jesús Labarta
Abstract:
In the image classification task, the most common approach is to resize all images in a dataset to a unique shape, while reducing their precision to a size which facilitates experimentation at scale. This practice has benefits from a computational perspective, but it entails negative side-effects on performance due to loss of information and image deformation. In this work we introduce the MAMe da…
▽ More
In the image classification task, the most common approach is to resize all images in a dataset to a unique shape, while reducing their precision to a size which facilitates experimentation at scale. This practice has benefits from a computational perspective, but it entails negative side-effects on performance due to loss of information and image deformation. In this work we introduce the MAMe dataset, an image classification dataset with remarkable high resolution and variable shape properties. The goal of MAMe is to provide a tool for studying the impact of such properties in image classification, while motivating research in the field. The MAMe dataset contains thousands of artworks from three different museums, and proposes a classification task consisting on differentiating between 29 mediums (i.e. materials and techniques) supervised by art experts. After reviewing the singularity of MAMe in the context of current image classification tasks, a thorough description of the task is provided, together with dataset statistics. Experiments are conducted to evaluate the impact of using high resolution images, variable shape inputs and both properties at the same time. Results illustrate the positive impact in performance when using high resolution images, while highlighting the lack of solutions to exploit variable shapes. An additional experiment exposes the distinctiveness between the MAMe dataset and the prototypical ImageNet dataset. Finally, the baselines are inspected using explainability methods and expert knowledge, to gain insights on the challenges that remain ahead.
△ Less
Submitted 20 May, 2021; v1 submitted 27 July, 2020;
originally announced July 2020.
-
DOME: Recommendations for supervised machine learning validation in biology
Authors:
Ian Walsh,
Dmytro Fishman,
Dario Garcia-Gasulla,
Tiina Titma,
Gianluca Pollastri,
The ELIXIR Machine Learning focus group,
Jen Harrow,
Fotis E. Psomopoulos,
Silvio C. E. Tosatto
Abstract:
Modern biology frequently relies on machine learning to provide predictions and improve decision processes. There have been recent calls for more scrutiny on machine learning performance and possible limitations. Here we present a set of community-wide recommendations aiming to help establish standards of supervised machine learning validation in biology. Adopting a structured methods description…
▽ More
Modern biology frequently relies on machine learning to provide predictions and improve decision processes. There have been recent calls for more scrutiny on machine learning performance and possible limitations. Here we present a set of community-wide recommendations aiming to help establish standards of supervised machine learning validation in biology. Adopting a structured methods description for machine learning based on data, optimization, model, evaluation (DOME) will aim to help both reviewers and readers to better understand and assess the performance and limitations of a method or outcome. The recommendations are formulated as questions to anyone wishing to pursue implementation of a machine learning algorithm. Answers to these questions can be easily included in the supplementary material of published papers.
△ Less
Submitted 7 January, 2021; v1 submitted 25 June, 2020;
originally announced June 2020.
-
Global Data Science Project for COVID-19
Authors:
Toyotaro Suzumura,
Dario Garcia-Gasulla,
Sergio Alvarez Napagao,
Irene Li,
Hiroshi Maruyama,
Hiroki Kanezashi,
Raquel P'erez-Arnal,
Kunihiko Miyoshi,
Euma Ishii,
Keita Suzuki,
Sayaka Shiba,
Mariko Kurokawa,
Yuta Kanzawa,
Naomi Nakagawa,
Masatoshi Hanai,
Yixin Li,
Tianxiao Li
Abstract:
This paper aims at providing the summary of the Global Data Science Project (GDSC) for COVID-19. as on May 31 2020. COVID-19 has largely impacted on our societies through both direct and indirect effects transmitted by the policy measures to counter the spread of viruses. We quantitatively analysed the multifaceted impacts of the COVID-19 pandemic on our societies including people's mobility, heal…
▽ More
This paper aims at providing the summary of the Global Data Science Project (GDSC) for COVID-19. as on May 31 2020. COVID-19 has largely impacted on our societies through both direct and indirect effects transmitted by the policy measures to counter the spread of viruses. We quantitatively analysed the multifaceted impacts of the COVID-19 pandemic on our societies including people's mobility, health, and social behaviour changes. People's mobility has changed significantly due to the implementation of travel restriction and quarantine measurements. Indeed, the physical distance has widened at international (cross-border), national and regional level. At international level, due to the travel restrictions, the number of international flights has plunged overall at around 88 percent during March. In particular, the number of flights connecting Europe dropped drastically in mid of March after the United States announced travel restrictions to Europe and the EU and participating countries agreed to close borders, at 84 percent decline compared to March 10th. Similarly, we examined the impacts of quarantine measures in the major city: Tokyo (Japan), New York City (the United States), and Barcelona (Spain). Within all three cities, we found the significant decline in traffic volume. We also identified the increased concern for mental health through the analysis of posts on social networking services such as Twitter and Instagram. Notably, in the beginning of April 2020, the number of post with #depression on Instagram doubled, which might reflect the rise in mental health awareness among Instagram users. Besides, we identified the changes in a wide range of people's social behaviors, as well as economic impacts through the analysis of Instagram data and primary survey data.
△ Less
Submitted 3 August, 2021; v1 submitted 9 June, 2020;
originally announced June 2020.
-
The Impact of COVID-19 on Flight Networks
Authors:
Toyotaro Suzumura,
Hiroki Kanezashi,
Mishal Dholakia,
Euma Ishii,
Sergio Alvarez Napagao,
Raquel Pérez-Arnal,
Dario Garcia-Gasulla,
Toshiaki Murofushi
Abstract:
As COVID-19 transmissions spread worldwide, governments have announced and enforced travel restrictions to prevent further infections. Such restrictions have a direct effect on the volume of international flights among these countries, resulting in extensive social and economic costs. To better understand the situation in a quantitative manner, we used the Opensky network data to clarify flight pa…
▽ More
As COVID-19 transmissions spread worldwide, governments have announced and enforced travel restrictions to prevent further infections. Such restrictions have a direct effect on the volume of international flights among these countries, resulting in extensive social and economic costs. To better understand the situation in a quantitative manner, we used the Opensky network data to clarify flight patterns and flight densities around the world and observe relationships between flight numbers with new infections, and with the economy (unemployment rate) in Barcelona. We found that the number of daily flights gradually decreased and suddenly dropped 64% during the second half of March in 2020 after the US and Europe enacted travel restrictions. We also observed a 51% decrease in the global flight network density decreased during this period. Regarding new COVID-19 cases, the world had an unexpected surge regardless of travel restrictions. Finally, the layoffs for temporary workers in the tourism and airplane business increased by 4.3 fold in the weeks following Spain's decision to close its borders.
△ Less
Submitted 14 February, 2021; v1 submitted 4 June, 2020;
originally announced June 2020.
-
What are We Depressed about When We Talk about COVID19: Mental Health Analysis on Tweets Using Natural Language Processing
Authors:
Irene Li,
Yixin Li,
Tianxiao Li,
Sergio Alvarez-Napagao,
Dario Garcia-Gasulla,
Toyotaro Suzumura
Abstract:
The outbreak of coronavirus disease 2019 (COVID-19) recently has affected human life to a great extent. Besides direct physical and economic threats, the pandemic also indirectly impact people's mental health conditions, which can be overwhelming but difficult to measure. The problem may come from various reasons such as unemployment status, stay-at-home policy, fear for the virus, and so forth. I…
▽ More
The outbreak of coronavirus disease 2019 (COVID-19) recently has affected human life to a great extent. Besides direct physical and economic threats, the pandemic also indirectly impact people's mental health conditions, which can be overwhelming but difficult to measure. The problem may come from various reasons such as unemployment status, stay-at-home policy, fear for the virus, and so forth. In this work, we focus on applying natural language processing (NLP) techniques to analyze tweets in terms of mental health. We trained deep models that classify each tweet into the following emotions: anger, anticipation, disgust, fear, joy, sadness, surprise and trust. We build the EmoCT (Emotion-Covid19-Tweet) dataset for the training purpose by manually labeling 1,000 English tweets. Furthermore, we propose and compare two methods to find out the reasons that are causing sadness and fear.
△ Less
Submitted 8 June, 2020; v1 submitted 22 April, 2020;
originally announced April 2020.
-
Obstruction level detection of sewer videos using convolutional neural networks
Authors:
Mario A. Gutierrez-Mondragon,
Dario Garcia-Gasulla,
Sergio Alvarez-Napagao,
Jaume Brossa-Ordoñez,
Rafael Gimenez-Esteban
Abstract:
Worldwide, sewer networks are designed to transport wastewater to a centralized treatment plant to be treated and returned to the environment. This process is critical for the current society, preventing waterborne illnesses, providing safe drinking water and enhancing general sanitation. To keep a sewer network perfectly operational, sampling inspections are performed constantly to identify obstr…
▽ More
Worldwide, sewer networks are designed to transport wastewater to a centralized treatment plant to be treated and returned to the environment. This process is critical for the current society, preventing waterborne illnesses, providing safe drinking water and enhancing general sanitation. To keep a sewer network perfectly operational, sampling inspections are performed constantly to identify obstructions. Typically, a Closed-Circuit Television system is used to record the inside of pipes and report the obstruction level, which may trigger a cleaning operative. Currently, the obstruction level assessment is done manually, which is time-consuming and inconsistent. In this work, we design a methodology to train a Convolutional Neural Network for identifying the level of obstruction in pipes, thus reducing the human effort required on such a frequent and repetitive task. We gathered a database of videos that are explored and adapted to generate useful frames to fed into the model. Our resulting classifier obtains deployment ready performances. To validate the consistency of the approach and its industrial applicability, we integrate the Layer-wise Relevance Propagation explainability technique, which enables us to further understand the behavior of the neural network for this task. In the end, the proposed system can provide higher speed, accuracy, and consistency in the process of sewer examination. Our analysis also uncovers some guidelines on how to further improve the quality of the data gathering methodology.
△ Less
Submitted 4 February, 2020;
originally announced February 2020.
-
Random Forest as a Tumour Genetic Marker Extractor
Authors:
Raquel Pérez-Arnal,
Dario Garcia-Gasulla,
David Torrents,
Ferran Parés,
Ulises Cortés,
Jesús Labarta,
Eduard Ayguadé
Abstract:
Finding tumour genetic markers is essential to biomedicine due to their relevance for cancer detection and therapy development. In this paper, we explore a recently released dataset of chromosome rearrangements in 2,586 cancer patients, where different sorts of alterations have been detected. Using a Random Forest classifier, we evaluate the relevance of several features (some directly available i…
▽ More
Finding tumour genetic markers is essential to biomedicine due to their relevance for cancer detection and therapy development. In this paper, we explore a recently released dataset of chromosome rearrangements in 2,586 cancer patients, where different sorts of alterations have been detected. Using a Random Forest classifier, we evaluate the relevance of several features (some directly available in the original data, some engineered by us) related to chromosome rearrangements. This evaluation results in a set of potential tumour genetic markers, some of which are validated in the bibliography, while others are potentially novel.
△ Less
Submitted 26 November, 2019;
originally announced November 2019.
-
Towards a Goal-oriented Agent-based Simulation framework for High-Performance Computing
Authors:
Dmitry Gnatyshak,
Luis Oliva-Felipe,
Sergio Álvarez-Napagao,
Julian Padget,
Javier Vázquez-Salceda,
Dario Garcia-Gasulla,
Ulises Cortés
Abstract:
Currently, agent-based simulation frameworks force the user to choose between simulations involving a large number of agents (at the expense of limited agent reasoning capability) or simulations including agents with increased reasoning capabilities (at the expense of a limited number of agents per simulation). This paper describes a first attempt at putting goal-oriented agents into large agent-b…
▽ More
Currently, agent-based simulation frameworks force the user to choose between simulations involving a large number of agents (at the expense of limited agent reasoning capability) or simulations including agents with increased reasoning capabilities (at the expense of a limited number of agents per simulation). This paper describes a first attempt at putting goal-oriented agents into large agent-based (micro-)simulations. We discuss a model for goal-oriented agents in High-Performance Computing (HPC) and then briefly discuss its implementation in PyCOMPSs (a library that eases the parallelisation of tasks) to build such a platform that benefits from a large number of agents with the capacity to execute complex cognitive agents.
△ Less
Submitted 22 November, 2019;
originally announced November 2019.
-
MetH: A family of high-resolution and variable-shape image challenges
Authors:
Ferran Parés,
Dario Garcia-Gasulla,
Harald Servat,
Jesús Labarta,
Eduard Ayguadé
Abstract:
High-resolution and variable-shape images have not yet been properly addressed by the AI community. The approach of down-sampling data often used with convolutional neural networks is sub-optimal for many tasks, and has too many drawbacks to be considered a sustainable alternative. In sight of the increasing importance of problems that can benefit from exploiting high-resolution (HR) and variable-…
▽ More
High-resolution and variable-shape images have not yet been properly addressed by the AI community. The approach of down-sampling data often used with convolutional neural networks is sub-optimal for many tasks, and has too many drawbacks to be considered a sustainable alternative. In sight of the increasing importance of problems that can benefit from exploiting high-resolution (HR) and variable-shape, and with the goal of promoting research in that direction, we introduce a new family of datasets (MetH). The four proposed problems include two image classification, one image regression and one super resolution task. Each of these datasets contains thousands of art pieces captured by HR and variable-shape images, labeled by experts at the Metropolitan Museum of Art. We perform an analysis, which shows how the proposed tasks go well beyond current public alternatives in both pixel size and aspect ratio variance. At the same time, the performance obtained by popular architectures on these tasks shows that there is ample room for improvement. To wrap up the relevance of the contribution we review the fields, both in AI and high-performance computing, that could benefit from the proposed challenges.
△ Less
Submitted 29 September, 2020; v1 submitted 20 November, 2019;
originally announced November 2019.
-
Feature discriminativity estimation in CNNs for transfer learning
Authors:
Victor Gimenez-Abalos,
Armand Vilalta,
Dario Garcia-Gasulla,
Jesus Labarta,
Eduard Ayguadé
Abstract:
The purpose of feature extraction on convolutional neural networks is to reuse deep representations learnt for a pre-trained model to solve a new, potentially unrelated problem. However, raw feature extraction from all layers is unfeasible given the massive size of these networks. Recently, a supervised method using complexity reduction was proposed, resulting in significant improvements in perfor…
▽ More
The purpose of feature extraction on convolutional neural networks is to reuse deep representations learnt for a pre-trained model to solve a new, potentially unrelated problem. However, raw feature extraction from all layers is unfeasible given the massive size of these networks. Recently, a supervised method using complexity reduction was proposed, resulting in significant improvements in performance for transfer learning tasks. This approach first computes the discriminative power of features, and then discretises them using thresholds computed for the task. In this paper, we analyse the behaviour of these thresholds, with the purpose of finding a methodology for their estimation. After a comprehensive study, we find a very strong correlation between problem size and threshold value, with coefficient of determination above 90%. These results allow us to propose a unified model for threshold estimation, with potential application to transfer learning tasks.
△ Less
Submitted 8 November, 2019;
originally announced November 2019.
-
Adaptive Pattern Matching with Reinforcement Learning for Dynamic Graphs
Authors:
Hiroki Kanezashi,
Toyotaro Suzumura,
Dario Garcia-Gasulla,
Min-hwan Oh,
Satoshi Matsuoka
Abstract:
Graph pattern matching algorithms to handle million-scale dynamic graphs are widely used in many applications such as social network analytics and suspicious transaction detections from financial networks. On the other hand, the computation complexity of many graph pattern matching algorithms is expensive, and it is not affordable to extract patterns from million-scale graphs. Moreover, most real-…
▽ More
Graph pattern matching algorithms to handle million-scale dynamic graphs are widely used in many applications such as social network analytics and suspicious transaction detections from financial networks. On the other hand, the computation complexity of many graph pattern matching algorithms is expensive, and it is not affordable to extract patterns from million-scale graphs. Moreover, most real-world networks are time-evolving, updating their structures continuously, which makes it harder to update and output newly matched patterns in real time. Many incremental graph pattern matching algorithms which reduce the number of updates have been proposed to handle such dynamic graphs. However, it is still challenging to recompute vertices in the incremental graph pattern matching algorithms in a single process, and that prevents the real-time analysis. We propose an incremental graph pattern matching algorithm to deal with time-evolving graph data and also propose an adaptive optimization system based on reinforcement learning to recompute vertices in the incremental process more efficiently. Then we discuss the qualitative efficiency of our system with several types of data graphs and pattern graphs. We evaluate the performance using million-scale attributed and time-evolving social graphs. Our incremental algorithm is up to 10.1 times faster than an existing graph pattern matching and 1.95 times faster with the adaptive systems in a computation node than naive incremental processing.
△ Less
Submitted 21 December, 2018;
originally announced December 2018.
-
A Visual Distance for WordNet
Authors:
Raquel Pérez-Arnal,
Armand Vilalta,
Dario Garcia-Gasulla,
Ulises Cortés,
Eduard Ayguadé,
Jesus Labarta
Abstract:
Measuring the distance between concepts is an important field of study of Natural Language Processing, as it can be used to improve tasks related to the interpretation of those same concepts. WordNet, which includes a wide variety of concepts associated with words (i.e., synsets), is often used as a source for computing those distances. In this paper, we explore a distance for WordNet synsets base…
▽ More
Measuring the distance between concepts is an important field of study of Natural Language Processing, as it can be used to improve tasks related to the interpretation of those same concepts. WordNet, which includes a wide variety of concepts associated with words (i.e., synsets), is often used as a source for computing those distances. In this paper, we explore a distance for WordNet synsets based on visual features, instead of lexical ones. For this purpose, we extract the graphic features generated within a deep convolutional neural networks trained with ImageNet and use those features to generate a representative of each synset. Based on those representatives, we define a distance measure of synsets, which complements the traditional lexical distances. Finally, we propose some experiments to evaluate its performance and compare it with the current state-of-the-art.
△ Less
Submitted 27 April, 2018; v1 submitted 24 April, 2018;
originally announced April 2018.
-
Full-Network Embedding in a Multimodal Embedding Pipeline
Authors:
Armand Vilalta,
Dario Garcia-Gasulla,
Ferran Parés,
Eduard Ayguadé,
Jesus Labarta,
Ulises Cortés,
Toyotaro Suzumura
Abstract:
The current state-of-the-art for image annotation and image retrieval tasks is obtained through deep neural networks, which combine an image representation and a text representation into a shared embedding space. In this paper we evaluate the impact of using the Full-Network embedding in this setting, replacing the original image representation in a competitive multimodal embedding generation sche…
▽ More
The current state-of-the-art for image annotation and image retrieval tasks is obtained through deep neural networks, which combine an image representation and a text representation into a shared embedding space. In this paper we evaluate the impact of using the Full-Network embedding in this setting, replacing the original image representation in a competitive multimodal embedding generation scheme. Unlike the one-layer image embeddings typically used by most approaches, the Full-Network embedding provides a multi-scale representation of images, which results in richer characterizations. To measure the influence of the Full-Network embedding, we evaluate its performance on three different datasets, and compare the results with the original multimodal embedding generation scheme when using a one-layer image embedding, and with the rest of the state-of-the-art. Results for image annotation and image retrieval tasks indicate that the Full-Network embedding is consistently superior to the one-layer embedding. These results motivate the integration of the Full-Network embedding on any multimodal embedding generation scheme, something feasible thanks to the flexibility of the approach.
△ Less
Submitted 9 August, 2017; v1 submitted 24 July, 2017;
originally announced July 2017.
-
Building Graph Representations of Deep Vector Embeddings
Authors:
Dario Garcia-Gasulla,
Armand Vilalta,
Ferran Parés,
Jonatan Moreno,
Eduard Ayguadé,
Jesus Labarta,
Ulises Cortés,
Toyotaro Suzumura
Abstract:
Patterns stored within pre-trained deep neural networks compose large and powerful descriptive languages that can be used for many different purposes. Typically, deep network representations are implemented within vector embedding spaces, which enables the use of traditional machine learning algorithms on top of them. In this short paper we propose the construction of a graph embedding space inste…
▽ More
Patterns stored within pre-trained deep neural networks compose large and powerful descriptive languages that can be used for many different purposes. Typically, deep network representations are implemented within vector embedding spaces, which enables the use of traditional machine learning algorithms on top of them. In this short paper we propose the construction of a graph embedding space instead, introducing a methodology to transform the knowledge coded within a deep convolutional network into a topological space (i.e. a network). We outline how such graph can hold data instances, data features, relations between instances and features, and relations among features. Finally, we introduce some preliminary experiments to illustrate how the resultant graph embedding space can be exploited through graph analytics algorithms.
△ Less
Submitted 9 August, 2017; v1 submitted 24 July, 2017;
originally announced July 2017.
-
An Out-of-the-box Full-network Embedding for Convolutional Neural Networks
Authors:
Dario Garcia-Gasulla,
Armand Vilalta,
Ferran Parés,
Jonatan Moreno,
Eduard Ayguadé,
Jesus Labarta,
Ulises Cortés,
Toyotaro Suzumura
Abstract:
Transfer learning for feature extraction can be used to exploit deep representations in contexts where there is very few training data, where there are limited computational resources, or when tuning the hyper-parameters needed for training is not an option. While previous contributions to feature extraction propose embeddings based on a single layer of the network, in this paper we propose a full…
▽ More
Transfer learning for feature extraction can be used to exploit deep representations in contexts where there is very few training data, where there are limited computational resources, or when tuning the hyper-parameters needed for training is not an option. While previous contributions to feature extraction propose embeddings based on a single layer of the network, in this paper we propose a full-network embedding which successfully integrates convolutional and fully connected features, coming from all layers of a deep convolutional neural network. To do so, the embedding normalizes features in the context of the problem, and discretizes their values to reduce noise and regularize the embedding space. Significantly, this also reduces the computational cost of processing the resultant representations. The proposed method is shown to outperform single layer embeddings on several image classification tasks, while also being more robust to the choice of the pre-trained model used for obtaining the initial features. The performance gap in classification accuracy between thoroughly tuned solutions and the full-network embedding is also reduced, which makes of the proposed approach a competitive solution for a large set of applications.
△ Less
Submitted 22 May, 2017;
originally announced May 2017.
-
Fluid Communities: A Competitive, Scalable and Diverse Community Detection Algorithm
Authors:
Ferran Parés,
Dario Garcia-Gasulla,
Armand Vilalta,
Jonatan Moreno,
Eduard Ayguadé,
Jesús Labarta,
Ulises Cortés,
Toyotaro Suzumura
Abstract:
We introduce a community detection algorithm (Fluid Communities) based on the idea of fluids interacting in an environment, expanding and contracting as a result of that interaction. Fluid Communities is based on the propagation methodology, which represents the state-of-the-art in terms of computational cost and scalability. While being highly efficient, Fluid Communities is able to find communit…
▽ More
We introduce a community detection algorithm (Fluid Communities) based on the idea of fluids interacting in an environment, expanding and contracting as a result of that interaction. Fluid Communities is based on the propagation methodology, which represents the state-of-the-art in terms of computational cost and scalability. While being highly efficient, Fluid Communities is able to find communities in synthetic graphs with an accuracy close to the current best alternatives. Additionally, Fluid Communities is the first propagation-based algorithm capable of identifying a variable number of communities in network. To illustrate the relevance of the algorithm, we evaluate the diversity of the communities found by Fluid Communities, and find them to be significantly different from the ones found by alternative methods.
△ Less
Submitted 9 October, 2017; v1 submitted 27 March, 2017;
originally announced March 2017.
-
On the Behavior of Convolutional Nets for Feature Extraction
Authors:
Dario Garcia-Gasulla,
Ferran Parés,
Armand Vilalta,
Jonatan Moreno,
Eduard Ayguadé,
Jesús Labarta,
Ulises Cortés,
Toyotaro Suzumura
Abstract:
Deep neural networks are representation learning techniques. During training, a deep net is capable of generating a descriptive language of unprecedented size and detail in machine learning. Extracting the descriptive language coded within a trained CNN model (in the case of image data), and reusing it for other purposes is a field of interest, as it provides access to the visual descriptors previ…
▽ More
Deep neural networks are representation learning techniques. During training, a deep net is capable of generating a descriptive language of unprecedented size and detail in machine learning. Extracting the descriptive language coded within a trained CNN model (in the case of image data), and reusing it for other purposes is a field of interest, as it provides access to the visual descriptors previously learnt by the CNN after processing millions of images, without requiring an expensive training phase. Contributions to this field (commonly known as feature representation transfer or transfer learning) have been purely empirical so far, extracting all CNN features from a single layer close to the output and testing their performance by feeding them to a classifier. This approach has provided consistent results, although its relevance is limited to classification tasks. In a completely different approach, in this paper we statistically measure the discriminative power of every single feature found within a deep CNN, when used for characterizing every class of 11 datasets. We seek to provide new insights into the behavior of CNN features, particularly the ones from convolutional layers, as this can be relevant for their application to knowledge representation and reasoning. Our results confirm that low and middle level features may behave differently to high level features, but only under certain conditions. We find that all CNN features can be used for knowledge representation purposes both by their presence or by their absence, doubling the information a single CNN feature may provide. We also study how much noise these features may include, and propose a thresholding approach to discard most of it. All these insights have a direct application to the generation of CNN embedding spaces.
△ Less
Submitted 29 January, 2018; v1 submitted 3 March, 2017;
originally announced March 2017.
-
Hierarchical Hyperlink Prediction for the WWW
Authors:
Dario Garcia-Gasulla,
Eduard Ayguadé,
Jesús Labarta,
Ulises Cortés,
Toyotaro Suzumura
Abstract:
The hyperlink prediction task, that of proposing new links between webpages, can be used to improve search engines, expand the visibility of web pages, and increase the connectivity and navigability of the web. Hyperlink prediction is typically performed on webgraphs composed by thousands or millions of vertices, where on average each webpage contains less than fifty links. Algorithms processing g…
▽ More
The hyperlink prediction task, that of proposing new links between webpages, can be used to improve search engines, expand the visibility of web pages, and increase the connectivity and navigability of the web. Hyperlink prediction is typically performed on webgraphs composed by thousands or millions of vertices, where on average each webpage contains less than fifty links. Algorithms processing graphs so large and sparse require to be both scalable and precise, a challenging combination. Similarity-based algorithms are among the most scalable solutions within the link prediction field, due to their parallel nature and computational simplicity. These algorithms independently explore the nearby topological features of every missing link from the graph in order to determine its likelihood. Unfortunately, the precision of similarity-based algorithms is limited, which has prevented their broad application so far. In this work we explore the performance of similarity-based algorithms for the particular problem of hyperlink prediction on large webgraphs, and propose a novel method which assumes the existence of hierarchical properties. We evaluate this new approach on several webgraphs and compare its performance with that of the current best similarity-based algorithms. Its remarkable performance leads us to argue on the applicability of the proposal, identifying several use cases of hyperlink prediction. We also describes the approach we took for the computation of large-scale graphs from the perspective of high-performance computing, providing details on the implementation and parallelization of code.
△ Less
Submitted 28 November, 2016;
originally announced November 2016.
-
Limitations and Alternatives for the Evaluation of Large-scale Link Prediction
Authors:
Dario Garcia-Gasulla,
Eduard Ayguadé,
Jesús Labarta,
Ulises Cortés
Abstract:
Link prediction, the problem of identifying missing links among a set of inter-related data entities, is a popular field of research due to its application to graph-like domains. Producing consistent evaluations of the performance of the many link prediction algorithms being proposed can be challenging due to variable graph properties, such as size and density. In this paper we first discuss tradi…
▽ More
Link prediction, the problem of identifying missing links among a set of inter-related data entities, is a popular field of research due to its application to graph-like domains. Producing consistent evaluations of the performance of the many link prediction algorithms being proposed can be challenging due to variable graph properties, such as size and density. In this paper we first discuss traditional data mining solutions which are applicable to link prediction evaluation, arguing about their capacity for producing faithful and useful evaluations. We also introduce an innovative modification to a traditional evaluation methodology with the goal of adapting it to the problem of evaluating link prediction algorithms when applied to large graphs, by tackling the problem of class imbalance. We empirically evaluate the proposed methodology and, building on these findings, make a case for its importance on the evaluation of large-scale graph processing.
△ Less
Submitted 25 November, 2016; v1 submitted 2 November, 2016;
originally announced November 2016.
-
A Visual Embedding for the Unsupervised Extraction of Abstract Semantics
Authors:
D. Garcia-Gasulla,
J. Béjar,
U. Cortés,
E. Ayguadé,
J. Labarta,
T. Suzumura,
R. Chen
Abstract:
Vector-space word representations obtained from neural network models have been shown to enable semantic operations based on vector arithmetic. In this paper, we explore the existence of similar information on vector representations of images. For that purpose we define a methodology to obtain large, sparse vector representations of image classes, and generate vectors through the state-of-the-art…
▽ More
Vector-space word representations obtained from neural network models have been shown to enable semantic operations based on vector arithmetic. In this paper, we explore the existence of similar information on vector representations of images. For that purpose we define a methodology to obtain large, sparse vector representations of image classes, and generate vectors through the state-of-the-art deep learning architecture GoogLeNet for 20K images obtained from ImageNet. We first evaluate the resultant vector-space semantics through its correlation with WordNet distances, and find vector distances to be strongly correlated with linguistic semantics. We then explore the location of images within the vector space, finding elements close in WordNet to be clustered together, regardless of significant visual variances (e.g. 118 dog types). More surprisingly, we find that the space unsupervisedly separates complex classes without prior knowledge (e.g. living things). Afterwards, we consider vector arithmetics. Although we are unable to obtain meaningful results on this regard, we discuss the various problem we encountered, and how we consider to solve them. Finally, we discuss the impact of our research for cognitive systems, focusing on the role of the architecture being used.
△ Less
Submitted 16 December, 2016; v1 submitted 31 July, 2015;
originally announced July 2015.