-
Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding
Authors:
Emmanouil Zaranis,
António Farinhas,
Saul Santos,
Beatriz Canaverde,
Miguel Moura Ramos,
Aditya K Surikuchi,
André Viveiros,
Baohao Liao,
Elena Bueno-Benito,
Nithin Sivakumaran,
Pavlo Vasylenko,
Shoubin Yu,
Sonal Sannigrahi,
Wafaa Mohammed,
Ben Peters,
Danae Sánchez Villegas,
Elias Stengel-Eskin,
Giuseppe Attanasio,
Jaehong Yoon,
Stella Frank,
Alessandro Suglia,
Chrysoula Zerva,
Desmond Elliott,
Mariella Dimiccoli,
Mohit Bansal
, et al. (6 additional authors not shown)
Abstract:
Despite recent progress in vision-language models (VLMs), holistic understanding of long-form video content remains a significant challenge, partly due to limitations in current benchmarks. Many focus on peripheral, ``needle-in-a-haystack'' details, encouraging context-insensitive retrieval over deep comprehension. Others rely on large-scale, semi-automatically generated questions (often produced…
▽ More
Despite recent progress in vision-language models (VLMs), holistic understanding of long-form video content remains a significant challenge, partly due to limitations in current benchmarks. Many focus on peripheral, ``needle-in-a-haystack'' details, encouraging context-insensitive retrieval over deep comprehension. Others rely on large-scale, semi-automatically generated questions (often produced by language models themselves) that are easier for models to answer but fail to reflect genuine understanding. In this paper, we introduce MF$^2$, a new benchmark for evaluating whether models can comprehend, consolidate, and recall key narrative information from full-length movies (50-170 minutes long). MF$^2$ includes over 50 full-length, open-licensed movies, each paired with manually constructed sets of claim pairs -- one true (fact) and one plausible but false (fib), totalling over 850 pairs. These claims target core narrative elements such as character motivations and emotions, causal chains, and event order, and refer to memorable moments that humans can recall without rewatching the movie. Instead of multiple-choice formats, we adopt a binary claim evaluation protocol: for each pair, models must correctly identify both the true and false claims. This reduces biases like answer ordering and enables a more precise assessment of reasoning. Our experiments demonstrate that both open-weight and closed state-of-the-art models fall well short of human performance, underscoring the relative ease of the task for humans and their superior ability to retain and reason over critical narrative information -- an ability current VLMs lack.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Different Speech Translation Models Encode and Translate Speaker Gender Differently
Authors:
Dennis Fucci,
Marco Gaido,
Matteo Negri,
Luisa Bentivogli,
Andre Martins,
Giuseppe Attanasio
Abstract:
Recent studies on interpreting the hidden states of speech models have shown their ability to capture speaker-specific features, including gender. Does this finding also hold for speech translation (ST) models? If so, what are the implications for the speaker's gender assignment in translation? We address these questions from an interpretability perspective, using probing methods to assess gender…
▽ More
Recent studies on interpreting the hidden states of speech models have shown their ability to capture speaker-specific features, including gender. Does this finding also hold for speech translation (ST) models? If so, what are the implications for the speaker's gender assignment in translation? We address these questions from an interpretability perspective, using probing methods to assess gender encoding across diverse ST models. Results on three language directions (English-French/Italian/Spanish) indicate that while traditional encoder-decoder models capture gender information, newer architectures -- integrating a speech encoder with a machine translation system via adapters -- do not. We also demonstrate that low gender encoding capabilities result in systems' tendency toward a masculine default, a translation bias that is more pronounced in newer architectures.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Broadband acousto-optic modulators on Silicon Nitride
Authors:
Scott E. Kenning,
Tzu-Han Chang,
Alaina G. Attanasio,
Warren Jin,
Avi Feshali,
Yu Tian,
Mario Paniccia,
Sunil A. Bhave
Abstract:
Stress-optic modulators are emerging as a necessary building block of photonic integrated circuits tasked with controlling and manipulating classical and quantum optical systems. While photonic platforms such as lithium niobate and silicon on insulator have well developed modulator ecosystems, silicon nitride so far does not. As silicon nitride has favorable optical properties, such as ultra-low-l…
▽ More
Stress-optic modulators are emerging as a necessary building block of photonic integrated circuits tasked with controlling and manipulating classical and quantum optical systems. While photonic platforms such as lithium niobate and silicon on insulator have well developed modulator ecosystems, silicon nitride so far does not. As silicon nitride has favorable optical properties, such as ultra-low-loss and a large optical transparency window, a rich ecosystem of potential photonic integrated circuits are therefore inhibited. Here we demonstrate a traveling wave optically broadband acousto-optic spiral modulator architecture at a wavelength of 1550 nm using 90 nm thick silicon nitride waveguides and demonstrate their use in an optomechanical sensing system. The spiral weaves the light repeatedly through the acoustic field up to 38 times, factoring in the time evolution of the acoustic field during the light's transit through spirals up to 26 cm in length. These modulators avoid heterogeneous integration, release processes, complicated fabrication procedures, and modifications of the commercial foundry fabricated photonic layer stack by exploiting ultra-low-loss waveguides to enable long phonon-photon interaction lengths required for efficient modulation. The design allows for thick top oxide cladding of 4 $μ$m such that the low loss optical properties of thin silicon nitride can be preserved, ultimately achieving a $V_π$ of 8.98 V at 704 MHz with 1.13 dB of insertion loss. Our modulators are the first optically broadband high frequency acousto-optic modulators on thin silicon nitride, and the novel architecture is accessible to any low loss photonic platform. We demonstrate an immediate use case for these devices in a high-Q optomechanical sensing system.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
MSTS: A Multimodal Safety Test Suite for Vision-Language Models
Authors:
Paul Röttger,
Giuseppe Attanasio,
Felix Friedrich,
Janis Goldzycher,
Alicia Parrish,
Rishabh Bhardwaj,
Chiara Di Bonaventura,
Roman Eng,
Gaia El Khoury Geagea,
Sujata Goswami,
Jieun Han,
Dirk Hovy,
Seogyeong Jeong,
Paloma Jeretič,
Flor Miriam Plaza-del-Arco,
Donya Rooein,
Patrick Schramowski,
Anastassia Shaitarova,
Xudong Shen,
Richard Willats,
Andrea Zugarini,
Bertie Vidgen
Abstract:
Vision-language models (VLMs), which process image and text inputs, are increasingly integrated into chat assistants and other consumer AI applications. Without proper safeguards, however, VLMs may give harmful advice (e.g. how to self-harm) or encourage unsafe behaviours (e.g. to consume drugs). Despite these clear hazards, little work so far has evaluated VLM safety and the novel risks created b…
▽ More
Vision-language models (VLMs), which process image and text inputs, are increasingly integrated into chat assistants and other consumer AI applications. Without proper safeguards, however, VLMs may give harmful advice (e.g. how to self-harm) or encourage unsafe behaviours (e.g. to consume drugs). Despite these clear hazards, little work so far has evaluated VLM safety and the novel risks created by multimodal inputs. To address this gap, we introduce MSTS, a Multimodal Safety Test Suite for VLMs. MSTS comprises 400 test prompts across 40 fine-grained hazard categories. Each test prompt consists of a text and an image that only in combination reveal their full unsafe meaning. With MSTS, we find clear safety issues in several open VLMs. We also find some VLMs to be safe by accident, meaning that they are safe because they fail to understand even simple test prompts. We translate MSTS into ten languages, showing non-English prompts to increase the rate of unsafe model responses. We also show models to be safer when tested with text only rather than multimodal prompts. Finally, we explore the automation of VLM safety assessments, finding even the best safety classifiers to be lacking.
△ Less
Submitted 17 January, 2025;
originally announced January 2025.
-
Watching the Watchers: Exposing Gender Disparities in Machine Translation Quality Estimation
Authors:
Emmanouil Zaranis,
Giuseppe Attanasio,
Sweta Agrawal,
André F. T. Martins
Abstract:
Quality estimation (QE)-the automatic assessment of translation quality-has recently become crucial across several stages of the translation pipeline, from data curation to training and decoding. While QE metrics have been optimized to align with human judgments, whether they encode social biases has been largely overlooked. Biased QE risks favoring certain demographic groups over others, e.g., by…
▽ More
Quality estimation (QE)-the automatic assessment of translation quality-has recently become crucial across several stages of the translation pipeline, from data curation to training and decoding. While QE metrics have been optimized to align with human judgments, whether they encode social biases has been largely overlooked. Biased QE risks favoring certain demographic groups over others, e.g., by exacerbating gaps in visibility and usability. This paper defines and investigates gender bias of QE metrics and discusses its downstream implications for machine translation (MT). Experiments with state-of-the-art QE metrics across multiple domains, datasets, and languages reveal significant bias. When a human entity's gender in the source is undisclosed, masculine-inflected translations score higher than feminine-inflected ones, and gender-neutral translations are penalized. Even when contextual cues disambiguate gender, using context-aware QE metrics leads to more errors in selecting the correct translation inflection for feminine referents than for masculine ones. Moreover, a biased QE metric affects data filtering and quality-aware decoding. Our findings underscore the need for a renewed focus on developing and evaluating QE metrics centered on gender.
△ Less
Submitted 2 June, 2025; v1 submitted 14 October, 2024;
originally announced October 2024.
-
Building Bridges: A Dataset for Evaluating Gender-Fair Machine Translation into German
Authors:
Manuel Lardelli,
Giuseppe Attanasio,
Anne Lauscher
Abstract:
The translation of gender-neutral person-referring terms (e.g., the students) is often non-trivial. Translating from English into German poses an interesting case -- in German, person-referring nouns are usually gender-specific, and if the gender of the referent(s) is unknown or diverse, the generic masculine (die Studenten (m.)) is commonly used. This solution, however, reduces the visibility of…
▽ More
The translation of gender-neutral person-referring terms (e.g., the students) is often non-trivial. Translating from English into German poses an interesting case -- in German, person-referring nouns are usually gender-specific, and if the gender of the referent(s) is unknown or diverse, the generic masculine (die Studenten (m.)) is commonly used. This solution, however, reduces the visibility of other genders, such as women and non-binary people. To counteract gender discrimination, a societal movement towards using gender-fair language exists (e.g., by adopting neosystems). However, gender-fair German is currently barely supported in machine translation (MT), requiring post-editing or manual translations. We address this research gap by studying gender-fair language in English-to-German MT. Concretely, we enrich a community-created gender-fair language dictionary and sample multi-sentence test instances from encyclopedic text and parliamentary speeches. Using these novel resources, we conduct the first benchmark study involving two commercial systems and six neural MT models for translating words in isolation and natural contexts across two domains. Our findings show that most systems produce mainly masculine forms and rarely gender-neutral variants, highlighting the need for future research. We release code and data at https://github.com/g8a9/building-bridges-gender-fair-german-mt.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Classist Tools: Social Class Correlates with Performance in NLP
Authors:
Amanda Cercas Curry,
Giuseppe Attanasio,
Zeerak Talat,
Dirk Hovy
Abstract:
Since the foundational work of William Labov on the social stratification of language (Labov, 1964), linguistics has made concentrated efforts to explore the links between sociodemographic characteristics and language production and perception. But while there is strong evidence for socio-demographic characteristics in language, they are infrequently used in Natural Language Processing (NLP). Age…
▽ More
Since the foundational work of William Labov on the social stratification of language (Labov, 1964), linguistics has made concentrated efforts to explore the links between sociodemographic characteristics and language production and perception. But while there is strong evidence for socio-demographic characteristics in language, they are infrequently used in Natural Language Processing (NLP). Age and gender are somewhat well represented, but Labov's original target, socioeconomic status, is noticeably absent. And yet it matters. We show empirically that NLP disadvantages less-privileged socioeconomic groups. We annotate a corpus of 95K utterances from movies with social class, ethnicity and geographical language variety and measure the performance of NLP systems on three tasks: language modelling, automatic speech recognition, and grammar error correction. We find significant performance disparities that can be attributed to socioeconomic status as well as ethnicity and geographical differences. With NLP technologies becoming ever more ubiquitous and quotidian, they must accommodate all language varieties to avoid disadvantaging already marginalised groups. We argue for the inclusion of socioeconomic class in future language technologies.
△ Less
Submitted 7 March, 2024;
originally announced March 2024.
-
Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps
Authors:
Giuseppe Attanasio,
Beatrice Savoldi,
Dennis Fucci,
Dirk Hovy
Abstract:
Current automatic speech recognition (ASR) models are designed to be used across many languages and tasks without substantial changes. However, this broad language coverage hides performance gaps within languages, for example, across genders. Our study systematically evaluates the performance of two widely used multilingual ASR models on three datasets, encompassing 19 languages from eight languag…
▽ More
Current automatic speech recognition (ASR) models are designed to be used across many languages and tasks without substantial changes. However, this broad language coverage hides performance gaps within languages, for example, across genders. Our study systematically evaluates the performance of two widely used multilingual ASR models on three datasets, encompassing 19 languages from eight language families and two speaking conditions. Our findings reveal clear gender disparities, with the advantaged group varying across languages and models. Surprisingly, those gaps are not explained by acoustic or lexical properties. However, probing internal model states reveals a correlation with gendered performance gap. That is, the easier it is to distinguish speaker gender in a language using probes, the more the gap reduces, favoring female speakers. Our results show that gender disparities persist even in state-of-the-art models. Our findings have implications for the improvement of multilingual ASR systems, underscoring the importance of accessibility to training data and nuanced evaluation to predict and mitigate gender gaps. We release all code and artifacts at https://github.com/g8a9/multilingual-asr-gender-gap.
△ Less
Submitted 3 October, 2024; v1 submitted 27 February, 2024;
originally announced February 2024.
-
A Tale of Pronouns: Interpretability Informs Gender Bias Mitigation for Fairer Instruction-Tuned Machine Translation
Authors:
Giuseppe Attanasio,
Flor Miriam Plaza-del-Arco,
Debora Nozza,
Anne Lauscher
Abstract:
Recent instruction fine-tuned models can solve multiple NLP tasks when prompted to do so, with machine translation (MT) being a prominent use case. However, current research often focuses on standard performance benchmarks, leaving compelling fairness and ethical considerations behind. In MT, this might lead to misgendered translations, resulting, among other harms, in the perpetuation of stereoty…
▽ More
Recent instruction fine-tuned models can solve multiple NLP tasks when prompted to do so, with machine translation (MT) being a prominent use case. However, current research often focuses on standard performance benchmarks, leaving compelling fairness and ethical considerations behind. In MT, this might lead to misgendered translations, resulting, among other harms, in the perpetuation of stereotypes and prejudices. In this work, we address this gap by investigating whether and to what extent such models exhibit gender bias in machine translation and how we can mitigate it. Concretely, we compute established gender bias metrics on the WinoMT corpus from English to German and Spanish. We discover that IFT models default to male-inflected translations, even disregarding female occupational stereotypes. Next, using interpretability methods, we unveil that models systematically overlook the pronoun indicating the gender of a target occupation in misgendered translations. Finally, based on this finding, we propose an easy-to-implement and effective bias mitigation solution based on few-shot learning that leads to significantly fairer translations.
△ Less
Submitted 25 October, 2023; v1 submitted 18 October, 2023;
originally announced October 2023.
-
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
Authors:
Federico Bianchi,
Mirac Suzgun,
Giuseppe Attanasio,
Paul Röttger,
Dan Jurafsky,
Tatsunori Hashimoto,
James Zou
Abstract:
Training large language models to follow instructions makes them perform better on a wide range of tasks and generally become more helpful. However, a perfectly helpful model will follow even the most malicious instructions and readily generate harmful content. In this paper, we raise concerns over the safety of models that only emphasize helpfulness, not harmlessness, in their instruction-tuning.…
▽ More
Training large language models to follow instructions makes them perform better on a wide range of tasks and generally become more helpful. However, a perfectly helpful model will follow even the most malicious instructions and readily generate harmful content. In this paper, we raise concerns over the safety of models that only emphasize helpfulness, not harmlessness, in their instruction-tuning. We show that several popular instruction-tuned models are highly unsafe. Moreover, we show that adding just 3% safety examples (a few hundred demonstrations) when fine-tuning a model like LLaMA can substantially improve its safety. Our safety-tuning does not make models significantly less capable or helpful as measured by standard benchmarks. However, we do find exaggerated safety behaviours, where too much safety-tuning makes models refuse perfectly safe prompts if they superficially resemble unsafe ones. As a whole, our results illustrate trade-offs in training LLMs to be helpful and training them to be safe.
△ Less
Submitted 19 March, 2024; v1 submitted 14 September, 2023;
originally announced September 2023.
-
Explaining Speech Classification Models via Word-Level Audio Segments and Paralinguistic Features
Authors:
Eliana Pastor,
Alkis Koudounas,
Giuseppe Attanasio,
Dirk Hovy,
Elena Baralis
Abstract:
Recent advances in eXplainable AI (XAI) have provided new insights into how models for vision, language, and tabular data operate. However, few approaches exist for understanding speech models. Existing work focuses on a few spoken language understanding (SLU) tasks, and explanations are difficult to interpret for most users. We introduce a new approach to explain speech classification models. We…
▽ More
Recent advances in eXplainable AI (XAI) have provided new insights into how models for vision, language, and tabular data operate. However, few approaches exist for understanding speech models. Existing work focuses on a few spoken language understanding (SLU) tasks, and explanations are difficult to interpret for most users. We introduce a new approach to explain speech classification models. We generate easy-to-interpret explanations via input perturbation on two information levels. 1) Word-level explanations reveal how each word-related audio segment impacts the outcome. 2) Paralinguistic features (e.g., prosody and background noise) answer the counterfactual: ``What would the model prediction be if we edited the audio signal in this way?'' We validate our approach by explaining two state-of-the-art SLU models on two speech classification tasks in English and Italian. Our findings demonstrate that the explanations are faithful to the model's inner workings and plausible to humans. Our method and findings pave the way for future research on interpreting speech models.
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
Weigh Your Own Words: Improving Hate Speech Counter Narrative Generation via Attention Regularization
Authors:
Helena Bonaldi,
Giuseppe Attanasio,
Debora Nozza,
Marco Guerini
Abstract:
Recent computational approaches for combating online hate speech involve the automatic generation of counter narratives by adapting Pretrained Transformer-based Language Models (PLMs) with human-curated data. This process, however, can produce in-domain overfitting, resulting in models generating acceptable narratives only for hatred similar to training data, with little portability to other targe…
▽ More
Recent computational approaches for combating online hate speech involve the automatic generation of counter narratives by adapting Pretrained Transformer-based Language Models (PLMs) with human-curated data. This process, however, can produce in-domain overfitting, resulting in models generating acceptable narratives only for hatred similar to training data, with little portability to other targets or to real-world toxic language. This paper introduces novel attention regularization methodologies to improve the generalization capabilities of PLMs for counter narratives generation. Overfitting to training-specific terms is then discouraged, resulting in more diverse and richer narratives. We experiment with two attention-based regularization techniques on a benchmark English dataset. Regularized models produce better counter narratives than state-of-the-art approaches in most cases, both in terms of automatic metrics and human evaluation, especially when hateful targets are not present in the training data. This work paves the way for better and more flexible counter-speech generation models, a task for which datasets are highly challenging to produce.
△ Less
Submitted 5 September, 2023;
originally announced September 2023.
-
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
Authors:
Paul Röttger,
Hannah Rose Kirk,
Bertie Vidgen,
Giuseppe Attanasio,
Federico Bianchi,
Dirk Hovy
Abstract:
Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This risk motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse to comply with unsafe prompts, and…
▽ More
Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This risk motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse to comply with unsafe prompts, and thus not be helpful. Recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or mention sensitive topics. In this paper, we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours in a systematic way. XSTest comprises 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as contrasts that models, for most applications, should refuse. We describe XSTest's creation and composition, and then use the test suite to highlight systematic failure modes in state-of-the-art language models as well as more general challenges in building safer language models.
△ Less
Submitted 1 April, 2024; v1 submitted 2 August, 2023;
originally announced August 2023.
-
ITALIC: An Italian Intent Classification Dataset
Authors:
Alkis Koudounas,
Moreno La Quatra,
Lorenzo Vaiani,
Luca Colomba,
Giuseppe Attanasio,
Eliana Pastor,
Luca Cagliero,
Elena Baralis
Abstract:
Recent large-scale Spoken Language Understanding datasets focus predominantly on English and do not account for language-specific phenomena such as particular phonemes or words in different lects. We introduce ITALIC, the first large-scale speech dataset designed for intent classification in Italian. The dataset comprises 16,521 crowdsourced audio samples recorded by 70 speakers from various Itali…
▽ More
Recent large-scale Spoken Language Understanding datasets focus predominantly on English and do not account for language-specific phenomena such as particular phonemes or words in different lects. We introduce ITALIC, the first large-scale speech dataset designed for intent classification in Italian. The dataset comprises 16,521 crowdsourced audio samples recorded by 70 speakers from various Italian regions and annotated with intent labels and additional metadata. We explore the versatility of ITALIC by evaluating current state-of-the-art speech and text models. Results on intent classification suggest that increasing scale and running language adaptation yield better speech models, monolingual text models outscore multilingual ones, and that speech recognition on ITALIC is more challenging than on existing Italian benchmarks. We release both the dataset and the annotation scheme to streamline the development of new Italian SLU models and language-specific datasets.
△ Less
Submitted 14 June, 2023;
originally announced June 2023.
-
E Pluribus Unum: Guidelines on Multi-Objective Evaluation of Recommender Systems
Authors:
Patrick John Chia,
Giuseppe Attanasio,
Jacopo Tagliabue,
Federico Bianchi,
Ciro Greco,
Gabriel de Souza P. Moreira,
Davide Eynard,
Fahd Husain
Abstract:
Recommender Systems today are still mostly evaluated in terms of accuracy, with other aspects beyond the immediate relevance of recommendations, such as diversity, long-term user retention and fairness, often taking a back seat. Moreover, reconciling multiple performance perspectives is by definition indeterminate, presenting a stumbling block to those in the pursuit of rounded evaluation of Recom…
▽ More
Recommender Systems today are still mostly evaluated in terms of accuracy, with other aspects beyond the immediate relevance of recommendations, such as diversity, long-term user retention and fairness, often taking a back seat. Moreover, reconciling multiple performance perspectives is by definition indeterminate, presenting a stumbling block to those in the pursuit of rounded evaluation of Recommender Systems. EvalRS 2022 -- a data challenge designed around Multi-Objective Evaluation -- was a first practical endeavour, providing many insights into the requirements and challenges of balancing multiple objectives in evaluation. In this work, we reflect on EvalRS 2022 and expound upon crucial learnings to formulate a first-principles approach toward Multi-Objective model selection, and outline a set of guidelines for carrying out a Multi-Objective Evaluation challenge, with potential applicability to the problem of rounded evaluation of competing models in real-world deployments.
△ Less
Submitted 20 April, 2023;
originally announced April 2023.
-
Is It Worth the (Environmental) Cost? Limited Evidence for Temporal Adaptation via Continuous Training
Authors:
Giuseppe Attanasio,
Debora Nozza,
Federico Bianchi,
Dirk Hovy
Abstract:
Language is constantly changing and evolving, leaving language models to become quickly outdated. Consequently, we should continuously update our models with new data to expose them to new events and facts. However, that requires additional computing, which means new carbon emissions. Do any measurable benefits justify this cost? This paper looks for empirical evidence to support continuous traini…
▽ More
Language is constantly changing and evolving, leaving language models to become quickly outdated. Consequently, we should continuously update our models with new data to expose them to new events and facts. However, that requires additional computing, which means new carbon emissions. Do any measurable benefits justify this cost? This paper looks for empirical evidence to support continuous training. We reproduce existing benchmarks and extend them to include additional time periods, models, and tasks. Our results show that the downstream task performance of temporally adapted English models for social media data do not improve over time. Pretrained models without temporal adaptation are actually significantly more effective and efficient. However, we also note a lack of suitable temporal benchmarks. Our findings invite a critical reflection on when and how to temporally adapt language models, accounting for sustainability.
△ Less
Submitted 4 May, 2023; v1 submitted 13 October, 2022;
originally announced October 2022.
-
ferret: a Framework for Benchmarking Explainers on Transformers
Authors:
Giuseppe Attanasio,
Eliana Pastor,
Chiara Di Bonaventura,
Debora Nozza
Abstract:
As Transformers are increasingly relied upon to solve complex NLP problems, there is an increased need for their decisions to be humanly interpretable. While several explainable AI (XAI) techniques for interpreting the outputs of transformer-based models have been proposed, there is still a lack of easy access to using and comparing them. We introduce ferret, a Python library to simplify the use a…
▽ More
As Transformers are increasingly relied upon to solve complex NLP problems, there is an increased need for their decisions to be humanly interpretable. While several explainable AI (XAI) techniques for interpreting the outputs of transformer-based models have been proposed, there is still a lack of easy access to using and comparing them. We introduce ferret, a Python library to simplify the use and comparisons of XAI methods on transformer-based classifiers. With ferret, users can visualize and compare transformers-based models output explanations using state-of-the-art XAI methods on any free-text or existing XAI corpora. Moreover, users can also evaluate ad-hoc XAI metrics to select the most faithful and plausible explanations. To align with the recently consolidated process of sharing and using transformers-based models from Hugging Face, ferret interfaces directly with its Python library. In this paper, we showcase ferret to benchmark XAI methods used on transformers for sentiment analysis and hate speech detection. We show how specific methods provide consistently better explanations and are preferable in the context of transformer models.
△ Less
Submitted 2 March, 2023; v1 submitted 2 August, 2022;
originally announced August 2022.
-
EvalRS: a Rounded Evaluation of Recommender Systems
Authors:
Jacopo Tagliabue,
Federico Bianchi,
Tobias Schnabel,
Giuseppe Attanasio,
Ciro Greco,
Gabriel de Souza P. Moreira,
Patrick John Chia
Abstract:
Much of the complexity of Recommender Systems (RSs) comes from the fact that they are used as part of more complex applications and affect user experience through a varied range of user interfaces. However, research focused almost exclusively on the ability of RSs to produce accurate item rankings while giving little attention to the evaluation of RS behavior in real-world scenarios. Such narrow f…
▽ More
Much of the complexity of Recommender Systems (RSs) comes from the fact that they are used as part of more complex applications and affect user experience through a varied range of user interfaces. However, research focused almost exclusively on the ability of RSs to produce accurate item rankings while giving little attention to the evaluation of RS behavior in real-world scenarios. Such narrow focus has limited the capacity of RSs to have a lasting impact in the real world and makes them vulnerable to undesired behavior, such as reinforcing data biases. We propose EvalRS as a new type of challenge, in order to foster this discussion among practitioners and build in the open new methodologies for testing RSs "in the wild".
△ Less
Submitted 12 August, 2022; v1 submitted 12 July, 2022;
originally announced July 2022.
-
Contrastive language and vision learning of general fashion concepts
Authors:
Patrick John Chia,
Giuseppe Attanasio,
Federico Bianchi,
Silvia Terragni,
Ana Rita Magalhães,
Diogo Goncalves,
Ciro Greco,
Jacopo Tagliabue
Abstract:
The steady rise of online shopping goes hand in hand with the development of increasingly complex ML and NLP models. While most use cases are cast as specialized supervised learning problems, we argue that practitioners would greatly benefit from more transferable representations of products. In this work, we build on recent developments in contrastive learning to train FashionCLIP, a CLIP-like mo…
▽ More
The steady rise of online shopping goes hand in hand with the development of increasingly complex ML and NLP models. While most use cases are cast as specialized supervised learning problems, we argue that practitioners would greatly benefit from more transferable representations of products. In this work, we build on recent developments in contrastive learning to train FashionCLIP, a CLIP-like model for the fashion industry. We showcase its capabilities for retrieval, classification and grounding, and release our model and code to the community.
△ Less
Submitted 18 April, 2023; v1 submitted 8 April, 2022;
originally announced April 2022.
-
Entropy-based Attention Regularization Frees Unintended Bias Mitigation from Lists
Authors:
Giuseppe Attanasio,
Debora Nozza,
Dirk Hovy,
Elena Baralis
Abstract:
Natural Language Processing (NLP) models risk overfitting to specific terms in the training data, thereby reducing their performance, fairness, and generalizability. E.g., neural hate speech detection models are strongly influenced by identity terms like gay, or women, resulting in false positives, severe unintended bias, and lower performance. Most mitigation techniques use lists of identity term…
▽ More
Natural Language Processing (NLP) models risk overfitting to specific terms in the training data, thereby reducing their performance, fairness, and generalizability. E.g., neural hate speech detection models are strongly influenced by identity terms like gay, or women, resulting in false positives, severe unintended bias, and lower performance. Most mitigation techniques use lists of identity terms or samples from the target domain during training. However, this approach requires a-priori knowledge and introduces further bias if important terms are neglected. Instead, we propose a knowledge-free Entropy-based Attention Regularization (EAR) to discourage overfitting to training-specific terms. An additional objective function penalizes tokens with low self-attention entropy. We fine-tune BERT via EAR: the resulting model matches or exceeds state-of-the-art performance for hate speech classification and bias metrics on three benchmark corpora in English and Italian. EAR also reveals overfitting terms, i.e., terms most likely to induce bias, to help identify their effect on the model, task, and predictions.
△ Less
Submitted 17 March, 2022;
originally announced March 2022.
-
Contrastive Language-Image Pre-training for the Italian Language
Authors:
Federico Bianchi,
Giuseppe Attanasio,
Raphael Pisoni,
Silvia Terragni,
Gabriele Sarti,
Sri Lakshmi
Abstract:
CLIP (Contrastive Language-Image Pre-training) is a very recent multi-modal model that jointly learns representations of images and texts. The model is trained on a massive amount of English data and shows impressive performance on zero-shot classification tasks. Training the same model on a different language is not trivial, since data in other languages might be not enough and the model needs hi…
▽ More
CLIP (Contrastive Language-Image Pre-training) is a very recent multi-modal model that jointly learns representations of images and texts. The model is trained on a massive amount of English data and shows impressive performance on zero-shot classification tasks. Training the same model on a different language is not trivial, since data in other languages might be not enough and the model needs high-quality translations of the texts to guarantee a good performance. In this paper, we present the first CLIP model for the Italian Language (CLIP-Italian), trained on more than 1.4 million image-text pairs. Results show that CLIP-Italian outperforms the multilingual CLIP model on the tasks of image retrieval and zero-shot classification.
△ Less
Submitted 19 August, 2021;
originally announced August 2021.