-
NeurIPS 2025 E2LM Competition : Early Training Evaluation of Language Models
Authors:
Mouadh Yagoubi,
Yasser Dahou,
Billel Mokeddem,
Younes Belkada,
Phuc H. Le-Khac,
Basma El Amel Boussaha,
Reda Alami,
Jingwei Zuo,
Damiano Marsili,
Mugariya Farooq,
Mounia Lalmas,
Georgia Gkioxari,
Patrick Gallinari,
Philip Torr,
Hakim Hacid
Abstract:
Existing benchmarks have proven effective for assessing the performance of fully trained large language models. However, we find striking differences in the early training stages of small models, where benchmarks often fail to provide meaningful or discriminative signals. To explore how these differences arise, this competition tackles the challenge of designing scientific knowledge evaluation tas…
▽ More
Existing benchmarks have proven effective for assessing the performance of fully trained large language models. However, we find striking differences in the early training stages of small models, where benchmarks often fail to provide meaningful or discriminative signals. To explore how these differences arise, this competition tackles the challenge of designing scientific knowledge evaluation tasks specifically tailored for measuring early training progress of language models. Participants are invited to develop novel evaluation methodologies or adapt existing benchmarks to better capture performance differences among language models. To support this effort, we provide three pre-trained small models (0.5B, 1B, and 3B parameters), along with intermediate checkpoints sampled during training up to 200B tokens. All experiments and development work can be run on widely available free cloud-based GPU platforms, making participation accessible to researchers with limited computational resources. Submissions will be evaluated based on three criteria: the quality of the performance signal they produce, the consistency of model rankings at 1 trillion tokens of training, and their relevance to the scientific knowledge domain. By promoting the design of tailored evaluation strategies for early training, this competition aims to attract a broad range of participants from various disciplines, including those who may not be machine learning experts or have access to dedicated GPU resources. Ultimately, this initiative seeks to make foundational LLM research more systematic and benchmark-informed from the earliest phases of model development.
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
Text2Tracks: Prompt-based Music Recommendation via Generative Retrieval
Authors:
Enrico Palumbo,
Gustavo Penha,
Andreas Damianou,
José Luis Redondo García,
Timothy Christopher Heath,
Alice Wang,
Hugues Bouchard,
Mounia Lalmas
Abstract:
In recent years, Large Language Models (LLMs) have enabled users to provide highly specific music recommendation requests using natural language prompts (e.g. "Can you recommend some old classics for slow dancing?"). In this setup, the recommended tracks are predicted by the LLM in an autoregressive way, i.e. the LLM generates the track titles one token at a time. While intuitive, this approach ha…
▽ More
In recent years, Large Language Models (LLMs) have enabled users to provide highly specific music recommendation requests using natural language prompts (e.g. "Can you recommend some old classics for slow dancing?"). In this setup, the recommended tracks are predicted by the LLM in an autoregressive way, i.e. the LLM generates the track titles one token at a time. While intuitive, this approach has several limitation. First, it is based on a general purpose tokenization that is optimized for words rather than for track titles. Second, it necessitates an additional entity resolution layer that matches the track title to the actual track identifier. Third, the number of decoding steps scales linearly with the length of the track title, slowing down inference. In this paper, we propose to address the task of prompt-based music recommendation as a generative retrieval task. Within this setting, we introduce novel, effective, and efficient representations of track identifiers that significantly outperform commonly used strategies. We introduce Text2Tracks, a generative retrieval model that learns a mapping from a user's music recommendation prompt to the relevant track IDs directly. Through an offline evaluation on a dataset of playlists with language inputs, we find that (1) the strategy to create IDs for music tracks is the most important factor for the effectiveness of Text2Tracks and semantic IDs significantly outperform commonly used strategies that rely on song titles as identifiers (2) provided with the right choice of track identifiers, Text2Tracks outperforms sparse and dense retrieval solutions trained to retrieve tracks from language prompts.
△ Less
Submitted 2 April, 2025; v1 submitted 31 March, 2025;
originally announced March 2025.
-
Policy-as-Prompt: Rethinking Content Moderation in the Age of Large Language Models
Authors:
Konstantina Palla,
José Luis Redondo García,
Claudia Hauff,
Francesco Fabbri,
Henrik Lindström,
Daniel R. Taber,
Andreas Damianou,
Mounia Lalmas
Abstract:
Content moderation plays a critical role in shaping safe and inclusive online environments, balancing platform standards, user expectations, and regulatory frameworks. Traditionally, this process involves operationalising policies into guidelines, which are then used by downstream human moderators for enforcement, or to further annotate datasets for training machine learning moderation models. How…
▽ More
Content moderation plays a critical role in shaping safe and inclusive online environments, balancing platform standards, user expectations, and regulatory frameworks. Traditionally, this process involves operationalising policies into guidelines, which are then used by downstream human moderators for enforcement, or to further annotate datasets for training machine learning moderation models. However, recent advancements in large language models (LLMs) are transforming this landscape. These models can now interpret policies directly as textual inputs, eliminating the need for extensive data curation. This approach offers unprecedented flexibility, as moderation can be dynamically adjusted through natural language interactions. This paradigm shift raises important questions about how policies are operationalised and the implications for content moderation practices. In this paper, we formalise the emerging policy-as-prompt framework and identify five key challenges across four domains: Technical Implementation (1. translating policy to prompts, 2. sensitivity to prompt structure and formatting), Sociotechnical (3. the risk of technological determinism in policy formation), Organisational (4. evolving roles between policy and machine learning teams), and Governance (5. model governance and accountability). Through analysing these challenges across technical, sociotechnical, organisational, and governance dimensions, we discuss potential mitigation approaches. This research provides actionable insights for practitioners and lays the groundwork for future exploration of scalable and adaptive content moderation systems in digital ecosystems.
△ Less
Submitted 25 February, 2025;
originally announced February 2025.
-
PODTILE: Facilitating Podcast Episode Browsing with Auto-generated Chapters
Authors:
Azin Ghazimatin,
Ekaterina Garmash,
Gustavo Penha,
Kristen Sheets,
Martin Achenbach,
Oguz Semerci,
Remi Galvez,
Marcus Tannenberg,
Sahitya Mantravadi,
Divya Narayanan,
Ofeliya Kalaydzhyan,
Douglas Cole,
Ben Carterette,
Ann Clifton,
Paul N. Bennett,
Claudia Hauff,
Mounia Lalmas
Abstract:
Listeners of long-form talk-audio content, such as podcast episodes, often find it challenging to understand the overall structure and locate relevant sections. A practical solution is to divide episodes into chapters--semantically coherent segments labeled with titles and timestamps. Since most episodes on our platform at Spotify currently lack creator-provided chapters, automating the creation o…
▽ More
Listeners of long-form talk-audio content, such as podcast episodes, often find it challenging to understand the overall structure and locate relevant sections. A practical solution is to divide episodes into chapters--semantically coherent segments labeled with titles and timestamps. Since most episodes on our platform at Spotify currently lack creator-provided chapters, automating the creation of chapters is essential. Scaling the chapterization of podcast episodes presents unique challenges. First, episodes tend to be less structured than written texts, featuring spontaneous discussions with nuanced transitions. Second, the transcripts are usually lengthy, averaging about 16,000 tokens, which necessitates efficient processing that can preserve context. To address these challenges, we introduce PODTILE, a fine-tuned encoder-decoder transformer to segment conversational data. The model simultaneously generates chapter transitions and titles for the input transcript. To preserve context, each input text is augmented with global context, including the episode's title, description, and previous chapter titles. In our intrinsic evaluation, PODTILE achieved an 11% improvement in ROUGE score over the strongest baseline. Additionally, we provide insights into the practical benefits of auto-generated chapters for listeners navigating episode content. Our findings indicate that auto-generated chapters serve as a useful tool for engaging with less popular podcasts. Finally, we present empirical evidence that using chapter titles can enhance effectiveness of sparse retrieval in search tasks.
△ Less
Submitted 21 October, 2024;
originally announced October 2024.
-
Diffusion Model for Slate Recommendation
Authors:
Federico Tomasi,
Francesco Fabbri,
Mounia Lalmas,
Zhenwen Dai
Abstract:
Slate recommendation is a technique commonly used on streaming platforms and e-commerce sites to present multiple items together. A significant challenge with slate recommendation is managing the complex combinatorial choice space. Traditional methods often simplify this problem by assuming users engage with only one item at a time. However, this simplification does not reflect the reality, as use…
▽ More
Slate recommendation is a technique commonly used on streaming platforms and e-commerce sites to present multiple items together. A significant challenge with slate recommendation is managing the complex combinatorial choice space. Traditional methods often simplify this problem by assuming users engage with only one item at a time. However, this simplification does not reflect the reality, as users often interact with multiple items simultaneously. In this paper, we address the general slate recommendation problem, which accounts for simultaneous engagement with multiple items. We propose a generative approach using Diffusion Models, leveraging their ability to learn structures in high-dimensional data. Our model generates high-quality slates that maximize user satisfaction by overcoming the challenges of the combinatorial choice space. Furthermore, our approach enhances the diversity of recommendations. Extensive offline evaluations on applications such as music playlist generation and e-commerce bundle recommendations show that our model outperforms state-of-the-art baselines in both relevance and diversity.
△ Less
Submitted 13 August, 2024;
originally announced August 2024.
-
Long-term Off-Policy Evaluation and Learning
Authors:
Yuta Saito,
Himan Abdollahpouri,
Jesse Anderton,
Ben Carterette,
Mounia Lalmas
Abstract:
Short- and long-term outcomes of an algorithm often differ, with damaging downstream effects. A known example is a click-bait algorithm, which may increase short-term clicks but damage long-term user engagement. A possible solution to estimate the long-term outcome is to run an online experiment or A/B test for the potential algorithms, but it takes months or even longer to observe the long-term o…
▽ More
Short- and long-term outcomes of an algorithm often differ, with damaging downstream effects. A known example is a click-bait algorithm, which may increase short-term clicks but damage long-term user engagement. A possible solution to estimate the long-term outcome is to run an online experiment or A/B test for the potential algorithms, but it takes months or even longer to observe the long-term outcomes of interest, making the algorithm selection process unacceptably slow. This work thus studies the problem of feasibly yet accurately estimating the long-term outcome of an algorithm using only historical and short-term experiment data. Existing approaches to this problem either need a restrictive assumption about the short-term outcomes called surrogacy or cannot effectively use short-term outcomes, which is inefficient. Therefore, we propose a new framework called Long-term Off-Policy Evaluation (LOPE), which is based on reward function decomposition. LOPE works under a more relaxed assumption than surrogacy and effectively leverages short-term rewards to substantially reduce the variance. Synthetic experiments show that LOPE outperforms existing approaches particularly when surrogacy is severely violated and the long-term reward is noisy. In addition, real-world experiments on large-scale A/B test data collected on a music streaming platform show that LOPE can estimate the long-term outcome of actual algorithms more accurately than existing feasible methods.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
Towards Graph Foundation Models for Personalization
Authors:
Andreas Damianou,
Francesco Fabbri,
Paul Gigioli,
Marco De Nadai,
Alice Wang,
Enrico Palumbo,
Mounia Lalmas
Abstract:
In the realm of personalization, integrating diverse information sources such as consumption signals and content-based representations is becoming increasingly critical to build state-of-the-art solutions. In this regard, two of the biggest trends in research around this subject are Graph Neural Networks (GNNs) and Foundation Models (FMs). While GNNs emerged as a popular solution in industry for p…
▽ More
In the realm of personalization, integrating diverse information sources such as consumption signals and content-based representations is becoming increasingly critical to build state-of-the-art solutions. In this regard, two of the biggest trends in research around this subject are Graph Neural Networks (GNNs) and Foundation Models (FMs). While GNNs emerged as a popular solution in industry for powering personalization at scale, FMs have only recently caught attention for their promising performance in personalization tasks like ranking and retrieval. In this paper, we present a graph-based foundation modeling approach tailored to personalization. Central to this approach is a Heterogeneous GNN (HGNN) designed to capture multi-hop content and consumption relationships across a range of recommendable item types. To ensure the generality required from a Foundation Model, we employ a Large Language Model (LLM) text-based featurization of nodes that accommodates all item types, and construct the graph using co-interaction signals, which inherently transcend content specificity. To facilitate practical generalization, we further couple the HGNN with an adaptation mechanism based on a two-tower (2T) architecture, which also operates agnostically to content type. This multi-stage approach ensures high scalability; while the HGNN produces general purpose embeddings, the 2T component models in a continuous space the sheer size of user-item interaction data. Our comprehensive approach has been rigorously tested and proven effective in delivering recommendations across a diverse array of products within a real-world, industrial audio streaming platform.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
Generalized User Representations for Transfer Learning
Authors:
Ghazal Fazelnia,
Sanket Gupta,
Claire Keum,
Mark Koh,
Ian Anderson,
Mounia Lalmas
Abstract:
We present a novel framework for user representation in large-scale recommender systems, aiming at effectively representing diverse user taste in a generalized manner. Our approach employs a two-stage methodology combining representation learning and transfer learning. The representation learning model uses an autoencoder that compresses various user features into a representation space. In the se…
▽ More
We present a novel framework for user representation in large-scale recommender systems, aiming at effectively representing diverse user taste in a generalized manner. Our approach employs a two-stage methodology combining representation learning and transfer learning. The representation learning model uses an autoencoder that compresses various user features into a representation space. In the second stage, downstream task-specific models leverage user representations via transfer learning instead of curating user features individually. We further augment this methodology on the representation's input features to increase flexibility and enable reaction to user events, including new user experiences, in Near-Real Time. Additionally, we propose a novel solution to manage deployment of this framework in production models, allowing downstream models to work independently. We validate the performance of our framework through rigorous offline and online experiments within a large-scale system, showcasing its remarkable efficacy across multiple evaluation tasks. Finally, we show how the proposed framework can significantly reduce infrastructure costs compared to alternative approaches.
△ Less
Submitted 1 March, 2024;
originally announced March 2024.
-
ForTune: Running Offline Scenarios to Estimate Impact on Business Metrics
Authors:
Georges Dupret,
Konstantin Sozinov,
Carmen Barcena Gonzalez,
Ziggy Zacks,
Amber Yuan,
Benjamin Carterette,
Manuel Mai,
Shubham Bansal,
Gwo Liang Leo Lien,
Andrey Gatash,
Roberto Sanchis Ojeda,
Mounia Lalmas
Abstract:
Making ideal decisions as a product leader in a web-facing company is extremely difficult. In addition to navigating the ambiguity of customer satisfaction and achieving business goals, one must also pave a path forward for ones' products and services to remain relevant, desirable, and profitable. Data and experimentation to test product hypotheses are key to informing product decisions. Online co…
▽ More
Making ideal decisions as a product leader in a web-facing company is extremely difficult. In addition to navigating the ambiguity of customer satisfaction and achieving business goals, one must also pave a path forward for ones' products and services to remain relevant, desirable, and profitable. Data and experimentation to test product hypotheses are key to informing product decisions. Online controlled experiments by A/B testing may provide the best data to support such decisions with high confidence, but can be time-consuming and expensive, especially when one wants to understand impact to key business metrics such as retention or long-term value. Offline experimentation allows one to rapidly iterate and test, but often cannot provide the same level of confidence, and cannot easily shine a light on impact on business metrics. We introduce a novel, lightweight, and flexible approach to investigating hypotheses, called scenario analysis, that aims to support product leaders' decisions using data about users and estimates of business metrics. Its strengths are that it can provide guidance on trade-offs that are incurred by growing or shifting consumption, estimate trends in long-term outcomes like retention and other important business metrics, and can generate hypotheses about relationships between metrics at scale.
△ Less
Submitted 28 January, 2025; v1 submitted 29 February, 2024;
originally announced March 2024.
-
Impatient Bandits: Optimizing Recommendations for the Long-Term Without Delay
Authors:
Thomas M. McDonald,
Lucas Maystre,
Mounia Lalmas,
Daniel Russo,
Kamil Ciosek
Abstract:
Recommender systems are a ubiquitous feature of online platforms. Increasingly, they are explicitly tasked with increasing users' long-term satisfaction. In this context, we study a content exploration task, which we formalize as a multi-armed bandit problem with delayed rewards. We observe that there is an apparent trade-off in choosing the learning signal: Waiting for the full reward to become a…
▽ More
Recommender systems are a ubiquitous feature of online platforms. Increasingly, they are explicitly tasked with increasing users' long-term satisfaction. In this context, we study a content exploration task, which we formalize as a multi-armed bandit problem with delayed rewards. We observe that there is an apparent trade-off in choosing the learning signal: Waiting for the full reward to become available might take several weeks, hurting the rate at which learning happens, whereas measuring short-term proxy rewards reflects the actual long-term goal only imperfectly. We address this challenge in two steps. First, we develop a predictive model of delayed rewards that incorporates all information obtained to date. Full observations as well as partial (short or medium-term) outcomes are combined through a Bayesian filter to obtain a probabilistic belief. Second, we devise a bandit algorithm that takes advantage of this new predictive model. The algorithm quickly learns to identify content aligned with long-term success by carefully balancing exploration and exploitation. We apply our approach to a podcast recommendation problem, where we seek to identify shows that users engage with repeatedly over two months. We empirically validate that our approach results in substantially better performance compared to approaches that either optimize for short-term proxies, or wait for the long-term outcome to be fully realized.
△ Less
Submitted 20 July, 2023; v1 submitted 19 July, 2023;
originally announced July 2023.
-
Disentangling Causal Effects from Sets of Interventions in the Presence of Unobserved Confounders
Authors:
Olivier Jeunen,
Ciarán M. Gilligan-Lee,
Rishabh Mehrotra,
Mounia Lalmas
Abstract:
The ability to answer causal questions is crucial in many domains, as causal inference allows one to understand the impact of interventions. In many applications, only a single intervention is possible at a given time. However, in some important areas, multiple interventions are concurrently applied. Disentangling the effects of single interventions from jointly applied interventions is a challeng…
▽ More
The ability to answer causal questions is crucial in many domains, as causal inference allows one to understand the impact of interventions. In many applications, only a single intervention is possible at a given time. However, in some important areas, multiple interventions are concurrently applied. Disentangling the effects of single interventions from jointly applied interventions is a challenging task -- especially as simultaneously applied interventions can interact. This problem is made harder still by unobserved confounders, which influence both treatments and outcome. We address this challenge by aiming to learn the effect of a single-intervention from both observational data and sets of interventions. We prove that this is not generally possible, but provide identification proofs demonstrating that it can be achieved under non-linear continuous structural causal models with additive, multivariate Gaussian noise -- even when unobserved confounders are present. Importantly, we show how to incorporate observed covariates and learn heterogeneous treatment effects. Based on the identifiability proofs, we provide an algorithm that learns the causal model parameters by pooling data from different regimes and jointly maximizing the combined likelihood. The effectiveness of our method is empirically demonstrated on both synthetic and real-world data.
△ Less
Submitted 11 October, 2022;
originally announced October 2022.
-
Mostra: A Flexible Balancing Framework to Trade-off User, Artist and Platform Objectives for Music Sequencing
Authors:
Emanuele Bugliarello,
Rishabh Mehrotra,
James Kirk,
Mounia Lalmas
Abstract:
We consider the task of sequencing tracks on music streaming platforms where the goal is to maximise not only user satisfaction, but also artist- and platform-centric objectives, needed to ensure long-term health and sustainability of the platform. Grounding the work across four objectives: Sat, Discovery, Exposure and Boost, we highlight the need and the potential to trade-off performance across…
▽ More
We consider the task of sequencing tracks on music streaming platforms where the goal is to maximise not only user satisfaction, but also artist- and platform-centric objectives, needed to ensure long-term health and sustainability of the platform. Grounding the work across four objectives: Sat, Discovery, Exposure and Boost, we highlight the need and the potential to trade-off performance across these objectives, and propose Mostra, a Set Transformer-based encoder-decoder architecture equipped with submodular multi-objective beam search decoding. The proposed model affords system designers the power to balance multiple goals, and dynamically control the impact on one objective to satisfy other objectives. Through extensive experiments on data from a large-scale music streaming platform, we present insights on the trade-offs that exist across different objectives, and demonstrate that the proposed framework leads to a superior, just-in-time balancing across the various metrics of interest.
△ Less
Submitted 21 April, 2022;
originally announced April 2022.
-
Every Colour You Are: Stance Prediction and Turnaround in Controversial Issues
Authors:
Eduardo Graells-Garrido,
Ricardo Baeza-Yates,
Mounia Lalmas
Abstract:
Web platforms have allowed political manifestation and debate for decades. Technology changes have brought new opportunities for expression, and the availability of longitudinal data of these debates entice new questions regarding who participates, and who updates their opinion. The aim of this work is to provide a methodology to measure these phenomena, and to test this methodology on a specific…
▽ More
Web platforms have allowed political manifestation and debate for decades. Technology changes have brought new opportunities for expression, and the availability of longitudinal data of these debates entice new questions regarding who participates, and who updates their opinion. The aim of this work is to provide a methodology to measure these phenomena, and to test this methodology on a specific topic, abortion, as observed on one of the most popular micro-blogging platforms. To do so, we followed the discussion on Twitter about abortion in two Spanish-speaking countries from 2015 to 2018. Our main insights are two fold. On the one hand, people adopted new technologies to express their stances, particularly colored variations of heart emojis ([green heart] & [purple heart]) in a way that mirrored physical manifestations on abortion. On the other hand, even on issues with strong opinions, opinions can change, and these changes show differences in demographic groups. These findings imply that debate on the Web embraces new ways of stance adherence, and that changes of opinion can be measured and characterized.
△ Less
Submitted 19 May, 2020;
originally announced May 2020.
-
Quantifying Biases in Online Information Exposure
Authors:
Dimitar Nikolov,
Mounia Lalmas,
Alessandro Flammini,
Filippo Menczer
Abstract:
Our consumption of online information is mediated by filtering, ranking, and recommendation algorithms that introduce unintentional biases as they attempt to deliver relevant and engaging content. It has been suggested that our reliance on online technologies such as search engines and social media may limit exposure to diverse points of view and make us vulnerable to manipulation by disinformatio…
▽ More
Our consumption of online information is mediated by filtering, ranking, and recommendation algorithms that introduce unintentional biases as they attempt to deliver relevant and engaging content. It has been suggested that our reliance on online technologies such as search engines and social media may limit exposure to diverse points of view and make us vulnerable to manipulation by disinformation. In this paper, we mine a massive dataset of Web traffic to quantify two kinds of bias: (i) homogeneity bias, which is the tendency to consume content from a narrow set of information sources, and (ii) popularity bias, which is the selective exposure to content from top sites. Our analysis reveals different bias levels across several widely used Web platforms. Search exposes users to a diverse set of sources, while social media traffic tends to exhibit high popularity and homogeneity bias. When we focus our analysis on traffic to news sites, we find higher levels of popularity bias, with smaller differences across applications. Overall, our results quantify the extent to which our choices of online systems confine us inside "social bubbles."
△ Less
Submitted 18 July, 2018;
originally announced July 2018.
-
Sentiment Visualisation Widgets for Exploratory Search
Authors:
Eduardo Graells-Garrido,
Mounia Lalmas,
Ricardo Baeza-Yates
Abstract:
This paper proposes the usage of \emph{visualisation widgets} for exploratory search with \emph{sentiment} as a facet. Starting from specific design goals for depiction of ambivalence in sentiment, two visualization widgets were implemented: \emph{scatter plot} and \emph{parallel coordinates}. Those widgets were evaluated against a text baseline in a small-scale usability study with exploratory ta…
▽ More
This paper proposes the usage of \emph{visualisation widgets} for exploratory search with \emph{sentiment} as a facet. Starting from specific design goals for depiction of ambivalence in sentiment, two visualization widgets were implemented: \emph{scatter plot} and \emph{parallel coordinates}. Those widgets were evaluated against a text baseline in a small-scale usability study with exploratory tasks using Wikipedia as dataset. The study results indicate that users spend more time browsing with scatter plots in a positive way. A post-hoc analysis of individual differences in behavior revealed that when considering two types of users, \emph{explorers} and \emph{achievers}, engagement with scatter plots is positive and significantly greater \textit{when users are explorers}. We discuss the implications of these findings for sentiment-based exploratory search and personalised user interfaces.
△ Less
Submitted 8 January, 2016;
originally announced January 2016.
-
Data Portraits and Intermediary Topics: Encouraging Exploration of Politically Diverse Profiles
Authors:
Eduardo Graells-Garrido,
Mounia Lalmas,
Ricardo Baeza-Yates
Abstract:
In micro-blogging platforms, people connect and interact with others. However, due to cognitive biases, they tend to interact with like-minded people and read agreeable information only. Many efforts to make people connect with those who think differently have not worked well. In this paper, we hypothesize, first, that previous approaches have not worked because they have been direct -- they have…
▽ More
In micro-blogging platforms, people connect and interact with others. However, due to cognitive biases, they tend to interact with like-minded people and read agreeable information only. Many efforts to make people connect with those who think differently have not worked well. In this paper, we hypothesize, first, that previous approaches have not worked because they have been direct -- they have tried to explicitly connect people with those having opposing views on sensitive issues. Second, that neither recommendation or presentation of information by themselves are enough to encourage behavioral change. We propose a platform that mixes a recommender algorithm and a visualization-based user interface to explore recommendations. It recommends politically diverse profiles in terms of distance of latent topics, and displays those recommendations in a visual representation of each user's personal content. We performed an "in the wild" evaluation of this platform, and found that people explored more recommendations when using a biased algorithm instead of ours. In line with our hypothesis, we also found that the mixture of our recommender algorithm and our user interface, allowed politically interested users to exhibit an unbiased exploration of the recommended profiles. Finally, our results contribute insights in two aspects: first, which individual differences are important when designing platforms aimed at behavioral change; and second, which algorithms and user interfaces should be mixed to help users avoid cognitive mechanisms that lead to biased behavior.
△ Less
Submitted 4 January, 2016;
originally announced January 2016.
-
Encouraging Diversity- and Representation-Awareness in Geographically Centralized Content
Authors:
Eduardo Graells-Garrido,
Mounia Lalmas,
Ricardo Baeza-Yates
Abstract:
In centralized countries, not only population, media and economic power are concentrated, but people give more attention to central locations. While this is not inherently bad, this behavior extends to micro-blogging platforms: central locations get more attention in terms of information flow. In this paper we study the effects of an information filtering algorithm that decentralizes content in su…
▽ More
In centralized countries, not only population, media and economic power are concentrated, but people give more attention to central locations. While this is not inherently bad, this behavior extends to micro-blogging platforms: central locations get more attention in terms of information flow. In this paper we study the effects of an information filtering algorithm that decentralizes content in such platforms. Particularly, we find that users from non-central locations were not able to identify the geographical diversity on timelines generated by the algorithm, which were diverse by construction. To make users see the inherent diversity, we define a design rationale to approach this problem, focused on an already known visualization technique: treemaps. Using interaction data from an "in the wild" deployment of our proposed system, we find that, even though there are effects of centralization in exploratory user behavior, the treemap was able to make users see the inherent geographical diversity of timelines, and engage with user generated content. With these results in mind, we propose practical actions for micro-blogging platforms to account for the differences and biased behavior induced by centralization.
△ Less
Submitted 7 October, 2015;
originally announced October 2015.
-
Finding Intermediary Topics Between People of Opposing Views: A Case Study
Authors:
Eduardo Graells-Garrido,
Mounia Lalmas,
Ricardo Baeza-Yates
Abstract:
In micro-blogging platforms, people can connect with others and have conversations on a wide variety of topics. However, because of homophily and selective exposure, users tend to connect with like-minded people and only read agreeable information. Motivated by this scenario, in this paper we study the diversity of intermediary topics, which are latent topics estimated from user generated content.…
▽ More
In micro-blogging platforms, people can connect with others and have conversations on a wide variety of topics. However, because of homophily and selective exposure, users tend to connect with like-minded people and only read agreeable information. Motivated by this scenario, in this paper we study the diversity of intermediary topics, which are latent topics estimated from user generated content. These topics can be used as features in recommender systems aimed at introducing people of diverse political viewpoints. We conducted a case study on Twitter, considering the debate about a sensitive issue in Chile, where we quantified homophilic behavior in terms of political discussion and then we evaluated the diversity of intermediary topics in terms of political stances of users.
△ Less
Submitted 30 July, 2015; v1 submitted 2 June, 2015;
originally announced June 2015.
-
First Women, Second Sex: Gender Bias in Wikipedia
Authors:
Eduardo Graells-Garrido,
Mounia Lalmas,
Filippo Menczer
Abstract:
Contributing to history has never been as easy as it is today. Anyone with access to the Web is able to play a part on Wikipedia, an open and free encyclopedia. Wikipedia, available in many languages, is one of the most visited websites in the world and arguably one of the primary sources of knowledge on the Web. However, not everyone is contributing to Wikipedia from a diversity point of view; se…
▽ More
Contributing to history has never been as easy as it is today. Anyone with access to the Web is able to play a part on Wikipedia, an open and free encyclopedia. Wikipedia, available in many languages, is one of the most visited websites in the world and arguably one of the primary sources of knowledge on the Web. However, not everyone is contributing to Wikipedia from a diversity point of view; several groups are severely underrepresented. One of those groups is women, who make up approximately 16% of the current contributor community, meaning that most of the content is written by men. In addition, although there are specific guidelines of verifiability, notability, and neutral point of view that must be adhered by Wikipedia content, these guidelines are supervised and enforced by men.
In this paper, we propose that gender bias is not about participation and representation only, but also about characterization of women. We approach the analysis of gender bias by defining a methodology for comparing the characterizations of men and women in biographies in three aspects: meta-data, language, and network structure. Our results show that, indeed, there are differences in characterization and structure. Some of these differences are reflected from the off-line world documented by Wikipedia, but other differences can be attributed to gender bias in Wikipedia content. We contextualize these differences in feminist theory and discuss their implications for Wikipedia policy.
△ Less
Submitted 2 June, 2015; v1 submitted 8 February, 2015;
originally announced February 2015.
-
An Exploration of Cursor tracking Data
Authors:
David Warnock,
Mounia Lalmas
Abstract:
Cursor tracking data contains information about website visitors which may provide new ways to understand visitors and their needs. This paper presents an Amazon Mechanical Turk study where participants were tracked as they used modified variants of the Wikipedia and BBC News websites. Participants were asked to complete reading and information-finding tasks. The results showed that it was possibl…
▽ More
Cursor tracking data contains information about website visitors which may provide new ways to understand visitors and their needs. This paper presents an Amazon Mechanical Turk study where participants were tracked as they used modified variants of the Wikipedia and BBC News websites. Participants were asked to complete reading and information-finding tasks. The results showed that it was possible to differentiate between users reading content and users looking for information based on cursor data. The effects of website aesthetics, user interest and cursor hardware were also analysed which showed it was possible to identify hardware from cursor data, but no relationship between cursor data and engagement was found. The implications of these results, from the impact on web analytics to the design of experiments to assess user engagement, are discussed.
△ Less
Submitted 3 February, 2015; v1 submitted 1 February, 2015;
originally announced February 2015.
-
Data Portraits: Connecting People of Opposing Views
Authors:
Eduardo Graells-Garrido,
Mounia Lalmas,
Daniele Quercia
Abstract:
Social networks allow people to connect with each other and have conversations on a wide variety of topics. However, users tend to connect with like-minded people and read agreeable information, a behavior that leads to group polarization. Motivated by this scenario, we study how to take advantage of partial homophily to suggest agreeable content to users authored by people with opposite views on…
▽ More
Social networks allow people to connect with each other and have conversations on a wide variety of topics. However, users tend to connect with like-minded people and read agreeable information, a behavior that leads to group polarization. Motivated by this scenario, we study how to take advantage of partial homophily to suggest agreeable content to users authored by people with opposite views on sensitive issues. We introduce a paradigm to present a data portrait of users, in which their characterizing topics are visualized and their corresponding tweets are displayed using an organic design. Among their tweets we inject recommended tweets from other people considering their views on sensitive issues in addition to topical relevance, indirectly motivating connections between dissimilar people. To evaluate our approach, we present a case study on Twitter about a sensitive topic in Chile, where we estimate user stances for regular people and find intermediary topics. We then evaluated our design in a user study. We found that recommending topically relevant content from authors with opposite views in a baseline interface had a negative emotional effect. We saw that our organic visualization design reverts that effect. We also observed significant individual differences linked to evaluation of recommendations. Our results suggest that organic visualization may revert the negative effects of providing potentially sensitive content.
△ Less
Submitted 19 November, 2013;
originally announced November 2013.
-
Beyond Cumulated Gain and Average Precision: Including Willingness and Expectation in the User Model
Authors:
Benjamin Piwowarski,
Georges Dupret,
Mounia Lalmas
Abstract:
In this paper, we define a new metric family based on two concepts: The definition of the stopping criterion and the notion of satisfaction, where the former depends on the willingness and expectation of a user exploring search results. Both concepts have been discussed so far in the IR literature, but we argue in this paper that defining a proper single valued metric depends on merging them into…
▽ More
In this paper, we define a new metric family based on two concepts: The definition of the stopping criterion and the notion of satisfaction, where the former depends on the willingness and expectation of a user exploring search results. Both concepts have been discussed so far in the IR literature, but we argue in this paper that defining a proper single valued metric depends on merging them into a single conceptual framework.
△ Less
Submitted 20 September, 2012;
originally announced September 2012.
-
Exploring a Multidimensional Representation of Documents and Queries (extended version)
Authors:
Benjamin Piwowarski,
Ingo Frommholz,
Mounia Lalmas,
Keith van Rijsbergen
Abstract:
In Information Retrieval (IR), whether implicitly or explicitly, queries and documents are often represented as vectors. However, it may be more beneficial to consider documents and/or queries as multidimensional objects. Our belief is this would allow building "truly" interactive IR systems, i.e., where interaction is fully incorporated in the IR framework.
The probabilistic formalism of quan…
▽ More
In Information Retrieval (IR), whether implicitly or explicitly, queries and documents are often represented as vectors. However, it may be more beneficial to consider documents and/or queries as multidimensional objects. Our belief is this would allow building "truly" interactive IR systems, i.e., where interaction is fully incorporated in the IR framework.
The probabilistic formalism of quantum physics represents events and densities as multidimensional objects. This paper presents our first step towards building an interactive IR framework upon this formalism, by stating how the first interaction of the retrieval process, when the user types a query, can be formalised. Our framework depends on a number of parameters affecting the final document ranking. In this paper we experimentally investigate the effect of these parameters, showing that the proposed representation of documents and queries as multidimensional objects can compete with standard approaches, with the additional prospect to be applied to interactive retrieval.
△ Less
Submitted 18 February, 2010; v1 submitted 17 February, 2010;
originally announced February 2010.
-
A Quantum-based Model for Interactive Information Retrieval (extended version)
Authors:
B. Piwowarski,
M. Lalmas
Abstract:
Even the best information retrieval model cannot always identify the most useful answers to a user query. This is in particular the case with web search systems, where it is known that users tend to minimise their effort to access relevant information. It is, however, believed that the interaction between users and a retrieval system, such as a web search engine, can be exploited to provide bett…
▽ More
Even the best information retrieval model cannot always identify the most useful answers to a user query. This is in particular the case with web search systems, where it is known that users tend to minimise their effort to access relevant information. It is, however, believed that the interaction between users and a retrieval system, such as a web search engine, can be exploited to provide better answers to users. Interactive Information Retrieval (IR) systems, in which users access information through a series of interactions with the search system, are concerned with building models for IR, where interaction plays a central role. There are many possible interactions between a user and a search system, ranging from query (re)formulation to relevance feedback. However, capturing them within a single framework is difficult and previously proposed approaches have mostly focused on relevance feedback. In this paper, we propose a general framework for interactive IR that is able to capture the full interaction process in a principled way. Our approach relies upon a generalisation of the probability framework of quantum physics, whose strong geometric component can be a key towards a successful interactive IR model.
△ Less
Submitted 23 June, 2009; v1 submitted 22 June, 2009;
originally announced June 2009.