-
Digital Advertising in a Post-Cookie World: Charting the Impact of Google's Topics API
Authors:
Jesús Romero,
Ángel Cuevas,
Rubén Cuevas
Abstract:
Integrating Google's Topics API into the digital advertising ecosystem represents a significant shift toward privacy-conscious advertising practices. This article analyses the implications of implementing Topics API on ad networks, focusing on competition dynamics and ad space accessibility. Through simulations based on extensive datasets capturing user behavior and market share data for ad networ…
▽ More
Integrating Google's Topics API into the digital advertising ecosystem represents a significant shift toward privacy-conscious advertising practices. This article analyses the implications of implementing Topics API on ad networks, focusing on competition dynamics and ad space accessibility. Through simulations based on extensive datasets capturing user behavior and market share data for ad networks, we evaluate metrics such as Ad Placement Eligibility, Low Competition Rate, and solo competitor. The findings reveal a noticeable impact on ad networks, with larger players strengthening their dominance and smaller networks facing challenges securing ad spaces and competing effectively. Moreover, the study explores the potential environmental implications of Google's actions, highlighting the need to carefully consider policy and regulatory measures to ensure fair competition and privacy protection. Overall, this research contributes valuable insights into the evolving dynamics of digital advertising and highlights the importance of balancing privacy with competition and innovation in the online advertising landscape.
△ Less
Submitted 21 September, 2024;
originally announced September 2024.
-
adF: A Novel System for Measuring Web Fingerprinting through Ads
Authors:
Miguel A. Bermejo-Agueda,
Patricia Callejo,
Rubén Cuevas,
Ángel Cuevas
Abstract:
This paper introduces adF, a novel system for analyzing the vulnerability of different devices, Operating Systems (OSes), and browsers to web fingerprinting. adF performs its measurements from code inserted in ads. We have used our system in several ad campaigns that delivered 5.40 million ad impressions. The collected data allow us to assess the vulnerability of current desktop and mobile devices…
▽ More
This paper introduces adF, a novel system for analyzing the vulnerability of different devices, Operating Systems (OSes), and browsers to web fingerprinting. adF performs its measurements from code inserted in ads. We have used our system in several ad campaigns that delivered 5.40 million ad impressions. The collected data allow us to assess the vulnerability of current desktop and mobile devices to web fingerprinting. Based on our results, we estimate that 66% of desktop devices and 40% of mobile devices can be uniquely fingerprinted with our web fingerprinting system. However, the resilience to web fingerprinting varies significantly across browsers and device types, with Chrome on desktops being the most vulnerable configuration.
To counter web fingerprinting, we propose ShieldF, a simple solution which blocks the reporting by browsers of those attributes that we found in the analysis of our dataset that present the most significant discrimination power. Our experiments reveal that ShieldF outperforms all anti-fingerprinting solutions proposed by major browsers (Chrome, Safari and Firefox) offering an increase in the resilience offered to web fingerprinting up to 62% for some device configurations. ShieldF is available as an add-on for any chromium-based browser. Moreover, it is readily adoptable by browser and mobile app developers. Its widespread use would lead to a significant improvement in the protection offered by browsers and mobile apps to web fingerprinting.
△ Less
Submitted 12 September, 2024; v1 submitted 15 November, 2023;
originally announced November 2023.
-
Analysis and implementation of nanotargeting on LinkedIn based on publicly available non-PII
Authors:
Ángel Merino,
José González-Cabañas,
Ángel Cuevas,
Rubén Cuevas
Abstract:
The literature has shown that combining a few non-Personal Identifiable Information (non-PII) is enough to make a user unique in a dataset including millions of users. This work demonstrates that a combination of a few non-PII items can be activated to nanotarget users. We demonstrate that the combination of the location and {5} rare ({13} random) skills in a LinkedIn profile is enough to become u…
▽ More
The literature has shown that combining a few non-Personal Identifiable Information (non-PII) is enough to make a user unique in a dataset including millions of users. This work demonstrates that a combination of a few non-PII items can be activated to nanotarget users. We demonstrate that the combination of the location and {5} rare ({13} random) skills in a LinkedIn profile is enough to become unique in a user base of {$\sim$970M} users with a probability of 75\%. The novelty is that these attributes are publicly accessible to anyone registered on LinkedIn and can be activated through advertising campaigns. We ran an experiment configuring ad campaigns using the location and skills of three of the paper's authors, demonstrating how all the ads using $\geq13$ skills were delivered exclusively to the targeted user. We reported this vulnerability to LinkedIn, which initially ignored the problem, but fixed it as of November 2023.%This nanotargeting may expose LinkedIn users to privacy and security risks such as malvertising or manipulation.
△ Less
Submitted 16 May, 2024; v1 submitted 16 October, 2023;
originally announced October 2023.
-
Collecting Qualitative Data at Scale with Large Language Models: A Case Study
Authors:
Alejandro Cuevas,
Jennifer V. Scurrell,
Eva M. Brown,
Jason Entenmann,
Madeleine I. G. Daepp
Abstract:
Chatbots have shown promise as tools to scale qualitative data collection. Recent advances in Large Language Models (LLMs) could accelerate this process by allowing researchers to easily deploy sophisticated interviewing chatbots. We test this assumption by conducting a large-scale user study (n=399) evaluating 3 different chatbots, two of which are LLM-based and a baseline which employs hard-code…
▽ More
Chatbots have shown promise as tools to scale qualitative data collection. Recent advances in Large Language Models (LLMs) could accelerate this process by allowing researchers to easily deploy sophisticated interviewing chatbots. We test this assumption by conducting a large-scale user study (n=399) evaluating 3 different chatbots, two of which are LLM-based and a baseline which employs hard-coded questions. We evaluate the results with respect to participant engagement and experience, established metrics of chatbot quality grounded in theories of effective communication, and a novel scale evaluating "richness" or the extent to which responses capture the complexity and specificity of the social context under study. We find that, while the chatbots were able to elicit high-quality responses based on established evaluation metrics, the responses rarely capture participants' specific motives or personalized examples, and thus perform poorly with respect to richness. We further find low inter-rater reliability between LLMs and humans in the assessment of both quality and richness metrics. Our study offers a cautionary tale for scaling and evaluating qualitative research with LLMs.
△ Less
Submitted 3 December, 2024; v1 submitted 18 September, 2023;
originally announced September 2023.
-
On the notion of polynomial reach: a statistical application
Authors:
Alejandro Cholaquidis,
Antonio Cuevas,
Leonardo Moreno
Abstract:
The volume function V(t) of a compact set S\in R^d is just the Lebesgue measure of the set of points within a distance to S not larger than t. According to some classical results in geometric measure theory, the volume function turns out to be a polynomial, at least in a finite interval, under a quite intuitive, easy to interpret, sufficient condition (called ``positive reach'') which can be seen…
▽ More
The volume function V(t) of a compact set S\in R^d is just the Lebesgue measure of the set of points within a distance to S not larger than t. According to some classical results in geometric measure theory, the volume function turns out to be a polynomial, at least in a finite interval, under a quite intuitive, easy to interpret, sufficient condition (called ``positive reach'') which can be seen as an extension of the notion of convexity. However, many other simple sets, not fulfilling the positive reach condition, have also a polynomial volume function. To our knowledge, there is no general, simple geometric description of such sets. Still, the polynomial character of $V(t)$ has some relevant consequences since the polynomial coefficients carry some useful geometric information. In particular, the constant term is the volume of S and the first order coefficient is the boundary measure (in Minkowski's sense). This paper is focused on sets whose volume function is polynomial on some interval starting at zero, whose length (that we call ``polynomial reach'') might be unknown. Our main goal is to approximate such polynomial reach by statistical means, using only a large enough random sample of points inside S. The practical motivation is simple: when the value of the polynomial reach , or rather a lower bound for it, is approximately known, the polynomial coefficients can be estimated from the sample points by using standard methods in polynomial approximation. As a result, we get a quite general method to estimate the volume and boundary measure of the set, relying only on an inner sample of points and not requiring the use any smoothing parameter. This paper explores the theoretical and practical aspects of this idea.
△ Less
Submitted 1 February, 2024; v1 submitted 1 July, 2023;
originally announced July 2023.
-
CarbonTag: A Browser-Based Method for Approximating Energy Consumption of Online Ads
Authors:
José González Cabañas,
Patricia Callejo,
Rubén Cuevas,
Steffen Svatberg,
Tommy Torjesen,
Ángel Cuevas,
Antonio Pastor,
Mikko Kotila
Abstract:
Energy is today the most critical environmental challenge. The amount of carbon emissions contributing to climate change is significantly influenced by both the production and consumption of energy. Measuring and reducing the energy consumption of services is a crucial step toward reducing adverse environmental effects caused by carbon emissions. Millions of websites rely on online advertisements…
▽ More
Energy is today the most critical environmental challenge. The amount of carbon emissions contributing to climate change is significantly influenced by both the production and consumption of energy. Measuring and reducing the energy consumption of services is a crucial step toward reducing adverse environmental effects caused by carbon emissions. Millions of websites rely on online advertisements to generate revenue, with most websites earning most or all of their revenues from ads. As a result, hundreds of billions of online ads are delivered daily to internet users to be rendered in their browsers. Both the delivery and rendering of each ad consume energy. This study investigates how much energy online ads use in the rendering process and offers a way for predicting it as part of rendering the ad. To the best of the authors' knowledge, this is the first study to calculate the energy usage of single advertisements in the rendering process. Our research further introduces different levels of consumption by which online ads can be classified based on energy efficiency. This classification will allow advertisers to add energy efficiency metrics and optimize campaigns towards consuming less possible.
△ Less
Submitted 26 June, 2023; v1 submitted 25 October, 2022;
originally announced November 2022.
-
Observations From an Online Security Competition and Its Implications on Crowdsourced Security
Authors:
Alejandro Cuevas,
Emma Hogan,
Hanan Hibshi,
Nicolas Christin
Abstract:
The crowd sourced security industry, particularly bug bounty programs, has grown dramatically over the past years and has become the main source of software security reviews for many companies. However, the academic literature has largely omitted security teams, particularly in crowd work contexts. As such, we know very little about how distributed security teams organize, collaborate, and what te…
▽ More
The crowd sourced security industry, particularly bug bounty programs, has grown dramatically over the past years and has become the main source of software security reviews for many companies. However, the academic literature has largely omitted security teams, particularly in crowd work contexts. As such, we know very little about how distributed security teams organize, collaborate, and what technology needs they have. We fill this gap by conducting focus groups with the top five teams (out of 18,201 participating teams) of a computer security Capture-the-Flag (CTF) competition. We find that these teams adopted a set of strategies centered on specialties, which allowed them to reduce issues relating to dispersion, double work, and lack of previous collaboration. Observing the current issues of a model centered on individual workers in security crowd work platforms, our study cases that scaling security work to teams is feasible and beneficial. Finally, we identify various areas which warrant future work, such as issues of social identity in high-skilled crowd work environments.
△ Less
Submitted 26 April, 2022;
originally announced April 2022.
-
Unique on Facebook: Formulation and Evidence of (Nano)targeting Individual Users with non-PII Data
Authors:
José González-Cabañas,
Ángel Cuevas,
Rubén Cuevas,
Juan López-Fernández,
David García
Abstract:
The privacy of an individual is bounded by the ability of a third party to reveal their identity. Certain data items such as a passport ID or a mobile phone number may be used to uniquely identify a person. These are referred to as Personal Identifiable Information (PII) items. Previous literature has also reported that, in datasets including millions of users, a combination of several non-PII ite…
▽ More
The privacy of an individual is bounded by the ability of a third party to reveal their identity. Certain data items such as a passport ID or a mobile phone number may be used to uniquely identify a person. These are referred to as Personal Identifiable Information (PII) items. Previous literature has also reported that, in datasets including millions of users, a combination of several non-PII items (which alone are not enough to identify an individual) can uniquely identify an individual within the dataset. In this paper, we define a data-driven model to quantify the number of interests from a user that make them unique on Facebook. To the best of our knowledge, this represents the first study of individuals' uniqueness at the world population scale. Besides, users' interests are actionable non-PII items that can be used to define ad campaigns and deliver tailored ads to Facebook users. We run an experiment through 21 Facebook ad campaigns that target three of the authors of this paper to prove that, if an advertiser knows enough interests from a user, the Facebook Advertising Platform can be systematically exploited to deliver ads exclusively to a specific user. We refer to this practice as nanotargeting. Finally, we discuss the harmful risks associated with nanotargeting such as psychological persuasion, user manipulation, or blackmailing, and provide easily implementable countermeasures to preclude attacks based on nanotargeting campaigns on Facebook.
△ Less
Submitted 16 October, 2021; v1 submitted 13 October, 2021;
originally announced October 2021.
-
A deep dive into the accuracy of IP Geolocation Databases and its impact on online advertising
Authors:
Patricia Callejo,
Marco Gramaglia,
Rubén Cuevas,
Ángel Cuevas
Abstract:
The quest for every time more personalized Internet experience relies on the enriched contextual information about each user. Online advertising also follows this approach. Among the context information that advertising stakeholders leverage, location information is certainly one of them. However, when this information is not directly available from the end users, advertising stakeholders infer it…
▽ More
The quest for every time more personalized Internet experience relies on the enriched contextual information about each user. Online advertising also follows this approach. Among the context information that advertising stakeholders leverage, location information is certainly one of them. However, when this information is not directly available from the end users, advertising stakeholders infer it using geolocation databases, matching IP addresses to a position on earth. The accuracy of this approach has often been questioned in the past: however, the reality check on an advertising DSP shows that this technique accounts for a large fraction of the served advertisements. In this paper, we revisit the work in the field, that is mostly from almost one decade ago, through the lenses of big data. More specifically, we, i) benchmark two commercial Internet geolocation databases, evaluate the quality of their information using a ground truth database of user positions containing more than 2 billion samples, ii) analyze the internals of these databases, devising a theoretical upper bound for the quality of the Internet geolocation approach, and iii) we run an empirical study that unveils the monetary impact of this technology by considering the costs associated with a real-world ad impressions dataset. We show that when factoring cost in, IP geolocation technology may be, under certain campaign characteristics, a better alternative than GPS from an economic point of view, despite its inferior performance.
△ Less
Submitted 1 June, 2022; v1 submitted 27 September, 2021;
originally announced September 2021.
-
How resilient is the Open Web to the COVID-19 pandemic?
Authors:
José González-Cabañas,
Patricia Callejo,
Pelayo Vallina,
Ángel Cuevas,
Rubén Cuevas,
Antonio Fernández Anta
Abstract:
In this paper we refer to the Open Web to the set of services offered freely to Internet users, representing a pillar of modern societies. Despite its importance for society, it is unknown how the COVID-19 pandemic is affecting the Open Web. In this paper, we address this issue, focusing our analysis on Spain, one of the countries which have been most impacted by the pandemic.
On the one hand, w…
▽ More
In this paper we refer to the Open Web to the set of services offered freely to Internet users, representing a pillar of modern societies. Despite its importance for society, it is unknown how the COVID-19 pandemic is affecting the Open Web. In this paper, we address this issue, focusing our analysis on Spain, one of the countries which have been most impacted by the pandemic.
On the one hand, we study the impact of the pandemic in the financial backbone of the Open Web, the online advertising business. To this end, we leverage concepts from Supply-Demand economic theory to perform a careful analysis of the elasticity in the supply of ad-spaces to the financial shortage of the online advertising business and its subsequent reduction in ad spaces' price. On the other hand, we analyze the distribution of the Open Web composition across business categories and its evolution during the COVID-19 pandemic. These analyses are conducted between Jan 1st and Dec 31st, 2020, using a reference dataset comprising information from more than 18 billion ad spaces.
Our results indicate that the Open Web has experienced a moderate shift in its composition across business categories. However, this change is not produced by the financial shortage of the online advertising business, because as our analysis shows, the Open Web's supply of ad spaces is inelastic (i.e., insensitive) to the sustained low-price of ad spaces during the pandemic. Instead, existing evidence suggests that the reported shift in the Open Web composition is likely due to the change in the users' online behavior (e.g., browsing and mobile apps utilization patterns).
△ Less
Submitted 28 March, 2022; v1 submitted 30 July, 2021;
originally announced July 2021.
-
Digital Contact Tracing: Large-scale Geolocation Data as an Alternative to Bluetooth-based Apps' Failure
Authors:
José González-Cabañas,
Ángel Cuevas,
Rubén Cuevas,
Martin Maier
Abstract:
The currently deployed contact-tracing mobile apps have failed as an efficient solution in the context of the COVID-19 pandemic. None of them has managed to attract the number of active users required to achieve an efficient operation. This urges the research community to re-open the debate and explore new avenues that lead to efficient contact-tracing solutions. This paper contributes to this deb…
▽ More
The currently deployed contact-tracing mobile apps have failed as an efficient solution in the context of the COVID-19 pandemic. None of them has managed to attract the number of active users required to achieve an efficient operation. This urges the research community to re-open the debate and explore new avenues that lead to efficient contact-tracing solutions. This paper contributes to this debate with an alternative contact-tracing solution that leverages already available geolocation information owned by BigTech companies with very large penetration rates in most countries adopting contact-tracing mobile apps. Moreover, our solution provides sufficient privacy guarantees to protect the identity of infected users as well as precluding Health Authorities from obtaining the contact graph from individuals.
△ Less
Submitted 28 March, 2022; v1 submitted 18 January, 2021;
originally announced January 2021.
-
Establishing Trust in Online Advertising with Signed Transactions
Authors:
Antonio Pastor,
Rubén Cuevas,
Ángel Cuevas,
Arturo Azcorra
Abstract:
Programmatic advertising operates one of the most sophisticated and efficient service platforms on the Internet. However, the complexity of this ecosystem is a direct cause of one of the most important problems in online advertising, the lack of transparency. This lack of transparency enables subsequent problems such as advertising fraud, which causes billions of dollars in losses. In this paper w…
▽ More
Programmatic advertising operates one of the most sophisticated and efficient service platforms on the Internet. However, the complexity of this ecosystem is a direct cause of one of the most important problems in online advertising, the lack of transparency. This lack of transparency enables subsequent problems such as advertising fraud, which causes billions of dollars in losses. In this paper we propose Ads.chain, a technological solution to the lack-of-transparency problem in programmatic advertising. Ads.chain extends the current effort of the Internet Advertising Bureau (IAB) in providing traceability in online advertising through the Ads.txt and Ads.cert solutions, addressing the limitations of these techniques. Ads.chain is (to the best of the authors' knowledge) the first solution that provides end-to-end cryptographic traceability at the ad transaction level. It is a communication protocol that can be seamlessly embedded into ad-tags and the OpenRTB protocol, the de-facto standards for communications in online advertising, allowing an incremental adoption by the industry. We have implemented Ads.chain and made the code publicly available. We assess the performance of Ads.chain through a thorough analysis in a lab environment that emulates a real ad delivery process at real-life throughputs. The obtained results show that Ads.chain can be implemented with limited impact on the hardware resources and marginal delay increments at the publishers lower than 0.20 milliseconds per ad space on webpages and 2.6 milliseconds at the programmatic advertising platforms. These results confirm that Ads.chain's impact on the user experience and the overall operation of the programmatic ad delivery process can be considered negligible.
△ Less
Submitted 7 January, 2021; v1 submitted 13 May, 2020;
originally announced May 2020.
-
Gaussian process imputation of multiple financial series
Authors:
Taco de Wolff,
Alejandro Cuevas,
Felipe Tobar
Abstract:
In Financial Signal Processing, multiple time series such as financial indicators, stock prices and exchange rates are strongly coupled due to their dependence on the latent state of the market and therefore they are required to be jointly analysed. We focus on learning the relationships among financial time series by modelling them through a multi-output Gaussian process (MOGP) with expressive co…
▽ More
In Financial Signal Processing, multiple time series such as financial indicators, stock prices and exchange rates are strongly coupled due to their dependence on the latent state of the market and therefore they are required to be jointly analysed. We focus on learning the relationships among financial time series by modelling them through a multi-output Gaussian process (MOGP) with expressive covariance functions. Learning these market dependencies among financial series is crucial for the imputation and prediction of financial observations. The proposed model is validated experimentally on two real-world financial datasets for which their correlations across channels are analysed. We compare our model against other MOGPs and the independent Gaussian process on real financial data.
△ Less
Submitted 11 February, 2020;
originally announced February 2020.
-
MOGPTK: The Multi-Output Gaussian Process Toolkit
Authors:
Taco de Wolff,
Alejandro Cuevas,
Felipe Tobar
Abstract:
We present MOGPTK, a Python package for multi-channel data modelling using Gaussian processes (GP). The aim of this toolkit is to make multi-output GP (MOGP) models accessible to researchers, data scientists, and practitioners alike. MOGPTK uses a Python front-end, relies on the GPflow suite and is built on a TensorFlow back-end, thus enabling GPU-accelerated training. The toolkit facilitates impl…
▽ More
We present MOGPTK, a Python package for multi-channel data modelling using Gaussian processes (GP). The aim of this toolkit is to make multi-output GP (MOGP) models accessible to researchers, data scientists, and practitioners alike. MOGPTK uses a Python front-end, relies on the GPflow suite and is built on a TensorFlow back-end, thus enabling GPU-accelerated training. The toolkit facilitates implementing the entire pipeline of GP modelling, including data loading, parameter initialization, model learning, parameter interpretation, up to data imputation and extrapolation. MOGPTK implements the main multi-output covariance kernels from literature, as well as spectral-based parameter initialization strategies. The source code, tutorials and examples in the form of Jupyter notebooks, together with the API documentation, can be found at http://github.com/GAMES-UChile/mogptk
△ Less
Submitted 9 February, 2020;
originally announced February 2020.
-
Does Facebook Use Sensitive Data for Advertising Purposes? Worldwide Analysis and GDPR Impact
Authors:
Ángel Cuevas,
José González Cabañas,
Aritz Arrate,
Rubén Cuevas
Abstract:
The recent European General Data Protection Regulation (GDPR) and other data protection regulations restrict the processing of some categories of personal data (health, political orientation, sexual preferences, religious beliefs, ethnic origin, etc.) due to the privacy risks associated to such information. The GDPR refers to these categories as sensitive personal data. This paper quantifies the p…
▽ More
The recent European General Data Protection Regulation (GDPR) and other data protection regulations restrict the processing of some categories of personal data (health, political orientation, sexual preferences, religious beliefs, ethnic origin, etc.) due to the privacy risks associated to such information. The GDPR refers to these categories as sensitive personal data. This paper quantifies the portion of Facebook (FB) users, across 197 countries, who are labeled with advertising interests linked to potentially sensitive personal data. Our study reveals that Facebook labels 67% of users with potential sensitive interests. This corresponds to 22% of the population in the referred 197 countries. Moreover, our work shows that the GDPR enforcement had a negligible impact in this context since the portion of FB users labeled with sensitive interests in the European Union remains almost the same 5 months before and 9 months after the GDPR was enacted. The paper also illustrates potential risks associated to the use of sensitive interests. For instance, we quantify the portion of FB users labelled with the interest "Homosexuality" in countries where being gay may be punished with the death penalty. The last contribution is the implementation of a web browser extension that allows FB users removing in a simple way the potentially sensitive interests FB has assigned them.
△ Less
Submitted 23 July, 2019;
originally announced July 2019.
-
Markerless Augmented Advertising for Sports Videos
Authors:
Hallee E. Wong,
Osman Akar,
Emmanuel Antonio Cuevas,
Iuliana Tabian,
Divyaa Ravichandran,
Iris Fu,
Cambron Carter
Abstract:
Markerless augmented reality can be a challenging computer vision task, especially in live broadcast settings and in the absence of information related to the video capture such as the intrinsic camera parameters. This typically requires the assistance of a skilled artist, along with the use of advanced video editing tools in a post-production environment. We present an automated video augmentatio…
▽ More
Markerless augmented reality can be a challenging computer vision task, especially in live broadcast settings and in the absence of information related to the video capture such as the intrinsic camera parameters. This typically requires the assistance of a skilled artist, along with the use of advanced video editing tools in a post-production environment. We present an automated video augmentation pipeline that identifies textures of interest and overlays an advertisement onto these regions. We constrain the advertisement to be placed in a way that is aesthetic and natural. The aim is to augment the scene such that there is no longer a need for commercial breaks. In order to achieve seamless integration of the advertisement with the original video we build a 3D representation of the scene, place the advertisement in 3D, and then project it back onto the image plane. After successful placement in a single frame, we use homography-based, shape-preserving tracking such that the advertisement appears perspective correct for the duration of a video clip. The tracker is designed to handle smooth camera motion and shot boundaries.
△ Less
Submitted 22 July, 2019;
originally announced July 2019.
-
Large-scale analysis of user exposure to online advertising in Facebook
Authors:
Aritz Arrate,
José González Cabañas,
Ángel Cuevas,
María Calderón,
Rubén Cuevas
Abstract:
Online advertising is the major source of income for a large portion of Internet Services. There exists a body of literature aiming at optimizing ads engagement, understanding the privacy and ethical implications of online advertising, etc. However, to the best of our knowledge, no previous work analyses at large scale the exposure of real users to online advertising. This paper performs a compreh…
▽ More
Online advertising is the major source of income for a large portion of Internet Services. There exists a body of literature aiming at optimizing ads engagement, understanding the privacy and ethical implications of online advertising, etc. However, to the best of our knowledge, no previous work analyses at large scale the exposure of real users to online advertising. This paper performs a comprehensive analysis of the exposure of users to ads and advertisers using a dataset including more than 7M ads from 140K unique advertisers delivered to more than 5K users that was collected between October 2016 and May 2018. The study focuses on Facebook, which is the second largest advertising platform only to Google in terms of revenue, and accounts for more than 2.2B monthly active users. Our analysis reveals that Facebook users are exposed (in median) to 70 ads per week, which come from 12 advertisers. Ads represent between 10% and 15% of all the information received in users' newsfeed. A small increment of 1% in the portion of ads in the newsfeed could roughly represent a revenue increase of 8.17M USD per week for Facebook. Finally, we also reveal that Facebook users are overprofiled since in the best case only 22.76% of the interests Facebook assigns to users for advertising purpose are actually related to the ads those users receive.
△ Less
Submitted 26 December, 2018; v1 submitted 27 November, 2018;
originally announced November 2018.
-
Facebook Use of Sensitive Data for Advertising in Europe
Authors:
José González Cabañas,
Ángel Cuevas,
Rubén Cuevas
Abstract:
The upcoming European General Data Protection Regulation (GDPR) prohibits the processing and exploitation of some categories of personal data (health, political orientation, sexual preferences, religious beliefs, ethnic origin, etc.) due to the obvious privacy risks that may be derived from a malicious use of such type of information. These categories are referred to as sensitive personal data. Fa…
▽ More
The upcoming European General Data Protection Regulation (GDPR) prohibits the processing and exploitation of some categories of personal data (health, political orientation, sexual preferences, religious beliefs, ethnic origin, etc.) due to the obvious privacy risks that may be derived from a malicious use of such type of information. These categories are referred to as sensitive personal data. Facebook has been recently fined EUR 1.2M in Spain for collecting, storing and processing sensitive personal data for advertising purposes. This paper quantifies the portion of Facebook users in the European Union (EU) who are labeled with interests linked to sensitive personal data. The results of our study reveal that Facebook labels 73% EU users with sensitive interests. This corresponds to 40% of the overall EU population. We also estimate that a malicious third-party could unveil the identity of Facebook users that have been assigned a sensitive interest at a cost as low as EUR 0.015 per user. Finally, we propose and implement a web browser extension to inform Facebook users of the sensitive interests Facebook has assigned them.
△ Less
Submitted 14 February, 2018;
originally announced February 2018.
-
Analyzing gender inequality through large-scale Facebook advertising data
Authors:
David Garcia,
Yonas Mitike Kassa,
Angel Cuevas,
Manuel Cebrian,
Esteban Moro,
Iyad Rahwan,
Ruben Cuevas
Abstract:
Online social media are information resources that can have a transformative power in society. While the Web was envisioned as an equalizing force that allows everyone to access information, the digital divide prevents large amounts of people from being present online. Online social media in particular are prone to gender inequality, an important issue given the link between social media use and e…
▽ More
Online social media are information resources that can have a transformative power in society. While the Web was envisioned as an equalizing force that allows everyone to access information, the digital divide prevents large amounts of people from being present online. Online social media in particular are prone to gender inequality, an important issue given the link between social media use and employment. Understanding gender inequality in social media is a challenging task due to the necessity of data sources that can provide large-scale measurements across multiple countries. Here we show how the Facebook Gender Divide (FGD), a metric based on aggregated statistics of more than 1.4 Billion users in 217 countries, explains various aspects of worldwide gender inequality. Our analysis shows that the FGD encodes gender equality indices in education, health, and economic opportunity. We find gender differences in network externalities that suggest that using social media has an added value for women. Furthermore, we find that low values of the FGD are associated with increases in economic gender equality. Our results suggest that online social networks, while suffering evident gender imbalance, may lower the barriers that women have to access informational resources and help to narrow the economic gender gap.
△ Less
Submitted 24 March, 2019; v1 submitted 10 October, 2017;
originally announced October 2017.
-
How far is Facebook from me? Facebook network infrastructure analysis
Authors:
Reza Farahbakhsh,
Angel Cuevas,
Antonio M. Ortiz,
Xiao Han,
Noel Crespi
Abstract:
Facebook is today the most popular social network with more than one billion subscribers worldwide. To provide good quality of service (e.g., low access delay) to their clients, FB relies on Akamai, which provides a worldwide content distribution network with a large number of edge servers that are much closer to FB subscribers. In this article we aim to depict a global picture of the current FB n…
▽ More
Facebook is today the most popular social network with more than one billion subscribers worldwide. To provide good quality of service (e.g., low access delay) to their clients, FB relies on Akamai, which provides a worldwide content distribution network with a large number of edge servers that are much closer to FB subscribers. In this article we aim to depict a global picture of the current FB network infrastructure deployment taking into account both native FB servers and Akamai nodes. Toward this end, we have performed a measurement-based analysis during a period of two weeks using 463 Planet- Lab nodes distributed across 41 countries. Based on the obtained data we compare the average access delay that nodes in different countries experience accessing both native FB servers and Akamai nodes. In addition, we obtain a wide view of the deployment of Akamai nodes serving FB users worldwide. Finally, we analyze the geographical coverage of those nodes, and demonstrate that in most of the cases Akamai nodes located in a particular country service not only local FB subscribers, but also FB users located in nearby countries.
△ Less
Submitted 1 May, 2017;
originally announced May 2017.
-
Characterization of Cross-posting Activity for Professional Users across Facebook, Twitter and Google+
Authors:
Reza Farahbakhsh,
Angel Cuevas,
Noel Crespi
Abstract:
Professional players in social media (e.g., big companies, politician, athletes, celebrities, etc) are intensively using Online Social Networks (OSNs) in order to interact with a huge amount of regular OSN users with different purposes (marketing campaigns, customer feedback, public reputation improvement, etc). Hence, due to the large catalog of existing OSNs, professional players usually count w…
▽ More
Professional players in social media (e.g., big companies, politician, athletes, celebrities, etc) are intensively using Online Social Networks (OSNs) in order to interact with a huge amount of regular OSN users with different purposes (marketing campaigns, customer feedback, public reputation improvement, etc). Hence, due to the large catalog of existing OSNs, professional players usually count with OSN accounts in different systems. In this context an interesting question is whether professional users publish the same information across their OSN accounts, or actually they use different OSNs in a different manner. We define as cross-posting activity the action of publishing the same information in two or more OSNs. This paper aims at characterizing the cross-posting activity of professional users across three major OSNs, Facebook, Twitter and Google+. To this end, we perform a large-scale measurement-based analysis across more than 2M posts collected from 616 professional users with active accounts in the three referred OSNs. Then we characterize the phenomenon of cross posting and analyze the behavioral patterns based on the identified characteristics.
△ Less
Submitted 1 May, 2017;
originally announced May 2017.
-
Understanding the evolution of multimedia content in the Internet through BitTorrent glasses
Authors:
Reza Farahbakhsh,
Angel Cuevas,
Ruben Cuevas,
Roberto Gonzalez,
Noel Crespi
Abstract:
Today's Internet traffic is mostly dominated by multimedia content and the prediction is that this trend will intensify in the future. Therefore, main Internet players, such as ISPs, content delivery platforms (e.g. Youtube, Bitorrent, Netflix, etc) or CDN operators, need to understand the evolution of multimedia content availability and popularity in order to adapt their infrastructures and resou…
▽ More
Today's Internet traffic is mostly dominated by multimedia content and the prediction is that this trend will intensify in the future. Therefore, main Internet players, such as ISPs, content delivery platforms (e.g. Youtube, Bitorrent, Netflix, etc) or CDN operators, need to understand the evolution of multimedia content availability and popularity in order to adapt their infrastructures and resources to satisfy clients requirements while they minimize their costs. This paper presents a thorough analysis on the evolution of multimedia content available in BitTorrent. Specifically, we analyze the evolution of four relevant metrics across different content categories: content availability, content popularity, content size and user's feedback. To this end we leverage a large-scale dataset formed by 4 snapshots collected from the most popular BitTorrent portal, namely The Pirate Bay, between Nov. 2009 and Feb. 2012. Overall our dataset is formed by more than 160k content that attracted more than 185M of download sessions.
△ Less
Submitted 1 May, 2017;
originally announced May 2017.
-
Analysis of publicly disclosed information in Facebook profiles
Authors:
Reza Farahbakhsh,
Xiao Han,
Angel Cuevas,
Noel Crespi
Abstract:
Facebook, the most popular Online social network is a virtual environment where users share information and are in contact with friends. Apart from many useful aspects, there is a large amount of personal and sensitive information publicly available that is accessible to external entities/users. In this paper we study the public exposure of Facebook profile attributes to understand what type of at…
▽ More
Facebook, the most popular Online social network is a virtual environment where users share information and are in contact with friends. Apart from many useful aspects, there is a large amount of personal and sensitive information publicly available that is accessible to external entities/users. In this paper we study the public exposure of Facebook profile attributes to understand what type of attributes are considered more sensitive by Facebook users in terms of privacy, and thus are rarely disclosed, and which attributes are available in most Facebook profiles. Furthermore, we also analyze the public exposure of Facebook users by accounting the number of attributes that users make publicly available on average. To complete our analysis we have crawled the profile information of 479K randomly selected Facebook users. Finally, in order to demonstrate the utility of the publicly available information in Facebook profiles we show in this paper three case studies. The first one carries out a gender-based analysis to understand whether men or women share more or less information. The second case study depicts the age distribution of Facebook users. The last case study uses data inferred from Facebook profiles to map the distribution of worldwide population across cities according to its size.
△ Less
Submitted 1 May, 2017;
originally announced May 2017.
-
Are You Really Hidden? Predicting Current City from Profile and Social Relationship
Authors:
Xiao Han,
Leye Wang,
Jiangtao Wen,
Angel Cuevas,
Chao Chen,
Noel Crespi
Abstract:
Privacy has become a major concern in Online Social Networks (OSNs) due to threats such as advertising spam, online stalking and identity theft. Although many users hide or do not fill out their private attributes in OSNs, prior studies point out that the hidden attributes may be inferred from some other public information. Thus, users' private information could still be at stake to be exposed. Hi…
▽ More
Privacy has become a major concern in Online Social Networks (OSNs) due to threats such as advertising spam, online stalking and identity theft. Although many users hide or do not fill out their private attributes in OSNs, prior studies point out that the hidden attributes may be inferred from some other public information. Thus, users' private information could still be at stake to be exposed. Hitherto, little work helps users to assess the exposure probability/risk that the hidden attributes can be correctly predicted, let alone provides them with pointed countermeasures. In this article, we focus our study on the exposure risk assessment by a particular privacy-sensitive attribute - current city - in Facebook. Specifically, we first design a novel current city prediction approach that discloses users' hidden `current city' from their self-exposed information. Based on 371,913 Facebook users' data, we verify that our proposed prediction approach can predict users' current city more accurately than state-of-the-art approaches. Furthermore, we inspect the prediction results and model the current city exposure probability via some measurable characteristics of the self-exposed information. Finally, we construct an exposure estimator to assess the current city exposure risk for individual users, given their self-exposed information. Several case studies are presented to illustrate how to use our proposed estimator for privacy protection.
△ Less
Submitted 4 August, 2015;
originally announced August 2015.
-
Google+ or Google-?: Dissecting the Evolution of the New OSN in its First Year
Authors:
Roberto Gonzalez,
Ruben Cuevas,
Reza Motamedi,
Reza Rejaie,
Angel Cuevas
Abstract:
In the era when Facebook and Twitter dominate the market for social media, Google has introduced Google+ (G+) and reported a significant growth in its size while others called it a ghost town. This begs the question that "whether G+ can really attract a significant number of connected and active users despite the dominance of Facebook and Twitter?".
This paper tackles the above question by prese…
▽ More
In the era when Facebook and Twitter dominate the market for social media, Google has introduced Google+ (G+) and reported a significant growth in its size while others called it a ghost town. This begs the question that "whether G+ can really attract a significant number of connected and active users despite the dominance of Facebook and Twitter?".
This paper tackles the above question by presenting a detailed characterization of G+ based on large scale measurements. We identify the main components of G+ structure, characterize the key features of their users and their evolution over time. We then conduct detailed analysis on the evolution of connectivity and activity among users in the largest connected component (LCC) of G+ structure, and compare their characteristics with other major OSNs. We show that despite the dramatic growth in the size of G+, the relative size of LCC has been decreasing and its connectivity has become less clustered. While the aggregate user activity has gradually increased, only a very small fraction of users exhibit any type of activity. To our knowledge, our study offers the most comprehensive characterization of G+ based on the largest collected data sets.
△ Less
Submitted 26 March, 2013; v1 submitted 25 May, 2012;
originally announced May 2012.
-
Where are my followers? Understanding the Locality Effect in Twitter
Authors:
Roberto Gonzalez,
Ruben Cuevas,
Angel Cuevas,
Carmen Guerrero
Abstract:
Twitter is one of the most used applications in the current Internet with more than 200M accounts created so far. As other large-scale systems Twitter can obtain enefit by exploiting the Locality effect existing among its users. In this paper we perform the first comprehensive study of the Locality effect of Twitter. For this purpose we have collected the geographical location of around 1M Twitter…
▽ More
Twitter is one of the most used applications in the current Internet with more than 200M accounts created so far. As other large-scale systems Twitter can obtain enefit by exploiting the Locality effect existing among its users. In this paper we perform the first comprehensive study of the Locality effect of Twitter. For this purpose we have collected the geographical location of around 1M Twitter users and 16M of their followers. Our results demonstrate that language and cultural characteristics determine the level of Locality expected for different countries. Those countries with a different language than English such as Brazil typically show a high intra-country Locality whereas those others where English is official or co-official language suffer from an external Locality effect. This is, their users have a larger number of followers in US than within their same country. This is produced by two reasons: first, US is the dominant country in Twitter counting with around half of the users, and second, these countries share a common language and cultural characteristics with US.
△ Less
Submitted 18 May, 2011;
originally announced May 2011.
-
TorrentGuard: stopping scam and malware distribution in the BitTorrent ecosystem
Authors:
Michal Kryczka,
Ruben Cuevas,
Roberto Gonzalez,
Angel Cuevas,
Arturo Azcorra
Abstract:
In this paper we conduct a large scale measurement study in order to analyse the fake content publishing phenomenon in the BitTorrent Ecosystem. Our results reveal that fake content represents an important portion (35%) of those files shared in BitTorrent and just a few tens of users are responsible for 90% of this content. Furthermore, more than 99% of the analysed fake files are linked to either…
▽ More
In this paper we conduct a large scale measurement study in order to analyse the fake content publishing phenomenon in the BitTorrent Ecosystem. Our results reveal that fake content represents an important portion (35%) of those files shared in BitTorrent and just a few tens of users are responsible for 90% of this content. Furthermore, more than 99% of the analysed fake files are linked to either malware or scam websites. This creates a serious threat for the BitTorrent ecosystem. To address this issue, we present a new detection tool named TorrentGuard for the early detection of fake content. Based on our evaluation this tool may prevent the download of more than 35 millions of fake files per year. This could help to reduce the number of computer infections and scams suffered by BitTorrent users. TorrentGuard is already available and it can be accessed through both a webpage or a Vuze plugin.
△ Less
Submitted 19 April, 2012; v1 submitted 18 May, 2011;
originally announced May 2011.
-
Is Content Publishing in BitTorrent Altruistic or Profit-Driven
Authors:
Ruben Cuevas,
Michal Kryczka,
Angel Cuevas,
Sebastian Kaune,
Carmen Guerrero,
Reza Rejaie
Abstract:
BitTorrent is the most popular P2P content delivery application where individual users share various type of content with tens of thousands of other users. The growing popularity of BitTorrent is primarily due to the availability of valuable content without any cost for the consumers. However, apart from required resources, publishing (sharing) valuable (and often copyrighted) content has serious…
▽ More
BitTorrent is the most popular P2P content delivery application where individual users share various type of content with tens of thousands of other users. The growing popularity of BitTorrent is primarily due to the availability of valuable content without any cost for the consumers. However, apart from required resources, publishing (sharing) valuable (and often copyrighted) content has serious legal implications for user who publish the material (or publishers). This raises a question that whether (at least major) content publishers behave in an altruistic fashion or have other incentives such as financial. In this study, we identify the content publishers of more than 55k torrents in 2 major BitTorrent portals and examine their behavior. We demonstrate that a small fraction of publishers are responsible for 66% of published content and 75% of the downloads. Our investigations reveal that these major publishers respond to two different profiles. On one hand, antipiracy agencies and malicious publishers publish a large amount of fake files to protect copyrighted content and spread malware respectively. On the other hand, content publishing in BitTorrent is largely driven by companies with financial incentive. Therefore, if these companies lose their interest or are unable to publish content, BitTorrent traffic/portals may disappear or at least their associated traffic will significantly reduce.
△ Less
Submitted 22 July, 2010; v1 submitted 14 July, 2010;
originally announced July 2010.