-
The potential of LLM-generated reports in DevSecOps
Authors:
Nikolaos Lykousas,
Vasileios Argyropoulos,
Fran Casino
Abstract:
Alert fatigue is a common issue faced by software teams using the DevSecOps paradigm. The overwhelming number of warnings and alerts generated by security and code scanning tools, particularly in smaller teams where resources are limited, leads to desensitization and diminished responsiveness to security warnings, potentially exposing systems to vulnerabilities. This paper explores the potential o…
▽ More
Alert fatigue is a common issue faced by software teams using the DevSecOps paradigm. The overwhelming number of warnings and alerts generated by security and code scanning tools, particularly in smaller teams where resources are limited, leads to desensitization and diminished responsiveness to security warnings, potentially exposing systems to vulnerabilities. This paper explores the potential of LLMs in generating actionable security reports that emphasize the financial impact and consequences of detected security issues, such as credential leaks, if they remain unaddressed. A survey conducted among developers indicates that LLM-generated reports significantly enhance the likelihood of immediate action on security issues by providing clear, comprehensive, and motivating insights. Integrating these reports into DevSecOps workflows can mitigate attention saturation and alert fatigue, ensuring that critical security warnings are addressed effectively.
△ Less
Submitted 2 October, 2024;
originally announced October 2024.
-
Assessing LLMs in Malicious Code Deobfuscation of Real-world Malware Campaigns
Authors:
Constantinos Patsakis,
Fran Casino,
Nikolaos Lykousas
Abstract:
The integration of large language models (LLMs) into various pipelines is increasingly widespread, effectively automating many manual tasks and often surpassing human capabilities. Cybersecurity researchers and practitioners have recognised this potential. Thus, they are actively exploring its applications, given the vast volume of heterogeneous data that requires processing to identify anomalies,…
▽ More
The integration of large language models (LLMs) into various pipelines is increasingly widespread, effectively automating many manual tasks and often surpassing human capabilities. Cybersecurity researchers and practitioners have recognised this potential. Thus, they are actively exploring its applications, given the vast volume of heterogeneous data that requires processing to identify anomalies, potential bypasses, attacks, and fraudulent incidents. On top of this, LLMs' advanced capabilities in generating functional code, comprehending code context, and summarising its operations can also be leveraged for reverse engineering and malware deobfuscation. To this end, we delve into the deobfuscation capabilities of state-of-the-art LLMs. Beyond merely discussing a hypothetical scenario, we evaluate four LLMs with real-world malicious scripts used in the notorious Emotet malware campaign. Our results indicate that while not absolutely accurate yet, some LLMs can efficiently deobfuscate such payloads. Thus, fine-tuning LLMs for this task can be a viable potential for future AI-powered threat intelligence pipelines in the fight against obfuscated malware.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
Tales from the Git: Automating the detection of secrets on code and assessing developers' passwords choices
Authors:
Nikolaos Lykousas,
Constantinos Patsakis
Abstract:
Typical users are known to use and reuse weak passwords. Yet, as cybersecurity concerns continue to rise, understanding the password practices of software developers becomes increasingly important. In this work, we examine developers' passwords on public repositories. Our dedicated crawler collected millions of passwords from public GitHub repositories; however, our focus is on their unique charac…
▽ More
Typical users are known to use and reuse weak passwords. Yet, as cybersecurity concerns continue to rise, understanding the password practices of software developers becomes increasingly important. In this work, we examine developers' passwords on public repositories. Our dedicated crawler collected millions of passwords from public GitHub repositories; however, our focus is on their unique characteristics. To this end, this is the first study investigating the developer traits in password selection across different programming languages and contexts, e.g. email and database. Despite the fact that developers may have carelessly leaked their code on public repositories, our findings indicate that they tend to use significantly more secure passwords, regardless of the underlying programming language and context. Nevertheless, when the context allows, they often resort to similar password selection criteria as typical users. The public availability of such information in a cleartext format indicates that there is still much room for improvement and that further targeted awareness campaigns are necessary.
△ Less
Submitted 3 July, 2023;
originally announced July 2023.
-
Man vs the machine: The Struggle for Effective Text Anonymisation in the Age of Large Language Models
Authors:
Constantinos Patsakis,
Nikolaos Lykousas
Abstract:
The collection and use of personal data are becoming more common in today's data-driven culture. While there are many advantages to this, including better decision-making and service delivery, it also poses significant ethical issues around confidentiality and privacy. Text anonymisation tries to prune and/or mask identifiable information from a text while keeping the remaining content intact to a…
▽ More
The collection and use of personal data are becoming more common in today's data-driven culture. While there are many advantages to this, including better decision-making and service delivery, it also poses significant ethical issues around confidentiality and privacy. Text anonymisation tries to prune and/or mask identifiable information from a text while keeping the remaining content intact to alleviate privacy concerns. Text anonymisation is especially important in industries like healthcare, law, as well as research, where sensitive and personal information is collected, processed, and exchanged under high legal and ethical standards.
Although text anonymization is widely adopted in practice, it continues to face considerable challenges. The most significant challenge is striking a balance between removing information to protect individuals' privacy while maintaining the text's usability for future purposes. The question is whether these anonymisation methods sufficiently reduce the risk of re-identification, in which an individual can be identified based on the remaining information in the text.
In this work, we challenge the effectiveness of these methods and how we perceive identifiers. We assess the efficacy of these methods against the elephant in the room, the use of AI over big data. While most of the research is focused on identifying and removing personal information, there is limited discussion on whether the remaining information is sufficient to deanonymise individuals and, more precisely, who can do it. To this end, we conduct an experiment using GPT over anonymised texts of famous people to determine whether such trained networks can deanonymise them. The latter allows us to revise these methods and introduce a novel methodology that employs Large Language Models to improve the anonymity of texts.
△ Less
Submitted 22 March, 2023;
originally announced March 2023.
-
The Cynicism of Modern Cybercrime: Automating the Analysis of Surface Web Marketplaces
Authors:
Nikolaos Lykousas,
Vasilios Koutsokostas,
Fran Casino,
Constantinos Patsakis
Abstract:
Cybercrime is continuously growing in numbers and becoming more sophisticated. Currently, there are various monetisation and money laundering methods, creating a huge, underground economy worldwide. A clear indicator of these activities is online marketplaces which allow cybercriminals to trade their stolen assets and services. While traditionally these marketplaces are available through the dark…
▽ More
Cybercrime is continuously growing in numbers and becoming more sophisticated. Currently, there are various monetisation and money laundering methods, creating a huge, underground economy worldwide. A clear indicator of these activities is online marketplaces which allow cybercriminals to trade their stolen assets and services. While traditionally these marketplaces are available through the dark web, several of them have emerged in the surface web. In this work, we perform a longitudinal analysis of a surface web marketplace. The information was collected through targeted web scrapping that allowed us to identify hundreds of merchants' profiles for the most widely used surface web marketplaces. In this regard, we discuss the products traded in these markets, their prices, their availability, and the exchange currency. This analysis is performed in an automated way through a machine learning-based pipeline, allowing us to quickly and accurately extract the needed information. The outcomes of our analysis evince that illegal practices are leveraged in surface marketplaces and that there are not effective mechanisms towards their takedown at the time of writing.
△ Less
Submitted 25 May, 2021;
originally announced May 2021.
-
Analysis and Correlation of Visual Evidence in Campaigns of Malicious Office Documents
Authors:
Fran Casino,
Nikolaos Totosis,
Theodoros Apostolopoulos,
Nikolaos Lykousas,
Constantinos Patsakis
Abstract:
Many malware campaigns use Microsoft (MS) Office documents as droppers to download and execute their malicious payload. Such campaigns often use these documents because MS Office is installed in billions of devices and that these files allow the execution of arbitrary VBA code. Recent versions of MS Office prevent the automatic execution of VBA macros, so malware authors try to convince users into…
▽ More
Many malware campaigns use Microsoft (MS) Office documents as droppers to download and execute their malicious payload. Such campaigns often use these documents because MS Office is installed in billions of devices and that these files allow the execution of arbitrary VBA code. Recent versions of MS Office prevent the automatic execution of VBA macros, so malware authors try to convince users into enabling the content via images that, e.g. forge system or technical errors. In this work, we leverage these visual elements to construct lightweight malware signatures that can be applied with minimal effort. We test and validate our approach using an extensive database of malware samples and identify correlations between different campaigns that illustrate that some campaigns are either using the same tools or that there is some collaboration between them.
△ Less
Submitted 30 March, 2021;
originally announced March 2021.
-
Inside the x-rated world of "premium" social media accounts
Authors:
Nikolaos Lykousas,
Fran Casino,
Constantinos Patsakis
Abstract:
During the last few years, there has been an upsurge of social media influencers who are part of the adult entertainment industry, referred to as Performers. To monetize their online presence, Performers often engage in practices which violate community guidelines of social media, such as selling subscriptions for accessing their private "premium" social media accounts, where they distribute adult…
▽ More
During the last few years, there has been an upsurge of social media influencers who are part of the adult entertainment industry, referred to as Performers. To monetize their online presence, Performers often engage in practices which violate community guidelines of social media, such as selling subscriptions for accessing their private "premium" social media accounts, where they distribute adult content. In this paper, we collect and analyze data from FanCentro, an online marketplace where Performers can sell adult content and subscriptions to private accounts in platforms like Snapchat and Instagram. Our work aims to shed light on the semi-illicit adult content market layered on the top of popular social media platforms and its offerings, as well as to profile the demographics, activity and content produced by Performers.
△ Less
Submitted 24 September, 2020;
originally announced September 2020.
-
Intercepting Hail Hydra: Real-Time Detection of Algorithmically Generated Domains
Authors:
Fran Casino,
Nikolaos Lykousas,
Ivan Homoliak,
Constantinos Patsakis,
Julio Hernandez-Castro
Abstract:
A crucial technical challenge for cybercriminals is to keep control over the potentially millions of infected devices that build up their botnets, without compromising the robustness of their attacks. A single, fixed C&C server, for example, can be trivially detected either by binary or traffic analysis and immediately sink-holed or taken-down by security researchers or law enforcement. Botnets of…
▽ More
A crucial technical challenge for cybercriminals is to keep control over the potentially millions of infected devices that build up their botnets, without compromising the robustness of their attacks. A single, fixed C&C server, for example, can be trivially detected either by binary or traffic analysis and immediately sink-holed or taken-down by security researchers or law enforcement. Botnets often use Domain Generation Algorithms (DGAs), primarily to evade take-down attempts. DGAs can enlarge the lifespan of a malware campaign, thus potentially enhancing its profitability. They can also contribute to hindering attack accountability. In this work, we introduce HYDRAS, the most comprehensive and representative dataset of Algorithmically-Generated Domains (AGD) available to date. The dataset contains more than 100 DGA families, including both real-world and adversarially designed ones. We analyse the dataset and discuss the possibility of differentiating between benign requests (to real domains) and malicious ones (to AGDs) in real-time. The simultaneous study of so many families and variants introduces several challenges; nonetheless, it alleviates biases found in previous literature employing small datasets which are frequently overfitted, exploiting characteristic features of particular families that do not generalise well.We thoroughly compare our approach with the current state-of-the-art and highlight some methodological shortcomings in the actual state of practice. The outcomes obtained show that our proposed approach significantly outperforms the current state-of-the-art in terms of both classification performance and efficiency.
△ Less
Submitted 2 August, 2021; v1 submitted 6 August, 2020;
originally announced August 2020.
-
Large-scale analysis of grooming in modern social networks
Authors:
Nikolaos Lykousas,
Constantinos Patsakis
Abstract:
Social networks are evolving to engage their users more by providing them with more functionalities. One of the most attracting ones is streaming. Users may broadcast part of their daily lives to thousands of others world-wide and interact with them in real-time. Unfortunately, this feature is reportedly exploited for grooming. In this work, we provide the first in-depth analysis of this problem f…
▽ More
Social networks are evolving to engage their users more by providing them with more functionalities. One of the most attracting ones is streaming. Users may broadcast part of their daily lives to thousands of others world-wide and interact with them in real-time. Unfortunately, this feature is reportedly exploited for grooming. In this work, we provide the first in-depth analysis of this problem for social live streaming services. More precisely, using a dataset that we collected, we identify predatory behaviours and grooming on chats that bypassed the moderation mechanisms of the LiveMe, the service under investigation. Beyond the traditional text approaches, we also investigate the relevance of emojis in this context, as well as the user interactions through the gift mechanisms of LiveMe. Finally, our analysis indicates the possibility of grooming towards minors, showing the extent of the problem in such platforms.
△ Less
Submitted 16 April, 2020;
originally announced April 2020.
-
Unravelling Ariadne's Thread: Exploring the Threats of Decentalised DNS
Authors:
Constantinos Patsakis,
Fran Casino,
Nikolaos Lykousas,
Vasilios Katos
Abstract:
The current landscape of the core Internet technologies shows considerable centralisation with the big tech companies controlling the vast majority of traffic and services. This has sparked a wide range of decentralisation initiatives with perhaps the most profound and successful being the blockchain technology. In the past years, a core Internet infrastructure, domain name system (DNS), is being…
▽ More
The current landscape of the core Internet technologies shows considerable centralisation with the big tech companies controlling the vast majority of traffic and services. This has sparked a wide range of decentralisation initiatives with perhaps the most profound and successful being the blockchain technology. In the past years, a core Internet infrastructure, domain name system (DNS), is being revised mainly due to its inherent security and privacy issues. One of the proposed panaceas is Blockchain-based DNS, which claims to solve many issues of traditional DNS. However, this does not come without security concerns and issues, as any introduction and adoption of a new technology does - let alone a disruptive one such as blockchain. In this work, we discuss a number of associated threats, including emerging ones, and we validate many of them with real-world data. In this regard, we explore a part of the blockchain DNS ecosystem in terms of the browser extensions using such technologies, the chain itself (Namecoin and Emercoin), the domains, and users which have been registered in both platforms. Finally, we provide some countermeasures to address the identified threats, and we propose a fertile common ground for further research.
△ Less
Submitted 7 December, 2019;
originally announced December 2019.
-
Sharing emotions at scale: The Vent dataset
Authors:
Nikolaos Lykousas,
Costantinos Patsakis,
Andreas Kaltenbrunner,
Vicenç Gómez
Abstract:
The continuous and increasing use of social media has enabled the expression of human thoughts, opinions, and everyday actions publicly at an unprecedented scale. We present the Vent dataset, the largest annotated dataset of text, emotions, and social connections to date. It comprises more than 33 millions of posts by nearly a million of users together with their social connections. Each post has…
▽ More
The continuous and increasing use of social media has enabled the expression of human thoughts, opinions, and everyday actions publicly at an unprecedented scale. We present the Vent dataset, the largest annotated dataset of text, emotions, and social connections to date. It comprises more than 33 millions of posts by nearly a million of users together with their social connections. Each post has an associated emotion. There are 705 different emotions, organized in 63 "emotion categories", forming a two-level taxonomy of affects. Our initial statistical analysis describes the global patterns of activity in the Vent platform, revealing large heterogenities and certain remarkable regularities regarding the use of the different emotions. We focus on the aggregated use of emotions, the temporal activity, and the social network of users, and outline possible methods to infer emotion networks based on the user activity. We also analyze the text and describe the affective landscape of Vent, finding agreements with existing (small scale) annotated corpus in terms of emotion categories and positive/negative valences. Finally, we discuss possible research questions that can be addressed from this unique dataset.
△ Less
Submitted 24 March, 2019; v1 submitted 15 January, 2019;
originally announced January 2019.
-
Adult content in Social Live Streaming Services: Characterizing deviant users and relationships
Authors:
Nikolaos Lykousas,
Vicenç Gómez,
Constantinos Patsakis
Abstract:
Social Live Stream Services (SLSS) exploit a new level of social interaction. One of the main challenges in these services is how to detect and prevent deviant behaviors that violate community guidelines. In this work, we focus on adult content production and consumption in two widely used SLSS, namely Live.me and Loops Live, which have millions of users producing massive amounts of video content…
▽ More
Social Live Stream Services (SLSS) exploit a new level of social interaction. One of the main challenges in these services is how to detect and prevent deviant behaviors that violate community guidelines. In this work, we focus on adult content production and consumption in two widely used SLSS, namely Live.me and Loops Live, which have millions of users producing massive amounts of video content on a daily basis. We use a pre-trained deep learning model to identify broadcasters of adult content. Our results indicate that moderation systems in place are highly ineffective in suspending the accounts of such users. We create two large datasets by crawling the social graphs of these platforms, which we analyze to identify characterizing traits of adult content producers and consumers, and discover interesting patterns of relationships among them, evident in both networks.
△ Less
Submitted 27 June, 2018;
originally announced June 2018.