Search | arXiv e-print repository

ScamFerret: Detecting Scam Websites Autonomously with Large Language Models

Authors: Hiroki Nakano, Takashi Koide, Daiki Chiba

Abstract: With the rise of sophisticated scam websites that exploit human psychological vulnerabilities, distinguishing between legitimate and scam websites has become increasingly challenging. This paper presents ScamFerret, an innovative agent system employing a large language model (LLM) to autonomously collect and analyze data from a given URL to determine whether it is a scam. Unlike traditional machin… ▽ More With the rise of sophisticated scam websites that exploit human psychological vulnerabilities, distinguishing between legitimate and scam websites has become increasingly challenging. This paper presents ScamFerret, an innovative agent system employing a large language model (LLM) to autonomously collect and analyze data from a given URL to determine whether it is a scam. Unlike traditional machine learning models that require large datasets and feature engineering, ScamFerret leverages LLMs' natural language understanding to accurately identify scam websites of various types and languages without requiring additional training or fine-tuning. Our evaluation demonstrated that ScamFerret achieves 0.972 accuracy in classifying four scam types in English and 0.993 accuracy in classifying online shopping websites across three different languages, particularly when using GPT-4. Furthermore, we confirmed that ScamFerret collects and analyzes external information such as web content, DNS records, and user reviews as necessary, providing a basis for identifying scam websites from multiple perspectives. These results suggest that LLMs have significant potential in enhancing cybersecurity measures against sophisticated scam websites. △ Less

Submitted 14 February, 2025; originally announced February 2025.

Comments: Accepted for publication at DIMVA 2025

arXiv:2410.02097 [pdf, other]

doi 10.1109/ACCESS.2025.3539882

DomainHarvester: Harvesting Infrequently Visited Yet Trustworthy Domain Names

Authors: Daiki Chiba, Hiroki Nakano, Takashi Koide

Abstract: In cybersecurity, allow lists play a crucial role in distinguishing safe websites from potential threats. Conventional methods for compiling allow lists, focusing heavily on website popularity, often overlook infrequently visited legitimate domains. This paper introduces DomainHarvester, a system aimed at generating allow lists that include trustworthy yet infrequently visited domains. By adopting… ▽ More In cybersecurity, allow lists play a crucial role in distinguishing safe websites from potential threats. Conventional methods for compiling allow lists, focusing heavily on website popularity, often overlook infrequently visited legitimate domains. This paper introduces DomainHarvester, a system aimed at generating allow lists that include trustworthy yet infrequently visited domains. By adopting an innovative bottom-up methodology that leverages the web's hyperlink structure, DomainHarvester identifies legitimate yet underrepresented domains. The system uses seed URLs to gather domain names, employing machine learning with a Transformer-based approach to assess their trustworthiness. DomainHarvester has developed two distinct allow lists: one with a global focus and another emphasizing local relevance. Compared to six existing top lists, DomainHarvester's allow lists show minimal overlaps, 4\% globally and 0.1\% locally, while significantly reducing the risk of including malicious domains, thereby enhancing security. The contributions of this research are substantial, illuminating the overlooked aspect of trustworthy yet underrepresented domains and introducing DomainHarvester, a system that goes beyond traditional popularity-based metrics. Our methodology enhances the inclusivity and precision of allow lists, offering significant advantages to users and businesses worldwide, especially in non-English speaking regions. △ Less

Submitted 11 February, 2025; v1 submitted 2 October, 2024; originally announced October 2024.

Comments: Originally presented at IEEE CCNC 2025. An extended version of this work has been published in IEEE Access: https://doi.org/10.1109/ACCESS.2025.3539882

arXiv:2410.02096 [pdf, other]

doi 10.1016/j.cose.2025.104366

DomainDynamics: Lifecycle-Aware Risk Timeline Construction for Domain Names

Authors: Daiki Chiba, Hiroki Nakano, Takashi Koide

Abstract: The persistent threat posed by malicious domain names in cyber-attacks underscores the urgent need for effective detection mechanisms. Traditional machine learning methods, while capable of identifying such domains, often suffer from high false positive and false negative rates due to their extensive reliance on historical data. Conventional approaches often overlook the dynamic nature of domain n… ▽ More The persistent threat posed by malicious domain names in cyber-attacks underscores the urgent need for effective detection mechanisms. Traditional machine learning methods, while capable of identifying such domains, often suffer from high false positive and false negative rates due to their extensive reliance on historical data. Conventional approaches often overlook the dynamic nature of domain names, the purposes and ownership of which may evolve, potentially rendering risk assessments outdated or irrelevant. To address these shortcomings, we introduce DomainDynamics, a novel system designed to predict domain name risks by considering their lifecycle stages. DomainDynamics constructs a timeline for each domain, evaluating the characteristics of each domain at various points in time to make informed, temporal risk determinations. In an evaluation experiment involving over 85,000 actual malicious domains from malware and phishing incidents, DomainDynamics demonstrated a significant improvement in detection rates, achieving an 82.58\% detection rate with a low false positive rate of 0.41\%. This performance surpasses that of previous studies and commercial services, improving detection capability substantially. △ Less

Submitted 20 February, 2025; v1 submitted 2 October, 2024; originally announced October 2024.

Comments: Originally presented at IEEE CCNC 2025. An extended version of this work has been published in Computers & Security, 2025

arXiv:2410.02095 [pdf, other]

doi 10.1109/ACCESS.2025.3542036

DomainLynx: Leveraging Large Language Models for Enhanced Domain Squatting Detection

Authors: Daiki Chiba, Hiroki Nakano, Takashi Koide

Abstract: Domain squatting poses a significant threat to Internet security, with attackers employing increasingly sophisticated techniques. This study introduces DomainLynx, an innovative compound AI system leveraging Large Language Models (LLMs) for enhanced domain squatting detection. Unlike existing methods focusing on predefined patterns for top-ranked domains, DomainLynx excels in identifying novel squ… ▽ More Domain squatting poses a significant threat to Internet security, with attackers employing increasingly sophisticated techniques. This study introduces DomainLynx, an innovative compound AI system leveraging Large Language Models (LLMs) for enhanced domain squatting detection. Unlike existing methods focusing on predefined patterns for top-ranked domains, DomainLynx excels in identifying novel squatting techniques and protecting less prominent brands. The system's architecture integrates advanced data processing, intelligent domain pairing, and LLM-powered threat assessment. Crucially, DomainLynx incorporates specialized components that mitigate LLM hallucinations, ensuring reliable and context-aware detection. This approach enables efficient analysis of vast security data from diverse sources, including Certificate Transparency logs, Passive DNS records, and zone files. Evaluated on a curated dataset of 1,649 squatting domains, DomainLynx achieved 94.7\% accuracy using Llama-3-70B. In a month-long real-world test, it detected 34,359 squatting domains from 2.09 million new domains, outperforming baseline methods by 2.5 times. This research advances Internet security by providing a versatile, accurate, and adaptable tool for combating evolving domain squatting threats. DomainLynx's approach paves the way for more robust, AI-driven cybersecurity solutions, enhancing protection for a broader range of online entities and contributing to a safer digital ecosystem. △ Less

Submitted 13 February, 2025; v1 submitted 2 October, 2024; originally announced October 2024.

Comments: Originally presented at IEEE CCNC 2025. An extended version of this work has been published in IEEE Access: https://doi.org/10.1109/ACCESS.2025.3542036

Journal ref: D. Chiba, H. Nakano, and T. Koide, "DomainLynx: Advancing LLM Techniques for Robust Domain Squatting Detection," IEEE Access, 2025

arXiv:2402.18093 [pdf, other]

ChatSpamDetector: Leveraging Large Language Models for Effective Phishing Email Detection

Authors: Takashi Koide, Naoki Fukushi, Hiroki Nakano, Daiki Chiba

Abstract: The proliferation of phishing sites and emails poses significant challenges to existing cybersecurity efforts. Despite advances in malicious email filters and email security protocols, problems with oversight and false positives persist. Users often struggle to understand why emails are flagged as potentially fraudulent, risking the possibility of missing important communications or mistakenly tru… ▽ More The proliferation of phishing sites and emails poses significant challenges to existing cybersecurity efforts. Despite advances in malicious email filters and email security protocols, problems with oversight and false positives persist. Users often struggle to understand why emails are flagged as potentially fraudulent, risking the possibility of missing important communications or mistakenly trusting deceptive phishing emails. This study introduces ChatSpamDetector, a system that uses large language models (LLMs) to detect phishing emails. By converting email data into a prompt suitable for LLM analysis, the system provides a highly accurate determination of whether an email is phishing or not. Importantly, it offers detailed reasoning for its phishing determinations, assisting users in making informed decisions about how to handle suspicious emails. We conducted an evaluation using a comprehensive phishing email dataset and compared our system to several LLMs and baseline systems. We confirmed that our system using GPT-4 has superior detection capabilities with an accuracy of 99.70%. Advanced contextual interpretation by LLMs enables the identification of various phishing tactics and impersonations, making them a potentially powerful tool in the fight against email-based phishing threats. △ Less

Submitted 23 August, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

Comments: Accepted at SecureComm 2024

arXiv:2310.11763 [pdf, other]

doi 10.1145/3627106.3627111

PhishReplicant: A Language Model-based Approach to Detect Generated Squatting Domain Names

Authors: Takashi Koide, Naoki Fukushi, Hiroki Nakano, Daiki Chiba

Abstract: Domain squatting is a technique used by attackers to create domain names for phishing sites. In recent phishing attempts, we have observed many domain names that use multiple techniques to evade existing methods for domain squatting. These domain names, which we call generated squatting domains (GSDs), are quite different in appearance from legitimate domain names and do not contain brand names, m… ▽ More Domain squatting is a technique used by attackers to create domain names for phishing sites. In recent phishing attempts, we have observed many domain names that use multiple techniques to evade existing methods for domain squatting. These domain names, which we call generated squatting domains (GSDs), are quite different in appearance from legitimate domain names and do not contain brand names, making them difficult to associate with phishing. In this paper, we propose a system called PhishReplicant that detects GSDs by focusing on the linguistic similarity of domain names. We analyzed newly registered and observed domain names extracted from certificate transparency logs, passive DNS, and DNS zone files. We detected 3,498 domain names acquired by attackers in a four-week experiment, of which 2,821 were used for phishing sites within a month of detection. We also confirmed that our proposed system outperformed existing systems in both detection accuracy and number of domain names detected. As an in-depth analysis, we examined 205k GSDs collected over 150 days and found that phishing using GSDs was distributed globally. However, attackers intensively targeted brands in specific regions and industries. By analyzing GSDs in real time, we can block phishing sites before or immediately after they appear. △ Less

Submitted 18 October, 2023; originally announced October 2023.

Comments: Accepted at ACSAC 2023

arXiv:2306.05816 [pdf, other]

Detecting Phishing Sites Using ChatGPT

Authors: Takashi Koide, Naoki Fukushi, Hiroki Nakano, Daiki Chiba

Abstract: The emergence of Large Language Models (LLMs), including ChatGPT, is having a significant impact on a wide range of fields. While LLMs have been extensively researched for tasks such as code generation and text synthesis, their application in detecting malicious web content, particularly phishing sites, has been largely unexplored. To combat the rising tide of cyber attacks due to the misuse of LL… ▽ More The emergence of Large Language Models (LLMs), including ChatGPT, is having a significant impact on a wide range of fields. While LLMs have been extensively researched for tasks such as code generation and text synthesis, their application in detecting malicious web content, particularly phishing sites, has been largely unexplored. To combat the rising tide of cyber attacks due to the misuse of LLMs, it is important to automate detection by leveraging the advanced capabilities of LLMs. In this paper, we propose a novel system called ChatPhishDetector that utilizes LLMs to detect phishing sites. Our system involves leveraging a web crawler to gather information from websites, generating prompts for LLMs based on the crawled data, and then retrieving the detection results from the responses generated by the LLMs. The system enables us to detect multilingual phishing sites with high accuracy by identifying impersonated brands and social engineering techniques in the context of the entire website, without the need to train machine learning models. To evaluate the performance of our system, we conducted experiments on our own dataset and compared it with baseline systems and several LLMs. The experimental results using GPT-4V demonstrated outstanding performance, with a precision of 98.7% and a recall of 99.6%, outperforming the detection results of other LLMs and existing systems. These findings highlight the potential of LLMs for protecting users from online fraudulent activities and have important implications for enhancing cybersecurity measures. △ Less

Submitted 13 February, 2025; v1 submitted 9 June, 2023; originally announced June 2023.

arXiv:2303.15847 [pdf, other]

Canary in Twitter Mine: Collecting Phishing Reports from Experts and Non-experts

Authors: Hiroki Nakano, Daiki Chiba, Takashi Koide, Naoki Fukushi, Takeshi Yagi, Takeo Hariu, Katsunari Yoshioka, Tsutomu Matsumoto

Abstract: The rise in phishing attacks via e-mail and short message service (SMS) has not slowed down at all. The first thing we need to do to combat the ever-increasing number of phishing attacks is to collect and characterize more phishing cases that reach end users. Without understanding these characteristics, anti-phishing countermeasures cannot evolve. In this study, we propose an approach using Twitte… ▽ More The rise in phishing attacks via e-mail and short message service (SMS) has not slowed down at all. The first thing we need to do to combat the ever-increasing number of phishing attacks is to collect and characterize more phishing cases that reach end users. Without understanding these characteristics, anti-phishing countermeasures cannot evolve. In this study, we propose an approach using Twitter as a new observation point to immediately collect and characterize phishing cases via e-mail and SMS that evade countermeasures and reach users. Specifically, we propose CrowdCanary, a system capable of structurally and accurately extracting phishing information (e.g., URLs and domains) from tweets about phishing by users who have actually discovered or encountered it. In our three months of live operation, CrowdCanary identified 35,432 phishing URLs out of 38,935 phishing reports, 31,960 (90.2%) of these phishing URLs were later detected by the anti-virus engine. We analyzed users who shared phishing threats by categorizing them into two groups: experts and non-experts. As a results, we discovered that CrowdCanary extracts non-expert report-specific information, like company brand name in tweets, phishing attack details from tweet images, and pre-redirect landing page information. △ Less

Submitted 6 June, 2023; v1 submitted 28 March, 2023; originally announced March 2023.

Comments: Accepted at the 18th International Conference on Availability, Reliability and Security (ARES 2023)

arXiv:2102.05290 [pdf, other]

A First Look at COVID-19 Domain Names: Origin and Implications

Authors: Ryo Kawaoka, Daiki Chiba, Takuya Watanabe, Mitsuaki Akiyama, Tatsuya Mori

Abstract: This work takes a first look at domain names related to COVID-19 (Cov19doms in short), using a large-scale registered Internet domain name database, which accounts for 260M of distinct domain names registered for 1.6K of distinct top-level domains. We extracted 167K of Cov19doms that have been registered between the end of December 2019 and the end of September 2020. We attempt to answer the follo… ▽ More This work takes a first look at domain names related to COVID-19 (Cov19doms in short), using a large-scale registered Internet domain name database, which accounts for 260M of distinct domain names registered for 1.6K of distinct top-level domains. We extracted 167K of Cov19doms that have been registered between the end of December 2019 and the end of September 2020. We attempt to answer the following research questions through our measurement study: RQ1: Is the number of Cov19doms registrations correlated with the COVID-19 outbreaks?, RQ2: For what purpose do people register Cov19doms? Our chief findings are as follows: (1) Similar to the global COVID-19 pandemic observed around April 2020, the number of Cov19doms registrations also experienced the drastic growth, which, interestingly, pre-ceded the COVID-19 pandemic by about a month, (2) 70 % of active Cov19doms websites with visible content provided useful information such as health, tools, or product sales related to COVID-19, and (3) non-negligible number of registered Cov19doms was used for malicious purposes. These findings imply that it has become more challenging to distinguish domain names registered for legitimate purposes from others and that it is crucial to pay close attention to how Cov19doms will be used/misused in the future. △ Less

Submitted 10 February, 2021; originally announced February 2021.

Comments: 9 pages, 4 figures, 4 tables. Accepted at the Passive and Active Measurement Conference 2021 (PAM 2021)

arXiv:1909.07539 [pdf, other]

ShamFinder: An Automated Framework for Detecting IDN Homographs

Authors: Hiroaki Suzuki, Daiki Chiba, Yoshiro Yoneya, Tatsuya Mori, Shigeki Goto

Abstract: The internationalized domain name (IDN) is a mechanism that enables us to use Unicode characters in domain names. The set of Unicode characters contains several pairs of characters that are visually identical with each other; e.g., the Latin character 'a' (U+0061) and Cyrillic character 'a' (U+0430). Visually identical characters such as these are generally known as homoglyphs. IDN homograph attac… ▽ More The internationalized domain name (IDN) is a mechanism that enables us to use Unicode characters in domain names. The set of Unicode characters contains several pairs of characters that are visually identical with each other; e.g., the Latin character 'a' (U+0061) and Cyrillic character 'a' (U+0430). Visually identical characters such as these are generally known as homoglyphs. IDN homograph attacks, which are widely known, abuse Unicode homoglyphs to create lookalike URLs. Although the threat posed by IDN homograph attacks is not new, the recent rise of IDN adoption in both domain name registries and web browsers has resulted in the threat of these attacks becoming increasingly widespread, leading to large-scale phishing attacks such as those targeting cryptocurrency exchange companies. In this work, we developed a framework named "ShamFinder," which is an automated scheme to detect IDN homographs. Our key contribution is the automatic construction of a homoglyph database, which can be used for direct countermeasures against the attack and to inform users about the context of an IDN homograph. Using the ShamFinder framework, we perform a large-scale measurement study that aims to understand the IDN homographs that exist in the wild. On the basis of our approach, we provide insights into an effective counter-measure against the threats caused by the IDN homograph attack. △ Less

Submitted 16 September, 2019; originally announced September 2019.

Comments: 16 pages, 12 figures, 14 tables. Proceedings of 19th ACM Internet Measurement Conference (IMC 2019)

Showing 1–10 of 10 results for author: Chiba, D