-
Scalable Language Agnostic Taint Tracking using Explicit Data Dependencies
Authors:
Sedick David Baker Effendi,
Xavier Pinho,
Andrei Michael Dreyer,
Fabian Yamaguchi
Abstract:
Taint analysis using explicit whole-program data-dependence graphs is powerful for vulnerability discovery but faces two major challenges. First, accurately modeling taint propagation through calls to external library procedures requires extensive manual annotations, which becomes impractical for large ecosystems. Second, the sheer size of whole-program graph representations leads to serious scala…
▽ More
Taint analysis using explicit whole-program data-dependence graphs is powerful for vulnerability discovery but faces two major challenges. First, accurately modeling taint propagation through calls to external library procedures requires extensive manual annotations, which becomes impractical for large ecosystems. Second, the sheer size of whole-program graph representations leads to serious scalability and performance issues, particularly when quick analysis is needed in continuous development pipelines.
This paper presents the design and implementation of a system for a language-agnostic data-dependence representation. The system accommodates missing annotations describing the behavior of library procedures by over-approximating data flows, allowing annotations to be added later without recalculation. We contribute this data-flow analysis system to the open-source code analysis platform Joern making it available to the community.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Learning Type Inference for Enhanced Dataflow Analysis
Authors:
Lukas Seidel,
Sedick David Baker Effendi,
Xavier Pinho,
Konrad Rieck,
Brink van der Merwe,
Fabian Yamaguchi
Abstract:
Statically analyzing dynamically-typed code is a challenging endeavor, as even seemingly trivial tasks such as determining the targets of procedure calls are non-trivial without knowing the types of objects at compile time. Addressing this challenge, gradual typing is increasingly added to dynamically-typed languages, a prominent example being TypeScript that introduces static typing to JavaScript…
▽ More
Statically analyzing dynamically-typed code is a challenging endeavor, as even seemingly trivial tasks such as determining the targets of procedure calls are non-trivial without knowing the types of objects at compile time. Addressing this challenge, gradual typing is increasingly added to dynamically-typed languages, a prominent example being TypeScript that introduces static typing to JavaScript. Gradual typing improves the developer's ability to verify program behavior, contributing to robust, secure and debuggable programs. In practice, however, users only sparsely annotate types directly. At the same time, conventional type inference faces performance-related challenges as program size grows. Statistical techniques based on machine learning offer faster inference, but although recent approaches demonstrate overall improved accuracy, they still perform significantly worse on user-defined types than on the most common built-in types. Limiting their real-world usefulness even more, they rarely integrate with user-facing applications. We propose CodeTIDAL5, a Transformer-based model trained to reliably predict type annotations. For effective result retrieval and re-integration, we extract usage slices from a program's code property graph. Comparing our approach against recent neural type inference systems, our model outperforms the current state-of-the-art by 7.85% on the ManyTypes4TypeScript benchmark, achieving 71.27% accuracy overall. Furthermore, we present JoernTI, an integration of our approach into Joern, an open source static analysis tool, and demonstrate that the analysis benefits from the additional type information. As our model allows for fast inference times even on commodity CPUs, making our system available through Joern leads to high accessibility and facilitates security research.
△ Less
Submitted 4 October, 2023; v1 submitted 1 October, 2023;
originally announced October 2023.
-
Hibikino-Musashi@Home 2018 Team Description Paper
Authors:
Yutaro Ishida,
Sansei Hori,
Yuichiro Tanaka,
Yuma Yoshimoto,
Kouhei Hashimoto,
Gouki Iwamoto,
Yoshiya Aratani,
Kenya Yamashita,
Shinya Ishimoto,
Kyosuke Hitaka,
Fumiaki Yamaguchi,
Ryuhei Miyoshi,
Kentaro Honda,
Yushi Abe,
Yoshitaka Kato,
Takashi Morie,
Hakaru Tamukoh
Abstract:
Our team, Hibikino-Musashi@Home (the shortened name is HMA), was founded in 2010. It is based in the Kitakyushu Science and Research Park, Japan. We have participated in the RoboCup@Home Japan open competition open platform league every year since 2010. Moreover, we participated in the RoboCup 2017 Nagoya as open platform league and domestic standard platform league teams. Currently, the Hibikino-…
▽ More
Our team, Hibikino-Musashi@Home (the shortened name is HMA), was founded in 2010. It is based in the Kitakyushu Science and Research Park, Japan. We have participated in the RoboCup@Home Japan open competition open platform league every year since 2010. Moreover, we participated in the RoboCup 2017 Nagoya as open platform league and domestic standard platform league teams. Currently, the Hibikino-Musashi@Home team has 20 members from seven different laboratories based in the Kyushu Institute of Technology. In this paper, we introduce the activities of our team and the technologies.
△ Less
Submitted 9 November, 2022;
originally announced November 2022.
-
Static Exploration of Taint-Style Vulnerabilities Found by Fuzzing
Authors:
Bhargava Shastry,
Federico Maggi,
Fabian Yamaguchi,
Konrad Rieck,
Jean-Pierre Seifert
Abstract:
Taint-style vulnerabilities comprise a majority of fuzzer discovered program faults. These vulnerabilities usually manifest as memory access violations caused by tainted program input. Although fuzzers have helped uncover a majority of taint-style vulnerabilities in software to date, they are limited by (i) extent of test coverage; and (ii) the availability of fuzzable test cases. Therefore, fuzzi…
▽ More
Taint-style vulnerabilities comprise a majority of fuzzer discovered program faults. These vulnerabilities usually manifest as memory access violations caused by tainted program input. Although fuzzers have helped uncover a majority of taint-style vulnerabilities in software to date, they are limited by (i) extent of test coverage; and (ii) the availability of fuzzable test cases. Therefore, fuzzing alone cannot provide a high assurance that all taint-style vulnerabilities have been uncovered. In this paper, we use static template matching to find recurrences of fuzzer-discovered vulnerabilities. To compensate for the inherent incompleteness of template matching, we implement a simple yet effective match-ranking algorithm that uses test coverage data to focus attention on those matches that comprise untested code. We prototype our approach using the Clang/LLVM compiler toolchain and use it in conjunction with afl-fuzz, a modern coverage-guided fuzzer. Using a case study carried out on the Open vSwitch codebase, we show that our prototype uncovers corner cases in modules that lack a fuzzable test harness. Our work demonstrates that static analysis can effectively complement fuzz testing, and is a useful addition to the security assessment tool-set. Furthermore, our techniques hold promise for increasing the effectiveness of program analysis and testing, and serve as a building block for a hybrid vulnerability discovery framework.
△ Less
Submitted 1 June, 2017;
originally announced June 2017.
-
Leveraging Flawed Tutorials for Seeding Large-Scale Web Vulnerability Discovery
Authors:
Tommi Unruh,
Bhargava Shastry,
Malte Skoruppa,
Federico Maggi,
Konrad Rieck,
Jean-Pierre Seifert,
Fabian Yamaguchi
Abstract:
The Web is replete with tutorial-style content on how to accomplish programming tasks. Unfortunately, even top-ranked tutorials suffer from severe security vulnerabilities, such as cross-site scripting (XSS), and SQL injection (SQLi). Assuming that these tutorials influence real-world software development, we hypothesize that code snippets from popular tutorials can be used to bootstrap vulnerabil…
▽ More
The Web is replete with tutorial-style content on how to accomplish programming tasks. Unfortunately, even top-ranked tutorials suffer from severe security vulnerabilities, such as cross-site scripting (XSS), and SQL injection (SQLi). Assuming that these tutorials influence real-world software development, we hypothesize that code snippets from popular tutorials can be used to bootstrap vulnerability discovery at scale. To validate our hypothesis, we propose a semi-automated approach to find recurring vulnerabilities starting from a handful of top-ranked tutorials that contain vulnerable code snippets. We evaluate our approach by performing an analysis of tens of thousands of open-source web applications to check if vulnerabilities originating in the selected tutorials recur. Our analysis framework has been running on a standard PC, analyzed 64,415 PHP codebases hosted on GitHub thus far, and found a total of 117 vulnerabilities that have a strong syntactic similarity to vulnerable code snippets present in popular tutorials. In addition to shedding light on the anecdotal belief that programmers reuse web tutorial code in an ad hoc manner, our study finds disconcerting evidence of insufficiently reviewed tutorials compromising the security of open-source projects. Moreover, our findings testify to the feasibility of large-scale vulnerability discovery using poorly written tutorials as a starting point.
△ Less
Submitted 10 April, 2017;
originally announced April 2017.
-
From Malware Signatures to Anti-Virus Assisted Attacks
Authors:
Christian Wressnegger,
Kevin Freeman,
Fabian Yamaguchi,
Konrad Rieck
Abstract:
Although anti-virus software has significantly evolved over the last decade, classic signature matching based on byte patterns is still a prevalent concept for identifying security threats. Anti-virus signatures are a simple and fast detection mechanism that can complement more sophisticated analysis strategies. However, if signatures are not designed with care, they can turn from a defensive mech…
▽ More
Although anti-virus software has significantly evolved over the last decade, classic signature matching based on byte patterns is still a prevalent concept for identifying security threats. Anti-virus signatures are a simple and fast detection mechanism that can complement more sophisticated analysis strategies. However, if signatures are not designed with care, they can turn from a defensive mechanism into an instrument of attack. In this paper, we present a novel method for automatically deriving signatures from anti-virus software and demonstrate how the extracted signatures can be used to attack sensible data with the aid of the virus scanner itself. We study the practicability of our approach using four commercial products and exemplarily discuss a novel attack vector made possible by insufficiently designed signatures. Our research indicates that there is an urgent need to improve pattern-based signatures if used in anti-virus software and to pursue alternative detection approaches in such products.
△ Less
Submitted 19 October, 2016;
originally announced October 2016.
-
When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries
Authors:
Aylin Caliskan,
Fabian Yamaguchi,
Edwin Dauber,
Richard Harang,
Konrad Rieck,
Rachel Greenstadt,
Arvind Narayanan
Abstract:
The ability to identify authors of computer programs based on their coding style is a direct threat to the privacy and anonymity of programmers. While recent work found that source code can be attributed to authors with high accuracy, attribution of executable binaries appears to be much more difficult. Many distinguishing features present in source code, e.g. variable names, are removed in the co…
▽ More
The ability to identify authors of computer programs based on their coding style is a direct threat to the privacy and anonymity of programmers. While recent work found that source code can be attributed to authors with high accuracy, attribution of executable binaries appears to be much more difficult. Many distinguishing features present in source code, e.g. variable names, are removed in the compilation process, and compiler optimization may alter the structure of a program, further obscuring features that are known to be useful in determining authorship. We examine programmer de-anonymization from the standpoint of machine learning, using a novel set of features that include ones obtained by decompiling the executable binary to source code. We adapt a powerful set of techniques from the domain of source code authorship attribution along with stylistic representations embedded in assembly, resulting in successful de-anonymization of a large set of programmers.
We evaluate our approach on data from the Google Code Jam, obtaining attribution accuracy of up to 96% with 100 and 83% with 600 candidate programmers. We present an executable binary authorship attribution approach, for the first time, that is robust to basic obfuscations, a range of compiler optimization settings, and binaries that have been stripped of their symbol tables. We perform programmer de-anonymization using both obfuscated binaries, and real-world code found "in the wild" in single-author GitHub repositories and the recently leaked Nulled.IO hacker forum. We show that programmers who would like to remain anonymous need to take extreme countermeasures to protect their privacy.
△ Less
Submitted 17 December, 2017; v1 submitted 28 December, 2015;
originally announced December 2015.
-
Towards Vulnerability Discovery Using Staged Program Analysis
Authors:
Bhargava Shastry,
Fabian Yamaguchi,
Konrad Rieck,
Jean-Pierre Seifert
Abstract:
Eliminating vulnerabilities from low-level code is vital for securing software. Static analysis is a promising approach for discovering vulnerabilities since it can provide developers early feedback on the code they write. But, it presents multiple challenges not the least of which is understanding what makes a bug exploitable and conveying this information to the developer. In this paper, we pres…
▽ More
Eliminating vulnerabilities from low-level code is vital for securing software. Static analysis is a promising approach for discovering vulnerabilities since it can provide developers early feedback on the code they write. But, it presents multiple challenges not the least of which is understanding what makes a bug exploitable and conveying this information to the developer. In this paper, we present the design and implementation of a practical vulnerability assessment framework, called Melange. Melange performs data and control flow analysis to diagnose potential security bugs, and outputs well-formatted bug reports that help developers understand and fix security bugs. Based on the intuition that real-world vulnerabilities manifest themselves across multiple parts of a program, Melange performs both local and global analyses. To scale up to large programs, global analysis is demand-driven. Our prototype detects multiple vulnerability classes in C and C++ code including type confusion, and garbage memory reads. We have evaluated Melange extensively. Our case studies show that Melange scales up to large codebases such as Chromium, is easy-to-use, and most importantly, capable of discovering vulnerabilities in real-world code. Our findings indicate that static analysis is a viable reinforcement to the software testing tool set.
△ Less
Submitted 6 April, 2016; v1 submitted 19 August, 2015;
originally announced August 2015.