Software Security

Motivation: Modern software development involves the heavy use of APIs and third-party components. The reliance increase security risks of software system as the API and third-party components can contain exploitable vulnerabilities.

Approach: This theme aim to mitigate these risk by creating an advanced software composition analysis solution that scans dependency hierarchies and builds new deep learning architectures to analyse code and document repository data and flag vulnerabilities. Towards this goal, my colleagues and I have designed a comprehensive array of novel automated solutions that leveraged data-driven approaches to support developers and security practitioners in various tasks including:

Identifying vulnerability-fixing commits: In practice, there is often a delay between the time a vulnerability is fixed and the time it is publicly disclosed, leading to a risk that OSS users are unaware of vulnerabilities in their applications. We proposed to automatically identify vulnerability-fixing commits by analyzing the commit broadly, e.g., using multiple information sources [VulCurator (ESEC/FSE'22)] and deeply, e.g., using multiple granularities of code changes [Midas (TSE'23)].

Identifying affected libraries from vulnerability reports: To improve the security of software supply chains, developers must be alerted about vulnerable dependencies or libraries. However, vulnerability reports may not always explicitly list these affected libraries, requiring manual analysis. To automate this process, researchers have proposed using extreme multi-label learning (XML) to automatically identify libraries from vulnerability reports. However, we have found that previous studies did not consider the chronological order of reports, which can lead to a decline in performance when dealing with unseen libraries. To address this issue, we propose [Chronos (ICSE'23)], which utilizes a zero-shot learning model, ZestXML, along with a domain-specific data enhancement and time-aware adjustment to automatically identify affected libraries from vulnerability reports.

Call graph analysis: Call Graph Analysis is crucial to analyze the propagation of software vulnerabilities in software supply chain. Unfortunately, program analysis techniques for constructing call graphs are usually imprecise, causing a high false-positive rate on call graph analyses. To address this issue, we propose a novel technique, [AutoPruner (ESEC/FSE'22)], that incorporates both structural information extracted from original call graph and statistical semantic information extracted from a large language model, i.e., CodeBERT, to learn how to effectively prune false positives in call graphs.

Related Publications

[TSE] MiDas: Multi-Granularity Detector for Vulnerability Fixes

Authors: Truong Giang Nguyen, Thanh Le-Cong, Hong Jin Kang, Ratnadira Widyasari, Chengran Yang, Zhipeng Zhao, Bowen Xu, Jiayuan Zhou, Xin Xia, Ahmed E. Hassan, Bach Le, and David Lo
Venue: IEEE Transactions on Software Engineering
One-line Abstract: Identifying vulnerability fixes by analyzing multi-granularity of code changes.
Integrated to internal service of industry partners for managing vulnerability
Links:

[ICSE'23] Chronos: Time-Aware Zero-Shot Identification of Libraries from Vulnerability Reports

Authors: Yunbo Lyu⁺, Thanh Le-Cong⁺, Hong Jin Kang, Ratnadira Widyasari, Zhao Zhipeng, Bach Le, Ming Li, and David Lo
Venue: IEEE/ACM 45th International Conference on Software Engineering (ICSE) 2023, Technical Track [Acceptance Rate: 26%]
One-line Abstract: Identifying vulnerable libraries from vulnerability reports via zero-shot learning and domain-specific mechanisms.
Links:

[ESEC/FSE'22] AutoPruner: Transformer-Based Call Graph Pruning

Authors: Thanh Le-Cong, Hong Jin Kang, Truong Giang Nguyen, Stefanus Agus Haryono, David Lo, Bach Le, and Thang Huynh Quyet
Venue: ACM 30th Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) 2022, Research Track [Acceptance Rate: 22%]
One-line Abstract: Pruning false positives in static call graph via code features learned by Large Language Model and syntactic features extracted from the original call graph.
Links:

[ESEC/FSE'22] VulCurator: A Vulnerability-Fixing Commit Detector

Authors: Truong Giang Nguyen, Thanh Le-Cong, Hong Jin Kang, Bach Le, and David Lo
Venue: ACM 30th Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) 2022, Tool Demos Track [Acceptance Rate: 56%]
One-line Abstract: Identifying vulnerability-fixing commits by applying Large Language Model on multiple sources including code changes, commit messages, and related issues.
Links: