Reliablity of Large Language Models

Motivation: Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks, including software engineering tasks. Through training on vast amounts of data, LLMs exhibit an unprecedented capacity on understanding and generating human-like responses, rendering them highly suitable for various applications such as code summarization or code generation. Unfortunately, while LLMs have shown great promise, the reliability of ChatGPT-generated responses are still questionable.

Approach: This theme aim to explore unknown issues and gain empirical insights regarding the reliability of LLM-generated responses. Drawing insights from this investigation, I aim to propose effective solutions that can address identified issues. Towards this goal, my colleagues and I have conducted a comprehensive array of empirical studies to explore explore unknown issues in various SE-related tasks including:

Code Generation: LLMs are widely-used for code generation with various systems including ChatGPT, Codex and GitHub Copilot. However, the reliablity of generated code is still questionable. To investigate unknown issues, we systematically study the quality of 4,066 LLM-generated code implemented in two popular programming languages, i.e., Java and Python, for 2,033 programming tasks [ Submitted to TOSEM]. Our study unveils various issues in LLM-generated code including solution inaccuracies and maintainability issues. We also demonstrated LLM's capabilities on self-mitigating the issues.

Technical Q&A: ChatGPT, a well-known LLM, is banned by Stack Overflow after only 6 days from its release. The main reason provided by the official Stack Overflow is that the answers generated by ChatGPT are of low quality. To verify this, we conduct a comparative evaluation of human-written and ChatGPT-generated answers and suggest that human-written and ChatGPT-generated answers are semantically similar, however, human-written answers outperform ChatGPT-generated ones consistently across multiple aspects [ ASE'23].

Related Publications

[TOSEM] Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues?

Authors: Yue Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamthavorn, Li Li, Bach Le, David Lo
Venue: ACM Transactions on Software Engineering and Methodology
One-line Abstract: An empirical study on code quality issues in ChatGPT-generated code.
Links:

[ASE'23] Are We Ready to Embrace Generative AI for Software Q&A?

Authors: Bowen Xu, Thanh-Dat Nguyen, Thanh Le-Cong, Thong Hoang, Jiakun Liu, Kisub Kim, Chen Gong, Changan Niu, Chenyu Wang, Bach Le, David Lo
Venue: IEEE/ACM 35th International Conference on Automated Software Engineering (ASE) 2023, New Ideas and Emerging Results (NIER) Track [Acceptance Rate: 36%]
One-line Abstract: A comparative evaluation between ChatGPT-generated and Human-generated answers for Software Q&A.
Links: