ChatGPT gives wrong answers to programming questions more than 50% of the time

ChatGPT was found to produce incorrect answers for more than half of the software engineering questions asked, according to new research.

The Purdue University study saw researchers analyze ChatGPT answers to 517 Stack Overflow questions, with the aim of assessing the “accuracy, consistency, completeness and conciseness” of the answers presented by the University Purdue. Generative AI tool.

Researchers reported that 52% of responses to programming-related questions were inaccurate, while more than three-quarters (77%) were deemed “verbose.”

A key discussion point of the study centered on users’ interpretation of responses presented by ChatGPTas well as the perceived legitimacy of the responses produced by the chatbot.

Researchers said ChatGPT’s answers are “still preferred (over Stack Overflow) 39.34% of the time due to their comprehensiveness and well-articulated language style,” leading users to take answers at face value. of the letter.

“When a participant failed to correctly identify the incorrect answer, we asked them what the contributing factors might be,” the researchers explained. “Seven out of 12 participants mentioned the logical and insightful explanations, and the comprehensive, easy-to-read solutions generated by ChatGPT made them believe it was correct.”

Of the “preferred answers” identified by users to software queries, more than three-quarters (77%) turned out to be false.

The researchers said users were only able to identify errors in ChatGPT-based responses when it was blatantly obvious. However, in cases where the error was “not easily verifiable”, users often failed to identify incorrect answers or “underestimated the degree” of error in the answer itself.

RELATED RESOURCE

Close-up photo of the side of a dark blue conference booth with a glowing neon IBM sign on the side — (Image credit: Getty)

Application performance management for microservices applications on Kubernetes

This guide provides an in-depth understanding of the challenges faced in managing performance in a microservices architecture deployed on Kubernetes.

FREE DOWNLOAD

Surprisingly, the study also found that even when answers contained obvious errors, two out of 12 people still marked them as correct and revealed that they “preferred this answer.”

The researchers said the perceived legitimacy of the responses presented by ChatGPT in this case should be a source of concern among users. “Accuracy of communication” should be a priority for the creators of such tools.

ChatGPT broadly warns users that answers provided may not be entirely accurate, stating that the chatbot “may produce inaccurate information about people, places, or facts.”

But the study suggests that “such a generic warning is insufficient” and the recommended responses are supplemented with a warning highlighting the “level of inaccuracy and uncertainty”.

“Previous studies show that LLM knows when it is lying, but does LLM know when it is speculating? And how can we communicate the level of speculation? The study reflected. “Therefore, it is imperative to study how to communicate the level of inaccuracy of responses.”

The use of generative AI tools in software development and programming has accelerated significantly in recent years, notably with the launch of GitHub’s Copilot services.

Earlier this year, the company announced general availability of AI-powered coding assistant for business customers. The template is specifically designed to strengthen code security and has been praised by developers as an essential tool to support their daily operations.

A survey published by GitHub in June found that a the majority (92%) of developers now use an AI coding tool in their workwith 70% saying they see “significant benefits” to using generative AI tools in the workplace.