Two new studies involving CPR and cancer treatment advice suggest AI voice assistants, ChatGPT not ready for prime time.

ChatGPT, robot - artistic interpretation.

ChatGPT, robot – artistic interpretation. Image credit: Pixabay, free license

Artificial intelligence holds the promise to reshape clinical medicine and empower patients, but when it comes to cardiopulmonary resuscitation and cancer treatments, certain AI tools are not quite there yet, according to two separate studies led by Harvard Medical School researchers at Brigham and Women’s Hospital.

AI voice assistants deliver subpar results on CPR directions

One study, published in JAMA Network Open, found that CPR directions provided by AI-based voice assistants often lacked relevance and resulted in inconsistencies.

Researchers posed eight verbal questions to four voice assistants, including Amazon’s Alexa, Apple’s Siri, Google Assistant on Nest Mini, and Microsoft’s Cortana. They also typed the same queries into ChatGPT. Two board-certified emergency medicine physicians evaluated all responses.

Nearly half of the responses from the voice assistants were unrelated to CPR, such as providing information related to a movie called CPR or a link to Colorado Public Radio News. Only 28 percent suggested calling emergency services.

CPR procedure training - illustrative photo.

CPR procedure training – illustrative photo. Image credit: Michel E via Unsplash, free license

Only 34 percent of responses provided CPR instruction and 12 percent provided verbal instructions. ChatGPT provided the most relevant information for all queries of the platforms tested. Based on the findings, the authors concluded that using existing AI voice assistants may delay care and not provide appropriate information.

Administered out of the hospital by untrained bystanders, CPR increases the chance for surviving a cardiac arrest by two to four times.

Bystanders can obtain CPR instructions from emergency dispatchers, but these services are not universally available and may not always be used. AI voice assistants may offer easy access to lifesaving CPR instructions in emergencies.

So, what is the bottom line for civilians who may end up in a situation to provide first aid?

“Our findings suggest that bystanders should call emergency services rather than rely on a voice assistant,” said study senior author Adam Landman, HMS associate professor of emergency medicine at Brigham and Women’s and chief information officer and senior vice president of digital innovation at Mass General Brigham.

“Voice assistants have potential to help provide CPR instructions, but need to have more standardized, evidence-based guidance built into their core functionalities,” added Landman, who is also an attending emergency physician.

The findings should be heeded as a call to action — and as an opportunity — for tech companies to collaborate with one another and standardize their emergency responses in an effort to improve public health, Landman and colleagues urged.

Hospital, cancer treatment - artistic impression.

Hospital, cancer treatment – artistic impression. Image credit: Insung Yoon via Unsplash, free license

ChatGPT and cancer treatment advice: Room for improvement

In another study, HMS researchers at Brigham and Women’s found that ChatGPT has limited ability to recommend cancer treatments based on national guidelines.

The research, published in JAMA Oncology, showed that in nearly one-third of cases, ChatGPT 3.5 provided an incorrect cancer treatment recommendation. Correct and incorrect recommendations intermingled in one-third of the chatbot’s responses, making errors more difficult to detect, the authors said.

For many patients, the internet is already a powerful tool for self-education on medical topics. The AI-powered tool ChatGPT is now increasingly used to research medical topics, but the investigators found it did not provide consistent recommendations on cancer treatments aligned with guidelines from the National Comprehensive Cancer Network.

The findings highlight the need for awareness of the technology’s limitations and the importance of working with one’s physician to individualize treatment.

Patients should feel empowered to educate themselves about their medical conditions, but they should always discuss with a clinician, and resources on the internet should not be consulted in isolation,” said study senior author Danielle Bitterman, HMS assistant professor of radiation oncology at Brigham and Women’s and a faculty member of the Artificial Intelligence in Medicine (AIM) Program at Mass General Brigham.

“ChatGPT responses can sound a lot like a human and can be quite convincing, but when it comes to clinical decision-making, there are so many subtleties for every patient’s unique situation,” Bitterman added.

“A right answer can be very nuanced, and not necessarily something ChatGPT or another large language model can provide.”

Thus, the authors caution, it is critical to audit the performance of AI tools and ensure that they are aligned with the best evidence and latest guidelines.

Medical decision-making is based on multiple factors, but NCCN guidelines are used widely by physicians and institutions as a foundation for these treatment choices, the authors said. Bitterman and colleagues chose to evaluate the extent to which ChatGPT’s recommendations aligned with these guidelines.

They focused on the three most common cancers (breast, prostate, and lung) and prompted ChatGPT to provide a treatment approach for each cancer based on the severity of the disease.

In total, the researchers included 26 unique diagnostic descriptions and used four slightly different prompts to ask ChatGPT to provide a treatment approach, generating a total of 104 prompts.

Ninety-eight percent of the chatbot’s responses included at least one treatment approach that agreed with NCCN guidelines. However, 34 percent of these responses also included one or more non-concordant recommendations, which were sometimes difficult to detect amid otherwise sound guidance.

A non-concordant treatment recommendation was defined as one that included one or more partly correct or incorrect recommendations. For example, for a locally advanced breast cancer, a recommendation of surgery alone, without mention of another therapy modality, would be partly correct.

Complete agreement in scoring occurred in only 62 percent of cases, underscoring both the complexity of the NCCN guidelines themselves and how difficult and vague ChatGPT’s output could be to interpret.

In 12.5 percent of cases, ChatGPT produced “hallucinations,” or a treatment recommendation entirely absent from NCCN guidelines. These included recommendations of novel therapies or curative therapies for incurable cancers.

The authors emphasized that this form of misinformation could incorrectly set patients’ expectations and could even impact the clinician-patient relationship.

The authors used GPT-3.5-turbo-0301, one of the largest models available at the time they conducted the study and the model class that is currently used in the open-access version of ChatGPT. A newer version, GPT-4, was not available at the time of the study and is currently available only with a paid subscription.

The researchers used the 2021 NCCN guidelines, because GPT-3.5-turbo-0301 was developed using data up to September 2021. While results may vary if other large-language models and/or clinical guidelines are used, the researchers emphasized that many such AI tools are similar in design and limitations.

Source: HMS