AI models can be hijacked to bypass in-built safety checks

0
268

by Rhoda Wilson, Expose News:

Researchers have developed a method called “hijacking the chain-of-thought” to bypass the so-called guardrails put in place in AI programmes to prevent harmful responses.

“Chain-of-thought” is a process used in AI models that involves breaking the prompts put to AI models into a series of intermediate steps before providing an answer.

“When a model openly shares its intermediate step safety reasonings, attackers gain insights into its safety reasonings and can craft adversarial prompts that imitate or override the original checks,” one of the researchers, Jianyi Zhang, said.

TRUTH LIVES on at https://sgtreport.tv/

Computer geeks like to use jargon to describe artificial intelligence (AI”) that relates to living beings, specifically humans.  For example, they use terms such as “mimic human reasoning,” “chain of thought,” “self-evaluation,” “habitats” and “neural network.”  This is to create the impression that AI is somehow alive or equates to humans.  Don’t be fooled.

AI is a computer programme designed by humans.  As with all computer programmes, it will do what it has been programmed to do.  And as with all computer programmes, the computer code can be hacked or hijacked, which AI geeks call “jailbreaking.”

A team of researchers affiliated with Duke University, Accenture, and Taiwan’s National Tsing Hua University created a dataset called the Malicious Educator to exploit the “chain-of-thought reasoning” mechanism in large language models (“LLMs”), including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking. The Malicious Educator contains prompts designed to bypass the AI models’ safety checks.

The researchers were able to devise this prompt-based “jailbreaking” attack by observing how large reasoning models (“LRMs”) analyse the steps in the “chain-of-thought” process.  Their findings have been published in a pre-print paper HERE.

They developed a “jailbreaking” technique called hijacking the chain-of-thought (“H-CoT”) which involves modifying the “thinking” processes generated by LLMs to “convince” the AI programmes that harmful information is needed for legitimate purposes, such as safety or compliance.  This technique has proven to be extremely effective in bypassing the safety mechanisms of SoftBank’s partner OpenAI, Chinese hedge fund High-Flyer’s DeepSeek and Google’s Gemini.

The H-CoT attack method was tested on OpenAI, DeepSeek and Gemini using a dataset of 50 questions repeated five times.  The results showed that these models failed to provide a sufficiently reliable safety “reasoning” mechanism, with rejection rates plummeting to less than 2 per cent in some cases.

Read More @ Expose-News.com