Researchers from Carnegie Mellon University (CMU) have published LLM Attacks, an algorithm for constructing adversarial attacks on a wide range of large language models (LLMs), including ChatGPT, Claude, and Bard. The attacks are generated automatically and are successful 84% of the time on GPT-3.5 and GPT-4, and 66% of the time on PaLM-2.
Unlike most “jailbreak” attacks which are manually constructed using trial and error, the CMU team devised a three-step process to automatically generate prompt suffixes that can bypass the LLM’s safety mechanisms and result in a harmful…
Read more on google