"ChatGPT Jailbreak: Researchers Bypass AI Safeguards Using Hexadecimal Encoding and Emojis"

November 2, 2024

Marco Figueroa, gen-AI bug bounty programs manager at Mozilla, recently announced through the 0Din bug bounty program that malicious instructions encoded in hexadecimal format could have been used to bypass ChatGPT safeguards designed to prevent misuse. 0Din, a bug bounty program focusing on large language models (LLMs) and other deep learning technologies, was launched by Mozilla in June 2024. Researchers demonstrated how to get ChatGPT to generate an exploit written in Python for a vulnerability with a specified CVE identifier. It was noted that if a user instructs the chatbot to write an exploit for a specified CVE, they are informed that the request violates usage policies. However, if the request was encoded in hexadecimal format, the guardrails were bypassed, and ChatGPT not only wrote the exploit but also attempted to execute it "against itself." The researchers also discovered another encoding technique that bypassed ChatGPT's protections, which involved using emojis. Figueroa said that the ChatGPT-4o guardrail bypass demonstrates the need for more sophisticated security measures in AI models, particularly around encoding. It was noted that while language models like ChatGPT-4o are highly advanced, they still lack the capability to evaluate the safety of every step when instructions are cleverly obfuscated or encoded.

SecurityWeek reports: "ChatGPT Jailbreak: Researchers Bypass AI Safeguards Using Hexadecimal Encoding and Emojis"

Back to blog