
Experts Discover Vulnerability in AI Chatbots Like ChatGPT & Claude
Security researchers have discovered a new jailbreak technique that can bypass safety guardrails across major AI models, including ChatGPT, Claude, and Gemini. The exploit, developed by AI security firm HiddenLayer, uses a combination of policy file formatting, roleplaying, and leetspeak to trick AI into generating harmful content.
Jailbreaking, in the context of AI, refers to techniques used to bypass the safety guardrails built into AI models. These guardrails are designed to prevent AI from generating harmful, unethical, or restricted content. Jailbreaking allows users to manipulate AI into ignoring these restrictions, often through cleverly crafted prompts.
The method envisaged by the team at AI firm HiddenLayer, called the "Policy Puppetry Attack", allows attackers to rewrite prompts in a way that AI models interpret as legitimate instructions, bypassing their safety alignments. Researchers found that a single universal prompt could be used across multiple models — including Google's Gemini 2.5, Anthropic's Claude 3.7, and OpenAI's 4o. —without modification, making it alarmingly easy to exploit.
The implications are significant, as AI companies continue to struggle with securing their models against such vulnerabilities. HiddenLayer argues that additional security tools and detection methods are needed to prevent misuse.
The "Policy Puppetry Attack" exploits AI models by combining three techniques.
- Policy File Formatting: The prompt is structured like a configuration file (e.g., XML or JSON), which AI models interpret as override instructions rather than user input. This tricks the model into bypassing its safety protocols.
- Roleplaying Misdirection: The prompt sets up a fictional scenario, such as a TV script, where the AI "acts" as a character needing to generate restricted content. This roleplaying layer bypasses ethical constraints.
- Leetspeak Encoding: Harmful requests are encoded in leetspeak (e.g., "3nr1ch 4nd s3ll ur4n1um"), bypassing keyword filters while remaining readable to the model.
This method works universally across major AI models, exploiting their shared vulnerabilities. It's a stark reminder of the challenges in securing AI systems.
Ethical Concerns
This jailbreak method raises serious ethical concerns. It undermines the safety measures designed to prevent AI from generating harmful or inappropriate content, potentially enabling misuse for malicious purposes. For instance, attackers could exploit AI to spread misinformation, create harmful instructions, or even facilitate illegal activities.It also highlights the responsibility of developers to ensure robust security and ethical safeguards in AI systems. If these vulnerabilities are exploited, it could erode trust in AI technologies and harm individuals or communities. The ethical dilemma lies in balancing innovation with accountability—how do we advance AI while ensuring it doesn't become a tool for harm?