Open, Closed and Broken: Prompt Fuzzing Finds LLMs Still Fragile Across Open and Closed Models

ID: 90e1e250-fe27-5e32-b878-0abfc8039b43

STIX ID: report--90e1e250-fe27-5e32-b878-0abfc8039b43

Feed Name: Palo Alto Networks Unit 42

Threat Score

Date Published: 2026-03-17

Date Updated: 2026-04-28

Author: Yu Fu, May Wang, Royce Lu and Shengming Xu

...

Unit 42 demonstrates a genetic-algorithm-based prompt-fuzzing method that automatically generates meaning-preserving rephrasings of disallowed prompts to measure and exploit guardrail fragility in large language models. The research tests multiple closed-source and open-source models (including a standalone content filter), reports keyword-dependent evasion rates ranging from low to very high (e.g., up to 90/100 for a closed model on “torpedo” and 97–99/100 false negatives for the content filter), and recommends layered controls, scope enforcement, output validation, continuous adversarial testing, and standard security controls to mitigate risk.

Your team is not currently subscribed to this feed. You must subscribe to it in order to see this post.