Auditing the Gatekeepers: Fuzzing "AI Judges" to Bypass Security Controls
ID: 5593794e-6842-5f3b-b631-4ff61f61adc7
STIX ID: report--5593794e-6842-5f3b-b631-4ff61f61adc7
Feed Name: Palo Alto Networks Unit 42
This Unit 42 report describes AdvJudge-Zero, an automated fuzzing tool that discovers stealthy, low-perplexity formatting and structural tokens that can manipulate LLM-based AI judges to bypass safety policies. The research details token discovery via next-token distributions, iterative logit-gap analysis, and isolation of decisive control elements, demonstrates high bypass success across multiple model categories, and warns of real-world impacts such as false approvals of harmful content and corrupted training labels while recommending adversarial training and AI security posture management.
Your team is not currently subscribed to this feed. You must subscribe to it in order to see this post.
