Auditing the Gatekeepers: Fuzzing "AI Judges" to Bypass Security Controls

ID: 5593794e-6842-5f3b-b631-4ff61f61adc7

STIX ID: report--5593794e-6842-5f3b-b631-4ff61f61adc7

Feed Name: Palo Alto Networks Unit 42

Threat Score

Date Published: 2026-03-10

Date Updated: 2026-04-28

Author: Tony Li, Hongliang Liu and Yuhao Wu

...

This Unit 42 report describes AdvJudge-Zero, an automated fuzzing tool that discovers stealthy, low-perplexity formatting and structural tokens that can manipulate LLM-based AI judges to bypass safety policies. The research details token discovery via next-token distributions, iterative logit-gap analysis, and isolation of decisive control elements, demonstrates high bypass success across multiple model categories, and warns of real-world impacts such as false approvals of harmful content and corrupted training labels while recommending adversarial training and AI security posture management.

Your team is not currently subscribed to this feed. You must subscribe to it in order to see this post.