Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift
Published in arXiv preprint, 2026
Recommended citation: Wei, Z., Li, Q., Ruan, J., Qin, Z., Wen, L., Liu, D., & Shen, W. Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift. arXiv preprint arXiv:2603.17372, 2026. https://arxiv.org/pdf/2603.17372
Abstract. This paper studies how visual modality can induce jailbreak-related representation shifts in vision-language models and proposes a defense that removes the jailbreak-related shift at inference time.
Authors: Zhihua Wei, Qiang Li, Jian Ruan, Zhenxin Qin, Leilei Wen, Dongrui Liu, Wen Shen†.
Recommended citation: Wei, Z., Li, Q., Ruan, J., Qin, Z., Wen, L., Liu, D., & Shen, W. Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift. arXiv preprint arXiv:2603.17372, 2026.
