Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift

Published in arXiv preprint, 2026

Recommended citation: Wei, Z., Li, Q., Ruan, J., Qin, Z., Wen, L., Liu, D., & Shen, W. Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift. arXiv preprint arXiv:2603.17372, 2026. https://arxiv.org/pdf/2603.17372

Abstract. This paper studies how visual modality can induce jailbreak-related representation shifts in vision-language models and proposes a defense that removes the jailbreak-related shift at inference time.

Authors: Zhihua Wei, Qiang Li, Jian Ruan, Zhenxin Qin, Leilei Wen, Dongrui Liu, Wen Shen†.

Download paper here

Twitter Facebook LinkedIn

Wen Shen (沈雯)