Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models
ACL, 2026
We identify privacy collapse, a novel phenomenon where benign fine-tuning of frontier models degrades contextual privacy reasoning. Fine-tuned models share information inappropriately with tools and violate memory boundaries, while maintaining high performance on standard safety benchmarks. Our mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning.
