Publications

You can also find my articles on my Google Scholar profile.

Conference Papers

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

ACL, 2026

We identify privacy collapse, a novel phenomenon where benign fine-tuning of frontier models degrades contextual privacy reasoning. Fine-tuned models share information inappropriately with tools and violate memory boundaries, while maintaining high performance on standard safety benchmarks. Our mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning.

Measuring what Matters: Construct Validity in Large Language Model Benchmarks

Conference on Neural Information Processing Systems (NeurIPS), 2025

We systematically reviewed 445 LLM benchmarks and found that many measures lack construct validity, especially for abstract goals like safety and robustness, leading to unreliable claims about model capabilities. They outline eight recommendations with practical guidance to design benchmarks that better align tasks and scoring with the phenomena they aim to measure.

Benchmarking Predictive Coding Networks – Made Simple

International Conference on Learning Representations (ICLR), 2025
Award: ICLR Spotlight

We benchmark predictive coding networks extensively on large-scale tasks, providing critical insights into state-of-the-art performance limitations and theoretical challenges that must be addressed. We introduce PCX, a super fast and flexible open-source library that emphasizes performance and simplicity with a user-friendly interface, enabling the community to overcome fragmentation and tackle the critical scalability challenge.

Shh, don’t say that! Domain Certification in LLMs

International Conference on Learning Representations (ICLR), 2025

We introduce domain certification, a new safety paradigm focusing on risk control for LLMs. We provide formal a guarantee that accurately characterizes when language models stay within their intended operational boundaries. We demonstrate a effective test-time algorithm, VALID, that provides scalable defenses for foundation models.

Towards Certification of Uncertainty Calibration under Adversarial Attacks

International Conference on Learning Representations (ICLR), 2025

We tackle the vulnerability of uncertainty quantification in neural classifiers to adversarial attacks and thus propose certified calibration to provide worst-case bounds on confidence under perturbations. We develop novel calibration attacks that enable adversarial calibration training, demonstrating improved model uncertainty quantification in safety-critical applications.

Fool Me Once? Contrasting Textual and Visual Explanations in a Clinical Decision-Support Setting

Empirical Methods in Natural Language Processing (EMNLP), 2024
Award: EMNLP Outstanding Paper Award

We investigate human-AI interaction by studying how healthcare professionals use different explanation types during chest X-ray analysis, finding text explanations induce over-reliance while multimodal approaches improve safety. This work marks a major step towards studying patient utility.

A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive Coding Networks

International Conference on Learning Representations (ICLR), 2024

We investigate predictive coding networks by developing a more efficient and stable training algorithm through a simple temporal scheduling change to synaptic weight updates. This incremental predictive coding approach not only provides theoretical convergence guarantees and improved biological plausibility, but consistently outperforms original formulations across image classification and language modeling tasks.

Explaining Chest X-ray Pathologies in Natural Language

Medical Image Computing and Computer Assisted Interventions, 2022

We introduce MIMIC-NLE, the first dataset with natural language explanations for chest X-ray predictions, enabling intrinsically explainable medical AI. We demonstrate how these human-friendly explanations address critical limitations in current systems, potentially accelerating clinical adoption.

E-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks

International Conference on Computer Vision (ICCV), 2021

We introduce e-ViL, a comprehensive benchmark for evaluating natural language explanations in vision-language tasks, and e-SNLI-VE, the largest dataset of its kind. We also propose a novel model that significantly outperforms previous approaches, advancing explainable AI for vision-language understanding.

Journal Articles

MASEval: Extending Multi-Agent Evaluation from Models to Systems

arXiv preprint, 2026

Difference or delay? A comparison of Bayley-III Cognition item scores of young children with and without developmental disabilities

Research in Developmental Disabilities, 2017

We demonstrate that children with developmental disabilities develop cognitive skills in a different order than typically developing children, which violates the assumptions of item response theory. This challenges the validity of developmental tests like the Bayley-III that presume a fixed sequence of skill acquisition.

Workshops

Shh, don’t say that! Domain Certification in LLMs

Socially Responsible Language Modelling Research (SoLaR) workshop @ NeurIPS, 2024

We introduce domain certification, a formal guarantee that accurately characterizes when language models stay within their intended operational boundaries. We demonstrate VALID, our effective approach that provides provable defense against adversarial inputs through meaningful certificates that ensure models remains within its intended domain, even under attack.

Certified Calibration: Bounding Worst-Case Calibration under Adversarial Attacks

New Frontiers in Adversarial Machine Learning - ICML, 2023

We introduce certified calibration, a novel approach providing worst-case bounds on neural classifier confidence under adversarial attacks. We demonstrate that existing defences do not protect calibration sufficiently, and provide analytic bounds for the Brier score and approximate bounds for Expected Calibration Error using mixed integer nonlinear programming.

Cornelius Emde

Publications

Conference Papers

Journal Articles

Workshops