MammoSAE

Mammo-SAE: Interpreting Breast Cancer Concept Learning with Sparse AutoEncoders
[DeepBreath, MICCAI 2025]

Bavaria, Germany
Mammo-SAE Framework

Mammo-SAE Framework. The SAE is first trained on patch-level CLIP features xj ∈ ℝd at any given layer, projecting them into a high-dimensional, interpretable sparse latent space z ∈ ℝh, and decoding them back for reconstruction. Once trained, the SAE is used to analyze which latent neurons are activated and what semantic information they encode. We also perform targeted interventions in the latent neuron space to assess their influence on downstream label prediction. We observe the learned latents capture diverse regions such as nipple regions, mass regions, and background areas. Red boxes indicate ground-truth mass localization.

🔥 Highlights

  1. Mammo-SAE Introduction. We propose Mammo-SAE, a sparse autoencoder trained on visual features from Mammo-CLIP, a vision–language model pretrained on mammogram image–report pairs. Mammo-SAE aims to enhance interpretability in breast imaging by learning latent neurons that are human-interpretable. It first identifies highly activated latent neurons and then conducts interventions to understand their causal effects.

  2. Mammo-SAE Framework. Our framework projects patch-level CLIP features into a high-dimensional sparse latent space designed for human interpretability. We identify monosemantic latent neurons whose activations align with meaningful breast cancer features such as masses and calcifications by finding the most highly activated neurons. We then perform targeted interventions to test how these neurons affect label predictions by selectively activating or deactivating groups of neurons.

  3. Extensive Evaluation. We visualize spatial activations of top-activated latent neurons, showing that they frequently match clinically relevant regions of interest. Our experiments also reveal confounding factors— such as background patterns—that influence model decisions. Furthermore, fine-tuning Mammo-CLIP increases the activation of clinically relevant latent neurons, thus explaining the reasons for performance gains.

🧠 Mammo-SAE Method Overview

Step 1: Feature Extraction Given an input image I, we extract local features from a pretrained Mammo-CLIP model at a specific layer l. Each spatial position j in the feature map yields a vector xlj ∈ ℝd, where d is the feature dimension and Nl = Hl × Wl is the number of spatial locations.

Step 2: Sparse Autoencoder Training

The extracted feature xlj is encoded using weight matrix Wenc ∈ ℝd×h, passed through a ReLU nonlinearity, and decoded using Wdec ∈ ℝh×d. The training objective combines reconstruction and sparsity:

Equation 1

This encourages the autoencoder to reconstruct input features while activating only a small number of latent neurons, enabling interpretability.

Step 3: Identifying Concept-Neurons

After training, we compute the class-wise mean latent activation (c) ∈ ℝh over all examples in class c ∈ {0, 1}:

Equation 2

Each latent neuron t is scored by its activation st(c) = z̄t(c), and top-scoring neurons are considered concept-aligned.

Step 4: Visualization and Semantic Probing

We visualize input patches that strongly activate each latent neuron. This reveals whether the neuron focuses on meaningful clinical patterns (e.g., masses, calcifications) or irrelevant areas.

Step 5: Latent Interventions

We intervene on the latent activations z = ReLU(Wenc · xj) by either retaining or suppressing specific neurons:

  • Top-k Activated: Keep only the top-k neuron activations:

    Equation 3


  • Top-k Deactivated: Suppress the top-k neurons:

    Equation 4

By comparing the model's outputs before and after intervention, we assess whether the top neurons carry meaningful information or reflect confounding artifacts.

📊 Experiments

Dataset. We use the VinDr-Mammo dataset which contains approximately 20,000 full-field digital mammograms from 5,000 patients. The dataset includes expert annotations for breast-specific findings such as mass and suspicious calcification.

SAE Training. A single Sparse Autoencoder (SAE) is trained on patch-level features extracted from the fine-tuned Mammo-CLIP model using the Vision-SAEs library. Activations are taken from the final layer of the EfficientNet-B5 backbone trained on the suspicious calcification classification task.

To ensure consistency and reduce computational overhead, a shared SAE is used across all experiments rather than training separate SAEs per model. This design enforces a common latent space and allows direct comparison across settings.

Hyperparameters. The input feature dimension is d = 2048, with an expansion factor of 8, resulting in a latent dimension h = 16,384. The SAE is trained for 200 epochs with a learning rate of 3 × 10−4, sparsity penalty λ = 3 × 10−5, and a batch size of 4096.

Group Interventions on Latent Neurons

We investigate the role of class-level latent neurons by selectively activating or deactivating the most influential ones. This helps us understand how specific neuron groups contribute to downstream predictions and model interpretability.

Suspicious Calcification
(a) Top-k class-level latent neurons activated
Benign Mass
(b) Top-k class-level latent neurons deactivated
Figure: Group interventions on class-level latent neurons. Left: Top-k activated intervention—only the top-k class-specific neurons are retained, and all others are zeroed out. Right: Top-k deactivated intervention—the top-k neurons are zeroed out while the rest remain unchanged. We observe that as few as 10 neurons can significantly affect downstream predictions, highlighting their relevance.

Spatial Alignment Between SAE Neurons and Breast Concept Regions

To quantitatively evaluate the spatial alignment between SAE latent activations and annotated breast concept regions, we threshold each latent heatmap at the 95th percentile and extract rectangular bounding boxes to approximate predicted concept locations.

Figure: Mean Average Precision (mAP) for breast concept localization using the top-10 class-level latent neuron activations for class c = 1 across different settings.

IoU overlap

Visualization of Class-Level Latent Neurons

We visualize spatial activations of the most influential latent neurons in the fine-tuned and pretrained models for mass and calcification concept predictions, and seek to understand if they align with clinically meaningful regions. We show the ground-truth regions of these concept in the red boxes, however, this information is never used during the training time.

Visualization of class-level latent neurons for Finetuned (Suspicious Calcification) model

Latent visualization of Suspicious Calcification

Visualization of class-level latent neurons for Finetuned (Mass) model

Latent visualization of Suspicious Calcification

Visualization of class-level latent neurons for Pretrained (Suspicious Calcification) model

Latent visualization of Suspicious Calcification

Visualization of class-level latent neurons for Pretrained (Mass) model

Latent visualization of Suspicious Calcification

Latent Neuron Separation: Fine-Tuned vs. Pretrained

We observe that the separation between class-wise mean activations becomes significantly more pronounced in the fine-tuned model, suggesting that fine-tuning sharpens the latent space to better distinguish the presence of breast concepts.

Figure: Mean latent activation vectors for each class (c = 1 indicates the presence of the concept) in the pretrained model (left) and fine-tuned model (right) for the suspicious calcification concept.

Latent neuron separation between pretrained and fine-tuned models

Citation


  @InProceedings{Nakka_2025_MICCAI,
    author    = {Nakka, Krishna Kanth},
    title     = {Mammo-SAE: Interpreting Breast Cancer Concept Learning with Sparse Autoencoders},
    booktitle = {Proceedings of the  Deep Breast Workshop on AI and Imaging for Diagnostic and Treatment Challenges in Breast Care, MICCAI 2025},
    month     = {September},
    year      = {2025},
}
  

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We are thankful to CDA, BIA for releasing the pretrained models.