NAT Logo

NAT: Learning to Attack Neurons for Enhanced Adversarial Transferability
WACV 2025

VITA Lab, EPFL, Switzerland

Introduction

NAT Teaser

Figure 1: Comparison of Attack Strategies. On the left, we illustrate how prior single generator-based methods, such as LTP and BIA, attack the entire embedding but predominantly disrupt neurons related to a single concept (e.g., circular text patterns), while leaving most other neurons largely unaffected. In contrast, our framework on the right trains multiple generators to target individual neurons, each representing distinct concepts. By focusing on attacking neurons that represent low-level concepts, our method not only generates highly transferable perturbations but also produces diverse, complementary attack patterns.

πŸ”₯ Highlights

  1. NAT Introduction. The generation of transferable adversarial perturbations typically involves training a generator to maximize embedding separation between clean and adversarial images at a single mid-layer of a source model. In this work, we build on this approach and introduce Neuron Attack for Transferability (NAT), a method designed to target specific neurons within the embedding. Our approach is motivated by the observation that previous layer-level optimizations often disproportionately focus on a few neurons representing similar concepts, leaving other neurons within the attacked layer minimally affected. NAT shifts the focus from embedding-level separation to a more fundamental, neuron-specific approach. We find that targeting individual neurons effectively disrupts the core units of the neural network, providing a common basis for transferability across different models.

  2. NAT Framework. NAT trains a UNet-based perturbation generator that takes input images and generates adversarial images in a single forward pass at attack time. For training the generator, NAT relies on $L_2$ separation between the clean and adversarial versions of a specific neuron (i.e., channels) in the feature map.

  3. Extensive Evaluation. We conduct a rigorous evaluation on 41 ImageNet-pretrained models, covering 16 traditional Convolutional networks, 8 Transformer architectures, and 17 Hybrid architectures. We also evaluate the transferability on nine fine-grained dataset models. We demonstrate that a single neuron-specific adversarial generator achieves over 14% improvement in transferability for cross-modal settings and 4% improvement in cross-domain settings. Additionally, by leveraging the complementary attack capabilities of NAT’s generators, we show that adversarial transferability can be significantly enhanced with fewer than 10 queries to the target model.

Method Overview

Our NAT framework focuses on training multiple perturbation generators, each targeting a specific neuron within a chosen layer of the source model. During training, each generator learns to maximize the $L_2$ separation between the clean and adversarial images at the targeted neuron level. This neuron-specific approach allows NAT to effectively disrupt distinct concepts represented by individual neurons, leading to the generation of diverse and complementary adversarial patterns. At inference time, these generators can be employed independently or in combination to produce adversarial examples that exhibit high transferability across various target models.

Figure 2: Overview of the NAT framework for training neuron-specific perturbation generators.

Method Overview

πŸ“Š Quantitative Results

Evaluation in Cross-Model Setting: We present a comprehensive quantitative evaluation of our NAT method against state-of-the-art baselines, LTP and BIA, across a diverse set of 41 ImageNet-pretrained models. Specifically, NAT achieves over 14% improvement in transferability for cross-modal settings and 4% improvement in cross-domain settings.

Evaluation in Cross-Domain Setting: We report the adversarial accuracy (in %) across three fine-grained datasets. We observe that our generators substantially outperform the baseline models in deceiving the target networks in single query $k = 1$ and multi-query $k = 10$ and $k = 40$ settings.

Cross-Modal Performance

Cross-Domain Transferability

Figure 3: Transferability Heatmap. Cross-architecture evaluation showing adversarial transferability from our neuron-specific generators (y-axis) across 41 target models (x-axis).

Generator-Model Heatmap

🎯 Qualitative Results

We present qualitative comparisons of adversarial examples generated by our NAT method against those produced by state-of-the-art baselines, LTP and BIA. The results highlight the superior transferability of perturbations generated by NAT across various target models, including ConvNeXt, DeiT, and BEiT. Notably, NAT generates perturbations that are more visually diverse and effective in misleading different architectures, demonstrating its robustness and versatility in adversarial attacks.

Qualitative Results
Figure 4: Qualitative Comparison. These are unbounded adversarial images (i.e., generated without clipping) to visualize raw perturbations. The specific neuron position targeted within Layer 18 is mentioned below each image. Best viewed in color and zoomed.

Citation

@InProceedings{Nakka_2025_WACV,
    author    = {Nakka, Krishna Kanth and Alahi, Alexandre},
    title     = {NAT: Learning to Attack Neurons for Enhanced Adversarial Transferability},
    booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)},
    month     = {February},
    year      = {2025},
    pages     = {7582-7593}
}

Visitors

Acknowledgement

This website is adapted from Nerfies, licensed under a CC BY-SA 4.0 License. We thank CDA and BIA for releasing their pretrained models.