FeatureMix: A General Adversarial Defense Method for Pretrained Language Models

Document Type

Conference Proceeding

Publication Date

1-1-2023

Abstract

Pretrained language models (PLMs) that are trained over large-scale data and then finetuned on downstream tasks have achieved great success. However, they are vulnerable to adversarial attacks. Adversarial training with both clean and adversarial data is a widely-used technique to improve model robustness. In this paper, we propose FeatureMix, a straightforward yet effective adversarial defense strategy for PLMs by finetuning on both discrete adversarial examples and online virtual examples. During finetuning, we augment clean data with discrete attacks first and generate virtual examples in each finetuning epoch by randomly mixing local latent features in the hidden layers of augmented data pairs. The virtual examples serve as additional training signals, regularizing the PLMs to favor mixing of latent features between discrete augmented examples and thus enhance adversarial robustness. The experimental evaluation results show that FeatureMix outperforms prevailing baseline methods in terms of robustness against adversarial attacks, without significantly reducing generalization performance.

This document is currently not available here.

Share

COinS