People | MIT CSAILpeople.csail.mit.edu/tsipras/bugs_poster.pdf · 2019-10-28 · Title: Adversarial...

1
Adversarial Examples Are Not Bugs, They Are Features Andrew Ilyas*, Shibani Santurkar*, Dimitris Tsipras*, Logan Engstrom*, Brandon Tran and Aleksander Madry Massachusetts Institute of Technology madry-lab.ml Adversarial Examples: A Challenge for ML Systems Why are ML models so sensitive to small perturbations? Prevailing theme: They stem from bugs/aberrations A Simple Experiment Adversarial perturbation towards “cat” dog Training set dog cat dog cat New training set Test set dog cat car ship Train So: We train on a totally “mislabeled” dataset but expect performance on a “correct” dataset Training Data Dataset CIFAR-10 ImageNet R Standard Dataset 95.3% 96.6% “Mislabeled” Dataset 43.7% 64.4% Result: nontrivial accuracy on the original task The Robust Features Model From “max accuracy” view: All features are good If NRFs are (often) good: Models want to use them Thus: Models use NRFs adversarial examples Adversarial example towards “cat” dog Training set dog cat dog Robust features: dog Non-robust features: dog Robust features: dog Non-robust features: cat New training set cat RFs misleading but NRFs suffice for generalization Directly Manipulating Features “Robust” Data: Standard training robust models Robust Optimization: Makes NRFs useless for learning Need more data to learn from only RFs (cf. [Schmidt et al., 2018]) Trade-off between robustness/accuracy (cf. [Tsipras et al., 2019]) Implications ML models do not work the way we expect them to Adversarial examples: A "human-based" phenomenon? Transfer Attacks: Models rely on similar NRFs Interpretability: May need to be enforced at training time A Theoretical Framework We consider (robust) MLE classification between Gaussians Vulnerability is misalignment between the data geometry and adversary’s ( 2 ) geometry Shows that robust optimization better aligns these geometries Moving Forward Do we want our models to rely on NRFs? How should we think of interpretability? Robustness as a goal beyond security/reliability? Paper arxiv:1905.02175 Blog gradsci.org/adv Python Library MadryLab/robustness

Transcript of People | MIT CSAILpeople.csail.mit.edu/tsipras/bugs_poster.pdf · 2019-10-28 · Title: Adversarial...

Page 1: People | MIT CSAILpeople.csail.mit.edu/tsipras/bugs_poster.pdf · 2019-10-28 · Title: Adversarial Examples Are Not Bugs, They Are Features Author: Andrew Ilyas*, Shibani Santurkar*,

Adversarial Examples Are Not Bugs, They Are FeaturesAndrew Ilyas*, Shibani Santurkar*, Dimitris Tsipras*, Logan Engstrom*, Brandon Tran and Aleksander Madry

Massachusetts Institute of Technology madry-lab.ml

Adversarial Examples: A Challenge for ML Systems

Why are ML models so sensitive to small perturbations?

Prevailing theme: They stem from bugs/aberrations

A Simple Experiment

Adversarial perturbation towards “cat”

1. Make adversarial example towards the other class 2. Relabel the image as the target class 3. Train with new dataset but test on the original test set

dog

Training set

dog

A Simple Experiment

catdogcat

New training setTest set

dog cat

car ship

Train

So: We train on a totally “mislabeled” dataset butexpect performance on a “correct” dataset

Training DataDataset

CIFAR-10 ImageNetRStandard Dataset 95.3% 96.6%“Mislabeled” Dataset 43.7% 64.4%

Result: nontrivial accuracy on the original task

The Robust Features Model

From “max accuracy” view: All features are goodIf NRFs are (often) good: Models want to use them

Thus: Models use NRFs → adversarial examples

Adversarial example towards “cat” dog

Training set

dogcat

dog

Robust features: dog Non-robust features: dog

Robust features: dog Non-robust features: cat

The Simple Experiment: A Second Look

New training set

cat

RFs misleading but NRFs suffice for generalization

Directly Manipulating Features“Robust” Data: Standard training → robust models

Robust Optimization: Makes NRFs useless for learning→ Need more data to learn from only RFs (cf. [Schmidt et al., 2018])→ Trade-off between robustness/accuracy (cf. [Tsipras et al., 2019])

Implications

ML models do not work the way we expect them to

Adversarial examples: A "human-based" phenomenon?

Transfer Attacks: Models rely on similar NRFs

25 30 35 40 45 50Test accuracy (%; trained on Dy + 1)

60

70

80

90

100

Tran

sfer

succ

ess r

ate

(%)

VGG-16

Inception-v3

ResNet-18 DenseNet

ResNet-50

Interpretability: May need to be enforced at training time

A Theoretical Framework→ We consider (robust) MLE classification between Gaussians→ Vulnerability is misalignment between the data geometry andadversary’s (`2) geometry→ Shows that robust optimization better aligns these geometries

20 15 10 5 0 5 10 15 20Feature x1

10.0

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

Feat

ure

x 2

Maximum likelihood estimate2 unit ball

1-induced metric unit ballSamples from (0, )

20 15 10 5 0 5 10 15 20Feature x1

10.0

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

Feat

ure

x 2

True Parameters ( = 0)Samples from ( , )Samples from ( , )

20 15 10 5 0 5 10 15 20Feature x1

10.0

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

Feat

ure

x 2

Robust parameters, = 1.0

20 15 10 5 0 5 10 15 20Feature x1

10.0

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

Feat

ure

x 2

Robust parameters, = 10.0

Moving Forward→ Do we want our models to rely on NRFs?→ How should we think of interpretability?

Robustness as a goal beyond security/reliability?

Paper

arxiv:1905.02175

Blog

gradsci.org/adv

Python Library

MadryLab/robustness