For sensitive problems, such as medical imaging or fraud detection, neural network (NN) adoption has been slow due to concerns about their reliability, leading to a number of algorithms for explaining their decisions. NNs have also been found to be vulnerable to a class of imperceptible attacks, called adversarial examples, which arbitrarily alter the output of the network. Here we demonstrate both that these attacks can invalidate previous attempts to explain the decisions of NNs, and that with very robust networks, the attacks themselves may be leveraged as explanations with greater fidelity to the model. We also show that the introduction of a novel regularization technique inspired by the Lipschitz constraint, alongside other proposed improvements including a half-Huber activation function, greatly improves the resistance of NNs to adversarial examples. On the ImageNet classification task, we demonstrate a network with an accuracy-robustness area (ARA) of 0.0053, an ARA 2.4 times greater than the previous state-of-the-art value. Improving the mechanisms by which NN decisions are understood is an important direction for both establishing trust in sensitive domains and learning more about the stimuli to which NNs respond.
Subscribe to Journal
Get full journal access for 1 year
only $8.67 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
A reference implementation of the techniques presented throughout this work, applied to the CIFAR-10 dataset, can be found at https://github.com/wwoods/adversarial-explanations-cifar.
Finlayson, S. G. et al. Adversarial attacks on medical machine learning. Science 363, 1287–1289 (2019).
Stilgoe, J. Machine learning, social learning and the governance of self-driving cars. Soc. Stud. Sci. 48, 25–56 (2018).
Tsao, H.-Y., Chan, P.-Y. & Su, E. C.-Y. Predicting diabetic retinopathy and identifying interpretable biomedical features using machine learning algorithms. BMC Bioinform. 19, 283 (2018).
Szegedy, C. et al. Intriguing properties of neural networks. Preprint at https://arxiv.org/abs/1312.6199 (2013).
Papernot, N., McDaniel, P. & Goodfellow, I. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. Preprint at https://arxiv.org/abs/1605.07277 (2016).
Selvaraju, R. R. et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision 618–626 (IEEE, 2017).
Ribeiro, M. T., Singh, S. & Guestrin, C. Why should I trust you? Explaining the predictions of any classifier. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1135–1144 (ACM, 2016).
Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: visualising image classification models and saliency maps. Preprint at https://arxiv.org/abs/1312.6034 (2013).
Landecker, W. Interpretable Machine Learning and Sparse Coding for Computer Vision. PhD thesis, Portland State Univ. (2014).
Bau, D., Zhou, B., Khosla, A., Oliva, A. & Torralba, A. Network dissection: quantifying interpretability of deep visual representations. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 6541–6549 (IEEE, 2017).
Bau, D. et al. Visualizing and understanding generative adversarial networks. In International Conference on Learning Representations (ICLR, 2019).
Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R. & Yu, B. Interpretable machine learning: definitions, methods, and applications. Preprint at https://arxiv.org/abs/1901.04592 (2019).
Hong, S., You, T., Kwak, S. & Han, B. Online tracking by learning discriminative saliency map with convolutional neural network. In International Conference on Machine Learning 597–606 (ICML, 2015).
Zeiler, M. D. & Fergus, R. Visualizing and understanding convolutional networks. In European Conference on Computer Vision 818–833 (Springer, 2014).
Olshausen, B. A. & Field, D. J. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381, 607–609 (1996).
Luo, W., Li, Y., Urtasun, R. & Zemel, R. Understanding the effective receptive field in deep convolutional neural networks. In Advances in Neural Information Processing Systems 4898–4906 (NIPS, 2016).
Jetley, S., Lord, N. A., Lee, N. & Torr, P. Learn to pay attention. In International Conference on Learning Representations (ICLR, 2018).
Li, K., Wu, Z., Peng, K.-C., Ernst, J. & Fu, Y. Tell me where to look: guided attention inference network. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 9215–9223 (IEEE, 2018).
Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems 5998–6008 (NIPS, 2017).
Cui, Y. et al. Attention-over-attention neural networks for reading comprehension. In Proc. 55th Annual Meeting of the Association for Computational Linguistics Vol. 1, 593–602 (ACL, 2017).
Ghorbani, A., Abid, A. & Zou, J. Interpretation of neural networks is fragile. Proc. 33rd AAAI Conference on Artificial Intelligence 3681–3688 (AAAI, 2019).
Athalye, A., Engstrom, L., Ilyas, A. & Kwok, K. Synthesizing robust adversarial examples. In Proc. 35th International Conference on Machine Learning (eds Dy, J. & Krause, A.) 284–293 (PMLR, 2018).
Goodfellow, I., Shlens, J. & Szegedy, C. Explaining and harnessing adversarial examples. In International Conference on Learning Representations (ICLR, 2015).
Chakraborty, A., Alam, M., Dey, V., Chattopadhyay, A. & Mukhopadhyay, D. Adversarial attacks and defences: a survey. Preprint at https://arxiv.org/abs/1810.00069 (2018).
Athalye, A., Carlini, N. & Wagner, D. Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In Proc. 35th International Conference on Machine Learning (eds Dy, J. & Krause, A.) 274–283 (PMLR, 2018).
Khoury, M. & Hadfield-Menell, D. On the geometry of adversarial examples. Preprint at https://arxiv.org/abs/1811.00525 (2018).
Tsipras, D., Santurkar, S., Engstrom, L., Turner, A. & Madry, A. Robustness may be at odds with accuracy. In International Conference on Learning Representations (ICLR, 2019).
Stutz, D., Hein, M. & Schiele, B. Disentangling adversarial robustness and generalization. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 6976–6987 (IEEE, 2019).
Weng, T.-W. et al. Evaluating the robustness of neural networks: an extreme value theory approach. In International Conference on Learning Representations (ICLR, 2018).
Cisse, M., Bojanowski, P., Grave, E., Dauphin, Y. & Usunier, N. Parseval networks: improving robustness to adversarial examples. In Proc. 34th International Conference on Machine Learning 854–863 (JMLR, 2017).
Behrmann, J., Grathwohl, W., Chen, R. T. Q., Duvenaud, D. & Jacobsen, J.-H. Invertible residual networks. In Proc. 36th International Conference on Machine Learning (eds Chaudhuri, K. & Salakhutdinov, R.) 573–582 (PMLR, 2019).
Madry, A., Makelov, A., Schmidt, L., Tsipras, D. & Vladu, A. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations (ICLR, 2018).
Russakovsky, O. et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vision (IJCV) 115, 211–252 (2015).
Carlini, N. & Wagner, D. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy 39–57 (IEEE, 2017).
Pei, K., Cao, Y., Yang, J. & Jana, S. Deepxplore: Automated whitebox testing of deep learning systems. In Proc. 26th Symposium on Operating Systems Principles 1–18 (ACM, 2017).
Cohen, J., Rosenfeld, E. & Kolter, Z. Certified adversarial robustness via randomized smoothing. In Proc. 36th International Conference on Machine Learning (eds Chaudhuri, K. & Salakhutdinov, R.) 1310–1320 (PMLR, 2019).
Liao, F. et al. Defense against adversarial attacks using high-level representation guided denoiser. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 1778–1787 (IEEE, 2018).
Kurakin, A. et al. Adversarial attacks and defences competition. In The NIPS’17 Competition: Building Intelligent Systems 195–231 (Springer, 2018).
Tramèr, F. et al. Ensemble adversarial training: attacks and defenses. In International Conference on Learning Representations (ICLR, 2018).
Wong, E., Schmidt, F., Metzen, J. H. & Kolter, J. Z. Scaling provable adversarial defenses. In Advances in Neural Information Processing Systems 8400–8409 (NIPS, 2018).
Su, D. et al. Is robustness the cost of accuracy? A comprehensive study on the robustness of 18 deep image classification models. In European Conference on Computer Vision 631–648 (Springer, 2018).
Krizhevsky, A. & Hinton, G. Learning Multiple Layers of Features from Tiny Images. Technical report, Univ. Toronto (2009).
Rony, J. et al. Decoupling direction and norm for efficient gradient-based L2 adversarial attacks and defenses. In The IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2019).
Shiraishi, J. et al. Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists’ detection of pulmonary nodules. Am. J. Roentgenol. 174, 71–74 (2000).
Lin, T.-Y. et al. Microsoft COCO: common objects in context. In European Conference on Computer Vision 740–755 (Springer, 2014).
Carlini, N. et al. On evaluating adversarial robustness. Preprint at https://arxiv.org/abs/1902.06705 (2019).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
He, K., Zhang, X., Ren, S. & Sun, J. Identity mappings in deep residual networks. In European Conference on Computer Vision 630–645 (Springer, 2016).
He, T. et al. Bag of tricks for image classification with convolutional neural networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 558–567 (IEEE, 2019).
Paszke, A. et al. Automatic differentiation in PyTorch. In Proc. 31st Conference on Neural Information Processing Systems (NIPS, 2017).
Yamada, Y., Iwamura, M., Akiba, T. & Kise, K. Shakedrop regularization for deep residual learning. Preprint at https://arxiv.org/abs/1802.02375 (2018).
Huang, G., Sun, Y., Liu, Z., Sedra, D. & Weinberger, K. Q. Deep networks with stochastic depth. In European Conference on Computer Vision 646–661 (Springer, 2016).
This work was supported in part by the Center for Brain-Inspired Computing (C-BRIC), one of six centres in the Joint University Microelectronics Program (JUMP), a Semiconductor Research Corporation (SRC) programme sponsored by the Defense Advanced Research Projects Agency (DARPA). W.W. acknowledges additional funding from Defense Threat Reduction Agency (DTRA) (award no. HDTRA1-18-1-0009). J.C. acknowledges funding from the Maseeh College of Engineering & Computer Science’s Undergraduate Research and Mentoring Program and the SRC Education Alliance (award no. 2009-UR-2032G). W.W. and J.C. received funding from F. Maseeh. We thank A. Madry27,32 and J. Cohen36 for helpful discussions and clarifications about their work. We thank FuR and A. Parise for assisting with the collection of photos for the examples throughout this work.
The authors declare no competing interests.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Woods, W., Chen, J. & Teuscher, C. Adversarial explanations for understanding image classification decisions and improved neural network robustness. Nat Mach Intell 1, 508–516 (2019) doi:10.1038/s42256-019-0104-6
Nature Machine Intelligence (2019)