Zadeh et al. propose a layer similar to radial basis functions (RBFs) to increase a network’s robustness against adversarial examples by rejection. Based on a deep feature extractor, the RBF units compute
$d_k(x) = \|A_k^Tx + b_k\|_p^p$
with parameters $A$ and $b$. The decision rule remains unchanged, but the output does not resemble probabilities anymore. The full network, i.e., feature extractor and RBF layer, is trained using an adapted loss that resembles a max margin loss:
$J = \sum_i (d_{y_i}(x_i) + \sum_{j \neq y_i} \max(0, \lambda – d_j(x_i)))$
where $(x_i, y_i)$ is a training examples including label. The loss essentially minimizes the output corresponding to the true class while maximizing the output for all other classes up to a specified margin. Additionally, noise examples are injected during training. For these noise examples,
$\sum_j \max(0, \lambda – d_j(x))$
is maximized to enforce that these examples are treated as negatives in a rejection setting where samples not corresponding to the data distribution (or adversarial examples) can be rejected by the model. In experiments, the proposed method seems to be more robust against FGSM and iterative attacks (as evaluated on Foolbox).
Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).