Using test-time augmentation to investigate explainable AI: inconsistencies between method, model and human intuition

Table 11 AUROC, accuracy, F1, MCC precision and recall scores with bootstrap variability of the TransformerCNN models transfer learned on Ames data

Architecture	Training	AUROC\(\uparrow\)	Accuracy\(\uparrow\)	F1\(\uparrow\)	MCC\(\uparrow\)	Precision\(\uparrow\)	Recall\(\uparrow\)
No training	Untrained	0.565 ± 0.006	0.502 ± 0.002	0.665 ± 0.001	0.019 ± 0.016	0.990 ± 0.002	0.501 ± 0.001
	Native	0.702 ± 0.005	0.659 ± 0.006	0.669 ± 0.006	0.318 ± 0.012	0.689 ± 0.007	0.650 ± 0.006
Encoder only	C2C	0.697 ± 0.005	0.654 ± 0.007	0.654 ± 0.007	0.309 ± 0.013	0.652 ± 0.007	0.655 ± 0.007
	R2C	0.714 ± 0.005	0.652 ± 0.007	0.660 ± 0.007	0.304 ± 0.013	0.676 ± 0.007	0.645 ± 0.006
	E2C	0.692 ± 0.005	0.649 ± 0.007	0.660 ± 0.007	0.298 ± 0.014	0.683 ± 0.007	0.639 ± 0.007
	MC2C	0.725 ± 0.006	0.660 ± 0.008	0.663 ± 0.008	0.320 ± 0.015	0.668 ± 0.008	0.657 ± 0.008
	MR2C	0.695 ± 0.005	0.643 ± 0.007	0.649 ± 0.007	0.287 ± 0.014	0.660 ± 0.007	0.638 ± 0.007
	ME2C	0.740 ± 0.006	0.680 ± 0.008	0.686 ± 0.007	0.360 ± 0.015	0.701 ± 0.008	0.673 ± 0.007
Encoder-decoder	C2C	0.711 ± 0.004	0.659 ± 0.006	0.657 ± 0.006	0.319 ± 0.012	0.652 ± 0.007	0.662 ± 0.006
	R2C	0.680 ± 0.006	0.634 ± 0.007	0.633 ± 0.008	0.269 ± 0.015	0.631 ± 0.008	0.635 ± 0.008
	E2C	0.713 ± 0.004	0.653 ± 0.006	0.652 ± 0.006	0.306 ± 0.012	0.649 ± 0.006	0.655 ± 0.006
	MC2C	0.726 ± 0.006	0.663 ± 0.008	0.667 ± 0.007	0.326 ± 0.015	0.676 ± 0.008	0.659 ± 0.007
	MR2C	0.644 ± 0.004	0.584 ± 0.001	0.583 ± 0.001	0.167 ± 0.001	0.583 ± 0.001	0.584 ± 0.001
	ME2C	0.721 ± 0.006	0.663 ± 0.008	0.668 ± 0.008	0.326 ± 0.016	0.678 ± 0.008	0.658 ± 0.008

Values are based on the scaffold split. ± values have been determined using 1000 fold test-time bootstrapping

ISSN: 1758-2946