Using test-time augmentation to investigate explainable AI: inconsistencies between method, model and human intuition

Table 9 AUROC, accuracy, F1, MCC precision and recall scores of the TransformerCNN models transfer learned on Ames data

Architecture	Training	AUROC\(\uparrow\)	Accuracy\(\uparrow\)	F1\(\uparrow\)	MCC\(\uparrow\)	Precision\(\uparrow\)	Recall\(\uparrow\)
No training	Untrained	0.564	0.507	0.668	0.055	0.991	0.504
	Native	0.698	0.653	0.665	0.308	0.689	0.643
Encoder only	C2C	0.687	0.639	0.638	0.279	0.635	0.641
	R2C	0.706	0.643	0.652	0.287	0.668	0.636
	E2C	0.685	0.644	0.656	0.288	0.679	0.634
	MC2C	0.711	0.656	0.658	0.311	0.663	0.653
	MR2C	0.693	0.641	0.647	0.283	0.656	0.637
	ME2C	0.738	0.676	0.681	0.352	0.692	0.670
Encoder-decoder	C2C	0.711	0.663	0.662	0.327	0.658	0.665
	R2C	0.680	0.625	0.623	0.249	0.621	0.626
	E2C	0.719	0.650	0.650	0.301	0.649	0.651
	MC2C	0.715	0.643	0.647	0.287	0.655	0.640
	MR2C	0.652	0.583	0.583	0.167	0.582	0.583
	ME2C	0.709	0.650	0.657	0.300	0.671	0.644

ISSN: 1758-2946