Using test-time augmentation to investigate explainable AI: inconsistencies between method, model and human intuition

Table 12 Mann Whitney U test statistics of the difference in in-sample distance populations for each model and interpretation method

Architecture	Training	IG	SHAP	Att.	Rollout	Grads	AttGrads	CAT	AttCAT
				Maps
Encoder only	C2C	0.761	0.867	0.767	0.722	0.831	0.032	0.930	0.673
	R2C	0.121	0.114	0.507	0.339	0.316	0.036	0.376	0.469
	E2C	0.375	0.796	0.871	0.299	0.667	0.908	0.900	0.081
	MC2C	0.848	0.978	0.370	0.548	0.281	0.342	0.921	0.409
	MR2C	0.711	0.760	0.925	0.793	0.903	0.278	0.457	0.855
	ME2C	0.216	0.373	0.626	0.955	0.326	0.060	0.004	0.000
Encoder-decoder	C2C	0.414	0.202	0.794	0.794	0.537	0.001	0.060	0.261
	R2C	0.563	0.324	0.765	0.000	0.657	0.000	0.237	0.376
	E2C	0.066	0.348	0.000	0.046	0.422	0.167	0.814	0.536
	MC2C	0.004	0.002	0.651	0.932	0.447	0.521	0.625	0.362
	MR2C	0.055	0.398	0.960	0.861	0.668	0.798	0.154	0.356
	ME2C	0.590	0.381	0.955	0.971	0.498	0.186	0.377	0.376

Values are compared between canonical SMILES representation attributions or random SMILES representation as compared to all other enumerated values. Bold values are p > 0.005

ISSN: 1758-2946