Skip to main content

Table 12 Mann Whitney U test statistics of the difference in in-sample distance populations for each model and interpretation method

From: Using test-time augmentation to investigate explainable AI: inconsistencies between method, model and human intuition

Architecture

Training

IG

SHAP

Att.

Rollout

Grads

AttGrads

CAT

AttCAT

    

Maps

     

Encoder only

C2C

0.761

0.867

0.767

0.722

0.831

0.032

0.930

0.673

 

R2C

0.121

0.114

0.507

0.339

0.316

0.036

0.376

0.469

 

E2C

0.375

0.796

0.871

0.299

0.667

0.908

0.900

0.081

 

MC2C

0.848

0.978

0.370

0.548

0.281

0.342

0.921

0.409

 

MR2C

0.711

0.760

0.925

0.793

0.903

0.278

0.457

0.855

 

ME2C

0.216

0.373

0.626

0.955

0.326

0.060

0.004

0.000

Encoder-decoder

C2C

0.414

0.202

0.794

0.794

0.537

0.001

0.060

0.261

 

R2C

0.563

0.324

0.765

0.000

0.657

0.000

0.237

0.376

 

E2C

0.066

0.348

0.000

0.046

0.422

0.167

0.814

0.536

 

MC2C

0.004

0.002

0.651

0.932

0.447

0.521

0.625

0.362

 

MR2C

0.055

0.398

0.960

0.861

0.668

0.798

0.154

0.356

 

ME2C

0.590

0.381

0.955

0.971

0.498

0.186

0.377

0.376

  1. Values are compared between canonical SMILES representation attributions or random SMILES representation as compared to all other enumerated values. Bold values are p > 0.005