Skip to main content
Fig. 4 | Journal of Cheminformatics

Fig. 4

From: The influence of solid state information and descriptor selection on statistical models of temperature dependent aqueous solubility

Fig. 4

The application of the CV = rt pseudo-cross-validation protocol to the same hypothetical dataset shown in Fig. 3. The first step entails the transformation of 1 into 2, via removing the temperature [T = x] suffix from the ID, deleting all but one occurrence of each truncated ID and assigning this truncated ID the arithmetic mean endpoint value associated with all corresponding original IDs. The transformation of 2 into 3 just entails the application of the standard cross-validation protocol. (In the current case, the nominal endpoint values were required as stratified sampling, based on the distribution of endpoint values, was employed for cross-validation.) Finally, the original dataset IDs are assigned the folds associated with their truncated IDs, in 3, to give the CV = rt folds 4. This ensures that instance IDs corresponding to the same material, yet with endpoint IDs measured at different temperatures, are always assigned to the same fold. (For this hypothetical dataset, this means [M1]_[T = 25] and [M1]_[T = 30] were both assigned to fold F1.) This ensures they can never be placed in corresponding training and test sets

Back to article page