This work divides the problem of interactive adaptation of the MPO objective function into two tasks that are implemented independently: In Task 1, depicted in Fig. 1, the high-level goal is to infer the parameters of the desirability function of each property in a MPO function: A chemist inputs a set of molecular properties they wish to optimize, and their weights. What is unknown is which values of the properties are good, i.e. the desirability functions. An initial guess about the good interval of each property is given by the chemist, but they are refined by the algorithm based on the chemist’s feedback. In Task 2, the goal is to build a chemist-specific scoring component for a molecular property for single parameter optimization; the same component can later act as part of the objective in an MPO. The chemist evaluates the score of molecules with respect to the property to be optimized. This feedback is used to learn a non-parametric model, which can be used during molecular optimization to generalize the chemist's feedback to new molecules.

The proposed method for Task 1 is outlined in Fig. 1: The goal of a chemist is encoded as a composite scoring function \({S}_{r,t}\left(x\right)\) for an MPO at round \(r\) of generating novel molecular designs and the \(t\):th iteration of online interaction with the chemist. The scoring function consists of \(K\) molecular properties \({c}_{k}\left(x\right)\) and score transformation functions \({\phi }_{r,t,k}\) that define desirability of the \(k\):th molecular property. A de novo molecular design system interacts with a chemist by selecting molecules to query, and the chemist then gives feedback of how well the molecules match their goal. The feedback is used to adapt scoring function \({S}_{r,t}\left(x\right)\) so that it predicts molecules’ score more accurately, which is achieved by fitting the desirability functions \({\phi }_{r,t,k}\).

### De novo* design tool without a human-in-the-loop component*

As a *de novo* design tool we use REINVENT, which is an open-source program [4]. REINVENT uses a deep generative model to generate small molecules in the SMILES [33] format. The generative model is used in a reinforcement learning scenario, where the main objective is to maximize the score of a composite scoring function. REINVENT generates molecules by sequentially adding tokens representing atoms and their connection to a SMILES string using the generative model, also referred to as 'agent' later in the text. In reinforcement learning mode, a batch of generated SMILES strings at each epoch are scored using the scoring function. The score is used as a reward to tune the weights of the agent and thus train it to produce more high-scoring molecules.

In the current system, without HITL, the user interacts with the tool by defining the learning objective by specifying the scoring function. The scoring function of REINVENT allows the user to combine the various objectives which can play a role in molecular design. The objectives include components such as predictive models, calculated properties, 2D and 3D similarity [34, 35] and molecular docking [36]. These components are normally combined either as a weighted product

$$S\left( x \right) = \left[ {\mathop \prod \limits_{k = 1}^{K} \phi_{k} \left( {c_{k} \left( x \right)} \right)^{{w_{k} }} } \right]^{{{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 {\mathop \sum \nolimits_{k = 1}^{K} w_{k} }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${\mathop \sum \nolimits_{k = 1}^{K} w_{k} }$}}}}$$

(1)

or a weighted sum

$$S\left(x\right)= \frac{{\sum }_{k=1}^{K}{w}_{k} {\phi }_{k}\left({c}_{k}\left(x\right)\right)}{{\sum }_{k=1}^{K}{w}_{k}}$$

(2)

where the user-selected components are denoted as \({c}_{k}\) in both equations and the corresponding weights are denoted as \(w\), and *K* is the number of components. If the component outputs a continuous value, e.g. a regression model, the prediction outcome is scaled to [0, 1] using a score transformation \({\phi }_{k}\) that is the desirability function of the \(k\):th property. Weights can vary in the range [1, + ∞) while the score from each component \({\phi }_{k}\left({c}_{k}\left(x\right)\right)\) can vary in the range [0, 1], resulting in an overall score within a range of [0, 1]. Both the components and the score transformations of these components are defined by the user and are manually tuned to guide the idea generation in a direction the designer assumes to be relevant to the project’s objectives.

###
*Human-in-the-loop assisted *de novo* molecular design*

This section introduces two HITL methods for setting objectives in *de novo* molecular design. The first is applicable when relevant sub-objective properties are known and available as scoring components: it adapts the MPO function (Section “Adapting the MPO function using human-in-the-loop feedback (Task 1)”). The second is for cases where a specification of a scoring component for a molecular property does not exist, and we propose a method to learn a new predictive model that captures the medicinal chemist’s knowledge about the molecular property (Section "Building a new scoring component from human knowledge (Task 2)"). In both cases, the AI-assistant needs to solve an active learning problem of how to select the molecules to show to the chemist during the interaction. Different active query selection strategies are described in Section "Query selection strategies".

### Adapting the MPO function using human-in-the-loop feedback (Task 1)

This method adapts the MPO objective to match the chemist’s goal by estimating its parameters from iterative simple feedback, in the setup depicted in Fig. 1. We assume that a chemist inspects a molecule \(x\in \mathcal{M}\), where \(\mathcal{M}\) is the set of all valid molecules, evaluates it based on their tacit inner scoring function modelled with \(S\left(x\right)\), \(S:\mathcal{M}\to \left[\mathrm{0,1}\right]\), and gives binary feedback \(y\in \left\{\mathrm{0,1}\right\}\). Here \(y\)=1 means that the molecule is good for their purpose and \(y\)=0 that it is not. In addition, we make the simplifying assumptions that \(S\) is stationary and deterministic.

The adaptive MPO scoring function consists of \(K\) adaptive scoring components \({\phi }_{r,t,k}\left({c}_{k}\left(x\right)\right)\in \left[\mathrm{0,1}\right]\), \(k=1,\dots ,K\), each measuring the utility of a molecular property \({c}_{k}\left(x\right)\in {\mathbb{R}}\) that can be computed from a molecule \(x\). The MPO function is adapted by modifying the desirability functions \({\phi }_{r,t,k}\), also called score transformations, at rounds of molecule generation (\(r=1,\dots ,R\)) and at iterations of on-line interaction with a chemist (\(t=1,\dots ,T\)). Let \({\theta }_{r,t,k}\in {\mathbb{R}}^{{d}_{k}}\) denote the unknown parameters of \({\phi }_{r,t,k}\), and simplify notation by writing \(\phi_{k} \left( {c_{k} \left( x \right),\theta_{r,t,k} } \right): = \,\,\phi_{r,t,k} \left( {c_{k} \left( x \right)} \right)\) . The number of parameters \({d}_{k}\) depends on the model family of the transformation \({\phi }_{k}\), which is assumed to be known. In this work, we use a double sigmoid score transformation for each component, which defines a range where the generated molecules’ properties are desired to lay, with smooth thresholds. The double sigmoid transformation, illustrated in Fig. 1, is parameterized with four parameters: \({\varvec{\theta}}=\left[LOW,HIGH, {\alpha }_{1}, {\alpha }_{2}\right]\):

$$\phi \left(x,{\varvec{\theta}}\right)=\frac{{10}^{{\alpha }_{1}x}}{{10}^{{\alpha }_{1}x}+{10}^{{\alpha }_{1}LOW}}-\frac{{10}^{{\alpha }_{2}x}}{{10}^{{\alpha }_{2}x}+{10}^{{\alpha }_{2}HIGH}}$$

where \(\left[LOW, HIGH\right]\) defines the desired interval of the property value \(x\), \({\alpha }_{1}\) and \({\alpha }_{2}\) control the steepness of the rising and descending edge respectively.

The scores of the scoring components are aggregated using an aggregation method from Eq. (1) or (2), assuming known weights \({w}_{k}\), with a constraint that \({\sum }_{k=1}^{K}{w}_{k}=1\). The resulting adaptive scoring function \({S}_{r,t}\left(x\right)\) with Eq. (1) aggregation is

$$S_{r,t} \left( x \right): = \,\,S_{{{{\user2{\theta } }}_{r,t} }} \left( x \right) = \mathop \prod \limits_{k = 1}^{K} \phi_{k} \left( {c_{k} \left( x \right), \user2{\theta }_{r,t,k} } \right)^{{w_{k} }}$$

(3)

where \({{\user2{\theta } }_{r,t}=\left[{\user2{\theta } }_{r,t,1},\dots ,{\user2{\theta } }_{r,t,K}\right]}^{\mathrm{\top }}\) (bolded letter denotes a vector).

#### Task

Given \(K\) molecular properties, known score transformation function family parameterized with \(\user2{\theta }\), score aggregation type (Eq. (1) or (2)), score aggregation weights \(w\) and an initial guess \({\user2{\user2{\theta } } }_{0}\), learn \(\user2\theta\) by showing molecules to a chemist, recording their response and computing the posterior distribution of \(\user2\theta\) to adapt the MPO scoring function \({S}_{{{\user2{\theta }}}}\left(x\right).\)

#### Workflow

In the first round \(\left(r=1\right)\), an initial batch of molecules is generated using scoring function \({S}_{{{\varvec{\theta}}}_{0}}\) as a scoring function in REINVENT. Then an active query selection strategy sequentially selects molecules to be shown to a chemist, who gives feedback to them. This continues for \(T\) iterations, after which the next round begins (\(r=2\)) and a new batch of molecules is generated using \({S}_{{{\varvec{\theta}}}_{r-1}}\) as a scoring function, where the \({\user2{\theta } }_{r-1}={\user2{\theta } }_{r-1,T}\) is a vector of point estimates of the score transformation parameters from the last round.

#### Probabilistic model of the chemist’s score

The chemist’s unknown score is modelled using the Eq. (3), where the relevant components \({c}_{k}\) are known. We further assume that the chemist has (tacit) limits for desired values of the properties, therefore, there are two unknown parameters for each component \(\user2{\theta }_{{r,t,k}} = \left[ {HIGH_{{r,t,k}} ,~LOW_{{r,t,k}} } \right]\). The two steepness parameters of the double sigmoid are assumed to be fixed.

We assume the chemist gives feedback \(y\)=1 with the probability \(S\left(x\right)\); therefore, the observation model for the chemist’s response, given that they were shown a molecule query \({x}_{r,t}\), is

$$y|{x}_{r,t} \sim \mathrm{ Bernoulli}({S}_{r,t-1}({x}_{r,t}))$$

(4)

With Bayesian inference, we can then compute the posterior distribution of model parameters, conditioned on the observed data \({D}_{r,t}={\left\{\left({x}_{i},{y}_{i} \right)\right\}}_{i=1}^{{N}_{r,t}}\), where \({N}_{r,t}\) is the number of queries up to round \(r\) and iteration\(t\), as

$$p\left( {\user2{\theta } \mid D_{r,t} } \right) = \frac{{p\left( {D_{r,t} \mid \user2{\theta } } \right)p\left( \user2{\theta } \right)}}{{\smallint p\left( {D_{r,t} \mid \user2{\theta } } \right)p\left( \user2{\theta } \right)d\user2{\theta } }}$$

(5)

where \(p\left( {D_{r,t} \mid \user2\theta } \right)\) is the likelihood of observed data given parameters \(\user2\theta\), \(p\left(\user2\theta \right)\) is the prior distribution of \(\user2\theta ,\) and the denominator \(\int p\left({D}_{r,t}\mid \user2\theta \right)d\user2\theta\) which normalizes the distribution is called evidence. Given the observation model in Eq. (4) and assuming that the observations are independent and identically distributed (i.i.d.), the likelihood is

$$p\left( {D_{{r,t}} \mid \user2{\theta }} \right) = \prod\nolimits_{{i = 1}}^{{N_{{r,t}} }} {S_{\user2{\theta }} \left( {x_{i} } \right)^{{y_{i} }} \left( {1 - S_{\user2{\theta }} \left( {x_{i} } \right)} \right)^{{1 - y_{i} }} }$$

(6)

In case an active learning query strategy selects which observations to acquire, the observations are no longer i.i.d., which in a full treatment can be taken into account. Here we make a simplifying assumption and use (6), which may result in a bias in the model.

For specifying the prior distributions \(p\left(\user2\theta \right)\), the chemist provides initial values \({\user2\theta }_{0}={\left\{\left(HIG{H}_{0,k},LO{W}_{0,k}\right)\right\}}_{k=1}^{K}\) which are set to be the expected values of the prior distributions:

$$LO{W}_{k}\sim \mathrm{Normal}\left(LO{W}_{0,k},{\sigma }_{\theta ,k}^{2}\right\},$$

$$HIG{H}_{k}\sim \mathrm{Normal}\left(HIG{H}_{0,k}, {\sigma }_{\theta ,k}^{2}\right)$$

(7)

where \({\sigma }_{\theta ,k}=\frac{1}{8}\left(HIG{H}_{0,k}-LO{W}_{0,k}\right)\) is a hyperparameter that defines how likely the values are to differ from the initial guess, and it depends on the width of the prior belief about the desired range of the property.

Using the scoring function in REINVENT requires point estimates \({\user2\theta }_{r}\). We use the expectation of posterior distributions \(\user2\theta_{{\text{r}}} = \int \user2\theta p(\user2\theta \mid {\text{D}}_{{\text{r,T}}} )d\user2\theta\), which minimizes the mean squared error of \(\user2\theta\). The full algorithm is shown in Algorithm 1.

### Implementation

We use the probabilistic programming language Stan [37] to fit the model, and to compute posterior distributions and expectations of \(\user2\theta\). For computational reasons, we parametrize the model with \(\left(LOW, DELTA\right)\), \(DELTA>0\), so that \(HIGH=LOW+DELTA\). The code is publicly available at https://github.com/MolecularAI/reinvent-hitl.

#### Building a new scoring component from human knowledge (Task 2)

The second method we propose is applicable in cases where a pre-specified scoring component for a specific property is not available but, instead, the values for the molecular property of interest can be obtained via interaction with a chemist and in addition potentially in a small experimental dataset. The method learns a new predictive model from the chemist’s feedback based on the property values, and the resulting component can then subsequently be used as one of the objectives in MPO.

#### Setup

We assume a small initial dataset \({D}_{0}\) with molecules \(x\) and their scores \(y\) for the property of interest, either acquired beforehand from a chemist, or from experiments. In addition, there exists a pool of unlabeled molecules \(\mathcal{U}\), which can be shown to a chemist. The chemist’s feedback \(y\) is a score between [0,1] about the suitability of the molecule for the drug design task, with respect to the property of interest (0 = not good, 1 = very likely good). We assume a Gaussian likelihood of feedback: \(y\sim N\left({f}^{*}\left(x\right),{\sigma }_{0}^{2}\right)\), where \({f}^{*}\left(x\right)\) is the chemist’s evaluation of the property of interest, and \({\sigma }_{0}\) is the standard deviation of the noise in the chemist’s answers. This means that the chemist’s answers may be erroneous but are correct on average. For simplicity, we assume that the noise in the data generating process of \(y\) in \({D}_{0}\) is the same as in the chemist’s feedback. Molecules are represented by features, which in this work are descriptors such as physicochemical properties; \(x\in {\mathbb{R}}^{p}\), or Morgan fingerprints \(x\in {\left\{\mathrm{0,1}\right\}}^{d}\) [38], where \(d\) is the dimensionality of the features.

#### Task

Given initial dataset \({D}_{0}\), a pool of unlabeled molecules \(\mathcal{U}\), and a possibility to query \(T\) molecules from a chemist, learn a non-parametric model \(f(x\)) (“a chemist’s component”) to represent the chemist’s knowledge, so that the molecules generated using \(f\left(x\right)\) as a scoring function get a high chemist’s score \({f}^{*}\left(x\right)\).

#### Chemist’s component

In contrast to the previous Task 1, here in Task 2 we do not make any assumptions about the structure of the model of the chemist; instead, we use a Bayesian non-parametric model, Gaussian Process, to fit a flexible user model to \({D}_{0}\) and the feedback.

We place a Gaussian process prior on the chemist’s component, \(f \sim GP\left(0,k\left(x,x^{\prime}\right)\right)\), where \(k\left(x,{x}^{\prime}\right)\) is a kernel that measures the similarity of two molecules \(x\) and \(x^{\prime}\). The observations in data \({D}_{t}={\left\{\left({x}_{i},{y}_{i}\right)\right\}}_{i=1}^{{N}_{t}}\) include both \({D}_{0}\) and all feedback received up to iteration \(t\) (\(t=1,\dots ,T\)), so that \({N}_{t}\) is the sum of \({N}_{0}\) and the number of feedback queries so far. The posterior of the Gaussian process, at a test point \({x}_{*}\), is then characterized with the mean \({\overline{f} }_{*}\) and variance

$$\bar{f}_{*} = \user2{k}_{*}^{T} \left( {\user2{K}_{t} + \sigma _{0}^{2} I} \right)^{{ - 1}} \user2{y}$$

(8)

$$Var\left( {f_{*} } \right) = k\left( {x_{*} ,x_{*} } \right) - ~k_{*}^{T} \left( {\user2{K}_{t} + \sigma _{0}^{2} I} \right)^{{ - 1}} \user2{k}_{*}$$

(9)

where \(\user2{k}_{*}\) is a vector with elements \(k\left({x}_{*},{x}_{i}\right)\), \(i=1,\dots ,{N}_{t}\), and \({K}_{t}\) is a covariance matrix with entries \(k\left(x,x^{\prime}\right)\) for each \(x,x^{\prime}\in {D}_{t}\). The vector \(y\) contains all observations \({y}_{i}\), \(i=1,\dots ,{N}_{t}\). [39]

We apply two types of kernels: squared exponential for a case when \(x\) are physicochemical properties, and Tanimoto kernel [40] for Morgan fingerprint features. In our experiments, Morgan fingerprints resulted in better performance, and therefore, we focus on results with them. For completeness, the results with physicochemical properties are shown in the Additional file 1: Sect. 3.2.

#### Implementation

We use GPflow [41] to implement the chemist’s component using a standard Gaussian process regression model, and Tanimoto kernel implementation from [42].

#### Query selection strategies

Active learning can be used to select a molecule for a chemist to label, from the pool of unlabeled molecules \(\mathcal{U}\). In typical active learning settings, \(\mathcal{U}\) is available before training. In our work, \(\mathcal{U}\) either consists of molecules from a previous molecule generation, or molecules from public databases. The goal in active learning is to learn an accurate model that maps from molecules \(x\) to labels \(y\).

Our setup differs from standard active learning in that the model will subsequently be used as a scoring function for molecule generation (technically: a reward function in reinforcement learning), and, therefore, it is desired to have a model that can correctly identify high-scoring molecules. This leads to an exploration–exploitation trade-off in query selection: the system needs to trade off showing as many positive examples as possible, while ensuring that unknown areas are explored sufficiently to find new positive examples. We use a Bayesian optimization approach based on Thompson sampling [29] to solve this trade-off. Below we give brief summaries of the query selection strategies that we compare in this work: random sampling, uncertainty sampling, pure exploitation, and Thompson sampling. Each of them aims at selecting a next molecule \({x}^{*}\in \mathcal{U}\) to query from a chemist.

#### Random sampling

Sample \({x}^{*}\) uniformly randomly from \(\mathcal{U}\).

#### Uncertainty sampling

Select the molecule that the model is the most uncertain about: \(x^{*} = {\arg \max}_{x \in u} H_{\user2\theta } \left( {y \mid x} \right)\) where \({H}_{\user2\theta}\) denotes entropy when the model parameters are \(\user2\theta\), and \(y\mid x\) is the predicted score of molecule \(x\) in the model (\(y\in \left\{\mathrm{0,1}\right\}\) in Task 1 and \(y\in \left[\mathrm{0,1}\right]\) in Task 2).

#### Pure exploitation

Select the molecule that maximizes the expected expert score: \(x^{*} = \arg \max_{x \in U} \int {S_{\user2\theta } (x)p(\user2\theta \mid D_{r,t} )d\user2\theta }\) (Task 1), and \({x}^{*} = \arg \max_{x \in \mathcal{U}}f\left(x\right)\) (Task 2).

#### Thompson sampling

Select a molecule that greedily maximizes the expected score given a randomly drawn belief. In Task 1, this means drawing a sample \({\user2{\theta}}_{s}\) from the current posterior distribution \(p\left( {\user2{\theta} \mid D_{r,t} } \right)\), and then maximizing \({x}^{*}= \arg \max_{x \in \mathcal{U}}{S}_{{{{\user2\theta}}}_{s}}\left(x\right)\) (Algorithm 2). In Task 2, we sample one realization \({{\varvec{f}}}_{s}\) of the GP posterior at points \(x\in \mathcal{U}\), and select the \(x\) with the largest value of the sampled function mean \({\overline{{\varvec{f}}} }_{s}\left(x\right)\) (Algorithm in Additional file 1: Sect. 1.1).

### Human-in-the-loop experiments

We demonstrate the methods in two example tasks, with binary and continuous feedback. The goal in Task 1 is to adapt the scoring function consisting of physicochemical properties to generate molecules that score high in Quantitative Estimate of Drug-likeness (QED) [10], with binary feedback. In Task 2 we train a novel scoring component for capturing the chemist’s knowledge about DRD2 activity of the molecule, based on continuous-valued feedback.

To make the study reproducible, we use an oracle to simulate the responses of a chemist instead of including a real human in the loop. Nevertheless, we assume the budget of 200 active learning queries, which is close to the maximum feasible number of interactions with a human chemist.

For evaluating how well the generated molecules match the simulated chemist’s goal, we use the score from the oracle, coined ‘oracle score’ to distinguish it from the from the score of a molecule in a scoring function. To reduce computation time in the experiments, we select queries sequentially in batches, by greedily selecting the \({n}_{\text{batch}}\) best molecules according to a query strategy at iteration \(t\). As a result, the performance of other methods may be underestimated compared to random sampling, because they will not be able to optimize their selection during a batch.

#### Task 1: Adapting the parameters of the MPO function

We experimentally evaluate the method for adapting MPO in a task of generating molecules with a high QED-score [10], based on scoring components of physicochemical properties. We chose QED-score as the goal because it is inspired by how humans evaluate the drug-likeness of molecules. This makes it a suitable proxy to simulate a chemist’s intuition and, furthermore, there exists a publicly available method for approximating it [10]. We make a minor modification to the standard QED score, so that the modified score \({S}_{\text{mQED}}\left(x\right)\in \left[\mathrm{0,1}\right]\) favors smaller values of partition coefficient (logP), to make the task more difficult a priori (for the details of the modification of the desired value of logP from the original average 3 to average 1.5, see Additional file 1: Sect. 2.1).

The scoring components include the following seven physicochemical properties, calculated with RDKit [43]: molecular weight (MW), partition coefficient (SlogP), hydrogen bond donors (Lipinski) (HBD), hydrogen bond acceptors (Lipinski) (HBA), polar surface area (PSA), number of rotatable bonds and the number of aromatic rings. We assume that all properties are transformed with the double sigmoid function, with unknown HIGH and LOW parameters. The other two parameters of the double sigmoids are set to fixed values deemed good for each property based on prior knowledge, provided in Additional file 1: Section 2.2. We aggregate the scores using weighted geometric average (eq. (1)).

As a starting point in the first round, we use poor guesses on the parameters \({{\varvec{\theta}}}_{0}={\left\{\left(HIG{H}_{0,k},LO{W}_{0,k}\right)\right\}}_{k=1}^{7}\) to create a scoring function that gives high score to molecules with a wide range of molecular properties. The exact initial values are reported in Additional file 1: section 2.3 We use this scoring function in REINVENT and collect the high-scoring molecules generated during 300 epochs of training as the first unlabeled molecules \(\mathcal{U}\) (depending on the run, this results in the order of 1,000–10,000 molecules). The number of epochs was chosen so that in most cases a (local) maximum has been found, observed as flattening of the learning curve. We run the experiment for two rounds (initialization, and two rounds of feedback queries, \(R=2\)), and evaluate the performance as the average oracle score of the generated molecules at initialization and at the end of each round.

At each round, the user model is initialized with \(10\) randomly chosen molecules, and the priors of the user-model are defined by the previous round’s \({\theta }_{r-1}\) (\({\theta }_{0}\) in the first round). Then, we do \(10\) iterations of querying \({n}_\text{batch}=10\) molecules in batches. This means that we query in total \(T=110\) molecules from a simulated chemist, making the total query budget in the experiment 220. The simulated chemist gives feedback 1 randomly with probability of \({S}_\text{mQED}\left(x\right)\). For each query strategy described in "Task 2: Learn human knowledge about a molecular property as a separate component" Section we repeat the experiment ten times with different random seeds to quantify variance due to different \({D}_{0}\) and stochasticity in the simulated chemist’s answers.

#### Task 2: Learn human knowledge about a molecular property as a separate component

We test possibility to learn human knowledge using example of DRD2 activity. For reproducibility, we use an oracle model, instead of a human chemist. We derive human component *f(x)* using algorithm described in "Building a new scoring component from human knowledge (Task 2)" Section. To evaluate derived human component *f(x)*, we first use *f(x)* as a scoring function in REINVENT to train an agent; we then sample molecules from trained REINVENT agent, and evaluate sampled molecules using the oracle model.

To compare query strategies, we derive human component *f(x)* for each of the query strategies described in "Building a new scoring component from human knowledge (Task 2)" Section, and repeat the experiments 10 times with different random seeds.

For sensitivity analysis, we repeat the experiment, but this time we derive human component *f(x)* with noise added to the simulated chemist’s answers.

##### Training oracle model

We evaluate the Task 2 method in an example case of learning the DRD2 activity of molecules from feedback. For an oracle in this case, we use activity prediction model trained on a large publicly available dataset on DRD2 activity [44]. We used an activity prediction model trained on both the active and inactive compounds of the ExcapeDB DRD2 modulator set.^{Footnote 1} To train the model, stereochemistry was stripped from all compounds in the dataset, and they were represented in their canonical form by using RDKit [43]; the resulting duplicates were removed; data was split to test and training sets with a stratified split and the compounds were represented with ECFP6 fingerprint (radius 3) hashed to 2048 bits; a Scikit-learn [45] Random Forest Classifier model was trained to discriminate active from inactive compounds; Optuna [46] was used for finding the optimal hyperparameters with a 5-fold cross validation; the performance of the resulting model in terms of area under the curve (AUC) was 0.945. We use predicted positive class probabilities from the activity prediction model when answering the queries.

##### Deriving human component f(x)

As described in "Task 1: Adapting the parameters of the MPO function" Section, we derive human component as a Gaussian Process model. The DRD2 dataset consists of 275,768 molecules represented as SMILES strings. In the beginning, we sample randomly \({N}_{0}=10\) molecules to be the initial dataset \({D}_{0}\) and acquire their scores from the simulated chemist (oracle in the noise-free case). The rest of the molecules are used as unlabeled molecules \(\mathcal{U}\) in the simulated experiments. During interaction, we query in total 200 molecules from the simulated chemist in batches of 5 (\(T=40\)).

##### Sensitivity analysis

In addition to a noise-free case, we do a sensitivity analysis of the method with respect to the noise in the simulated chemist’s answers. For this, the oracle’s answers are corrupted with independent Gaussian noise with standard deviation \({\upsigma }_{0}=0.15\) (moderate noise) and \({\upsigma }_{0}=0.30\) (severe noise). For simplicity, feedback values are capped within range [0,1].

For evaluation, we set \(f\left(x\right)\) as the scoring function in REINVENT and train the agent for 300 epochs. After obtaining a trained agent, we sample 1024 molecules from the agent and evaluate sampled molecules using the oracle model. For each query strategy described in "Building a new scoring component from human knowledge (Task 2)" section, we repeat the experiments 10 times with different random seeds.

#### Deriving a DRD2 scoring function using a human

To show that the results of the simulated experiment in Task 2 are relevant, we exemplified the method with human feedback in a modified version of Task 2 where the chemist was queried directly. We let a medicinal chemist (who is also coauthor of the manuscript) interact with the system in the same DRD2 activity setup as described in "Task 2: Learn human knowledge about a molecular property as a separate component" Section. The system has a graphical user interface for interaction, shown in Fig. 2.