 Research article
 Open Access
 Published:
Algorithmsupported, mass and sequence diversityoriented random peptide library design
Journal of Cheminformatics volume 11, Article number: 25 (2019)
Abstract
Random peptide libraries that cover large search spaces are often used for the discovery of new binders, even when the target is unknown. To ensure an accurate population representation, there is a tendency to use large libraries. However, parameters such as the synthesis scale, the number of library members, the sequence deconvolution and peptide structure elucidation, are challenging when increasing the library size. To tackle these challenges, we propose an algorithmsupported approach to peptide library design based on molecular mass and amino acid diversity. The aim is to simplify the tedious permutation identification in complex mixtures, when mass spectrometry is used, by avoiding mass redundancy. For this purpose, we applied multi (two and three)objective genetic algorithms to discriminate between library members based on defined parameters. The optimizations led to diverse random libraries by maximizing the number of amino acid permutations and minimizing the mass and/or sequence overlapping. The algorithmsuggested designs offer to the user a choice of appropriate compromise solutions depending on the experimental needs. This implies that diversity rather than library size is the key element when designing peptide libraries for the discovery of potential novel biologically active peptides.
Introduction
Smallmolecule libraries are widely used in drug discovery to identify biologically active molecules [1]. Traditionally, small molecule library design is based on a known target structure or on known ligands. The last two decades have witnessed the application of docking calculations to the pharmaceutical field through the lead optimization of structurebased design of small molecules [2]. Moreover, molecular docking is the main approach in the structurebased peptidyldrug design studies, even though peptide based drugs are less explored than the small molecule ones. However, when the targets and/or ligands are unknown or “undruggable” this virtual screening approach fails to provide useful information [3]. Therefore, one has to rely on experimental screenings of random, large numbers of compounds in order to obtain further insight into possible identification of binders or disruptors of proteinprotein interactions (PPI) [4, 5].
Safer and more specific alternatives to small molecules, peptidebased drugs are emerging as a new paradigm in medicinal chemistry [6, 7]. Therefore, there is a need to develop combinatorial approaches to identify new peptide therapeutics [5]. In this context, phage display was the main approach to obtain a variety of random peptide sequences [8, 9]. Nowadays, the phage display technology is well established in the peptidebased drug discovery process [10, 11]. However, the limitation was that the library building blocks are limited to the 20 natural proteinogenic amino acids resulting in peptides with short halflives and susceptible to proteolytic cleavage. Progress with phage display and RNAdisplay technologies has allowed the use of nonproteinogenic amino acids in those techniques [12], but the onebeadonecompound (OBOC) method allows a more straightforward use of unnatural amino acids [13, 14]. It allows the introduction of stereochemical variability at the \(\alpha\)carbon, headtotail cyclization, disulfide bridges, etc. and their combinations without complex genetic manipulations [15]. Therefore, OBOC peptide libraries, have opened up the path to a larger search space for discovery of peptidebased drugs [15, 16].
Chemical and molecular diversity of library components are key for drug discovery [17]. Combinatorial libraries are tools that allow large amount of compounds to be screened at the same time [4]. Currently, the diversityoriented systems (DOS) for small molecules are taking over the static combinatorial approach, allowing the introduction of skeletal, structural and stereochemical complexity [18,19,20]. Similarly to DOS, one can think of introducing stereochemical and skeletal complexity into peptides through the use of Dstereoisomers, the combination of L and Dones, the retroenantio versions and cyclization. By this means, peptidebased libraries with increased stability could be generated. However, their analysis is challenging as similarity in structure and behavior of available peptide permutations would lead to the impossibility to distinguish between sequences using today’s available screening and/or analyzing techniques such as HPLC, UPLCMS, etc.
It is believed that larger the sample size, more accurate is the representation of a given population [21]. In these terms, random libraries are advantageous over the focused ones, as they cover a larger search space. However, parameters such as the synthesis scale to guarantee the synthesis of all possible compounds, the number of library members, the sequence deconvolution and peptide structure elucidation after screening, are challenging steps when increasing the library size [22]. Moreover, libraries may result in competition among candidates and biased hits identification due to the physicochemical similarity of the components [21]. This is particularly the case during the sequence deconvolution process of positive hits, where the identification of single peptides is not always possible. Often, families of peptides having identical masses are identified [23]. In order to address this issue, we set out to determine whether multiobjective genetic algorithms can aid and simplify the rational design of random libraries to increase diversity and minimize redundancy. The advantage of this tool is the reduction of the number of library members in a single screening while maintaining maximal chemical and mass diversity.
Recently, artificial intelligence has been applied to a variety of chemical problems to maximize the chance of successful and rapid solving of complex issues [24,25,26,27,28]. In our group, evolutionary algorithms have been successfully applied to address various challenges of peptide chemistry [29,30,31,32] related to the design and identification of peptides that cross the bloodbrainbarrier (BBB) [33], or that bind to the major histocompatibility complex (MHC) [34]. Recently, Cronin and coworkers used evolutionary algorithms in combination to machine learning to predict antimicrobial activity of peptide sequences [35]. Although the heuristic approach has been explored in the field of peptide design, it is mainly focused on the use of the oneobjective setting where one fitness function optimizes a specific peptide property (e.g., antimicrobial, anticancer activity) [36,37,38]. To date, multiobjective genetic algorithms have been applied to the design of small molecule combinatorial libraries [21, 39,40,41] but have not been explored for random peptide libraries yet. Herein, we present the framework to solving and/or simplifying a combinatorial challenge—whether we could rationally design a random peptide library and what are the rules that underline this algorithm supported process. For this purpose, we used an evolutionary computing approach based on a multiobjective genetic algorithm (GA) [42, 43] represented in scheme 1 to address the issue of maximizing the number of diverse amino acid permutations. In this way, sequence deconvolution and peptide sequence elucidation are simplified by avoiding sequence redundancy. Ultimately, the designer is able to make an algorithmsupported choice of an appropriate compromise solution among given optimized library designs, depending on the experimental needs.
Results and discussion
The 20 geneencoded amino acids, together with a variety of nonnatural ones, constitute a versatile toolbox for combinatorial library design (Scheme 1a). A high chemical and structural diversity can be achieved by designing libraries composed of sequences of amino acids (peptides) which total number is calculated using the formula \(R=m^r\), where r is the number of positions where the variability can be introduced and m the number of amino acids per position [22]. Therefore, the number of permutations grows exponentially with the peptide length, depending also on the number of amino acids per position. If all the 20 proteinogenic amino acids are used in r positions, there are \(R=20^r\) possible permutations. OBOC libraries prepared using the “split and mix” methodology result in redundant peptide mixtures consisting of permutations of amino acids having overlapping masses [23].
In our study, the introduction of complexity of the system refers to the molecular weight (\(M_W\)) diversity of the library components. In the case of peptides, made of combinations of various natural or nonnatural amino acids, the \(M_W\) diversity often corresponds to sequence diversity. Therefore, based on monoisotopic \(M_W\) values we set out to determine an algorithm supported approach to help the random design of systems that maintain maximum mass and sequence diversity. Ultimately, the objective of this paper is to simplify the tedious and sometimes difficult chromatographic and mass spectrometry based analyses of complex mixtures of peptides that have similar properties. We chose the \(M_W\) as the discriminant for increasing diversity and complexity of the system, consequently reducing the number of peptides in the mixture.
First, we developed a peptide mass calculator with the possibility to discriminate between masses of peptides in a given mixture, generating an output csv (commaseparated values) file containing all possible peptide permutations alongside with their masses in addition to a csv file containing the list of peptides with unique masses. In the latter, the permutations having the same amino acid composition but with varying locations within the sequence were omitted except one representative of each collision set. The same was applied to peptides having different amino acid composition but overlapping masses, with tolerance set to 1 (see Additional file 1: Fig. S7). Tolerance is an arbitrary parameter that is user defined and it can be tuned to any value, as described in the next subsection. A simplified schematic representation of the calculator output, using a small library of dipeptides is shown in Scheme 1b. The amino acids are divided in 5 different categories (colorcoded) depending on their polarity and charge (Scheme 1a). If each of the colors stands for one representative amino acid and all the possible dipeptides are made, there are \(5^2\) (r = 2, m = 5) = 25 permutations. After the calculator excludes the peptides with overlapping masses, we are left with a subset of 15 dipeptides of unique masses, as shown in Scheme 1b (right panel). In this way, we could simplify the chromatographic and mass spectrometry analysis of the mixture.
The following step was the development of the optimization tool based on the genetic algorithm (Scheme 1c) that is able to suggest several design strategies based on user’s needs and inputs (Scheme 1d). The input is the same as for the calculator, where the number of positions (r) and the amino acids (\(x_i\)) for each position have to be defined. The output is a Pareto front containing a distribution of solutions, based on how they scored during the GA selection rounds. The Pareto front is a set of nondominated solutions chosen as optimal by the GA. Subsequently, the user makes the choice of the preferred design suggestion based on the experimental expertise and the requirements of the library (Scheme 1d). We refer to this choice as algorithmsupported decision. Moreover, we explored the possibility of a three objective algorithm to include also the sequence diversity.
Library mass analysis with the combinatorial calculator
During the initial step of peptide library planning there are two parameters that need to be determined (Fig. 1a): (1) the number of positions r and (2) the list of \(m_i\) amino acids \(x_i=\left\{ \right\}\) that can appear at ith positions, where \(1 \le i \le r\). These constitute our input parameters. To simplify and automatize mass calculations of all the possible permutations in a userdefined peptide library, the peptide mass calculator algorithm was developed using the Matlab scripting language. An example is a pentapeptide library where the input was: r = 5, m = 6, for \(x_i\) = s, e, r, w, a, G (Fig. 1a). For this condition, the total number of peptide permutations given by \(m^r\) is 7776 (Fig. 1b).
The peptide mass calculator also compares the masses of all the possible permutations and locates the ones that have unique mass. The motivation behind this was to estimate whether the analysis of all the compounds with unique masses i.e., simplified libraries in terms of number of peptides in the mixture is feasible by using chromatography coupled mass spectrometry. Hence, the mass of a peptide M is considered unique if no other peptide has the mass within the range \(<M  T(\Delta mass), M + T(\Delta mass)>\). If there is another peptide in the mixture, having mass within this range, it is excluded from the list of permutations unique by mass, thus automatizing redundancy identification. The parameter \(T(\Delta mass)\) is referred to as the tolerance of mass discrimination. It is also a userdefined parameter, which may be tuned (according to user’s needs and the resolution of the machine used to perform the mass spectrometry). Throughout this study, the condition \(T(\Delta mass)=1\) was kept in all the calculations and permutations, unless stated otherwise. Using this configuration and the exemplary pentapeptide library from Fig. 1a, the peptide mass calculator computes 130 peptides of unique mass (Fig. 1b).
The calculator output provides a csv (commaseparated value) type file containing the full list of permutations with the corresponding \(M_W\) values, in the form of: (1) the average mass, (2) the monoisotopic mass, (3) the singly charged \([M+H]^+\) and (4) the doubly charged \([M+2H]^{2+}\) expected ions for mass spectrometry analysis (Fig. 1c). The algorithm also provides the full list of unique peptide masses (for simplicity only a screenshot is shown in the Fig. 1c).
Therefore, our algorithm gives the possibility to the user to explore all possible permutations as well as the ones baring unique masses, representative of the population. Here, peptide diversity is achieved solely through the introduction of functional group (side chain) diversity and leads to peptide collections with variable mass diversity. This tool is the basis for the optimization process used for the algorithmsupported library design described below.
Library design optimization with the genetic algorithm approach
Next, we developed an optimization decision support system based on the genetic algorithm (GA) approach. GAs are iterative heuristic processes of computational optimization that is based on selection, crossover and mutation of genes, analogous to the natural process of evolution where the best genes of individuals are preserved and passed to the next generation with the expectation that the fittest parents will give even fitter offspring. In computer science, GAs are mimicking this process by trial and error, gradually finding the best solution or solutions to a given problem and ending only if one of the stopping criteria is met (Fig. 1c): (1) the fitness function of an individual achieved the maximum level or a predefined threshold level, i.e., the algorithm has found the best solution or a solution that is good enough; (2) after a certain number of generations, the fitness function of best solutions is not improved, i.e., the algorithm cannot find another solution that is better than the one it has already found. In each iteration (generation) of a GA, a group (population) of candidate solutions (individuals) to a given optimization problem are considered simultaneously. Each solution is defined by parameters encoded in genes and its quality is quantified by fitness functions. The values of fitness functions are numerical indicators used to compare solutions and find the best ones.
In this paper, the second variant of the Nondominated Solutions Genetic Algorithm (NSGAII) was used to run peptide library optimizations. The goal was to identify optimal library designs that yield the greatest number of peptides with unique masses. The motivation was to find library designs with feasible analysis of all its components by chromatography coupled mass spectrometry. The design of a library is defined by the number of positions r and the variability (a number of possible amino acids) in each position \(x_i\) \((1< i < r)\). The variability is user defined and consists of a set of possible amino acids, which may vary for each position. The implemented algorithm transforms each set of possible amino acids into a bitstring, where the length of the bitstring is equal to the length of the set and each bit (1 or 0) represents the inclusion (1) or exclusion (0) of a given amino acid. That way, the algorithm is searching for an optimal subset of amino acids. The search is guided by two fitness functions that need to be maximized: (1) the total number of peptides for a given variability and (2) the number of peptides unique by mass. Both fitness functions are computed for each solution examined in this iterative search process by using the peptide mass calculator.
To illustrate the advantage of the optimization output over the single, userpredetermined design, we used the same input shown in Fig. 1a. In contrast to the peptide calculator, the optimization offers a wide range of random peptide library designs simplified to satisfy the maximum mass diversity condition. This can be seen from the Pareto front that contains all the best solutions (red dots) and the distribution of all the remaining solutions from the final population (blue dots) in Fig. 2. We focused our analysis to the 70% to 100% diversity region and reported on the three best solutions encircled in the zoomed Pareto front graph being BS 1 (100%), BS 2 (94%) and BS 3 (92%). Our implementation of the algorithm also allows the user to pick any point of interest and obtain a detailed analysis containing the output of the peptide mass calculator accompanied by the corresponding sequence logo graphs. All the sequence logos throughout the text are presented in the conventional N to Cterminus fashion.
The library design suggestion for BS 1 consists of 32 peptides having unique mass. The algorithm informs that this specific library can be obtained by simplifying the input from s, e, r, w, a, G in each position x_{i} to {a, G} for \(x_1\), {w, a} for \(x_2\), {r, G} for \(x_3\), {e, r} for \(x_4\) and {s, r} for \(x_5\) (Fig. 2). Similar design suggestions are shown in Fig. 2 for BS 2 and BS 3, where the number of peptides having unique masses increases as well as the mass overlapping. Depending on users requirements, in terms of the number of library members and the desired chemical diversity, the design strategy can shift towards the increasing number of possible peptide permutations and introduction of limited or extended mass overlapping. From the same optimization, BS 50%, BS 30% and BS 10% points were also analyzed. The zoom of the Pareto front in the 10% to 50% mass diversity range with the corresponding sequence logos can be found in supplementary information (Additional file 1: Fig. S1). Having increased the number of amino acids in each position \(x_{i}\) lead to the greater number of peptides unique by mass in BS 50%, BS 30% and BS 10%. However, the total number of peptides increased at a greater rate, thus reducing the overall mass diversity ratio of these libraries.
The size of search space is the number of all possible solutions for the binary encoded genome calculated as \(2^{(r * m)}\) (Fig. 3a). In our case, the complexity of the optimization task is determined by the search space size that is exponentially dependent on the product of m and r. In Fig. 3, we showed how the search space changes from \(10^9\) to \(10^{16}\) for a constant value of m (\(m=6\)) when increasing r from 5 to 10, respectively. The time needed to complete the optimization task is influenced by its complexity (i.e. balance of m and r) and the available computer resources. For peptide lengths from 5 to 10residues long with varying values of m the algorithm completed the optimization tasks in a reasonable time frame (hours to days) with a standard PC configuration. In contrast to the search space complexity for the given set of libraries (Fig. 3b), all the best solutions (BS 1) have only 32 peptides unique by mass, regardless of the increasing number of positions r (Fig. 3a). A set of sequence logo graphs given in Fig. 3c corresponding to 100% mass diversity shows a set of designs available to the user. Their similarity proves that having an algorithmsupported design strategy still requires carefully chosen inputs from the expert user.
To illustrate the utility of the twoobjective optimization, we used the example of an all Dheptapeptide with r = 5 and m = 7, for \(x_i = {s, e, r, w, a, G, i}\); having two fixed positions \(x_3=p\) and \(x_7=y\). The Pareto front for this optimization can be seen in Fig. 4a. When zoomed into the 70–100% mass diversity region, several design solutions appear. We chose to analyze five points being BS 1 (100%), BS 2 (98%), BS 3 (86%), BS 4 (77%) and BS 6 (70%) and show their sequence logos (Fig. 4b and c) to allow better visualization of the design suggestions and possible synthetic challenges. In our region of interest, the algorithm suggests possible designs to obtain maximally diverse random peptide libraries. Interestingly, BS 2 shows a high mass diversity of 98% and only one peptide with overlapping mass. With 47 peptides unique by mass it offers a design possibility with 33% higher total number of peptides and only 2% mass diversity reduction when compared to BS 1. Thus, this would be our design of choice to attempt the experimental validation. The complete list of peptides for this solution can be found in the Additional file 2: Table S1.
To show the versatility of the optimization process, we presented several other examples of library designs: (a) r = 5, m = 6 for \(x_i={s, e, r, w, a, G}\); \(x_6=y\) in Additional file 1: Fig. S2, (b) r = 7, m = 7 for \(x_i={w, r, e, p, s, i, a}\); \(x_7=y\) in Additional file 1: Fig. S3, (c) r = 5, m = 10 for \(x_i={h, f, r, a, n, e, s, y, w, i}\) in Additional file 1: Fig. S4 and (d) r = 6, m = 10 for \(x_i={h, f, r, a, n, e, s, y, w, i}\); \(x_6=G\) in Additional file 1: Fig. S5.
Threeobjective GA to maximize the complexity through sequence diversity
Among the permutations with overlapping masses taken from the examples examined in the previous section (Fig. 4 and Additional file 1: Figs. S1–S5), some entries exhibited different amino acid composition (Additional file 1: Fig. S7). In order to preserve these solutions in the library design, we introduced a third fitness function to distinguish the permutations by their composition. Consequently, the optimization implemented for combinatorial peptide library design computed the number of peptides of unique amino acid composition and maximized their sequence diversity. In addition, the introduction of sequence diversity as a fitness function enables the 3objective GA to find new solutions that may have been neglected in the 2objective setting because of lower mass diversity. An example is a heptapeptide library from the \(x_i={s,e,r,w,a,G,i}\), \(x_3={p}\), \(x_7={y}\) optimization where two permutations with different sequence, aGpGery and rGpGisy, show monoisotopic masses of 748.3505 and 748.3869, respectively (Additional file 1: Fig. S7). In this particular case, the monoisotopic mass of ‘ae’ dipeptide (218.116) overlaps with the mass of the ‘is’ dipeptide (218.079). These two peptides would have been excluded in the previous version of the algorithm with \(T(\Delta mass)=1\), but now they are included because of the increased sequence diversity within the library.
Using the 3objective GA, the optimization results are displayed in a threedimensional Pareto front, from which the 2D projections of both amino acid diversity and mass diversity can be extracted and analysed separately (Additional file 1: Fig. S6). In addition to the output obtained with the 2objective GA, the output of the 3objective optimization contains a separate csv file listing all the permutations unique by sequence. In this setting, we performed the optimization for two libraries: r = 5, m = 6 for \(x_i={s, e, r, w, a, G}\); \(x_6=y\) in Fig. 5a–c and r = 5, m = 7 for \(x_i={s,e,r,w,a,G,i}\); \(x_3=p\), \(x_7=y\) in Fig. 5d–f. We analyzed the output of these examples showing the difference between the sequence and mass diversity of the best solutions, followed by the analysis of the 70% to 100% diversity region of all solutions and finally presenting the sequence logo graphs of two prominent solutions.
The diversity analysis of the libraries (Fig. 5a, d) consists of a parallel representation of sequence and mass diversity distributions on the yaxis for all the best solutions, where each has a different total number of permutations on the xaxis. In these figures, three distinct regions can be observed:

1
for smaller library sizes, composed of 200 or less permutations, the optimization outputs of mass and sequence diversity overlap;

2
for medium sized libraries, containing a 1000 or more permutations, the sequence diversity exhibits a faster growth than the mass diversity;

3
for large libraries with more than 2000 permutations, both diversities saturate and exhibit almost no growth with the increase of library size.
The behavior of the optimization results is summarized in the optimal solutions graph (Additional file 1: Fig. S6d), where two areas of interest are labeled: the overlapping zone (1) where the diversity by sequence and by mass is very similar, and the diversity zone (2) where the diversity by sequence is greater than the diversity by mass. This underlines the overall advantage of the introduction of the third objective in the optimization process that offers a greater possibility to finetune the design and fulfill the users’ requirements. As expected, the number of peptides unique by sequence is higher than the number of peptides unique by mass, but both numbers have a theoretical maximum, which is considerably lower than the total number of permutations within the library.
Figure 5b, e present the 2D graphs of the best solutions (green dots) alongside all the solutions (blue dots) from the optimization representing mass diversity on the xaxis and sequence diversity on the yaxis, relative to the total number of permutations. This representation includes an overview of several other design options available that might be of interest to the user and could lead to highly diverse libraries. We focused our interest to the 70% to 100% diversity region and chose only solutions of sequence diversity higher than 95% even if their mass diversity was lower. Two prominent designs are marked and their sequence logo graphs presented in Fig. 5c, f. Both the examples show sequence diversity of 98% while their diversity by mass is 89% and 81%, respectively. Considering that the sequence diversity is a stronger indicator of chemical diversity within the library, it is evident that these two solutions cover a wide range of chemical properties and possibly offer alternative design choices worth of consideration.
In the current system, if the user is interested in making a hydrophobic library, the input should consist of preferentially hydrophobic amino acids. The algorithmassisted, threeobjective optimization offers different possibilities of peptide designs, shown for the high diversity region (70–100%) as well as for the 10–50% diversity region, in Additional file 1: Figs. S8 and S9.
Comparison of two and threeobjective settings and the effect of \(\mathbf T (\Delta \mathbf mass)\)
Although the computational cost for the threeobjective GA optimization is higher, the benefit of increasing the sequence diversity is visible when comparing the output results, i.e., the best solutions of the 3 and 2GA optimizations. Figure 6 presents this comparison in the 70% to 100% diversity region for two different inputs: (a) r = 5, m = 6 and \(x_i={s, e, r, w, a, G}\); \(x_6=y\) and (b) r = 5, m = 7 and \(x_i={s,e,r,w,a,G,i}\); \(x_3=p\), \(x_7=y\). It can be noticed that the best solutions from the 3objective optimization (red crosses) are more numerous than the 2objective ones (blue dots), indicating the coverage of a larger search space and greater design choices. However, the number of solutions with higher diversity is closely related to the chosen value of the tolerance of mass discrimination \(T(\Delta mass)\). As previously mentioned, we set this parameter to \(T(\Delta mass) = 1\) to discriminate the permutations by mass. Therefore, should a user require lower or higher tolerance values, the ratio between the number of permutations unique by sequence and unique by mass would change accordingly.
As expected, some of the proposed designs overlap in both optimizations as shown in Fig. 6a, b. This suggests that the choice of the optimization method will depend solely on the users’ requirements. When the mass diversity design is sufficient, the 2objective optimization will be the method of choice. Should a user require additional design suggestions based on sequence diversity, it can opt for the more costly but more informative 3objective method.
Next, we explored the effect of varying the tolerance input to values of \(T<1\) and \(T>1\) for the optimization described in Fig. 4 using both, the two (Additional file 1: Figs. S10–S14) and the threeobjective (Figs. 7, 8 and Additional file 1: S14) settings. When changing \(T(\Delta mass)\) to 2, 5, 0.5, 0.1, 0.01 and 0.001, the distribution of best solutions (Fig. 7) and the number of permutations for related mass diversities are affected (Fig. 8). The impact of varying the \(T(\Delta mass)\) from 0.001 to 2.5 on the number of best solutions is visible in the optimization results where a smaller tolerance window yields a lower number of best solutions. An example is the high diversity region (sequence diversity above 90%), where the number of best solutions is 10 for T = 0.5, 9 for T = 0.1, 5 for T = 0.01 and only 4 for T = 0.001 (Fig. 7). In addition, when lowering \(T(\Delta mass)\) to values close to zero such as 0.001 and 0.01, the algorithm behaves as sequence diversity was the measure of library diversity, i.e. it works according to the 3rd objective. This results in mass and sequence diversity being linearly correlated as seen in Fig. 7d.
In agreement with the spread of pareto fronts shown in Fig. 8, the number of permutations increases for decreasing values of \(T(\Delta mass)\) and hence, the pareto front shifts to the right accordingly. By comparing best solutions having 85(\(\pm 1\))% mass diversity (Fig. 8), it can be noticed that the number of permutations increases from 48 (T = 2.5) to 72 (T = 1), 108 (T = 0.5 and T = 0.1), 144 (T = 0.01) and 162 (T = 0.001). For \(T(\Delta mass)\) values below 1, i.e., 0.5, 0.1, 0.01 and 0.001, no differences are observed in terms of number of library components for BS1 (100% diversity). On the other hand, when increasing the \(T(\Delta mass)\) value to 1 and 2.5, BS 1 (100% diversity) shows a decreasing number of library components from 36 (BS1, \(T=1\)) to 32 (BS1, \(T=2.5\)).
Furthermore, in the 2objective setting, by lowering the tolerance window, the number of permutations with unique mass increases, to take into account those sequences that have different amino acid composition but mass difference lower than \(T(\Delta mass)\). As expected, there is little difference in optimization results between \(T(\Delta mass)\) values of 0.5 and 0.1. In the high diversity region BS1, BS2, BS3 and BS5 match while BS 4 shows one more peptide unique by mass for \(T=0.1\). When looking at \(T=0.01\), the differences are more pronounced in terms of mass diversity and number of library components seen for BS 2, BS 3, BS 4 and BS 5 when compared to \(T(\Delta mass)\) values of 0.5 and 0.1 (Additional file 1: Figs. S10–S13).
Conclusions
When dealing with collections of structurally similar compounds, such as peptides that have the same amino acid composition but differential positioning of residues in the sequence (permutations), similar or identical mass, and similar physicochemical properties, it is challenging to discriminate single permutations with high accuracy. Therefore, diversity rather than library size is the key element when designing random peptide libraries for the discovery of novel biologically active peptides.
In this paper, we have presented an algorithmsupported methodology for the design of random peptide libraries. Basing the methodology on the multiobjective genetic algorithms we achieved libraries with maximal number of peptides that seek to maximal mass and/or sequence diversity. In the twoobjective setting, where the goal was mass diversity, the tolerance parameter allows the library designer to define how different two peptides should be in terms of their mass. Moreover, sequence diversity was achieved in the threeobjective setting that offered additional library solutions whit similar mass but differential composition of amino acids. This system could be extended to several other fitness functions and their combinations. A possible future multiobjective GA could take into account the charge or the hydrophobicity/philicity of the library components.
Intelligent systems that operate under controlled conditions allow us to explore huge search spaces and offer a large number of possible solutions. It is up to the user to choose the solution of interest. Throughout a series of examples we highlighted the advantage of having numerous design suggestions before attempting any synthesis of complex mixtures. In this way, we could rationally design a library or different pools of smaller libraries—having desired properties and amino acid compositions—based on informed and careful user choices. Thus, this paper demonstrated the necessity of interlinking the advanced computing capabilities of genetic algorithms and the design of peptide libraries. The size of the search space makes heuristic algorithms the method of choice. In fact, the range of possible solutions is difficult to access to an expert user even for smaller scale inputs.
Materials and methods
The definition of input parameters is presented in Algorithm 1, where r is the number of positions, counter is the number of amino acids at each position and full list is the full amino acid list:
The core NSGAII algorithm is used from the Matlab builtin script gamultiobj. Its default parameters are modified as follows:

‘PopulationSize’ is set to 500,

‘ParetoFraction’ is set to 0.2,

‘PopulationType’ is set to ‘bitstring’.
The binary encoded genome for NSGAII is created as a bitstring of length:
The methodology used to compute the three fitness functions (\(FF_1\), \(FF_2\) and \(FF_3\)) for each Genome is presented graphically in Fig. 9.
Algorithm 2 (readGenome) is used to transform the genome of a solution obtained by NSGAII into a algorithm selected list of amino acids (selected_list):
The analysis of the library solution (selected_list) is followed by the computation of the permutations which constitute the Library. Algorithm permutations uses Matlab builtin function ndgrid to make the computation.
Algorithm (getMass) is used to compute the mass of every permutation within the Library using the following equation:
where M represents the mass of a peptide, \(m(a_i)\) represents the mass of amino acid a at position i within the peptide, and \(m(H_2 O)\) is the mass of a molecule of water. Slightly modifying the Eq. 2, the algorithm computes the average mass, the monoisotopic mass, the singly charged \([M+H]^+\) or the doubly charged \([M+2H]^{2+}\) mass of peptides.
The analysis of the Library is followed by the exclusion of peptides of similar mass implemented in Algorithm 3:
where \(M_i\) and \(M_j\) are masses of permutations at positions i and j within the Library, while \(T(\Delta mass)\) is the tolerance of mass discrimination. If the absolute value of mass difference between permutations at positions i and j is less than \(T(\Delta mass)\), then the permutation at position j is removed from the Library list. Algorithm Exclude similar mass repeats this process until all the permutations are compared and only the ones with unique mass remain in the list (Unique by mass).
The analysis of the Library is also followed by the exclusion of peptides of equal sequence implemented in Algorithm 4:
The algorithm Exclude equal sequence uses the algorithm getMass to speed up the search because peptides of similar mass are candidates for peptides of equal sequence. The candidate permutations within the Library need to be transformed into a list of characters and sorted alphabetically. If two peptides have equal list of sorted characters, one of them is excluded from the Library list. Algorithm Exclude equal sequence repeats this process until all the permutations are compared and only the ones with unique sequence remain in the list (Unique by sequence).
Abbreviations
 GA :

genetic algorithm
 NSGAII :

nondominated solutions genetic algorithm II
 csv :

commaseparated values
 BS :

best solution
 MW :

molecular weight
 OBOC :

onebeadonecompound
 DOS :

diversityoriented systems
 HPLC :

high performance liquid chromatography
 UPLCMS :

ultrahigh performance liquid chromatographymass spectrometry
 BBB :

bloodbrainbarrier
 MHC :

major histocompatibility complex
References
 1.
Hajduk P, Galloway WR, Spring DR (2011) A question of library design. Nature 480:42–43
 2.
Kitchen DB, Decornez H, Furr JR, Bajorath J (2004) Docking and scoring in virtual screening for drug discovery: methods and applications. Nat Rev Drug Discov 3(11):935–949
 3.
Dandapani S, Marcaurelle LA (2010) Accessing new chemical space for “undruggable” targets. Nat Chem Biol 6(12):861–863
 4.
Kodadek T (2011) The rise, fall and reinvention of combinatorial chemistry. Chem Commun 47:9757–9763
 5.
Falciani C, Lozzi L, Pini A, Bracci L (2005) Bioactive peptides from libraries. Chem Biol 14(4):417–426
 6.
Vlieghe P, Lisowski V, Martinez J, Khrestchatisky M (2010) Synthetic therapeutic peptides: science and market. Drug Discov Today 15(1–2):40–56
 7.
Sato AK, Viswanathan M, Kent RB, Wood CR (2006) Therapeutic peptides: technological advances driving peptides into development. Curr Opin Biotechnol 17(6):638–642
 8.
Ladner RC, Sato AK, Gorzelany J, De Souza M (2004) Phage displayderived peptides as therapeutic alternatives to antibodies. Drug Discov Today 9(12):525–529
 9.
Craik DJ, Fairlie DP, Liras S, Price D (2013) The future of peptidebased drugs. Chem Biol Drug Des 81(1):136–147
 10.
Molek P, Strukelj B, Bratkovic T (2011) Peptide phage display as a tool for drug discovery: targeting membrane receptors. Molecules 16(1):857–887
 11.
HamzehMivehroud M, Alizadeh AA, Morris MB, Bret Church W, Dastmalchi S (2013) Phage display as a technology delivering on the promise of peptide drug discovery. Drug Discov Today 18(23–24):1144–1157
 12.
Passioura T, Suga H (2017) A rapid way to discover nonstandard macrocyclic peptide modulators of drug targets. Chem Commun 53(12):1931–1940
 13.
Lam KS, Salmon SE, Hersh EM, Hruby VJ, Kazmierski WM, Knapp RJ (1991) A new type of synthetic peptide library for identifying ligandbinding activity. Nature 354(6348):82–84
 14.
Meldal MP (2002) The onebead twocompound assay for solid phase screening of combinatorial libraries. Biopolymers 66(2):93–100
 15.
Liu R, Li X, Xiao W, Lam KS (2017) Tumortargeting peptides from combinatorial libraries. Adv Drug Deliv Rev 110–111:13–37
 16.
Christensen C, Groth T, Bruun Schiødt C, Tækker Foged N, Meldal M (2003) Automated sorting of beads from a “onebeadtwocompounds” combinatorial library of metalloproteinase inhibitors. QSAR Comb Sci 22(7):737–744
 17.
Schreiber SL (2009) Organic chemistry: molecular diversity by design. Nature 457(4226):153–154
 18.
Galloway WRJD, IsidroLlobet A, Spring DR (2010) Diversityoriented synthesis as a tool for the discovery of novel biologically active small molecules. Nat Commun 1(6):1–13
 19.
O’ Connor CJ, Beckmann HSG, Spring DR (2012) Diversityoriented synthesis: producing chemical tools for dissecting biology. Chem Soc Rev 41(12):4444
 20.
Wang Y, Madsen AØ, Diness F, Meldal M (2017) Diversityoriented syntheses by combining cuaac and stereoselective incic reactions with peptides. Chem Eur J 23(56):13869–13874
 21.
Wright T, Gillet VJ, Green DV, Pickett SD (2003) Optimizing the size and configuration of combinatorial libraries. J Chem Inform Comput Sci 43(2):381–390
 22.
Zhao PL, Nachbar RB, Bolognese JA, Chapman K (1996) Two new criteria for choosing sample size in combinatorial chemistry. J Med Chem 39(2):350–352
 23.
Guixer B, Arroyo X, Belda I, Sabidó E, Teixidó M, Giralt E (2016) Chemically synthesized peptide libraries as a new source of bbb shuttles. Use of mass spectrometry for peptide identification. J Pept Sci 22:577–591
 24.
Granda JM, Donina L, Dragone V, Long DL, Cronin L (2018) Controlling an organic synthesis robot with machine learning to search for new reactivity. Nature 559:377–381
 25.
Graulich N, Hopfb H, Schreiner PR (2010) Heuristic thinking makes a chemist smart. Chem Soc Rev 39:1503–1512
 26.
Krivák R, Hoksza D (2018) P2rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J Cheminform 10(39):1–12
 27.
Trobe M, Burke MD (2018) The molecular industrial revolution: automated synthesis of small molecules. Angew Chem Int Ed 59(16):4192–4214
 28.
Gromski PS, Henson AB, Granda JM, Cronin L (2019) How to explore chemical space using algorithms and automation. Nat Rev Chem 3:119–128
 29.
Belda I, Madurga S, Llorà X, Martinell M, Tarragó T, Piqueras MG, Nicolás E, Giralt E (2005) Enpda: an evolutionary structurebased de novo peptide design algorithm. J Comput Aided Mol Des 19(8):585–601
 30.
Belda I, Llorà X, Giralt E (2006) Evolutionary algorithms and de novo peptide design. Soft Comput 10(4):295–304
 31.
Belda I, Madurga S, Tarragó T, Llorà X, Giralt E (2007) Evolutionary computation and multimodal search: a good combination to tackle molecular diversity in the field of peptide design. Mol Divers 11(1):7–21
 32.
Teixidó M, Belda I, Zurita E, Llorá X, Fabre M, Vilaró S, Albericio F, Giralt E (2005) Evolutionary combinatorial chemistry, a novel tool for sar studies on peptide transport across the bloodbrain barrier. Part 2. Design, synthesis and evaluation of a first generation of peptides. J Pept Sci 11(12):789–804
 33.
Teixidó M, Belda I, Roselló X, González S, Fabre M, Llorá X, Bacardit J, Garrell JM, Vilaró S, Albericio F et al (2003) Development of a genetic algorithm to design and identify peptides that can cross the bloodbrain barrier. QSAR Comb Sci 22(7):745–753
 34.
Madurga S, Belda I, Llorà X, Giralt E (2005) Design of enhanced agonists through the use of a new virtual screening method: application to peptides that bind class I major histocompatibility complex (mhc) molecules. Protein Sci 14(8):2069–2079
 35.
Yoshida M, Hinkley T, Tsuda S, AbulHaija YM, McBurney RT, Kulikov V, Mathieson JS, Galiñanes Reyes S, Castro MD, Cronin L (2018) Using evolutionary algorithms and machine learning to explore sequence space for the discovery of antimicrobial peptides. Chem 4:533–543
 36.
Fjell CD, Jenssen H, Cheung WA, Hancock RE, Cherkasov A (2011) Optimization of antibacterial peptides by genetic algorithms and cheminformatics. Chem Biol Drug Des 77(1):48–56
 37.
Porto WF, Irazazabal L, Alves ES, Ribeiro SM, Matos CO, Pires ÁS, Fensterseifer IC, Miranda VJ, Haney EF, Humblot V et al (2018) In silico optimization of a guava antimicrobial peptide enables combinatorial exploration for peptide design. Nat Commun 9(1):1490
 38.
Neuhaus CS, Gabernet G, Steuer C, Root K, Hiss JA, Zenobi R, Schneider G (2019) Simulated molecular evolution for anticancer peptide design. Angew Chem Int Ed 58(6):1674–1678
 39.
Gillet VJ, Khatib W, Willett P, Fleming PJ, Green DV (2002) Combinatorial library design using a multiobjective genetic algorithm. J Chem Inform Comput Sci 42(2):375–385
 40.
Meinl T, Ostermann C, Berthold MR (2011) Maximumscore diversity selection for early drug discovery. J Chem Inf Model 51(2):237–247
 41.
Nicolaou CA, Brown N (2013) Multiobjective optimization methods in drug design. Drug Discov Today Technol 10(3):427–435
 42.
Mauša G, Galinac Grbac T (2018) Coevolutionary multipopulation genetic programming for classification in software defect prediction: an empirical case study. Appl Soft Comput 55:331–351
 43.
Coello CAC, Lamont GB, Van Veldhuizen DA (2007) Evolutionary algorithms for solving multiobjective problems. Springer, Boston
Authors' contributions
DK conceived and designed the project. GM designed and implemented the algorithms. GM, DK, TT and EG analyzed the data. GM and DK wrote the paper. EG did a critical revision of the manuscript. All authors read and approved the final manuscript.
Acknowledgements
We would like to thank the ERASMUS+ staff mobility program 2017 which founded the mobility for GM and thus allowed the first contact between IRB, Barcelona and UNIRI, Rijeka. We would also like to thank Prof. Tihana Galinac Grbac and EVOSOFT Project (UIP2014097945) for their valuable support.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
All the data on which the conclusions of the manuscript rely are thoroughly presented throughout the manuscript. The algorithms needed to reproduce the results are stated in “Materials and methods” section.
Funding
This study was funded by the Ministry of Economy and Competitiveness (MINECO) and the European Fund for Regional Development FEDER (BIO 201675327R, RTC201543361, PCIN2015052); the Generalitat de Catalunya (XRB and 2014SGR521); RecerCaixa Foundation and Fundación Ramón Areces. The authors would also like to thank the FP7 Marie Curie Actions (COFUND program; grant agreement no.IRBPostPro2.0 600404) for funding. IRB Barcelona is the recipient of a Severo Ochoa Award of Excellence from MINECO (Government of Spain). This work was supported in part by University of Rijeka under the project number 18.10.2.1.01.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Additional files
13321_2019_347_MOESM1_ESM.pdf
13321_2019_347_MOESM2_ESM.pdf
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Kalafatovic, D., Mauša, G., Todorovski, T. et al. Algorithmsupported, mass and sequence diversityoriented random peptide library design. J Cheminform 11, 25 (2019) doi:10.1186/s1332101903476
Received
Accepted
Published
DOI
Keywords
 Peptide libraries
 Onebeadonecompound
 Algorithmsupported design
 Genetic algorithm
 Optimization