11th German Conference on Chemoinformatics (GCC 2015)

This presentation will describe how structural biology, molecular pharmacology, and medicinal chemistry studies can be combined with molecular modeling and chemoinformatics analyses for a more accurate description and prediction of structural determinants of protein-ligand binding, functional activity, and selectivity. The challenges and possibilities of structural chemogenomics studies will be discussed, including the integration of large volumes of heterogeneous pharmacological and chemical data for different protein targets and the development of structure-based virtual screening and computer-aided drug design approaches to discover novel small molecule ligands with well defined functional activity and protein selectivity profiles. The potential of molecular dynamics simulation methods to complement hybrid structural biology studies will be demonstrated for the investigation the mechanisms of conformational selection and protein-ligand binding kinetics. In the final part of the presentation structural protein-ligand interaction databases will be described that link structure-based protein-ligand interaction maps to protein ligand topology and can be used as structural chemogenomics tools to navigate medicinal chemistry space. Acknowledgments:

The Chemistry-Information-Computer (CIC) division [1] of the German Chemical Society (Gesellschaft Deutscher Chemiker e.V.) hosted the 11th German Conference on Chemoinformatics (GCC2015) from the 8th to the 10th of November 2015 in Fulda, Germany [2]. The conference reflected the new role of chemoinformatics in the modern information world. Discussed topics were related to the utilization of computers in chemistry, pharmacy, material sciences and biology. The Program Committee took great care to compile a scientific program which covers a wide range of subjects: from chemo-and bioinformatics to molecular modelling, from fundamental academic research to industrial applications. The plenary sessions, which comprised a total of 28 oral presentations, started off with a newly introduced session Sunday Highlights characterized by three keynotes from high-profile speakers. Subsequent sessions were a mix of invited keynote speakers and selected contributions submitted by the community. The session Research Telegrams was exclusively dedicated to the presentation of the current status of PhD students' work. In addition, 61 poster contributions were presented in two poster sessions. The more than 135 attendees from 12 nations proofed that the German Conference on Chemoinformatics is a well-established event in the international Chemoinformatics and Modelling community (Fig. 1). The CIC-Award for Computational Chemistry honors outstanding doctoral dissertations and master thesis in the fields of the scientific specialties of the Chemistry-Information-Computer division. In 2014, the German Conference on Chemoinformatics was jointly held with the International Conference on Chemical Structures (ICCS) [3], which led to the awarding of the CIC-Award for Computational Chemistry for both the year 2014 and the year 2015 at the GCC2015. The CIC-Award for Computational Chemistry for the best dissertation thesis for 2014 was granted to Achim Sandmann, who worked on his dissertation thesis under the supervision of Dr. Harald Lahnig at the Friedrich-Alexander-Universität Erlangen-Nürnberg. Both Patrick Kibies, supervised by Prof. Dr. Stefan Kast from the TU Dortmund, and Manuel Ruff, supervised by Prof. Dr. Frank Böckler from the Eberhard Karls University Tübingen, were granted the CIC-Award for Computational Chemistry for their outstanding master theses. Among the six oral contributions presented in the plenary session Research Telegrams the work of Markus Zimmermann titled "ChemPLPXB: Implementation of QMbased terms for the recognition of halogen bonding in drug design" was selected as the best dissertation thesis 2015. Markus Zimmermann prepared his dissertation thesis under the supervision of Prof. Dr. Frank Böckler at the Eberhard Karls University Tübingen (Fig. 2).
More than 20 years after their first climax in the 80s and 90s, therapeutic peptides and proteins are currently experiencing a renaissance as drug molecules as it is believed that they offer key advantages over traditional small molecules, such as fewer toxic side effects. Identification of compounds for clinical development still requires multiparameter optimization as these modalities may suffer from potential liabilities such as chemical stability, solubility and aggregation propensity. In this talk some of the work in the area of structure-activity and structure-property modelling and its application in the design of new peptides and proteins will be highlighted. Traditionally, many computational prediction tools that have been reported in the literature are purely based on sequence-based descriptors. Our strategy bridges the worlds of bioinformatics and chemoinformatics by using approaches that are based on 3D structures of individual amino acids as well as on the full 3D structure of peptides or proteins. Here, the applicability of such methods to unnatural or modified amino acids is of particular importance as introducing those into the peptide sequence might significantly impact biophysical peptide properties. While the Z-scale descriptors introduced originally by Wold [1] are based on experimental values, we have applied computational amino acid descriptors [2] derived using the CoMFA method [3] for building QSAR models, which are amenable to general amino acids. Details will be shown on how the descriptors are derived as well as approaches for the analysis and interpretation of the resulting models. J Cheminform 2016, 8(Suppl 1):S18 of in vitro approaches to target deconvolution. However, these tend to be of lower throughput and better placed later in a screening cascade. So there is a real need for in silico based approaches that can be deployed early on in a drug discovery programme to identify potential MOAs. Using publicly available data on the Published Kinase Inhibitor Set (PKIS) [2,3], we describe the application of Formal Concept Analysis (FCA), an association mining technique with roots in set theory, to the problem of deconvoluting a phenotypic screen. We describe each compound in the PKIS by the set of kinases it inhibits. We then construct a Galois Lattice, whose nodes correspond to a set of compounds inhibiting a common set of kinases and where two nodes are connected if the compound set of the child node is a subset of the compound set of the parent node. Lattice nodes enriched with compounds that promote neurite outgrowth in rat inform on which kinases should be targeted when seeking small molecules that encourage CNS axon repair following injury. The targets we identify using this unsupervised and interpretable approach, are in line with those identified in [3] where here the authors use a combination of support vector machines, considered a black box method, and mutual information, then confirm in siRNA studies.
Chemists pursue the discovery of new chemical reactions to meet the demand for transformations that are more efficient, are more environmentally benign or allow the synthesis of previously inaccessible molecules [1]. Usually, discovery starts with an idea derived from knowledge, which is subsequently tested in experiments. The outcome of the experiments then increases knowledge, leading to a cyclic process. In Cognitive Science, initial hypothesis generation is attributed to analogical reasoning, and creativity is described as a process of combining seemingly unrelated knowledge [2]. However, as published chemical knowledge increases rapidly, many interesting hypotheses may never be recognized. Computational approaches for the discovery of unprecedented reactions are therefore desirable, but only seldom reported and not applicable to modern catalytic reactions [3,4]. We have developed a scalable model to obtain hypotheses for unprecedented catalytic chemical reactions based on a set of known reactions. Given just the starting materials as the input, our model proposes hypotheses about the product and the reaction conditions by identifying analogies in the reactivity the reactants. In a validation study, the model correctly assigns 94% of a reaction set unknown to the model as plausible reactions and provides the correct product structures. In further studies, we show that our model can infer correct products and accurate (or highly similar and reasonable) reaction conditions incl. catalysts and reagents for recently published reactions.
Nowadays crime scene investigations and evidence analysis are enormously challenged by the rapidly growing crime. In the last decade crime situation has changed dramatically and forensic science must be fitted to current challenges (e.g. cybercrime) at its best. Keeping the forensics up to date, it is essential to perform research projects for the development of new and powerful methods. Over the course of time analytical achievements provide additional possibilities for case work analysis we could not imagine a couple of years earlier. Forensic science must benefit from these efforts in the field of analytics. It needs to be proved frequently, how new analytical methods can support forensic investigations. So far unsolved problems can be faced by introducing new techniques in the forensics considering the applicability in case work. It is an important point to always think about how feasible certain methods are for case work. Besides the development of new techniques, established methods need to be optimized frequently to always obtain most reliable results for crime scene investigations. Due to mentioned reasons, research and development in general is a big issue in forensics and especially for the forensic science institute of the Bundeskriminalamt Germany. Some selected research projects will be presented to give an overview of potential forensic methods for the future. J Cheminform 2016, 8(Suppl 1):S18 A basic challenge for biomolecular simulations is the necessity to study comparatively large and complex structures in the order of at least tens of nanometers for comparatively long times up to the microseconds scale. A mesoscopic coarse-grained approach neglects precise atomic interactions but tries to keep essential features of the complex system of interest-and thus may be helpful to adequately describe average collaborative interactions within large biomolecular ensembles and their effects. The new mesoscopic approach comprises the Molecular Fragment Dynamics (MFD) variant of the established mesoscopic Dissipative Particle Dynamics (DPD) simulation technique, adequate fragmentation schemata for all biomolecular species of interest (e.g., phospholipids, amino acids, peptides and proteins), the construction of a versatile biomolecular fragment set, Cheminformatics concepts and tools for fragment structure representations and handling, concepts for peptide and protein 3D structure treatment like backbone potentials for structural rigidity/flexibility and an integrative software architecture for practical realization [1]. First simulation results for phospholipid membranes, peptides/proteins and their mutual interactions demonstrate opportunities and limitations [2].
We investigate the methodological development of coarse grained 'mesoscale' simulations for biomolecular affinity calculations such as in ligand-protein interactions, where the ligand is potentially of macromolecular dimension. The ambitious aim is to reach thermodynamic accuracy of 1-3 kT in free energy of binding prediction, by including all relevant interactions, using soft potentials, thermodynamic integration and calibration using engineering thermodynamics data. In the approach, small groups of 2-3 heteroatoms are combined in a single simulation unit. Such bead, typical of the mesoscale nomenclature, is to be considered a tiny molecular fragment. Since in drug discovery screening of many thousands if not millions of compounds is the rule, the all-determining factor for the success of the approach is the fast automation of fragmentation and ensuing parameterization (Automated Fragmentation Parameterization, AFP). We have developed an AFP protocol that combines elements from chemical informatics (for finding optimal fragments) and engineering thermodynamics (for parameterization through surface charge interactions). The first results look promising. We have calculated fragments for ~100k molecules (PUBCHEM data base) by rule based simulated annealing. Classification of fragments shows a distribution with a typical Zipf regression. We estimate that five to ten thousand unique fragments are enough to cover all organic molecules in the CAS database. As for the parameterization: we rely heavily on non-biological engineering thermodynamics data, such as Vapor-Liquid Equlibrium diagrams from the NIST database (Thermodynamics Data Engine). This unique set of data that covers excess Gibbs energies for a wide range of binary fluid systems, and is an ultimate test for any simulation that aims to produce results of thermodynamic accuracy. So far, this data has been the bread and butter of ultrafast and accurate equation based engineering thermodynamics models, such as UNIFAC or COSMO-RS. Simulation methods up till now have been to slow and inaccurate for massive validation. Here our mesoscale simulation approach reaches an accuracy of a fraction of a kT per fragment (1-3 kT per molecule), this is on par, or better than the engineering methods, with the enormous advantage that our approach is intrinsically 3-dimensional. All is still within very reasonable calculation times on modest computational resources. We complement the presentation with results on logP prediction and lignine, keratine and kinase modelling. The software development is sponsored by CULGI, a large consortium from oil, personal care and chemical industries.
Solvation plays an essential role for all kinds of physiologically relevant systems and phenomena. Not only water as the key solvent for life but also electrolytes and osmo-lytes modulate molecular interactions with implications for structure, dynamics, and thermodynamics. Computational modeling of (bio-)molecular systems in solution re-quires taking into account the much larger number of solvent particles in comparison with solute species. This results in vast technical effort, e.g., to treat molecular dynam-ics by atomistic simulations in full detail although most of the explicitly considered sol-vent molecules have only statistical influence on biological function due to the sepa-ration of dynamical time scales of solute and solvent. An efficient way to account for solvent-mediated effects in atomistic simulations is to treat the solvent statistically, either by structureless dielectric continuum models or by computing solvent distribution functions around solute systems. In this talk, the basic concepts of solvation modeling frameworks will be presented and illustrated. After de-veloping the hierarchy of models, particular attention will be paid to analytical theories based on the statistical-mechanical integral equation formalism. While originally de-veloped for simple liquids these methods have been adapted over the years to be ap-plicable to complex biomolecular scenarios [1,2]. Due to the massive reduction of the number of degrees of freedom compared to atomistic simulations, such a framework can also be coupled to polarizable potential models as well as to a quantum-mechanical description of solute-solvent interactions, thus allowing for enhanced pre-dictive capabilities [3,4]. Several example applications will be presented, including hy-dration patterns around biomolecules, chemistry under extreme thermodynamic condi-tions, and complicated problems of coupled protomeric, tautomeric [5] and conforma-tional equilibria of drug-like molecules in solution.
Recent technological advancements in the field of health science brought with it a deluge of data that is waiting to be interpreted. It is critical to understand the information that can be derived from this wealth of data to not just identify the chemical, biological and/or genetic markers/patterns of a particular disease but also to utilize this knowledge to design more potent and selective medicines. However, partly driven by the urgency of this need, we are generating more data than we are able to analyze and interpret. Most diseases display a 'pathological footprint' on a systems biology level, even if the observed phenotype is localized to a tissue or organ. In order to understand and explain such activity, integration of biological and chemical information such as chemical fingerprints with bioactivity profiles, gene expression patterns, pathway annotations, protein interaction networks etc., is vital. This study focuses on integrating experimentally derived chemical, biological and pathological data to identify and explain compound mode-of-action and its impact on disease regulation. Such analysis will not only help define a therapeutic approach towards more efficient and less promiscuous drugs, but also expedite this process by identifying current approved drugs for repurposing that could potentially reverse disease signatures. Drug combinations, which has been shown previously to be a strong approach towards tackling the issues of drug resistance and specificity, will be the primary focus of this study. This talk will highlight in silico approaches towards next-generation drug discovery along with relevant resources and datasets that are/ can be integrated to achieve a 'systems biology' picture of disease progression and regulation.
Statistical modeling (also termed QSAR/QSPR) is a general name for a host of methods that correlate a specific activity for a set of compounds with their structure-derived descriptors by means of a mathematical model. The method has been widely applied in many fields including chemistry, biology, and environmental sciences with a particular emphasis in drug design. In recent years less "traditional" QSAR studies have emerged focusing for example on the design of food, cosmetics and oil products, high energy materials and solar cells. These QSAR models are often referred to as M(material)QSAR and form part of the now growing field of Material Informatics. This seminar will focus on the application of statistical modeling techniques in material sciences discussing, through selected applications, methods typically used in this field (e.g., PCA, Clustering and linear and non-linear regression), material descriptors (e.g., material composition, material spectra which are particularly useful when the exact composition / structure of the material are unknown) and the challenges in obtaining them. Special emphasis will be put on the application of MQSAR in the design of photovoltaic (PV) cells of various types. In particular cells entirely made of metal oxides (MO) have the potential to provide clean and affordable energy if their power conversion efficiencies are improved. Such improvements require the development of new MOs which in turn could benefit from combining combinatorial material sciences for producing solar cells libraries with statistical tools to direct synthesis efforts. For this purpose we developed a QSAR workflow with several novel components [1,2] and applied it to the analysis of a diversity of solar cell libraries [3]. Our results demonstrate that MQSAR models with good prediction statistics for multiple solar cells properties could be developed and that these models highlight important factors affecting these properties in accord with experimental findings. The resulting models are therefore suitable for designing better solar cells. We further demonstrate that the similar property principle commonly invoked in pharmaceuticals design could be extended to PV cells.
Determination of similarities between protein binding pockets is an important challenge in computer-aided drug design. To this end, Cavbase was developed as a tool for the automated detection, storage, and classification of putative protein binding sites [1]. Binding sites are characterized as sets of pseudocenters, which denote surface-exposed physicochemical properties, and can be used to enable mutual binding site comparisons. However, these comparisons tend to be computationally very demanding and often lead to very slow computations of the similarity measures. In this contribution, improved and accelerated methods for the comparison of protein binding sites are presented. We propose a novel and efficient modeling formalism that does not increase the size of the graph model used in Cavbase, but leads to graphs containing considerably more information assigned to the nodes, due to the introduction of additional descriptors which consider local surface characteristics. Combined with a heuristic for the efficient detection of maximum common subgraphs, this leads to a gain of information and enables much faster but still very accurate comparisons between different structures. Moreover, another accelerated version is discussed which makes use of graph partitioning [3]. Therefore, graphs are split into disjoint components with regard to pseudocenter types prior to their comparisons. This leads to seven much smaller graphs than the original one and thus to another significant speed-up with only a small loss of accuracy. Finally, a program is introduced, which allows for ultra-fast similarity comparisons, as protein binding sites are represented by sets of distance histograms that are both generated and compared with linear complexity [4]. This method attains a speed of more than 20,000 comparisons per second, which makes screenings across large datasets and even entire databases easily feasible. The practical use of the J Cheminform 2016, 8(Suppl 1):S18 new methods is proven by a successful prospective virtual screening study that aimed at the identification of novel inhibitors of the NMDA receptor. Fragment-based methodologies have become an alternative to conventional High Throughput Screening during the last decade [1]. One of the advantages is the higher hit rate due to the much smaller fragment space. Identified hits not only serve as starting points for drug design but also represent sweet spots in the chemical space. Further development can therefore be promoted by sampling the chemical around the fragment hits. This was applied to evolve fragment hits identified through a 3D pharmacophore-based virtual screening [2] against the viral 3C protease via fragment growing. At first fragment growing was carried out through a de novo design [3] workflow in which the fragment hit core is combined with tangible chemical building blocks via organic synthesis reactions encoded as SMIRKS patterns to ensure synthetic feasibility later on. To select de novo designed fragment hit analogues for chemical synthesis the generated library was analyzed through virtual screening for target binding along fragment growing vectors defined through structure-based design and for chemical space population by principle component analysis. A subset with promising binding characteristics was selected for chemical synthesis and subsequent biological evaluation. In this work we show how fragment growing can be supported by efficient chemical space sampling to suggest promising de novo designed analogues, which are synthetically feasible and thus enable rapid fragment growing. Modelling protein flexibility is still a highly challenging objective in various fields of computer aided drug design. Many applications make use of structure ensembles as a convenient way for incorporating structural variability of proteins. Due to the steadily growing amount of available protein structures, alternative conformations can meanwhile be provided for a multitude of different targets. However, ensemble generation is mostly a time-consuming process which often requires manual interaction. In order to simplify the usage of protein ensembles, we developed a fully automated preprocessing workflow for the enrichment and the analysis of protein flexibility information [1]. In a first step, alternative conformations of a given binding site are extracted from an arbitrary selection of protein structures, e.g. the current version of the Protein Data Base (PDB). Using an indexed database that has been specially geared to this purpose makes this step highly efficient. Afterwards, the resulting structures are being aligned with our new active site alignment algorithm ASCONA [2] which has been developed with a focus on an accurate detection of protein backbone variations. Based on the active site alignment, a further step detects flexible regions within the binding site and uses a remaining rigid core for a superimposition of the generated ensemble. If required, a subsequent filtering reduces the ensemble to a set of relevant protein conformations. Finally, the resulting ensemble structures are protonated with our hydrogen prediction tool Protoss [3], considering tautomerism and protonation states of both protein and ligand molecules. In summary, our preprocessing workflow constitutes a very convenient, reliable, and particularly fast way to generate structural ensembles for any application of interest.
Halogen bonding is rapidly gaining recognition in medicinal chemistry and related fields with a concomitant need for reliable evaluation of the quality of the interaction [1]. Several MM parameterizations and QM/MM methods have been recently developed to facilitate the study of these interactions. We extensively used QM model calculations on the MP2/TZVPP-level of theory to systematically map the relationship between strength and geometry of halogen bonds to different interaction partners (carbonyl backbone, sulfur contacts, nitrogen contacts, carboxylates, and π-systems) [2][3][4]. We evaluated the potential for molecular design of additional halogen bonds in existing protein-ligand complexes of the PDB by applying XBScore, our first QM-derived scoring function for the recognition of contacts to the carbonyl backbone [5]. This approach has been experimentally validated on the protein kinase p38α with halogenated ligands using DSF and FP assays. In addition, we used support vector regression to develop a QM-based scoring function for the recognition of halogen bonds targeting methionine. We integrated both scoring functions in ChemPLP and investigated their potential to improve pose retrieval for a test set of crystal structures featuring halogen bonds.
Computational approaches have become indispensable for drug design campaigns but also as auxiliary tool for structural biology [1]. In the field of G protein coupled receptors (GPCRs), the combination of in silico methods and pharmacological experiments represent a strong alliance for novel functional insights. Recent achievements in GPCR crystallography provide us with new structural data on GPCRs. However, these structures represent only single static views of highly flexible proteins. The combination of crystallographic data with stateof-the-art computer-driven simulations allows for a mechanistic view on this highly important class of drug targets. Taking muscarinic acetylcholine receptors (MAChRs) as representative examples we explained, how GPCRs can be modulated in a ligand dependent and predictable manner. Starting from carefully developed homology models of MAChRs, extensive binding mode analyses were performed by means of protein-ligand docking and 3D pharmacophore modeling. In order to sample the flexibility and the dynamic properties of the receptorligand complexes we carried out all-atom molecular dynamics simulations [2]. This combination led to mechanistic GPCR models that comprise both inactive and active-like receptor states. After characterization of the orthosteric binding pocket, we focused on dualsteric ligand binding [3]. Such ligands simultaneously bind to the orthosteric and the allosteric binding site and combine the high affinity of orthosteric ligands with the high specificity of the allosteric binding site. We show, for the first time, the structural basis for dualsteric GPCR modulation on a molecular level. Our functional GPCR models are able to rationalize subtype selectivity and biased signaling. Additionally, our models explain a novel concept for partial agonism, termed dynamic ligand binding. Taken together, our dynamic models illustrate how distinct conformational states can be stabilized in a liganddependent manner. This offers the possibility to rationally design specific modulators for MAChRs but also for other GPCRs.
CO2 absorption in ionic liquids (ILs)-which are suitable solvents with tunable properties-caught great interest of scientists in the last decade. Having a huge variety of possible ILs (~10 15 ) the CO 2 solubility can vary strongly with selected ions. In order to save the time for choosing the most promising IL candidates from the experiment, we suggest simple theoretical protocols to predict the CO 2 absorption behaviour of ionic liquids [1]. It was found that strong interaction with the anions of the IL, i.e., the formation of carboxylates with O-C-O angles <140°, corresponds chemical absorption whereas weak interaction (O-C-O angle >170°) indicates physical absorption (Fig. 3). A predictive estimate with a clear exact distinction between physical and chemical absorption can be simply obtained by carrying out the geometry optimization in the presence of a solvation model instead of optimizing it only in gas phase as has been done so far. From these solvated geometries free energies are derived which compare very well with experiment [2]. We also correlate the calculated energies with experimental gas capacities (mol CO 2 per mol IL), in order to estimate the potential capacity close to the room temperature. Within the suggested protocol the most promising anions, potentially useful to design ionic liquids for reversible chemical absorption of CO 2 , are defined by the reaction Gibbs free energy in a range from −30 to 16 kJ mol −1 .
Antagonizing the human M3 muscarinic receptor (hM3R) over a long time is a key feature of modern bronchodilating COPD drugs aiming at symptom relief. The long duration of action of the antimuscarinic drug tiotropium and its kinetic subtype selectivity over hM2R have been investigated by kinetic mapping of the binding site and the exit channel of hM3R. Hence, dissociation experiments have been performed with a set of molecular matched pairs of tiotropium on a large variety of mutated variants of hM3R. The exceedingly long half-life of tiotropium (of more than 24 h) is attributed to interactions in the binding site; particularly a highly directed interaction of the ligands' hydroxy group with an asparagine (N508 6.52 ) prevents rapid dissociation via a snap-lock mechanism. The kinetic selectivity over hM2R, however, is caused by differences in the electrostatics and in the flexibility of the extracellular vestibule. Extensive molecular dynamics (MD) simulations based on the M3R X-ray structure are performed and various unexpected changes in receptor flexibility and solvent networks upon ligand binding are seen. Minor changes in the ligand structure can lead to breathingmotions on long time-scales of the entry channel in hM3R and only the most optimized ligand (i.e. tiotropium) tends to freeze all large scale movements. Similar effects are seen in MD simulations of mutated receptor variants. These observations may help to understand differences in residence times of various ligands. Investigations like this may prove to be useful to qualitatively predict differences in receptor residence times for structurally closely related ligands. Putting together the pieces: building a reaction-centric electronic lab notebook for mobile devices. Protein kinases are involved in a variety of diseases including cancer, inflammation, and autoimmune disorders. Thus, the development of new kinase inhibitors has been a major focus in pharmaceutical research over the last decades. Although, this resulted to date in 25 FDA approved drugs, only a small subset of kinases are established key targets, while most kinase inhibitors are unintentionally promiscuous. Using the wealth of available kinase structures-i.e. over 3600 PDB structures-we analysed the druggability of the entire human kinome and prioritized (yet) untapped kinases for drug discovery efforts [1]. For this, representative kinase structures were selected and the missing part of the kinome was modelled via homology modelling. DoGSiteScorer [2] was used to calculate geometrical and physicochemical properties of the ATP pockets and to predict the potential of each kinase to be druggable. The results indicate thatfrom a structural perspective-around 75% of the kinome have binding site characteristics that should allow the design of drug-like compounds. Top ranking structures comprise kinases that are primary targets of known approved drugs but additionally point to so far less explored kinases. Several aspects will be discussed including the top ranking kinases, the information gained from experimental hit rates [3] as well as our attempts to use structural information to enhance compound selectivity.

O21
A three-site mechanism for agonist/antagonist action on the vasopressin receptors Noureldin Saleh 1 , Giorgio Saladino 2 , Francesco L. Gervasio 2  Extensive classical molecular-dynamics simulations including metadynamics enhanced sampling reveal three distinct binding sites for arginine vasopressin (AVP) at its V 2 -receptor (V 2 R). Two of these, the vestibule and intermediate sites, block (antagonize) the receptor and the third is the orthosteric activation (agonist) site. The contacts found for the orthosteric site satisfy all the requirements deduced from mutagenesis experiments, including the involvement of residues near the extracellular N-terminus of the receptor. The biologically active conformation of AVP has been determined for each binding site. Metadynamics simulations on V 2 R and its V 1a R-analog give an excellent correlation with experimental binding free energies by assuming that the most stable binding site in the simulations corresponds to the experimentally determined binding free energy in each case. We extended our results to the β 2 -adrenergic receptor to study the co-operative mechanism of the ligand and G-protein on GPCR activation. The resulting three-site mechanism for both β 2 and V 2 -activity is compatible with the ternary complex model. Increased computational performance makes possible MD simulations of biologically relevant timescales exceeding 1 µs, which can be used to sample the conformational space of proteins. Of the available methods, the brute force method has the advantage of being highly analogous to the actual processes and of using little a priori knowledge [1]. In this work, we sought methods that can be useful for interpreting the vast amount of data associated with these kinds of simulations.

Analysis of µs MD simulations of the p53 core domain
The system under investigation, the p53 core domain, has several loop regions protruding from a stable β-barrel that B-factor plots suggest to be flexible. Cluster analysis of simulations several microseconds long showed that no structural convergence is reached for the system. Subsequently, the flexible regions were investigated individually. A set of cluster analyses were performed, in which only residues of one individual region were regarded in each. This vastly simplified data allowed reoccurring structures to be identified, while it also showed that no convergence is reached for the individual regions. Structural homogeneity within clusters, and structural differences between clusters with high RMSD differences of those individual regions, can be shown by hydrogen-bond analyses. They can also show thermodynamic explanations for sudden changes in the RMSD trajectory of the different regions. Clustering of the flexible loop regions using a DASH analysis [2], which is based on torsional angles, showed a good match to the results of Cartesian clustering based on atomic positions. Correlation analysis showed that flexible regions are mostly isolated by the β-barrel and move independently as long as they have no direct contacts and none of the regions makes a large move. These analyses show that regarding flexible regions individually can improve the evaluation of the incompletely sampled conformational space of a protein. This knowledge can then be used when it is compared with a similar system like a mutated or complex-bound version of the protein. Accurate assessment of the conformational space of drug-like molecules in free solution is a frequently underestimated, yet relevant ingredient of molecular design. In particular, predicting and controlling free ligand conformations is essential for minimizing the entropic penalty to reorganize a ligand's geometry upon binding to a protein.
Overcoming the deficiencies of common small molecule force fields represents a particular challenge due to the considerable computational cost of high-level quantum-chemical calculations for predicting the conformational manifold.
Here, we demonstrate the performance of a hierarchical filtering scheme that allows for the identification of dominant conformations together with their proper statistical weights measured by their free energies in solution with quantum-chemical accuracy. The automated workflow implies a sequence of force field-based high-temperature molecular dynamics simulations using implicit solvent models, clustering and filtering steps, and high-level geometry optimizations in solution employing the polarizable continuum model (PCM) as well as the embedded cluster reference interaction site model (EC-RISM) [1] for scoring, and calculation of theoretical NMR spectra [2] to be compared with experiments. We apply the workflow to variants of the protein kinase inhibitor WZ4002 and several novel EGFR inhibitors that are highly active against a drug-resistant mutant of the epidermal growth factor receptor (EGFR-T790M) [3][4][5]. The relative significance of conformational preconfiguration in comparison with modulation of direct protein-ligand interactions upon chemical substitution is discussed. Halogen bonding is a rather new type of non-classical interaction. Recently, the field of halogen bonding has attracted much attention in life sciences and drug discovery [1][2][3][4] and various halogen bonding examples have been reported and shown to be viable. Therefore, halogen bonding needs to be integrated into the molecular and drug design process. This talk presents a new scoring function for the halogen bonding interaction between halogenated molecules and the side chain sulfur of methionine based on quantum chemical calculations using the MP2/TZVPP level of theory. For this purpose, we have extended our previously reported QM studies quite significantly [5] by an exhaustive, systematical generation and evaluation of interaction geometries using small model systems. From this data, we derived two separate scoring terms for the interaction using support vector regression and 4-fold cross validation: a σ-hole factor and a SphericalScore. The σ-hole factor describes the directionality of the interaction and the Spheri-calScore addresses the spatial position of the halogenated ligand around the sulfur atom. A combination of these two scoring terms yields the overall halogen bonding score. Validation was done through randomly generated interaction geometries not contained in the original data. Concomitant evaluation of these geometries through QM calculations and prediction through the scoring function showed very good correlations. The herein presented scoring function is a blueprint for integration into empirical scoring functions and docking programs. Water molecules play integral roles in the formation of many protein-ligand complexes, and recent computational efforts have been focused on predicting the thermodynamic properties of individual waters and how they may be exploited in rational drug design. However, when water molecules form highly coupled hydrogen bonding networks, there is, as yet, no method that can rigorously calculate the free energy to bind the entire network, or asses the degree of cooperativity between waters.

Recent progress in exploring the role of water in protein-ligand binding
In this work, we revisit the grand canonical Monte Carlo simulation technique [1] and show how it can be used to calculate efficiently the binding free energies of water networks of arbitrary size and complexity. Using a single set of simulations, our methods can locate waters, estimate their binding affinities, capture the cooperativity of the water network, and evaluate the hydration free energy of entire protein binding sites. Our techniques have been applied to multiple test systems and compare favourably to thermodynamic integration simulations and experimental data. The implications of these methods in drug design will be discussed. The complexity and diversity of chemical transformations involved in drug metabolism renders Site of Metabolism (SoM) prediction difficult [1]. One strategy to increase performance of machine-learning models is to develop appropriate molecular representations in terms of new descriptors. Quantum-chemically derived molecular descriptors encoding the reactivity of individual atoms appear to be an intuitive starting-point for model improvements. Here, we address the challenge of SoM prediction with descriptors based on atomic partial charges. To this end we assessed various partial charge schemes with respect to their dependence on quantum mechanical methods as well as their dependence on molecular conformation and charge state [2]. Based on the results of this study, we implemented various atomic reactivity descriptors derived from the atomic charge neighborhood information. These descriptors performed well for the task of predicting cytochrome P450 SoMs in drug molecules. The methodological concept is generally applicable to reactivity prediction and not limited to metabolism. Molecular materials are a crucial component of electrochemical storage devices like Lithium ion batteries [1]. Electrolyte solvents for instance have a strong influence on all relevant properties of the electrolyte, which in turn has great impact on the performance of the storage device. Physical limitations of electrolyte solvents are more and more often found to be roadblocks for the further improvement of batter technology [1]. Despite this, only a tiny fraction of the large number of possible solvents has been experimentally investigated, because experimental high-throughput work is complicated and costly. Computational work at the atomic scale on the other hand is still in the early stages in this field [2]. Like many parts of materials science, battery research is dominated by Solid State Physics, though in the case of molecular materials, methods from Quantum Chemistry and Chemoinformatics offer many much needed opportunities. We have made first steps to integrate existing computational approaches from these fields as well as Chemical Engineering within a tool-box for the design of optimization of liquid electrolyte systems as a filter for subsequent experimental work [3][4][5]. In addition, combinatorial Quantum Chemistry based tools were developed to estimate complex properties related to the reactivity of electrolyte materials with each other and with electrode materials. We have then investigated the known chemical space and complete subspaces of the most promising compound classes for new electrolyte materials.

Integrating Chemoinformatics and Quantum
We have previously developed a 3D virtual screening library of >3,000 compounds derived from Central African flora [1,2] and evaluated the pharmacokinetics profile [3]. This work has been extended to include West Africa [4], North Africa and Southern Africa, as well as a small library of compounds with remarkable potencies from African flora [5], including anti-mycobacterial infections, malaria, onchocerciasis and cancer [6]. In this work, we present an SQL platform for searching natural products for drug discovery from African medicinal plants. Data has been collected from major natural products journals and PhD theses from university libraries. The unified library contains > 5,000 unique structures from all regions of the African continent. The known uses of the plant species in African Traditional Medicine (ATM) have been previously related with the measured biological activities of the isolated plant metabolites [7]. Apart from the pan-African natural products library [8], each compound in this library is linked to known biological activities, as well as several physicochemical properties that can be used to evaluate drug-likeness. 3D structures are available for virtual screening purposes and each compound was classified to a chemical class, known biological activities and the plant species from which the compound was originally isolated. All compounds are available for download, thus the present data supports computer-aided drug design (CADD) and virtual screening (VS) campaigns.
Cancer stands amongst the most common disease-related causes of death; with ~7.6 million deaths within the human population and is expected to worsen in some few decades [1]. Previous studies have shown that about 48% of anticancer drugs approved were either natural products (NPs) or directly derived from NP lead compounds by semi-synthesis [2]. Our goal is to develop a small potent and less toxic NP library for virtual screening within the African setting. Computer-aided drug design methods, has become a promising part of the drug discovery process nowadays. This study was focused on generating a 3D structural library of potential anticancer compounds from the African flora and to evaluate the "drug-likeness" and toxicity of the new compound library. Virtually screening and in silico toxicity assessment were carried in comparison with the dataset of Naturally Occurring Plant-based Anticancer Compound-Activity-Target (NPACT), comprising ~1,500 published naturally occurring plant-based compounds from around the world [3]. From this study, about 400 compounds have been identified from African flora and their drug-likeness properties evaluated in comparison with NPACT and DNP. A successful docking attempt was carried out for some fourteen selected known anticancer drug targets [4]. Pharmacophore-based virtual screening and in silico toxicity assessment have also been done. Pharmacophore models were validated through receiver operating characteristic (ROC) and Güner-Henry (GH) scoring methods [5], indicating that several of the models generated could be useful for the identification of potential anticancer agents from natural product databases. The validated pharmacophore models were used as 3D search queries for virtual screening of the AfroCancer, along with the NPACT. Additionally, the in silico assessment of toxicity of the two datasets was carried out by use of eighty eight (88) toxicity end points predicted by Lhasa's expert knowledge-based system (Derek) [6], showing that only an insignificant proportion of the promising anticancer agents would be likely to show high toxicity profiles. Diversity analysis of the AfroCancer and NPACT datasets was carried out using the analysis of principal components [7].
There is widespread concern about the toxicological effects of man-made chemicals in the environment, including those of pharmaceutical products and their metabolites. There are databases of, and computer models to predict, aquatic narcosis and, in some cases, excess aquatic toxicity such as ECOSAR [1] but there remains a need for better sharing of human knowledge about the full range of environmental and human toxicological hazards. Computer programs such as Derek [2] are used to share human knowledge about mammalian toxicity and a proof of concept application, Eco-Derek, using the same technology to share knowledge about toxicity to the ciliate Tetrahymena pyriformis and human skin sensitisation has been described [3]. We are developing new alerts covering other species not included in the original knowledgebase for Eco-Derek. The knowledgebase currently contains 52 alerts, 207 patterns, 138 reasoning rules, 79 literature sources, 5 species and 6 toxicity endpoints. For mammalian toxicity, Derek reports confidence in predictions of activity, or likelihood that activity will be seen, using terms such as "probable", "plausible". Eco-Derek reports the estimated potency of toxicity of a query chemical to T. pyriformis, using terms such as "High", "Moderate", "Low". It takes into account both narcosis and mechanism-based toxicity (often called "excess toxicity" in cases where it is more severe than narcosis). The new knowledge-based system for predicting environmental toxicity is being built using the Nexus platform from Lhasa Limited [4]. It is intended that the system will report both potency and confidence in predictions (e.g. "Predicted toxicity to fish: Moderate. Confidence in the prediction: High"). Nexus currently only supports reporting of confidence (likelihood). So, for the present, knowledge about toxicity to T. pyriformis contained in eco-Derek has been reworked and assessed in terms of confidence about its predictive reliability and incorporated into the new system. In addition to the knowledge base, a database covering the environmental toxicity of chemicals is being built, emphasising, but not restricted to, aquatic toxicity, with effort concentrated on data from African sources not previously easily-accessible electronically. The data are being studied to develop new rules for the knowledge base of the prediction system. Species covered include Daphnia magna, Escherichia coli, Pseudokirchneriella subcapitata, Danio rerio, Pimephales promelas etc. The prediction system will be demonstrated during the presentation if time allows, and will be available for demonstration during conference breaks. We have currently incorporated the Eco-Derek knowledge into Lhasa's Nexus platform. The purpose of this project is to facilitate the exchange of knowledge about environmental toxicity between scientists. It is a collaborative effort between disciplines and across continents. The authors invite and encourage other scientists to become involved.
The modelling of the chemical stage of radiobiological mechanism may be very helpful in the study of the radiobiological effect of ionizing radiation when the water radicals formed by the densely ionizing ends of primary or secondary charged particles react with DNA molecules in living cells and damage them. These radicals arise in clusters that diffuse while the radicals react mutually or with other species (radiomodifiers) present in water medium; oxygen and N 2 O being very important. The proposed mathematical model has been created with the help of Continuous Petri nets. It enables us to describe and study the influence of both the main parallel processes: chemical reactions and diffusion of radical clusters. It is possible to study the time change of concentration of individual radicals during this diffusion process, which may be very helpful when the efficiency of different substances present in medium in the DNA damage (radiobiological effect) is to be studied. We have started to study the corresponding problem earlier; the contemporary result of chemical reactions and cluster diffusion having been described with the help of corresponding differential equations. The given model has been applied to the experimental data obtained for Co60 radiation (see [1]). Continuous Petri nets have been then used to simulate time dynamics of chemical stage under anoxic conditions (see [2]). Oxygen (if present) may act in two different directions: at small concentrations the interaction with hydrogen radicals prevails and final biological effect diminishes while at higher concentrations additional efficient oxygen radicals may be formed. Molecules N 2 O react then with hydrated electrons; the number of OH radicals increases, which results in a greater damage of DNA molecules.  [1]. Studying materials with a complex phase structure as well as the prediction of the properties of materials obtained under different conditions are of particular interest.

Prediction of the functional properties of electroceramic materials using chemoinformatics approaches
In this study we propose methods and descriptors efficient for the assessment of "composition-structure-functional property" relationship. Involved approaches include complementary machine learning methods.
The representation of the relevant conformational space of small molecules (e.g. drug-like molecules) is a thoroughly studied, nontrivial problem. A wide variety of different methodologies have been explored, with the aim to devise an algorithm that achieves optimum performance with respect to accuracy, conformational ensemble size, and computing time.
Here, we present a comprehensive benchmarking study, in which we assess and compare the performance of a variety of free and commercial tools for computing conformer ensembles. Inspired by the Iridium-HT benchmarking dataset [1], we have devised a new, high-quality library of protein-bound ligand conformations from the PDB, with the aim to improve statistical significance by use of a substantially larger dataset, while adhering to strict quality criteria. Our benchmarking tests show that several freely available conformer ensemble generators (e.g. RDKit [2], Balloon [3], CONFECT [4]) achieve a level of accuracy which is comparable to that of commercial software (e.g. MOE [5]). However, substantial differences with respect to runtimes and the percentage of molecules for which the algorithms fail to produce any solutions can be observed. Also, longer computing times do not per se lead to better conformer ensembles. From this statistical analysis we devised guidelines on how to parameterize tools for best performance in different application scenarios (e.g. computation of small data sets with maximum accuracy vs. large-scale virtual screening applications).
Matched Molecular Pairs is a well-established concept in Medicinal Chemistry which was successfully employed for optimization of pharmacokinetic or physicochemical properties [1]. Several recent publications from different groups suggest the extension of the MMP towards structure-based lead optimization in context of protein environment [2,3]. We introduce a strategy to relate the substitution effect within MMPs to the atom environment within the co-crystallized proteinligand complex implemented within the VAMMPIRE database and the supplementary web interface (http://vammpire.pharmchem.uni-frankfurt.de) [4]. In this presentation we discuss different applications of the VAMMPIRE database. VAMMPIRE-LORD (lead optimization by rational design) describes an innovative strategy to improve the binding affinity of a defined lead compound using 3D-MMPs [5]. We demonstrate that the created model is able to extrapolate the knowledge of a chemical transformation and the associated effect on ligand affinity to any similar system. In a second application we discuss the use of a subset of the VAMMPIRE database for validation of scoring functions. Co-crystallized 3D-MMPs with measured affinity data are suitable for evaluation of scoring functions independent from the underlying docking algorithm. We present our findings considering the performance of scoring functions on the validation dataset derived from the VAMMPIRE database. One of the most widely used molecular similarity metrics to cluster large compounds sets is 2D fingerprint-based Tanimoto distance, due the overall good balance between speed and effectiveness [1]. However, there are significant limitations in the ability of a 2D fingerprintbased method to capture the biological similarity between molecules, especially when conformationally flexible structures are involved.

Examining the diversity of large collections of building blocks in 3D
Structures which appear to largely differ in terms of 2D structure may give rise to quite similar steric/electrostatic properties (Fig. 4), which are what actually determine their recognition by biological macromolecules [2]. We were recently confronted with the task of clustering a collection of building blocks for drug discovery consisting of about 800K heterocyclic scaffolds with variable functional group decoration. The structural diversity of this collection was not adequately captured by 2D ECFP4 fingerprint Tanimoto distances, as shown by the rather flat distribution of 2D similarity values across the set, and by their lack of correlation with the 3D similarity metrics. The initial step of any clustering procedure is the computation of an upper triangular matrix holding similarity values between all pairs of compounds. This step becomes computationally demanding when using 3D methods, since an optimum alignment between the molecules needs be found taking into account multiple conformers. The talk will cover the methodological and technical solutions adopted to enable 3D clustering of such a large set of compounds. Selected examples will be presented to compare the quality and the informative content of 3D vs 2D clusters. J Cheminform 2016, 8(Suppl 1):S18 Furthermore, identification by reference spectra libraries does not succeed in combination with contaminated and / or degraded micro plastics. A new robust characterization method for high spatial resolution FTIR microscopy, using vibration band patterns to obtain specific characteristics for qualitative analysis, is introduced in the present study. Several polymer types (PE, PP, PA, bio polymers) in different sizes (macro and micro particles) and different qualities (new, recycling, plastic marine litter) were measured in transflection mode with FTIR microscope (72 scans, R=2 cm −1 , 5-25 repeats). Due to the practice of sampling huge amounts of plastics, no specific sample preparation is performed. The basis of the new method is the correlation between physico-chemical properties, molecular structure and patterns of vibration bands [2]. In data processing all spectra are baseline corrected and fitted with asymmetric Voigt functions to obtain all the vibration band areas. All areas of one spectrum are divided by each other, which leads to a highly characteristic dataset. Measurement statistics allows filtering these data to extract just analyte information and to separate matrix information. For characterization, these data are compared to references which are treated as well by multidimensional scaling combined with a novel distance function and k-Means clustering. By this method all measured samples could be classified accurately by their similarities and differences. Even small effects as those between HD-and LD-PE are identified correctly, which can be confirmed by reference literature [3]. The new method recognizes fundamental differences of synthetic and natural polymers and classifies even unknown synthetic and bio polyamides correctly.
Cheminformatics is becoming an essential asset in the chemist's toolbox, especially when it comes to guiding experiments. However, the compendium of chemical biological datasets generated by experimentalists and available in publicly-accessible repositories is skyrocketing [1]. As a consequence, there is a general need for new cheminformatics technologies capable of handling, analyzing, and modeling Big Chemical Data [1]. In this presentation, we introduce three next-generation cheminformatics approaches that can: (i) Enumerate extremely large virtual libraries: we will introduce the PKS Enumerator technology that generate millions of macrolides with user-defined building blocks and constraints. This research is relevant for the development of novel antibiotics.
(ii) Screen very large libraries of virtual compounds using a GPUaccelerated platform: we will show how a GPU-powered ligand-based screening workflow can perform ultra-fast similarity searches among one billion molecules.
(iii) Use machine-learning techniques for analyzing molecular dynamics trajectories: we will discuss the rationale of examining both full-atom and coarse-grained molecular dynamics trajectories using machinelearning techniques, what additional knowledge can be extracted, and how next-generation QSAR models could benefit of such technology. J Cheminform 2016, 8(Suppl 1):S18 Reaction driven de novo design uses a library of reactions and a database of reactants to perform a stochastic walk through "synthetic chemistry space. " Guided by appropriate drug design "scoring" such as docking or ligand shape similarity, such an approach provides a means to generate novel ideas for drug candidates that includes a proposed synthesis pathway. The methodology for reaction driven molecular evolution we have developed is independent of how the design ideas are scored. We will present studies where reaction driven de novo design software simulations are used to "re-invent" know drug molecules or close analogs. When successful, the simulations also provide a proposed synthesis pathway to the compound under study. Examples which compare the "in silico" synthesis with the known synthetic path will be presented. This approach could also provide a novel means of suggesting synthesis schemes for synthetic chemists. Virtual screening based on high-throughput docking of millions of compounds is still not feasible due to required computational power and time. In a constrained based docking important protein-ligand interactions can be used to include additional knowledge and leading to a possible alternative for pharmacophore based pre-filtering of huge libraries. This was successfully shown by Koch et al. who identified the first inhibitors of an important protein-protein interaction in M. tuberculosis [1] based on a high-throughput constraint docking campaign using GOLD [2]. The underlying workflow contains several steps like constrained docking with very fast settings, pose filtering and redocking using exhaustive settings. Up to now this processing steps had to be performed manually, since GOLD lacks a programming language API. Therefore, we developed a python wrapper for processing the gold configuration file and creating python-based automated virtual screening workflows. Using this wrapper an automated constrained docking workflow was implemented and used to analyze the performance of this approach in comparison to pharmacophore mapping-based virtual screening. First studies showed promising results on Poly [ADP-ribose] polymerase 1 (PARP1), Janus Kinase 3 (JAK3) and Phosphoinositid-3-Kinase gamma (PI3Kg) from the DEKOIS2.0 dataset [3]. The development of novel drugs is an arduous procedure which can take more than a decade and can accumulate expenses in the triple digit million euro range [1]. These facts remain true in spite of recent advances in virtual screening and computer-aided drug design [2]. The need for reasonably accurate and fast methods to predict and optimize the binding affinity which can drive the design process of drug-like molecules is therefore very high. We here present novel methodology for computing binding free energies and their analytical dependence on chemical properties based on the so-called solute-solute equation of the reference interaction site model (RISM) [3]. This integral equation theory is derived from the molecular Ornstein-Zernicke equation and enables us to calculate the potential of mean force (PMF) including physics-based solvation contributions with very high computational efficiency. The PMF is the key quantity for characterizing chemical and biological processes since it represents the free energy change along a given reaction coordinate from which the binding free energy can be computed.

Computation and optimization of complex formation thermodynamics by a solute-solute integral equation theory
We derive the conceptual framework and demonstrate various application areas of the model. PMF results from the solute-solute RISM approach are compared with explicit-solvent free energy perturbation molecular dynamics simulations in order to assess the approximations inherent to the theory for simple model systems. Optimal binding chemistry is defined by minimizing the free energy with respect to nonbonded complex interaction parameters. The results lead to suggestions for novel approaches to define protein-ligand scoring functions.
The emerging understanding of selective modulation of Gamma Amino Butyric Acid (GABA) receptors allows an increase in the development of clinically efficacious and subtype selective drugs for these targets [1]. The development of various α-2/3 selective agonists is in process, for the treatment of anxiety disorders [2,3], however, the binding modes of these selective modulators is still unknown. In our study, we propose a unified pipeline of chemo genomic analyses of GABA modulators and established a link between their bioactivity, efficacy and mode of action on α-1 and α-2 receptors. We applied this protocol on a dataset of 5,440 GABA modulators and extracted a subset of compounds with desired structure activity relationship (SAR) and structure efficacy relationship (SER). The similar behavior of candidate compounds in structure and activity space and their opposite behavior in functional efficacy space was assessed by assay related target similarity (ARTS) method. The dynamical behavior of GABA α-1/2 in the presence of selected compounds was analyzed using molecular dynamics simulations to further elucidate the binding modes. The results obtained showed that sequence differences (G200E, I201T, V202I) in the loop C region of GABA α-2 subunit are likely to be responsible for destabilizing the active site and hence decreasing the rigidity in presence of modulators as compared to α-1. To summarize, the behavior of the active site in presence of different modulators suggests that the difference in binding conformation of compounds could be attributed to structural differences in either the protein subunits or the compounds, and hence could possibly be a reason for their selective functional efficacy profiles.
In silico Mechanism-of-Action (MoA) analysis protocols have been developed in recent years, comprising molecule bioactivity profiling, annotation of predicted targets with pathways and calculation of enrichment factors to highlight targets and pathways likely to be implicated in the studied phenotype [1,2]. Although such analyses report that pathway annotation improves the MoA information gained than by target prediction alone, these experiments are conducted on small and inconsistent datasets (approximately 1,500 compounds [1]). In this work, we have combined an in silico target prediction tool, utilising over 9.5 million and 600 million active and inactive datapoints respectively, with automated pathways annotation from the NCBI BioSystems database. This MoA protocol has been applied to over 6,800 cytotoxic and 300,000 non-cytotoxic compounds from Astra-Zeneca cell viability screens in order to rationalise the cytotoxic MoA.
Post-processing of predictions involved the removal of targets with correlated activity due to overlapping bioactive compound annotations.
Many of the enriched targets support known mechanisms of cytotoxicity, with multiple pathways clustering around processes responsible for the fidelity of cell cycle, cellular division and metabolism. Fingerprints, a bit representation of compound chemical structure, have been widely used in cheminformatics for many years. The conversion of chemical structures into the bit strings is based on either structural keys or graph representations. Despite the fact that fingerprints with the highest resolution display high performance in virtual screening campaigns [1], the presence of a relatively high number of irrelevant bits introduces noise and makes their application more time-consuming.

Mean Information Content (MIC) algorithm: a new approach for fingerprint hybridization and reduction
Here, we present a new method of hybrid reduced fingerprints construction-the MIC algorithm. The methodology was applied for fingerprints implemented in the PaDEL-Descriptor software [2]. The MIC algorithm was applied for ligands of cognate serotonin receptors (5-HT 2A R, 5-HT 2B R, 5-HT 2C R, 5-HT 5A R and 5-HT 6 R). In the study, the length and composition of fingerprint is optimized separately for every single receptor by an algorithm which iteratively maximizes the amount of information included. Generated fingerprints applied in random forest experiment outperformed every raw (original) fingerprint. Moreover, a universal fingerprint for the whole set of receptors was also developed and applied in machine learning experiments carried out on 5-HT 1A receptor ligands, which gave satisfactory results. The composition of hybrid fingerprints highlights the most important structural features of the serotonin receptors' ligands in machine learning evaluation.
controlling the pharmacokinetics of a drug, hence being heavily studied.
Here, we present two decision tree-based, multi-label QSAR models able to simultaneously predict substrates and non-substrates in a total of 1493 compounds for 4 of the main ABC transporters [2] (BCRP1, MDR1, MRP1 and MRP2). Each of these transporters corresponds to a binary class variable, taking the value (class label) "substrate" or "non-substrate", and the simultaneous prediction of the four class variables is performed by a multilabel classification model, unlike the conventional (single-label) classification task where just one class variable is predicted. One of the models-a Classifier Chain -was built to consider transporter (class) interaction, and a Binary Relevance model was used as the baseline model. Training, optimization and testing were performed with separate portions of the dataset. Feature selection was optimized from 5 different methods where the best method chosen for each transporter. Both models were validated with testing against y-randomization. The applicability domain was characterized using the STD method [3], and activity cliffs and chemical space coverage were determined. Both multi-label models yielded 70% accuracy in the test set, but considering label interaction lead to more balanced models. The AD was able to reliably map the predictive accuracies across the dataset, which means that predictive reliability of new queries can be correctly determined. Moreover, various mispredictions coincided with activity cliff areas, which means that they could not have been picked up as outliers by any model using similar molecular descriptors. This multilabel classification approach is an appropriate alternative for addressing the prediction of ABC efflux, especially considering the unspecific nature of substrate recognition by the ABC family members. Since available crystal structures usually only cover a small portion of possible protein conformations, drug development processes based on them can only detect a limited number of all possible ligands. However, ligand-based approaches often cover a broader range of binding conformations, but only a small part of possible binding site space at the protein, which is a reason to prefer structure based methods. Additionally, proteins with very similar binding site in available crystal structures can still adapt other conformations of high probability which enable the design of selective ligands. In order to close this gap in ligand design a new method for binding site shape clustering was developed based on the test case PTP1B/TC-PTP. PTP1B is a long known validated target for type 2 diabetes and obesity, while inhibition of the closely related TC-PTP leads to intolerable side effects in mice [1].

Integration of chemical structure and full-text search: MarkLogic goes chemistry
To collect protein conformations, molecular dynamics simulations were performed. Since for TC-PTP no crystal structure with required starting conformation was available, a homology model was used instead. The collected molecular dynamics frames were aligned and for each frame the binding site was filled with a grid of points using the software POVME2 [2]. The binding site shape information derived from the grids was translated into a matrix containing binary data for presence or absence of each point inside the binding site of the corresponding frame. Different clustering algorithms were then used to extract the present information on binding site shape differences. This way, binding site shapes with high probability in PTP1B, but low probability in TC-PTP can be discovered and the evidence can be used to select favourable shapes and interaction points for selective PTP1B inhibitors.
Molecular docking is a one of the key tools in computer-aided drug design that is mainly used in virtual screening of compound libraries and to study mechanism of ligand-target interaction. However, a problem of effective scoring and discrimination between active and inactive compounds is still not completely solved [1]. In this study, a novel protocol for the automatic evaluation of docking results is presented. In the first stage, the known active and inactive compounds are docked to receptor binding site; then the molecular interactions from the resulting ligand-protein complexes are encoded into binary format by means of the SIFt algorithm [2]. In parallel, conformations of docked ligands are used to calculate 3D molecular descriptors (PaDEL software [3]), and further concatenated with SIFt vectors. The obtained hybrid fingerprints are used as an input for the selected machine learning methods (e.g. SVM, RF, J48, Naïve Bayes) implemented in WEKA software [4]. Evaluation using compounds active toward 5-HT 6 , 5-HT 7 and 5-HT 1B receptors, proved that the methodology can be successfully used for supporting virtual screening protocols.
For a broader recognition of halogen bonding in molecular design, we have recently studied halogen bonding contacts with different interaction partners in protein binding sites [1][2][3][4]. One of our goals was to examine the accessibility of the sulfur atom in the amino acid methionine through halogen bonding [1]. We found in the c-Jun N-terminal kinases 3 (JNK3) an interesting target. In the binding pocket of JNK3 the sulfur of methionine 146 (MET146) appears to be targetable by halogen bonding (PDB 2p33) [5]. Herein, a chlorine atom shows very favourable distance (336 pm) and σ-hole angle (160.4°) to the sulfur of methionine. In panel a) of the picture below, spherical interaction energies of iodine toward sulfur are plotted onto MET146. Favourable interaction geometries are highlighted by red colouring. The placement and bond direction of chlorine in the depicted ligand suggests a reasonable good quality of halogen bond (yellow area). DFT-calculations of a meaningful part of the binding pocket (TPSS-D/SVP) indicated that an exchange of chlorine by bromine and iodine could be useful for the strength of the halogen bond (b+c). Individual synthetic access was developed for the heavier halide analogues, which were then characterized by various biophysical techniques (d+e). By showing the crystal structure of the iodine analogue in complex with JNK3, we have obtained very valuable insights into the possibility and limitations of the application of halogen bonding in molecular design, which will be communicated in this talk [6].
It is widely accepted that the human ATP binding cassette transporters P-glycoprotein (P-gp, MDR1, ABCB1) and breast cancer resistance protein (BCRP, ABCG2) are working in concert in order to restrict brain penetration of certain drugs. Known substrates and inhibitors of these transporters are showing a great structural variety, however, there exists some overlap in compound selectivity between the two. Because dual inhibition of P-gp/BCRP would lead to better oral bioavailability and CNS penetration of i.e. anticancer drugs, information on determinants for specific or dual inhibition would be of great value in early phases of drug discovery. However, the collection of pharmacological data for a target of interest is a tedious task, especially if data from open data sources (like Open PHACTS or ChEMBL) is combined with literature data. If there are multiple targets involved, then the work becomes even harder. We tackled this research question by creating a KNIME [1] workflow for conveniently combining data from the Open PHACTS Discovery Platform [2] and other data sources, including pre-processing, filtering, annotation (e.g. creating a binary representation of the bioactivities), and visualization steps. In our use case, P-gp and BCRP inhibition data was collected and their compound overlap was determined. Different ways of solving multi-label problems were explored and compared: label-powerset, binary relevance and classifiers chain. Label-powerset revealed important molecular features for selective or polyspecific inhibitory activity, while binary relevance and classifiers chain allow for more predictive models. The models reveal important molecular features for specific or polyspecific inhibitory activity such as SlogP, the number of donors and acceptors of H bonds, the number of aromatic atoms and the length of the maximum single bond chain. Understanding the underlying processes upon ligand binding is a key parameter in drug design. The early view of a lock-and-key principle has evolved to a more dynamic view of protein-ligand interactions. This requires an accurate treatment of both, the ligand and the protein dynamics. Up to now, there are still no robust methods to describe the complex interplay between ligands and proteins in an efficient manner and often these methods are limited to highly specialized hardware.
Recently, a novel Monte Carlo based algorithm combined with side chain prediction was introduced which efficiently explores proteinligand and protein-protein dynamics [1]. This Protein Energy Landscape Exploration (PELE) method has provided efficient and accurate induced fit results in respect to ligand migration and ligand binding events [2][3][4]. Originally, the protein backbone motion is limited to normal modes obtained by the anisotropic network model (ANM), a pure theoretical approach based on a distance matrix. Experimental parameters are not directly incorporated in these simulation, despite the fact that experimental data derived from X-Ray or NMR contain valuable information regarding conformational space and protein dynamics.
To make use of public or in-house available experimental information, we incorporated driving modes based on X-Ray structures. Testing the new approach using the mineralocorticoid receptor system showed an improvement of the general performance and the description of the protein motion. Currently, we are implementing NMR chemical shift information into PELE which guides the protein motion following experimentally observed chemical shifts. Overall, this will allow for efficient sampling of ligand-protein binding events under direct consideration of experimental data. Molecular docking is the most widely used method in modern computer-aided drug design, regardless of being used in virtual screen or lead optimization. Molecular docking predicts the preferred orientation of one molecule (ligand) to a bigger molecule (protein) to form a stable complex. The most complex and important step in this docking process is scoring. In this process, a large number of binding poses are computationally generated and then evaluated using a scoring function (SF), which is a mathematical or predictive model that produces a score representing binding stability of the pose. Generally, three main aspects of a SF define its goodness, they are "docking power", "ranking power" and "scoring power" [1]. Reportedly, conventional SFs are able to predict binding modes while mostly failed to predict binding affinities. In literature, SFs are typically classified as force-field-based, empirical, and knowledge-based. A first application of machine learning using Random Forest to predict binding affinities shows an increasing of more than 20% in term of Pearson's correlation coefficient in a generic benchmark set with 195 protein-ligand complexes [2]. Since then, a new class of SF has been intensively studied, the machine learning-based SF. Using CASF databases [3], we evaluate 11 novel conventional SFs (e.g. goldscore, DrugScore) and 10 machine learning-based SFs, we could show that scoring and ranking power of machine learning-based SFs are superior in comparison to conventional SFs. Especially ensemble learning classifier like Rotation Forest, in combination with diverse modifications by our own to guarantee diversity and accuracy, achieves the best performance in all cases. J Cheminform 2016, 8(Suppl 1):S18 Applying high solvent pressure to biomolecules has substantial impact on their free energy surfaces that govern structure, function and dynamics. This poses a challenge to computational modeling approaches since the applicability of conventional empirical force fields is not known. As a step toward clarifying the situation, we need to ac-count for high pressure in quantum-chemical calculations. A suitable methodology is provided by integral equation theories, in particular the "embedded cluster reference interaction site model" (EC-RISM) [2,3] that combines statistical-mechanical 3D RISM integral equation theory and quantum-chemical calculations self-consistently. In this context the impact of pressure is naturally accounted for since the solvent susceptibility function that enters the theory contains the pure solvent correlation functions at the pressure chosen. Here we illustrate the methodology for several benchmark applications in a pressure range of 1 bar up to 10 kbar, including the effect of pressure on molecular structure, the relevance of electronic polarizability un-der extreme conditions, and the pressure dependence of nuclear magnetic resonance parameters.
DACS (Database of Available Chemical Compounds) is a virtual compound collection of chemical substances prepared for analysis of the chemical space at the FMP. The database contains compounds offered since 2005 by more than 50 different vendors. Up to now we collected altogether 375 million data records, which were extracted from nearly 3000 files. These data records represent more than 80 million unique chemical structures. In July 2015, the beta release of open-dacs.de combined with an active web-interface is available for users. This release contains more than 19 million unique structures. In the next release we are planning to include the 3D representations of the unique compounds in complementary to the 2D-structures to facilitate the direct application for virtual screening and docking studies. Furthermore, a first standard set of descriptors, such as number of H-bond donors/acceptors (NHBA/D), psa, alogp and divers solubility models [1][2][3] is available for filtering the search results. All chemical structures are tagged according to their content of reactive substructures following a list of SMART based rules, which are developed in the context of several library design initiatives [4,5].
The ever growing production of digital data as basis for scientific research has led funding agencies to regard data preservation and reuse as transparency-enhancing and cost-effective policies. On the other hand, scientific communities have early recognized the advantage in the up-scaling and reuse of scientific data, and some subjectspecific repositories have emerged within the respective research fields, while interdisciplinary ones [1][2][3] still remain scarce. RADAR (Research Data Repository) [4,5] aims to provide a significant contribution to the above described scenario, and to establish itself as discipline-agnostic, sustainable, and cost-transparent infrastructure, guaranteeing long-term digital preservation and thereby enhancing the reuse of research data. An additional and key aim of the project is to serve as a platform to support the peer-review of scientific articles, in which the underlying primary data is jointly submitted with the written manuscript. Specifically, our project offers two distinct services: I) a preservation service, in which datasets are digitally preserved, and kept externally inaccessible, and II) a publication service, in which datasets receive a DOI and are accessible, free of costs, through our site [4,5]. The implementation of an OAI-PMH standard and an open API, for the harvesting of metadata and (published) datasets, respectively, is under development.
RADAR is a DFG-funded (2013-2016) initiative, consisting of an interdisciplinary team formed by the FIZ Karlsruhe, TIB Hannover, KIT/SCC, LMU Munich and the IPB Halle. We actively seek cooperation and feedback, not only from publishers and manuscript submission systems, but especially from the scientific community. J Cheminform 2016, 8(Suppl 1):S18 To this end class probability estimates can be used. Generally, many supervised learning algorithms output some sort of probability estimate, but they need not be well calibrated. However, when assessing the reliability of a prediction well calibrated probabilities are definitely needed. Logistic regression is most commonly applied for calibrating class probability estimates and was originally used to calibrate support vector machine (SVM) outputs [1,2]. In this study, logistic regression was employed to calibrate the outputs of six classification techniques (random forest (RF), k-nearest neighbour (KNN), SVM, partial least squares discriminant analysis (PLSDA), linear discriminant analysis (LDA) and naïve Bayes classifier (NBC). In addition to that, six regression techniques (random forest regression (RFR), support vector regression (SVR), sparse partial least squares (SPLS), lasso, elastic net and ridge regression) were also used to estimate class probabilities. The quality of the calibration is assessed with reliability diagrams [3]. In detail, the mean squared error of prediction (MSE) for predicted and actual class membership probabilities are computed to quantify the degree of improvement. Furthermore, the influence of accuracy, correlation structure of the predictor matrix, and dataset size on the quality of class probability estimation was analysed in several simulation studies as well as with real data sets. From these studies, the following conclusions can be drawn: 1) within the accuracy range of 0.70 to 0.95, the probability estimates are reliable; 2) accuracy and correlation have the largest influence on the quality of class probability estimation; 3) The data set size has minor influence.
With calibrated class probability estimates, it is straightforward to define the applicability domain by setting a threshold for the class membership probability below which the prediction of an object is rejected. Calibration is advantageous in this context to assure that the threshold coincides with the nominal probability. While Traditional Chinese Medicine (TCM) has been of great importance consistently throughout the years in China, there is still some more scientific rationale needs to be proven for it to be accepted further in the West. Given the increasing data of TCM chemical ingredients and the availability of large-scale bioactivity data, we are now in the position to provide computational hypotheses for the modeof-actions (MOAs) of some TCM treatments, and to discover the relationship between the treatments. By using in silico target prediction algorithms, we generated a hierarchical clustering of 45 TCM therapeutic action (sub-)classes based on predicted bioactivity spaces, which were further annotated with KEGG pathways. The in silico target prediction of 10,079 TCM compounds showed a diverse bioactivity space, predicting a total of 409 unique targets and 171 unique pathways and 183 enriched targets and 99 enriched pathways based on Estimation Score ≤ 0 and ≥5% of compounds/targets in a (sub)-class. The TCM (sub-)classes were compiled into 14 clusters and the MOAs of each (sub)-class was established by associating the top three enriched targets/pathways to its indications with supporting literature. Overall, the most frequent top three enriched targets were immune-related targets such as tyrosine-protein phosphatase non-receptor type 2 (PTPN2) and protein-kinase C family while the most frequent enriched pathway was related to digestive system such as mineral absorption and bile secretion. For instance, compounds from "Hemostatic, stasis resolving" showed wound healing activity, which can be suggested from the action of compounds against PTPN2 [1,2] and the mineral absorption pathway [3]. The annotation of the protein family showed that the G-protein coupled receptor (GPCR) and protein kinase family were mainly contributed to the diversity of the bioactivity space and pathway mapping indicated that the digestive system was consistently annotated, which agreed with the important treatment's thought of TCM, "the foundation of acquired constitution" that includes spleen and stomach. This global overview enables the observation of similarities and the differences between (sub-)classes, which are not apparent from the names given. Hence, this analysis hopefully helps to bridge the gap between TCM and Western medicine a bit further.

Global mapping of Traditional Chinese Medicine (TCM) into bioactivity space and pathways annotation improve mechanistic understanding and discovers relationships between therapeutic action (sub-)classes
The BioMAP systems are high-quality phenotypic screening assays and part of EPA's ToxCast effort to prioritise chemical testing. Based on a robust metric, the nScore, 370 compounds were grouped into 82 clusters. These clusters are the results of common underlying modeof-action and 4 out of the 82 clusters were identified with specific molecular targets using a novel target prediction algorithm. Indeed, only these four clusters were composed of compounds which all had at least one shared target in common. Moreover, these four clusters are equally identified when using two different metrics as activity cutoff, further confirming their selective mode-of-action as compared to other clusters. Different activities were found for these four clusters. The first and third clusters seemed to be associated with targets involved in nuclear receptor transcription, and are composed of compounds associated with cardiovascular toxicity biomarkers. However the third cluster is composed of compounds targeting proteins belonging to the family of retinoic acid receptors which additionally target lipid metabolism and adipocyte differentiation, while the first cluster is slightly more diverse and have compounds associated with fibrosis and restenosis biomarkers in addition of the cardiovascular inflammation biomarkers. The second cluster seemed to be associated with one specific target, the Sphingosine 1-phosphate receptor 2, involved in G alpha signalling events and lysosphingolipid activity, and compounds belonging to this cluster do not seem to be associated with specific biomarkers. The last cluster is associated with 7 targets active mostly in steroid hormone bio-synthesis, and with cardiovascular and fibrosis biomarkers. The utility of this approach lies in the fact that compounds having the same target profile may be associated with one of these clusters to better understand their mode-ofaction. As an example, 16 compounds were associated with one of the 4 clusters, based on their predicted targets. For 3 of the 4 clusters, the biomarker profiles of the newly associated compounds were similar to those already in the cluster confirming the relevance of this approach. This approach may enable a comprehensive understanding of compound in terms of mechanism-of-action, associated pathways and phenotypic profile. Generating quality 2D layout coordinates of structures from connection tables is a surprisingly difficult problem. Even if the general connectivity of a compound is mapped to a visually pleasing layout, this is not sufficient to result in a compound rendering which is immediately recognizable to the eye of the pattern-trained chemist. There are many implicit conventions in the layout of chemical structures, concerning standard orientations of ring systems as well as the placement of hetero atom within this ring and substituents on it or the selection of bonds with wedges to indicate stereochemistry. Textbooks generally agree on the layout of basic compounds, while most 2D chemical layout software has no notion about these. The consequence is, for example, that one can find in online databases drawings of pyridine (a trivial layout case where a simple polygon ring drawing algorithm is sufficient) with the nitrogen in all six possible positions, while we were unable to find any textbook where it was not on the bottom. The Cactvs toolkit is the layout and rendering engine behind the PubChem database. Since more than a decade, the software has been one of the few engines which have been using advanced layout rules implementing implicit conventions in addition to computing a reasonable 2D graph. This rule set has been significantly expanded, and includes improvements implemented in response to feedback of PubChem users and developers, as well as 3rd party vendors which have licensed the engine. We discuss details and examples of these developments.

P37
The Cactvs KNIME node compiler Within a few short years, the KNIME [1] dataflow environment has become an indispensable tool in chemical data analysis. The basic software already ships with a reasonable collection of chemistry nodes. It also enjoys the support of many vendors which provide a rich set if additional nodes covering most standard data processing needs for typical chemistry studies. However, it frequently happens that there are minor gaps in the capabilities of the turnkey nodes. The development of custom nodes within KNIME is possible, but requires an intimate knowledge of its software architecture and Java. A tool which would allow the specification of custom chemistry-aware nodes on a high level, for example by encapsulating scripts of a chemical information processing toolkit, would be a welcome addition. While scripting nodes exist in the KNIME collection, they either lack chemistry awareness, or are very limited in which actions they can perform.
We have now developed a KNIME node compiler for the Cactvs Cheminformatics Toolkit as a new toolkit module. It directly generates a complete KNIME plugin-in Jar file from a very terse and high-level node description and a custom script. This script can perform any operation the toolkit in its stand-alone form is capable of. The new feature is a significant addition to the capabilities of the toolkit. Examples are shown and the unique development process for these custom nodes outside of the KNIME software proper is explained.
Tuberculosis, caused by Mycobacterium tuberculosis, is one of the world's deadliest infectious diseases affecting nine million people worldwide [1]. Due to emerging resistance against currently available antituberculotic drugs [2], novel potent inhibitors are urgently needed. M. tuberculosis thioredoxin reductase (MtTrxR) present a promising drug target since it plays a crucial role in antioxidative defense, proliferation, and growth of M. tuberculosis, based on thioredoxin-dependent reduction of target proteins e.g. ribonucleotide reductase with downstream effects on DNA synthesis [3,4]. In a previous high-throughput docking approach, promising small molecule inhibitors of MtTrxR were identified that target the thioredoxin binding site [5]. In order to improve inhibition activity, extend the structure-activity relationship, and validate selected initial hits as J Cheminform 2016, 8(Suppl 1):S18 real hits, we performed virtual screening campaigns for these scaffold classes.
We will present the results of this in-silico analysis and the biochemical testing. The histone deaceylases SIRT4, and SIRT5 are mitochondrial proteins and are considered as metabolic sensors of cell's energetic status. The two enzymes have important role in several human diseases such as cancer, and diabetes. SIRT4 regulates the glutamate dehydrogenase activity and insulin secretion, SIRT4 also functions as a cellular lipoamidase that regulates the pyruvate dehydrogenase (PDH), Its catalytic efficiency for lipoyl and biotinyl lysine modifications is superior to its deacetylation activity [1][2][3] whereas SIRT5 removes post-translational modifications such as lysine malonylation and succinylation [4]. Recent studies reported that SIRT4 seems to have a tumor-suppressive function [5,6] and may serve as a novel therapeutic target in colorectal cancer [7]. SIRT5 regulates urea production, reactive oxygen species (ROS) metabolism, via regulating the carbamoyl phosphate synthetase (CPS1) [3]. Due to the absence of a crystal structure for human SIRT4, a homology model was generated using different templates and computational approaches. The template selection identified human SIRT5 as most suitable template. MD simulations were carried out on the SIRT4 models in complex with the cofactor and different substrates using program AMBER 12 to understand the stability and conformational changes of the modeled proteins in holo and apo form. In addition, several MD simulations of available crystal structures of SIRT5 in complex and in apo form have been performed and compared with the results obtained for SIRT4. In addition, shape-based virtual screening for inhibitors and activators have been carried out.

P39
The quality and predictivity of most QSAR models used in drug design and development often depend on the particular tautomeric and valence structures used to represent the molecules of interest. This is because the location of hydrogens, bond orders, and formal charges affect the values of atomic and molecular descriptors upon which the models are based. For descriptors such as partial charges, this dependence is typically due to basing atom typing on hybridization: e.g., sp 3hybridized oxygen in -OH is treated differently than its sp 2 -hybridized counterpart in the =O group. That same oxygen may well occur in both forms in different tautomers of the same molecule. In the aqueous environment of interest in drug design applications, representing a compound as any one of its tautomers is likely to distort QSAR models trained using that tautomer. Even worse is the danger of choosing a tautomer present at low abundance, which will bias the modelbuilding process or the reliability of predictions or both. We have developed a descriptor generation method which is independent of tautomer and valence structure representation to address this problem and will illustrate its application to the atomic descriptors used in S+pKa, which is our global model of protic ionization constants. Preliminary results will be shown and compared with the "traditional" approach along with a discussion of advantages and potential pitfalls of the method.  Despite recent advancements in biological and medicinal chemistry research, the numbers of new approved drugs are decreasing. There are more than 3000 potential drug targets surfaced as an outcome of Genomics revolution [1] and nearly 1033 compounds are estimated as drug-like chemical space [2]. Profiling interactions of these entities are much needed to purpose or repurpose chemical substances towards therapeutic targets. Since experimental ways to profile these chemical substances are costly, time-consuming and impossible on such large scale, computational prediction becomes an impending complement that offers valuable information in an efficient way. Amongst many available approaches, proteochemometrics (PCM) is distinctly beneficial in exploiting ever-increasing drug-target interaction data [3]. PCM is an extrapolation of classical Quantitative structure-activity relationship (QSAR), which were trained only on chemical data for particular target. On the other hand, PCM models are trained on both biological and chemical data for multiple targets by various machine learning techniques.
In this work, a systematic approach based on PCM is reported (Fig. 5), which can predict interaction profile of given chemical moiety against therapeutic targets. Total 3036 interactions of 1473 proteins extracted from sc-PDB database are used for training the models [4]. This study also compares set of machine-learning algorithms (Random Forest and Support Vector Machine) and protein descriptors (Structurebased and Sequence-based) for the proposed task. Best results were obtained by Random Forest model trained on Structure-based protein descriptors with the accuracy of 99% and 86% for test and external J Cheminform 2016, 8(Suppl 1):S18 validation set respectively. Outcome of the study suggest that proposed models could be used to not only elucidate the intricate mechanism of action of small molecules by identifying novel therapeutic target but also provide an opportunity to reposition existing drugs for new therapeutic application. We offer you a solution that is based on Open Access software (Fig. 6). The global regulator Carbon catabolite Protein A (CcpA) controls carbon metabolism in Bacillus subtilis. It does so by binding to the degenerate consensus site WTGNNARCGNWWWCAW [1]. To investigate how CcpA can bind to such diverse sequence motifs, we tried to identify contributions to binding selectivity. On the one hand, direct contacts via hydrogen bonds and nonpolar interactions to the nucleotide bases ('base readout') strongly influence the preference of proteins for specific sequences. On the other hand, the base composition also influences the shape and flexibility of the DNA, thereby modulating the strength of the interactions.

Different readout mechanisms for protein-DNA interactions investigated with MD simulations
The strongest conserved bases of the consensus site are the central CG bases, at which the DNA is bent in the complex structure. The direct contacts of the bases cannot explain the selectivity in respect to central GC bases. In addition, CG represents a pyrimidine-purine base step, which is known to facilitate kinks, which might favour shape readout at this site. To dissect the individual contributions of base and shape readout, the CcpA-DNA complex was simulated as wild type and compared to a mutant, in which the central CG base step was replaced by GC. The CcpA-DNA hydrogen bonds remained stable throughout the simulation of the mutant, whereas the comparison of different MMPBSA analysis protocols showed a higher energy requirement to bend the mutant DNA sequence. It has long been the holy grail of computational structure-based ligand design to accurately predict binding free energies for novel compounds. This task is of special importance in fragment-based drug design, where typically multiple rounds of potency improvement are necessary to generate highly active lead structures out of micromolar initial inhibitors. Molecular Dynamics based free energy calculations (or FEP for free energy perturbation) are among the most suitable methods to reach this goal, which would significantly impact the modern drug design process. Many of the issues previously encountered with FEP have been mitigated by our introduction of the FEP+ (free energy perturbation plus REST, i.e. replica-exchange with solute tempering) methodology along with state-of-the-art modern force field, together with the computational power offered by GPU computing. The lack of large scale validation studies on diverse series of ligands are another obstacle for the practical application of FEP, due to the lack of computational resources and the time consuming process of simulation setup and analysis. Recently, we have conducted a validation study of FEP results on more than 10 targets and more than 500 compounds, offering an order of magnitude more data than typical FEP studies and allowing statistically valid conclusion about their efficacy. Here, we extend this validation study to fragment hit series binding a variety of pharmaceutically relevant target systems. Results for seven diverse protein systems with about 100 ligands are presented. Relative binding free energies can be calculated with good accuracy, typically with R-squared values in the range of 0.5-0.8 and mean unsigned errors (MUE) of about 1 kcal/mol when comparing to experimental data. FEP+ significantly outperforms alternative methods to determine binding free energies. This suggests that FEP+ binding energy predictions offer unsurpassed accuracy for fragments and lead precursors as well as for drug-like molecules and could become a regular part of FBDD projects. The target-based identification and optimization of new drug leads heavily relies on computational techniques. Among these, virtual docking is a common method for investigation and evaluation of interactions between small molecules and their receptors. Crucial questions for any docking study are: How well do the docking results agree with experimental data? Do the observed complexes occur in reality and is there a relationship between the docking score and the binding affinity of a compound? It is common knowledge that the performance of docking procedures is generally target dependent. However, there may be many other factors that limit the predictive capabilities of docking. In our research, we utilize docking for drug discovery against neglected tropical diseases like Human African Trypanosomiasis and leishmaniasis, which present a threat to millions of people world wide, who suffer from the limited number of currently available treatment options. One enzyme involved in the parasitic folate metabolism, pteridine reductase 1 (PTR1) is unique to the parasites, which makes it a promising target for anti-parasitic compound design. We investigated different sets of chemically diverse PTR1 inhibitors with a small set of receptor template structures, but varying other aspects like docking methodology, protonation states and presence of structural waters. Our results show some key aspects for the choice of the method and pinpoint pitfalls in the selection of a receptor template. As such effects are likely to impact other targets as well, the study should aid future target-based drug design efforts by highlighting important considerations beyond the selection of a receptor template for docking studies.

P45
Discriminative chemical patterns are used to distinguish molecules with different effects. Classifying the inhibiting, activating or toxicological potential of an unknown molecule constitutes a central aspect in cheminformatics. The structural features of a molecule responsible for a certain effect can be represented with a chemical pattern. A chemical pattern is a substructure of the molecule, generalized or specialized with molecular features and logic operators. The SMARTS language [1] is the most frequently used representation for chemical patterns. While SMARTS strings constitute a powerful concept for the representation and processing of abstract chemical patterns, their manual generation remains a complex and time-consuming process. Here, we introduce SMARTSminer [2], a new algorithm for the automatic derivation of discriminative SMARTS patterns from preclassified molecule sets. Based on a specially adapted subgraph mining algorithm, SMARTSminer identifies structural features that are frequent in only one of the given molecule classes. In comparison to elemental substructures, it also supports the consideration of general and specific SMARTS features. Furthermore, SMARTSminer is integrated into an interactive pattern editor named SMARTSeditor [3]. This allows for J Cheminform 2016, 8(Suppl 1):S18 an intuitive visualization on the basis of the SMARTSviewer [4] concept as well as interactive adaption and further improvement of the generated patterns. Additionally, a new molecular matching feature provides an immediate feedback on a pattern's matching behavior across the molecule sets. SMARTSminer's functionality and its integration into the SMARTSeditor software is shown in different classification scenarios.

P48
The use of force field and quantum chemistry based methods to overcome the lack of structural information in PDB structures with very low resolution Christian Jäger 1 , Vivien Wieczorek 1 , Lance M. Westerhoff 2 , Oleg Y. For structure based approaches a reliable structural model of the target protein, ideally bound with one or more effectors, is an essential requirement. Besides internal industrial databases, a primary resource for such models is represented by the World Wide Protein Data Bank (wwPDB) [1], or PDB for short. Its repositories contain more than 113000 biological macromolecular structures of various qualities (as of August 2015). Most of these structures referring to a resolution below 3 Å. This number is a critical threshold for an appropriate refinement and structural meaningful optimization, to achieve the best possible fit between observed experimental structure factors (e.g. electron densities) and the chosen target function. However, more than 7000 structures listed in the PDB were deposited with a resolution worse than 3 Å. The use of these structural models is questionable for structure-based approaches. Such low resolution structures are also often lacking the information of essential interactions between the effector molecule and the target protein because of unresolved side chains. For such purposes we hereby present an extended refinement strategy to increase the structural information content of a crystal structure model. As an example a crystal structure of the Vascular Endothelial Groth Factor Receptor 3 (VEGFR-3) with its ligand VEGF-C is used for further in silico modelling and binding studies. The resolution of this structure is 4.2 Å and lacks the information about crystallographic water molecules. Therefore a solvent analysis with 3D-RISM was applied to the complex. A combined standard crystallographic model-building of PHENIX and the structure modelling methodology of Rosetta was used, that utilizes an all-atom force field [2]. We will demonstrate the impact of a semi-empirical, linear-scaling, quantummechanic method on the PHENIX refinement process [3]. The combination of all 3 different approaches was optimized to a workflow that produces a reliable structural model, useful for a meaningful analysis of ligand-receptor interactions, not seen in the original model. We believe such a workflow could be routinely established and may be important for several drug discovery programs. J Cheminform 2016, 8(Suppl 1):S18 2. Huntington's disease (HD) is associated with the expansion of polyglutamine (polyQ) stretch of the huntingtin (htt) protein. Above a threshold of 37 glutamines huntingtin exon 1 starts to aggregate in a nucleation dependent manner. A 17-residue N-terminal fragment of exon 1 (N17) was shown to play a crucial role in modulating aggregation propensity and toxicity of htt exon 1. We used molecular dynamics simulations to show that binding of CLR01 induces structural rearrangements within the N17 region of htt exon 1 monomer that leads to change in aggregation pathway of htt.