Paths to cheminformatics: Q&A with Ann M. Richard

© The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. The Creative Commons Public Domain Dedication waiver (http:// creat iveco mmons. org/ publi cdoma in/ zero/1. 0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Journal of Cheminformatics


What has been your path to where you are today?
AMR: First in my family to graduate from college (SUNY-Oswego) in 1978, I dual majored in Chemistry and Math.My future husband and I moved to North Carolina to each attend graduate school in Chemistry at UNC-Chapel Hill, hardly believing we would be paid to do this.
My graduate work was in Theoretical Physical Chemistry, bridging quantum and classical approaches to model small molecule energy transfer in the gas phase and at solid surface interfaces.Mainframe computers were just becoming accessible with remote terminals, personal computers were in their infancy, and the Internet was several years off by the time I graduated in 1983; additionally, there were very few women in my field.Shortly after, I was hired as a post doc in EPA's National Health & Environmental Effects Research Lab working under Dr. James Rabinowitz.My research focused on extending a technique for efficiently computing electrostatic potentials to model chemical similarity of small molecules.I joined EPA as a Principal Investigator in 1987 and over the next decade established collaborations with toxicologists and experimentalists across EPA, applying computational chemistry approaches to elucidate mechanisms and build SAR models [2,3].During that time, I came to appreciate the challenges and rewards of applying these approaches to real-world problems in toxicology.I gained recognition as an effective (and passionate) communicator to cross-disciplinary audiences, relating important details while mindful of the bigger picture, which opened doors and opportunities to me as a young scientist.Being a peacemaker by nature, I was also stepping into the role of neutral arbiter in evaluating and critiquing the new expert-based and computational global SAR models (mostly commercial software) being applied to predicting mutagenicity and carcinogenicity [4][5][6].I came to understand the critical role of quality training sets for building such models, the challenges of modeling mechanistical complex toxicity endpoints (vs.physicochemical properties or a receptor target activity), and how the commercial "black box" software applications were built on business models that mined public data and then sequestered the resulting training sets as proprietary.I began to strongly advocate for a public data sharing approach to ensure model transparency, quality, and reproducibility, as well as to help the field progress.At the same time, I became increasingly aware of the many siloed, independently maintained, chemical lists lacking structures across EPA's Program Offices that were supporting regulatory programs such as the Clean Water Act, Clean Air Act, Superfund, etc.In the last slide of an invited presentation at the 2000 QSAR meeting in Burgas, Bulgaria, I proposed development of a public, web-based, standardized, and curated structure-toxicity data resource to serve the SAR toxicity modeling community; there were no publicly available structure databases at that time.With the help of a talented student contractor (C.Williams-Devane), I adopted the motto "Just do it" and published the initial DSSTox database and public website in 2004, helping to pierce some of EPA's early barriers to open science [7,8].I was recruited to EPA's newly formed National Center for Computational Toxicology (NCCT) in 2005 and received a small grant to hire a full time DSS-Tox chemical curator (M.Wolf ) and develop EPA's first and only (until 2021) web-based, structure-similarity search capability in association with the DSSTox website.As one of the only chemists in NCCT, I took on contract manager responsibilities for procuring, solubilizing, plating, and data management of thousands of chemicals to support EPA's new ToxCast and Tox21 programs, which were adopting HTS technologies of the pharmaceutical industry to transforming toxicology.I accepted this multi-year, support-role, convinced that DSSTox curation and strict chemical quality control measures would be necessary to ensure the success of toxicity prediction models using HTS data.I further helped to design and develop the cheminformatics infrastructure to support these programs, while advocating for an integration of SAR concepts and HTS data to advance predictive toxicology [9,10].Dr. Chihae Yang, a long-time collaborator and supporter, was a leader in the area of toxicity data informatics and modeling during this time, not only advocating for, but doing the hard work herself of curating toxicity datasets for use in modeling [11,12].She and her colleagues at Molecular Networks GmbH, with a grant from the U.S. Food & Drug Administration, publicly released an expert-derived, knowledge-informed set of public ToxPrint fingerprints for use in toxicity modeling in 2013 [13], which helped to launch the next chapter of my research.Employing ToxPrints, I developed a simple, standardized chemotype-enrichment approach to help ToxCast researchers better understand and elucidate chemical structure patterns within and across their HTS datasets and, in the process, came to better appreciate the complexity and challenges associated with these datasets [14,15].Around the same time, partnering with the extraordinary talented duo of Drs.Chris Grulke and Antony Williams in 2014 and 2015, the original, manually curated DSSTox database was migrated to MySQL and greatly expanded, becoming the underpinning for EPA's CompTox Chemicals Dashboard (https:// compt ox.epa.gov/ dashb oard/), launched in 2017 [16,17].The latter currently hosts over 1.2 M DSSTox substances (> 1 M structures) linked to hundreds of regulatory and data lists, spanning HTS, toxicity, hazard, and exposure, and is supporting EPA and environmental researchers worldwide.

What is your current research focus, and what are your plans for the future?
AMR: A few years prior to retiring earlier this year, I had shifted my research efforts to focus on PFAS (perand poly-fluoroalkyl substances), an emerging contaminant problem area for EPA.This entailed supervising the DSSTox curation of thousands of PFAS substances (represented by both defined structures and Markush) from public sources [18], procurement and plating of an actual PFAS test library of more than 400 substances being screened in phases in a subset of ToxCast assays, and, most recently, development of a customized set of PFAS fingerprints (an extension of the public ToxPrints) to support cheminformatics research and modeling in this chemical domain [19].I also helped to integrate the ToxPrint chemotype-enrichment approach, as well as the results from global analysis of more than 1000 Tox-Cast assays, into a public, on-line tool (Cheminformatics Modules: https:// www.epa.gov/ chemi cal-resea rch/ chemi nform atics).Through seminars and individual collaborations, I trained several of EPA's younger scientists in use of the approach, as well as use of the customized PFAS fingerprints.As an Emeritus, I plan to continue to support the thoughtful application of computational and cheminformatics approaches, advocate for quality DSS-Tox structure curation to complement testing and modeling efforts, and provide guidance and encouragement to EPA's next generation of researchers.

Which obstacles did you encounter during your career, and what experiences have helped you get to where you are today?
AMR: Starting my career at EPA in a multidisciplinary team of toxicologists, molecular biologists, and life scientists, my challenge early on was a steep learning curve and finding where I could best apply my training and skills to support fellow researchers and EPA's mission.The biggest obstacle was gaining the trust and respect of my experimental toxicology colleagues, and advocating for myself.This was a slow, iterative process, where I took the time to listen, learn, and understand enough of their experiments and data to gauge where and how a computational chemistry approach could contribute to further understanding [1].Early on, the onus was most often on me to cross the learning bridge, articulate a hypothesis, and initiate the collaboration.However, each time a project met with success, I gained an enthusiastic collaborator and advocate who was eager to embrace these approaches in future collaborations.This success even extended to early interactions with a new journal editor (L.Marnett, Chemical Research in Toxicology), who was initially skeptical of structure-based toxicity prediction models, as they were often divorced from mechanistic interpretation.However, he came to appreciate the potential and value of models and computational experiments, when thoughtfully constructed, to elucidate mechanisms in toxicology [3].I also put a lot of thought and effort into communicating computational approaches and results, in writing and in presentations, in a way that was respectful of diverse audiences, entertaining (colorful graphics, minimizing text), and easy to digest and grasp.To this day, the many hours I have devoted to creating information dense, colorful graphics for conveying complex ideas and results in clear, simple terms has served me well [10,20].Lastly, I want to say that pursuing a career in a government research laboratory, supporting a public health mission, has been very enabling and rewarding-I do not believe that I could have charted my research path or had the impact that I have had in my career elsewhere in industry or academia.In addition, my government job provided me the opportunity to be part of a community of extraordinarily committed and talented scientists, travel, teach, and pursue my scientific interests, all while enabling a work-life balance that included raising a family and twice being a caregiver to elderly parents.As a woman in the physical sciences, I feel that we have come a long way since 1983, but that balancing work and family caregiver responsibilities (for both young and old) is still challenging for women in science.

What advice would you give to your younger self?
AMR: Firstly, do not apologize for being emotional when talking about things that you care deeply about-it is not something you can control and is just who you are; if it makes managers uncomfortable, that is their problem; if they think less of you for it, push through and prove them wrong.Secondly, continue to strongly advocate for what you believe in and care about-passion and truth are your fuel and will help to overcome many obstacles and inspire others.The term cheminformatics will be coined and eventually recognized as a valid and transformative scientific discipline, and you will be joined in your data curation/informatics efforts and supported by a community of scientists who understand and share your vision and commitment.Stay the course in your efforts to change entrenched attitudes about the importance of quality chemistry databases to support modeling and, just remember-the slow, deliberate turtle with the hard shell ultimately wins the race against the fast and careless rabbit.
What is a current challenge you are facing that should not be a challenge in the near future?AMR: As more than one prior interviewee has said or implied, data is the biggest enabler and limiting factor for cheminformatics modelers.My career trajectory took a major turn once I recognized the pivotal role that quality chemical representations and their accurate association with activity data had to play in achieving predictive toxicology objectives.Although collecting, cleaning, and curating data and list associations is a tedious and timeconsuming pursuit (and looked down on by some as not sufficiently scientific or innovative), it is something that almost every modeler has had to tackle at some point.In the process, one comes to understand the value and limitations of those data, and realize that quality data, in turn, can improve the science of the modeling enterprise.As I shifted my focus to cleaning up datasets and chemical structure-identifier associations, I became alarmed at the scope of the problem across public datasets, undermining toxicity modeling efforts.Expert manual chemistry curation in association with bioactivity data, which only needs to be done right once and then publicly shared, is the most effective solution, which is why I took on this challenge.I am also convinced that if the toxicology research community does not get the chemistry right, we modelers are hobbled right out of the starting gate.Consider, e.g., a multi-million dollar, 2-year rodent carcinogenicity study published with a chemical name and CAS RN that point to different substances, or where the structure is wrong or important stereochemistry is missing or ambiguous.Or consider a chemical-activity HTS result where the chemical was originally misidentified by a major supplier (yes, this happens) or, unbeknownst to the experimenter, the chemical reacted, volatilized, or degraded under the testing conditions.Additionally, I have learned first-hand the extent to which testing artifacts (both chemical and biological) can obscure the desired target activity, resulting in misleading modeling outcomes.Too often, modelers scour the Internet for chemical-activity data sets, and apply the newest modeling approach without doing the work to clean and understand the data or endpoint being modeled.I also well understand the economic rationale behind the pharmaceutical and chemical industries' reluctance to publicly share toxicity data and knowledge that is intertwined with the pursuit of new drugs.However, failure to predict drug toxicity is a major impediment to new drug advances, just as it is a challenge from a public health standpoint.My hope for the near future is that public data resources supporting chemical-toxicity evaluation and regulation not only continue to expand, but that quality chemical curation becomes the norm and is demanded and expected by the scientific and regulatory communities.

What do you think the cheminformatics community could do to increase diversity and inclusion?
AMR: I would like to see increased opportunities for younger scientists to more actively participate and contribute to public scientific forums, not just giving presentations but in facilitated discussions.I continue to see too many younger, mostly women scientists (particularly minorities and Asians) seated on the periphery of conference rooms, seemingly afraid to speak up or ask questions, particularly when a small number of senior scientists (not always, but most often male) dominate the discussion.Until the scales are truly balanced to include diverse voices, these younger scientists need to be actively encouraged and nudged by those same senior scientists to sit at that table, ask questions, and contribute to the discussion.With the growing availability of "big data" and increasingly sophisticated machine-learning and AI methods, cheminformatics is a rapidly evolving discipline.However, it remains highly multi-disciplinary and requires engagement with, and understanding of the data being modeled.Younger scientists have a lot to learn, but they also bring a fresh outlook and new skills to contribute to the field.They should be actively encouraged to engage in continuous learning to broaden their perspective, sit at the table, and speak up.It is the responsibility of the older generation to guide, encourage, and make this path easier.

What is your thought on ChatGPT/Large Language Models and how these might influence the way we do science?
AMR: I have started seeing colleagues investigate Chat-GPT to compile short biographies and abstracts covering various scientific topics and fields of studies.The results have been impressive, sometimes humorous, and mostly (but not always) accurate, largely due to the volume of publicly available scientific information in the form of open-access articles, PubMed abstracts, and proceedings of scientific meetings.I can also see large language models being potentially useful for improving context recognition in text data mining, if fed or pointed to appropriate and trustworthy resources.Additionally, they might be helpful for non-English speakers in writing and editing scientific manuscripts for submission to English journals.Some issues that need to be grappled with, however, are how journals and others will use, oversee, and require disclosure of AI-generated text.Without adequate oversight, it has already been shown that ChatGPT can create a convincing looking scientific abstract or article, right down to the fake references (https:// www.biorx iv.org/ conte nt/ 10. 1101/ 2022.12. 23.52161 0v1).This presents the unscrupulous with a ripe opportunity to spread misinformation with a scientific veneer.Within the cheminformatics community, chemistry is already our lingua franca and ChatGPT is just a next iteration of AI, which has been used in various forms (machine-learning algorithms, neural nets, etc.) in this area for many years now.Focusing on the positives, AI can be a powerful tool for advancing the science if harnessed not only to quality data, but to the iterative scientific enterprise to ensure that results inform, illuminate, and guide a path forward.These are powerful techniques that can serve our field but they need to be tethered to critical review, understanding, and analysis.A chilling recent experiment with these methods points to unanticipated, possible negative applications: Ekins and coworkers [21] reported reversing the drug discovery AI paradigm to discover potentially potent chemical weapons (the work is also featured in a new Netflix documentary titled "Unknown: Killer Robots").As with researchers in many other fields grappling with the impact of these new approaches, the cheminformatics community needs to keep their eyes wide open to both the potential and pitfalls of the approaches.