AI-driven automated chemistry as a tool to accelerate drug discovery

Simeonov, Anton; Zakharov, Alexey

AI-driven automated chemistry as a tool to accelerate drug discovery

14

SHARES

Share via

Posted: 15 May 2019 | Alexey Zakharov (National Center for Advancing Translational Sciences) | No comments yet

Typically, early drug discovery campaigns start with the screening of chemical libraries to identify candidate chemotypes modulating a particular target and/or phenotype. Success of the primary screening depends on multiple factors related to both biology and chemistry. These include the target’s druggability, sensitivity and specificity of assay system, composition and diversity of chemical libraries, number of screened compounds, etc.

SINCE every project has cost constraints, the chemical libraries cannot cover the entire chemical space of up to 1060 potentially drug-like small molecules.¹ Moreover, chemical library size ranges are often between tens of thousands to hundreds of thousands, which further complicates the search for new prospective chemotypes in such narrow chemical space. The recent advances in chemistry automation could help to overcome this limitation and establish the process of new bioactive chemical discovery.

The automation of synthetic organic chemistry is a complex process, which requires the synchronised application of instruments, devices, and software algorithms, as well as the ability to execute different operations like reagent mixing, filtrations, purifications, etc. This complex process needs to be regulated by sophisticated algorithms, which can be symbolically divided into infrastructure and reaction generator parts. The infrastructure software is mapping the synthetic reactions into specific steps necessary for the robots to execute, capture and store the reactions’ conditions, parameters and yields, controlling the operation of hardware/devices, etc. Recently, several types of research proposed different approaches to control the assembly of the molecules in robotic systems. For example, Steiner et al.2 proposed a chemical programming language that can direct the synthesis of organic compounds on a flexible modular platform. The authors called this language XDL (chemical descriptive language), which is based on XML format and capable of describing the synthetic procedure with all required information. Once a reaction is translated to the XDL, it is further converted to a scripting language named ChASM (chemical assembly), which provides machine operation instructions and supports functions and variables. To navigate among physically connected robotics modules, authors implemented a graph abstraction using GraphML open source format. Thus, to automatically assemble the molecule, the authors developed, implemented, and integrated the three different software elements into one system named Chemputer. As a proof of concept, Steiner et al.² successfully resynthesised three pharmaceutical compounds (Nytol, rufinamide, and sildenafil) using the proposed approach. In another study, Godfrey et al.³ used a custom software package as part of a fully integrated, globally accessible, automated chemical synthesis laboratory. The architecture of the proposed software consists of different layers that perform the separate tasks:

controls of transport logic (Alchemii)
establishment, submission, and management of chemistry workflow (ALS software)
real-time inventory and trackers (web-interface). The authors applied the proposed system for automated medicinal chemistry, focusing on the generation of challenging heterocyclic libraries.

Although these explorations and developments in infrastructure software for automated chemistry have been made, solutions are still needed that can be broadly applicable and scalable and have modular expansion capability and flexibility. Another limitation of the described systems is that they rely on well-described input. Thus, for Chemputer, an input constitutes a detailed reaction description, which is converted into XLD format using a neuro-linguistic programming (NLP) approach. This description may be taken from scientific publication or written down directly by the user. The input in the system described by Godfrey comes directly from a chemist and is represented by the drawn reaction workflow.

Drug Target Review has just announced the launch of its NEW and EXCLUSIVE report examining the evolution of AI and informatics in drug discovery and development.

In this 63 page in-depth report, experts and researchers explore the key benefits of AI and informatics processes, reveal where the challenges lie for the implementation of AI and how they see the use of these technologies streamlining workflows in the future.

Also featured are exclusive interviews with leading scientists from AstraZeneca, Auransa, PolarisQB and Chalmers University of Technology.

FREE DOWNLOAD HERE

Thus, to explore new chemical space, both systems need to have a reaction prediction component that can forecast synthetic routes with corresponding conditions and yields. The constant progression in machine learning and artificial intelligence (AI) systems provides a great opportunity to integrate cutting-edge solutions into this field. In recent decades, many studies have been conducted with the aim of predicting chemical reactions.^4-10 In general, they can be divided into two major groups: those that require reaction atom mapping and those with a mapping-free approach. The reaction atom mapping group can be sub-divided into rule-based and machine learning-based.

Rule-based approaches mostly use SMARTS¹¹ and/or SMIRKS¹² notations for reaction transforms. Szymkuc et al. developed a software named Chematica, which uses more than 20,000 reaction transforms represented by SMARTS notations.¹⁰ Since pure SMARTS transforms often do not contain specific information – eg, groups’ protection data, reaction conditions, incompatible groups, etc – the authors incorporated these data together with reaction SMARTS in expert-coded records that can be used as the “rules” for retrosynthesis planning.¹⁰ In addition to the protected and incompatible groups, the system calculates a delocalisation energy per atom, to determine the position at which a particular reaction transformation can be applied.

Since multiple reaction transformations can be applied for some compounds, the authors proposed to use a scoring function based on the pre-defined structure and reaction features – such as the number of rings, stereocentres, group incompatibilities, etc – to prioritise synthetic routes. The drawbacks of the rule-based approaches are the necessity to apply all rules for a certain compound, which slows down the prediction process, as well as a lack of statistical rationality beyond the scoring functions. Indeed, since the proposed scoring functions depend on features and coefficients, their best combination needs to be statistically validated and proved.

To address some of these limitations, machine-learning approaches have been deployed.^8,4 Instead of scoring functions, the reaction transform can be predicted by a model developed using machine-learning methods such as Random Forest,¹³ Support Vector Machines,¹⁴ Gradient Boosting,¹⁵ Deep Neural Nets¹⁶ and others. Such models utilise chemical descriptors calculated from reactants as an input and predict a label representing the reaction transformation in form of SMARTS or SMIRKS as an output. Once the label is predicted, the corresponding SMARTS/ SMIRKS can be directly applied for reactants to generate the reaction product.

Such an approach has the following strengths:

it can provide the probability for each reaction transform so they can be easily ranked, and
it can be used for the prediction of reaction conditions and agents.⁶

Wei et al.⁴ combined Morgan and neural fingerprints as descriptors, together with deep neural network (DNN) to build a model for prediction of 16 reaction types of alkyl halides and alkenes. Segler et al.¹⁷ used Morgan fingerprints and DNN to build a model for the prediction of 8,720 reaction transformations gathered from the Reaxys database. The authors showed that the machine-learning approach yielded significantly better results compared to the pure rule-based method. More examples can be found elsewhere.^9,5,18 Although the aforementioned studies significantly enhanced the field, the developed models ignore the chirality and stereochemistry of chemical compounds and can also predict only the most frequent reaction types (occurred at least 100 times). Also, since these methods used SMARTS/SMIRKS as a label, they still depend on proper atom mapping, which can be challenging for unbalanced reactions.

How artificial intelligence is the future of pharma…

The current achievements in AI systems propose new solutions for this field. One of these has been borrowed from the machine translation area and is treated as a sequence-to-sequence (seq2seq) modelling problem. In this case, reactants, products and agents are treated as SMILES and fed into the autoencoder to build the predictive model.¹⁹ Autoencoder is a type of deep learning architecture, which consists of both encoder and decoder parts. Encoder is used to transform the SMILES of reactants and agents into a latent space, and decoder is used to reconstruct the reaction product from the latent space. The different types of deep neural nets can be used to build autoencoders. The most commonly used are convolutional²⁰ and recurrent neural nets.²¹

To achieve better performance, the teacher forcing strategy²² and attention mechanism²³ have been proposed and widely used. In the teacher forcing strategy, the improvement is achieved by feeding the actual data into decoder instead of the predicted one obtained from encoder. The attention mechanism allows networks to concentrate on the sequence context adding the corresponding weights. Since the seq2seq approach ignores the atom mapping, it enables the model to be taught even from unbalanced reactions with multiple agents. Liu et al.⁵ developed the recurrent-based seq2seq model and validated it on a test set of 5,000 reactions belonging to 10 different reaction types. The authors showed that performance of the seq2seq model was comparable with the rule-based approach. Schwaller et al.²⁴ developed a recurrent model with similar architecture and validated it using a reaction dataset extracted from patents.²⁵ As a result, the obtained model showed superior performance compared to another deep learning approach.²⁶ In the follow-up study, Schwaller et al.²⁷ adopted the new architecture originally proposed by Google Brain²⁸ and showed a significant improvement in the accuracy of reaction predictions.

Although the described models demonstrated great performance, they disregarded the reaction conditions and yield, and sometimes predicted invalid SMILES, which led to inaccurate results. Thus, the combination of both atom mapping and mapping-free approaches, or the construction of a hybrid method that should consider reaction conditions, agents and yields, may lead to a dramatic improvement in computer-based synthetic planning.

With so much progress made in the infrastructure and reaction prediction fields, the most promising future direction seems to be the combination of both areas, which offers a great opportunity to explore uncharted chemical space. Indeed, the new chemical compounds can be proposed by AI and physically synthesised by an automated robotic system controlled by infrastructure software. Taking account of these possibilities, The National Center for Advancing Translational Sciences (NCATS), part of the National Institutes of Health (NIH), initiated a programme titled ‘A Specialied Platform for Innovative Research Exploration (ASPIRE)’,^29,30 whose goal is to help study unexplored biologically-active chemical space through integrating automated synthetic chemistry, high-throughput biology, and artificial intelligence technologies. The NCATS ASPIRE programme is also designed to accelerate scientific solutions to the national opioid public health crisis under NIH HEAL (Helping to End Addiction Long-term)³¹ initiative. Thus, the new AI approaches and infrastructure technologies developed under these programmes could help with the engineering of treatments for opioid misuse and addiction, as well as enhance the discovery of next-generation addiction-free analgesics.

Biographies

ALEXEY ZAKHAROV received his Ph.D. in Bioinformatics from the Institute of Biomedical Chemistry of Russian Academy of Medical Sciences. In 2011 he joined the Chemical Biology Laboratory at the National Cancer Institute, National Institutes of Health (NIH) where he continued to strengthen his expertise in the cheminformatics field, by applying and developing in silico methods to aid in the drug discovery and design projects of the CADD Group. In 2015 he joined the informatics group at the National Center for Advancing Translational Sciences (NCATS), NIH. Currently, Zakharov serves as the Informatics lead for an early therapeutic discovery project team and manages and coordinates the team efforts toward achieving the project deliverables and milestones. He works on various projects in the therapeutic areas of cancer, cardiovascular, neurological, viral and inflammatory diseases.

ANTON SIMEONOV is the scientific director of the Intramural Division of Preclinical Innovation at the National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH). The author or inventor on more than 150 peer-reviewed scientific publications and patents, Simeonov has a truly diverse background, ranging from bioorganic chemistry and molecular biology to clinical diagnostic research and development. He received a Ph.D. in bioorganic chemistry from the University of Southern California and a B.A. in chemistry from Concordia College. Simeonov then trained as a postdoctoral fellow at the Scripps Research Institute under Richard Lerner and Kim Janda. Prior to joining NIH in November 2004, Simeonov was a senior scientist at Caliper Life Sciences, and a leading developer of microfluidic technologies, where he was responsible for basic research on novel assay methodologies and development of microfluidic products for research and clinical diagnostics.

References

1. Bohacek RS, McMartin C, Guida WC. The art and practice of structure-based drug design: a molecular modeling perspective. Med Res Rev. 1996 Jan;16(1):3–50.

2. Steiner S, Wolf J, Glatzel S, Andreou A, Granda JM, Keenan G, et al. Organic synthesis in a modular robotic system driven by a chemical programming language. Science. 2018 Nov 29;eaav2211.

3. Godfrey AG, Masquelin T, Hemmerle H. A remote-controlled adaptive medchem lab: an innovative approach to enable drug discovery in the 21st Century. Drug Discov Today. 2013 Sep 1;18(17):795–802.

4. Wei JN, Duvenaud D, Aspuru-Guzik A. Neural Networks for the Prediction of Organic Chemistry Reactions. ACS Cent Sci. 2016 Oct 26;2(10):725–32.

5. Liu B, Ramsundar B, Kawthekar P, Shi J, Gomes J, Luu Nguyen Q, et al. Retrosynthetic Reaction Prediction Using Neural Sequence-to-Sequence Models. ACS Cent Sci. 2017 Oct 25;3(10):1103–13.

6. Gao H, Struble TJ, Coley CW, Wang Y, Green WH, Jensen KF. Using Machine Learning To Predict Suitable Conditions for Organic Reactions. ACS Cent Sci. 2018 Nov 28;4(11):1465–76.

7. Fooshee D, Mood A, Gutman E, Tavakoli M, Urban G, Liu F, et al. Deep learning for chemical reaction prediction. Mol Syst Des Eng. 2018;3(3):442–52.

8. Gimadiev TR, Madzhidov TI, Nugmanov RI, Baskin II, Antipin IS, Varnek A. Assessment of tautomer distribution using the condensed reaction graph approach. J Comput Aided Mol Des. 2018 Mar 1;32(3):401–14.

9. Segler MHS, Preuss M, Waller MP. Planning chemical syntheses with deep neural networks and symbolic AI. Nature. 2018 Mar;555(7698):604–10.

10. Szymkuć S, Gajewska EP, Klucznik T, Molga K, Dittwald P, Startek M, et al. Computer-Assisted Synthetic Planning: The End of the Beginning. Angew Chem Int Ed. 2016 May 10;55(20):5904–37.

11. Daylight Theory: SMARTS – A Language for Describing Molecular Patterns [Internet]. [cited 2019 Jan 4]. Available from: http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html

12. Daylight>SMIRKS Tutorial [Internet]. [cited 2019 Jan 4]. Available from: http://daylight.com/dayhtml_tutorials/languages/smirks/

13. Breiman L. Random Forests. Mach Learn. 2001 Oct 1;45(1):5–32.

14. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995 Sep 1;20(3):273–97.

15. Friedman JH. Greedy function approximation: A gradient boosting machine. Ann Stat. 2001 Oct;29(5):1189–232.

16. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015 May;521(7553):436–44.

17. Segler MHS, Waller MP. Neural-Symbolic Machine Learning for Retrosynthesis and Reaction Prediction. Chem – Eur J. 2017;23(25):5966–71.

18. Watson IA, Wang J, Nicolaou CA. A retrosynthetic analysis algorithm implementation. J Cheminformatics. 2019 Jan 3;11(1):1.

19. Liou C-Y, Cheng W-C, Liou J-W, Liou D-R. Autoencoder for words. Neurocomputing. 2014 Sep 2;139:84–96.

20. Bengio Y, Lecun Y, Lecun Y. Convolutional Networks for Images, Speech, and Time-Series. 1995.

21. Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Comput. 1997 Nov 1;9(8):1735–80.

22. Lamb A, Goyal A, Zhang Y, Zhang S, Courville A, Bengio Y. Professor Forcing: A New Algorithm for Training Recurrent Networks. ArXiv161009038 Cs Stat [Internet]. 2016 Oct 27 [cited 2019 Jan 4]; Available from: http://arxiv.org/abs/1610.09038

23. Luong M-T, Pham H, Manning CD. Effective Approaches to Attention-based Neural Machine Translation. ArXiv150804025 Cs [Internet]. 2015 Aug 17 [cited 2019 Jan 4]; Available from: http://arxiv.org/abs/1508.04025

24. Schwaller P, Gaudin T, Lanyi D, Bekas C, Laino T. “Found in Translation”: Predicting Outcomes of Complex Organic Chemistry Reactions using Neural Sequence-to-Sequence Models. ArXiv171104810 Cs Stat [Internet]. 2017 Nov 13 [cited 2018 Dec 28]; Available from: http://arxiv.org/abs/1711.04810

25. Lowe D. Chemical reactions from US patents (1976-Sep2016) [Internet]. 2017 [cited 2019 Jan 4]. Available from: https://figshare.com/articles/Chemical_reactions_from_US_patents_1976Sep2016_/5104873

26. Jin W, Coley CW, Barzilay R, Jaakkola T. Predicting Organic Reaction Outcomes with Weisfeiler-Lehman Network. ArXiv170904555 Cs Stat [Internet]. 2017 Sep 13 [cited 2019 Jan 4]; Available from: http://arxiv.org/abs/1709.04555

27. Schwaller P, Laino T, Gaudin T, Bolgar P, Bekas C, Lee AA. Molecular Transformer for Chemical Reaction Prediction and Uncertainty Estimation. ArXiv181102633 Phys [Internet]. 2018 Nov 6 [cited 2018 Dec 28]; Available from: http://arxiv.org/abs/1811.02633

28. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All You Need. ArXiv170603762 Cs [Internet]. 2017 Jun 12 [cited 2019 Jan 4]; Available from: http://arxiv.org/abs/1706.03762

29. A Specialized Platform for Innovative Research Exploration (ASPIRE) [Internet]. National Center for Advancing Translational Sciences. 2018 [cited 2019 Jan 4]. Available from: https://ncats.nih.gov/aspire

30. Sittampalam GS, Rudnicki DD, Tagle DA, Simeonov A, Austin CP. Mapping biologically active chemical space to accelerate drug discovery. Nat Rev Drug Discov [Internet]. 2018 Nov 27 [cited 2019 Jan 11]; Available from: http://www.nature.com/articles/d41573-018-00007-2

31. HEAL Initiative [Internet]. National Institutes of Health (NIH). [cited 2019 Jan 4]. Available from: https://www.nih.gov/research-training/medical-research-initiatives/heal-initiative

Cookie	Type	Duration	Description
cookielawinfo-checkbox-advertising-targeting	persistent	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertising & Targeting".
cookielawinfo-checkbox-analytics	persistent	1 year	This cookie is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category "Analytics".
cookielawinfo-checkbox-necessary	persistent	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	persistent	1 year	This cookie is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category "Performance".
PHPSESSID	session	1 year	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	persistent	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
zmember_logged	session	1 year	This session cookie is served by our membership/subscription system and controls whether you are able to see content which is only available to logged in users.

Cookie	Type	Duration	Description
advanced_ads_browser_width	persistent	1 month	This cookie is set by Advanced Ads and measures the browser width.
advanced_ads_page_impressions	persistent	2 years	This cookie is set by Advanced Ads and measures the number of previous page impressions.
advanced_ads_pro_server_info	persistent	1 month	This cookie is set by Advanced Ads and sets geo-location, user role and user capabilities. It is used by cache busting in Advanced Ads Pro when the appropriate visitor conditions are used.
advanced_ads_pro_visitor_referrer	persistent	1 year	This cookie is set by Advanced Ads and sets the referrer URL.
bscookie	persistent	2 years	This cookie is a browser ID cookie set by LinkedIn share Buttons and ad tags.
IDE	persistent	2 years	This cookie is set by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
li_sugr	persistent	3 months	This cookie is set by LinkedIn and is used for tracking.
UserMatchHistory	persistent	1 month	This cookie is set by Linkedin and is used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.
VISITOR_INFO1_LIVE	persistent	5 months	This cookie is set by YouTube. Used to track the information of the embedded YouTube videos on a website.

Cookie	Type	Duration	Description
bcookie	persistent	2 years	This cookie is set by LinkedIn. The purpose of the cookie is to enable LinkedIn functionalities on the page.
GPS	persistent	30 minutes	This cookie is set by YouTube and registers a unique ID for tracking users based on their geographical location
lang	session	1 year	This cookie is set by LinkedIn and is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	persistent	1 day	This cookie is set by LinkedIn and used for routing.
lissc	persistent	11 months	This cookie is set by LinkedIn share Buttons and ad tags.
vuid	persistent	2 years	We embed videos from our official Vimeo channel. When you press play, Vimeo will drop third party cookies to enable the video to play and to see how long a viewer has watched the video. This cookie does not track individuals.
wow.anonymousId	persistent	2 years	This cookie is set by Spotler and tracks an anonymous visitor ID.
wow.schedule	persistent	20 minutes	This cookie is set by Spotler and enables it to track the Load Balance Session Queue.
wow.session	persistent	20 minutes	This cookie is set by Spotler to track the Internet Information Services (IIS) session state.
wow.utmvalues	persistent	20 minutes	This cookie is set by Spotler and stores the UTM values for the session. UTM values are specific text strings that are appended to URLs that allow Communigator to track the URLs and the UTM values when they get clicked on.
_ga	persistent	2 years	This cookie is set by Google Analytics and is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. It stores information anonymously and assign a randomly generated number to identify unique visitors.
_gat	persistent	1 minute	This cookies is set by Google Universal Analytics to throttle the request rate to limit the collection of data on high traffic sites.
_gid	persistent	1 day	This cookie is set by Google Analytics and is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visited in an anonymous form.

Cookie	Type	Duration	Description
cf_ob_info	persistent	1 minute	This cookie is set by Cloudflare content delivery network and, in conjunction with the cookie 'cf_use_ob', is used to determine whether it should continue serving “Always Online” until the cookie expires.
cf_use_ob	persistent	1 minute	This cookie is set by Cloudflare content delivery network and is used to determine whether it should continue serving “Always Online” until the cookie expires.
free_subscription_only	session	1 year	This session cookie is served by our membership/subscription system and controls which types of content you are able to access.
ls_smartpush	persistent	1 month	This cookie is set by Litespeed Server and allows the server to store settings to help improve performance of the site.
one_signal_sdk_db	persistent	Until cleared	This cookie is set by OneSignal push notifications and is used for storing user preferences in connection with their notification permission status.
YSC	session	1 year	This cookie is set by Youtube and is used to track the views of embedded videos.

Recommended

AI-driven automated chemistry as a tool to accelerate drug discovery

Leave a Reply Cancel reply