AI-driven automated chemistry as a tool to accelerate drug discovery

Typically, early drug discovery campaigns start with the screening of chemical libraries to identify candidate chemotypes modulating a particular target and/or phenotype. Success of the primary screening depends on multiple factors related to both biology and chemistry. These include the target’s druggability, sensitivity and specificity of assay system, composition and diversity of chemical libraries, number of screened compounds, etc.

AI in drug discovery chemistry

SINCE every project has cost constraints, the chemical libraries cannot cover the entire chemical space of up to 1060 potentially drug-like small molecules.1 Moreover, chemical library size ranges are often between tens of thousands to hundreds of thousands, which further complicates the search for new prospective chemotypes in such narrow chemical space. The recent advances in chemistry automation could help to overcome this limitation and establish the process of new bioactive chemical discovery.

The automation of synthetic organic chemistry is a complex process, which requires the synchronised application of instruments, devices, and software algorithms, as well as the ability to execute different operations like reagent mixing, filtrations, purifications, etc. This complex process needs to be regulated by sophisticated algorithms, which can be symbolically divided into infrastructure and reaction generator parts. The infrastructure software is mapping the synthetic reactions into specific steps necessary for the robots to execute, capture and store the reactions’ conditions, parameters and yields, controlling the operation of hardware/devices, etc. Recently, several types of research proposed different approaches to control the assembly of the molecules in robotic systems. For example, Steiner et al.2 proposed a chemical programming language that can direct the synthesis of organic compounds on a flexible modular platform. The authors called this language XDL (chemical descriptive language), which is based on XML format and capable of describing the synthetic procedure with all required information. Once a reaction is translated to the XDL, it is further converted to a scripting language named ChASM (chemical assembly), which provides machine operation instructions and supports functions and variables. To navigate among physically connected robotics modules, authors implemented a graph abstraction using GraphML open source format. Thus, to automatically assemble the molecule, the authors developed, implemented, and integrated the three different software elements into one system named Chemputer. As a proof of concept, Steiner et al.2 successfully resynthesised three pharmaceutical compounds (Nytol, rufinamide, and sildenafil) using the proposed approach. In another study, Godfrey et al.3 used a custom software package as part of a fully integrated, globally accessible, automated chemical synthesis laboratory. The architecture of the proposed software consists of different layers that perform the separate tasks:

  •  controls of transport logic (Alchemii)
  •  establishment, submission, and management of chemistry workflow (ALS software)
  •  real-time inventory and trackers (web-interface). The authors applied the proposed system for automated medicinal chemistry, focusing on the generation of challenging heterocyclic libraries.

Although these explorations and developments in infrastructure software for automated chemistry have been made, solutions are still needed that can be broadly applicable and scalable and have modular expansion capability and flexibility. Another limitation of the described systems is that they rely on well-described input. Thus, for Chemputer, an input constitutes a detailed reaction description, which is converted into XLD format using a neuro-linguistic programming (NLP) approach. This description may be taken from scientific publication or written down directly by the user. The input in the system described by Godfrey comes directly from a chemist and is represented by the drawn reaction workflow.

AI for drug discovery chemistryThus, to explore new chemical space, both systems need to have a reaction prediction component that can forecast synthetic routes with corresponding conditions and yields. The constant progression in machine learning and artificial intelligence (AI) systems provides a great opportunity to integrate cutting-edge solutions into this field. In recent decades, many studies have been conducted with the aim of predicting chemical reactions.4-10 In general, they can be divided into two major groups: those that require reaction atom mapping and those with a mapping-free approach. The reaction atom mapping group can be sub-divided into rule-based and machine learning-based.

Rule-based approaches mostly use SMARTS11 and/or SMIRKS12 notations for reaction transforms. Szymkuc et al. developed a software named Chematica, which uses more than 20,000 reaction transforms represented by SMARTS notations.10 Since pure SMARTS transforms often do not contain specific information – eg, groups’ protection data, reaction conditions, incompatible groups, etc – the authors incorporated these data together with reaction SMARTS in expert-coded records that can be used as the “rules” for retrosynthesis planning.10 In addition to the protected and incompatible groups, the system calculates a delocalisation energy per atom, to determine the position at which a particular reaction transformation can be applied.

Since multiple reaction transformations can be applied for some compounds, the authors proposed to use a scoring function based on the pre-defined structure and reaction features – such as the number of rings, stereocentres, group incompatibilities, etc – to prioritise synthetic routes. The drawbacks of the rule-based approaches are the necessity to apply all rules for a certain compound, which slows down the prediction process, as well as a lack of statistical rationality beyond the scoring functions. Indeed, since the proposed scoring functions depend on features and coefficients, their best combination needs to be statistically validated and proved.

To address some of these limitations, machine-learning approaches have been deployed.8,4 Instead of scoring functions, the reaction transform can be predicted by a model developed using machine-learning methods such as Random Forest,13 Support Vector Machines,14 Gradient Boosting,15 Deep Neural Nets16 and others. Such models utilise chemical descriptors calculated from reactants as an input and predict a label representing the reaction transformation in form of SMARTS or SMIRKS as an output. Once the label is predicted, the corresponding SMARTS/ SMIRKS can be directly applied for reactants to generate the reaction product.

Such an approach has the following strengths:

  • it can provide the probability for each reaction transform so they can be easily ranked, and
  • it can be used for the prediction of reaction conditions and agents.6

Wei et al.4 combined Morgan and neural fingerprints as descriptors, together with deep neural network (DNN) to build a model for prediction of 16 reaction types of alkyl halides and alkenes. Segler et al.17 used Morgan fingerprints and DNN to build a model for the prediction of 8,720 reaction transformations gathered from the Reaxys database. The authors showed that the machine-learning approach yielded significantly better results compared to the pure rule-based method. More examples can be found elsewhere.9,5,18 Although the aforementioned studies significantly enhanced the field, the developed models ignore the chirality and stereochemistry of chemical compounds and can also predict only the most frequent reaction types (occurred at least 100 times). Also, since these methods used SMARTS/SMIRKS as a label, they still depend on proper atom mapping, which can be challenging for unbalanced reactions.

The current achievements in AI systems propose new solutions for this field. One of these has been borrowed from the machine translation area and is treated as a sequence-to-sequence (seq2seq) modelling problem. In this case, reactants, products and agents are treated as SMILES and fed into the autoencoder to build the predictive model.19 Autoencoder is a type of deep learning architecture, which consists of both encoder and decoder parts. Encoder is used to transform the SMILES of reactants and agents into a latent space, and decoder is used to reconstruct the reaction product from the latent space. The different types of deep neural nets can be used to build autoencoders. The most commonly used are convolutional20 and recurrent neural nets.21

To achieve better performance, the teacher forcing strategy22 and attention mechanism23 have been proposed and widely used. In the teacher forcing strategy, the improvement is achieved by feeding the actual data into decoder instead of the predicted one obtained from encoder. The attention mechanism allows networks to concentrate on the sequence context adding the corresponding weights. Since the seq2seq approach ignores the atom mapping, it enables the model to be taught even from unbalanced reactions with multiple agents. Liu et al.5 developed the recurrent-based seq2seq model and validated it on a test set of 5,000 reactions belonging to 10 different reaction types. The authors showed that performance of the seq2seq model was comparable with the rule-based approach. Schwaller et al.24 developed a recurrent model with similar architecture and validated it using a reaction dataset extracted from patents.25 As a result, the obtained model showed superior performance compared to another deep learning approach.26 In the follow-up study, Schwaller et al.27 adopted the new architecture originally proposed by Google Brain28 and showed a significant improvement in the accuracy of reaction predictions.

Although the described models demonstrated great performance, they disregarded the reaction conditions and yield, and sometimes predicted invalid SMILES, which led to inaccurate results. Thus, the combination of both atom mapping and mapping-free approaches, or the construction of a hybrid method that should consider reaction conditions, agents and yields, may lead to a dramatic improvement in computer-based synthetic planning.

With so much progress made in the infrastructure and reaction prediction fields, the most promising future direction seems to be the combination of both areas, which offers a great opportunity to explore uncharted chemical space. Indeed, the new chemical compounds can be proposed by AI and physically synthesised by an automated robotic system controlled by infrastructure software. Taking account of these possibilities, The National Center for Advancing Translational Sciences (NCATS), part of the National Institutes of Health (NIH), initiated a programme titled ‘A Specialied Platform for Innovative Research Exploration (ASPIRE)’,29,30 whose goal is to help study unexplored biologically-active chemical space through integrating automated synthetic chemistry, high-throughput biology, and artificial intelligence technologies. The NCATS ASPIRE programme is also designed to accelerate scientific solutions to the national opioid public health crisis under NIH HEAL (Helping to End Addiction Long-term)31 initiative. Thus, the new AI approaches and infrastructure technologies developed under these programmes could help with the engineering of treatments for opioid misuse and addiction, as well as enhance the discovery of next-generation addiction-free analgesics. 


chemistry author photoALEXEY ZAKHAROV received his Ph.D. in Bioinformatics from the Institute of Biomedical Chemistry of Russian Academy of Medical Sciences. In 2011 he joined the Chemical Biology Laboratory at the National Cancer Institute, National Institutes of Health (NIH) where he continued to strengthen his expertise in the cheminformatics field, by applying and developing in silico methods to aid in the drug discovery and design projects of the CADD Group. In 2015 he joined the informatics group at the National Center for Advancing Translational Sciences (NCATS), NIH. Currently, Zakharov serves as the Informatics lead for an early therapeutic discovery project team and manages and coordinates the team efforts toward achieving the project deliverables and milestones. He works on various projects in the therapeutic areas of cancer, cardiovascular, neurological, viral and inflammatory diseases.

AI authorANTON SIMEONOV is the scientific director of the Intramural Division of Preclinical Innovation at the National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH). The author or inventor on more than 150 peer-reviewed scientific publications and patents, Simeonov has a truly diverse background, ranging from bioorganic chemistry and molecular biology to clinical diagnostic research and development. He received a Ph.D. in bioorganic chemistry from the University of Southern California and a B.A. in chemistry from Concordia College. Simeonov then trained as a postdoctoral fellow at the Scripps Research Institute under Richard Lerner and Kim Janda. Prior to joining NIH in November 2004, Simeonov was a senior scientist at Caliper Life Sciences, and a leading developer of microfluidic technologies, where he was responsible for basic research on novel assay methodologies and development of microfluidic products for research and clinical diagnostics.


1. Bohacek RS, McMartin C, Guida WC. The art and practice of structure-based drug design: a molecular modeling perspective. Med Res Rev. 1996 Jan;16(1):3–50.

2. Steiner S, Wolf J, Glatzel S, Andreou A, Granda JM, Keenan G, et al. Organic synthesis in a modular robotic system driven by a chemical programming language. Science. 2018 Nov 29;eaav2211.

3. Godfrey AG, Masquelin T, Hemmerle H. A remote-controlled adaptive medchem lab: an innovative approach to enable drug discovery in the 21st Century. Drug Discov Today. 2013 Sep 1;18(17):795–802.

4. Wei JN, Duvenaud D, Aspuru-Guzik A. Neural Networks for the Prediction of Organic Chemistry Reactions. ACS Cent Sci. 2016 Oct 26;2(10):725–32.

5. Liu B, Ramsundar B, Kawthekar P, Shi J, Gomes J, Luu Nguyen Q, et al. Retrosynthetic Reaction Prediction Using Neural Sequence-to-Sequence Models. ACS Cent Sci. 2017 Oct 25;3(10):1103–13.

6. Gao H, Struble TJ, Coley CW, Wang Y, Green WH, Jensen KF. Using Machine Learning To Predict Suitable Conditions for Organic Reactions. ACS Cent Sci. 2018 Nov 28;4(11):1465–76.

7. Fooshee D, Mood A, Gutman E, Tavakoli M, Urban G, Liu F, et al. Deep learning for chemical reaction prediction. Mol Syst Des Eng. 2018;3(3):442–52.

8. Gimadiev TR, Madzhidov TI, Nugmanov RI, Baskin II, Antipin IS, Varnek A. Assessment of tautomer distribution using the condensed reaction graph approach. J Comput Aided Mol Des. 2018 Mar 1;32(3):401–14.

9. Segler MHS, Preuss M, Waller MP. Planning chemical syntheses with deep neural networks and symbolic AI. Nature. 2018 Mar;555(7698):604–10.

10. Szymkuć S, Gajewska EP, Klucznik T, Molga K, Dittwald P, Startek M, et al. Computer-Assisted Synthetic Planning: The End of the Beginning. Angew Chem Int Ed. 2016 May 10;55(20):5904–37.

11. Daylight Theory: SMARTS – A Language for Describing Molecular Patterns [Internet]. [cited 2019 Jan 4]. Available from:

12. Daylight>SMIRKS Tutorial [Internet]. [cited 2019 Jan 4]. Available from:

13. Breiman L. Random Forests. Mach Learn. 2001 Oct 1;45(1):5–32.

14. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995 Sep 1;20(3):273–97.

15. Friedman JH. Greedy function approximation: A gradient boosting machine. Ann Stat. 2001 Oct;29(5):1189–232.

16. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015 May;521(7553):436–44.

17. Segler MHS, Waller MP. Neural-Symbolic Machine Learning for Retrosynthesis and Reaction Prediction. Chem – Eur J. 2017;23(25):5966–71.

18. Watson IA, Wang J, Nicolaou CA. A retrosynthetic analysis algorithm implementation. J Cheminformatics. 2019 Jan 3;11(1):1.

19. Liou C-Y, Cheng W-C, Liou J-W, Liou D-R. Autoencoder for words. Neurocomputing. 2014 Sep 2;139:84–96.

20. Bengio Y, Lecun Y, Lecun Y. Convolutional Networks for Images, Speech, and Time-Series. 1995.

21. Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Comput. 1997 Nov 1;9(8):1735–80.

22. Lamb A, Goyal A, Zhang Y, Zhang S, Courville A, Bengio Y. Professor Forcing: A New Algorithm for Training Recurrent Networks. ArXiv161009038 Cs Stat [Internet]. 2016 Oct 27 [cited 2019 Jan 4]; Available from:

23. Luong M-T, Pham H, Manning CD. Effective Approaches to Attention-based Neural Machine Translation. ArXiv150804025 Cs [Internet]. 2015 Aug 17 [cited 2019 Jan 4]; Available from:

24. Schwaller P, Gaudin T, Lanyi D, Bekas C, Laino T. “Found in Translation”: Predicting Outcomes of Complex Organic Chemistry Reactions using Neural Sequence-to-Sequence Models. ArXiv171104810 Cs Stat [Internet]. 2017 Nov 13 [cited 2018 Dec 28]; Available from:

25. Lowe D. Chemical reactions from US patents (1976-Sep2016) [Internet]. 2017 [cited 2019 Jan 4]. Available from:

26. Jin W, Coley CW, Barzilay R, Jaakkola T. Predicting Organic Reaction Outcomes with Weisfeiler-Lehman Network. ArXiv170904555 Cs Stat [Internet]. 2017 Sep 13 [cited 2019 Jan 4]; Available from:

27. Schwaller P, Laino T, Gaudin T, Bolgar P, Bekas C, Lee AA. Molecular Transformer for Chemical Reaction Prediction and Uncertainty Estimation. ArXiv181102633 Phys [Internet]. 2018 Nov 6 [cited 2018 Dec 28]; Available from:

28. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All You Need. ArXiv170603762 Cs [Internet]. 2017 Jun 12 [cited 2019 Jan 4]; Available from:

29. A Specialized Platform for Innovative Research Exploration (ASPIRE) [Internet]. National Center for Advancing Translational Sciences. 2018 [cited 2019 Jan 4]. Available from:

30. Sittampalam GS, Rudnicki DD, Tagle DA, Simeonov A, Austin CP. Mapping biologically active chemical space to accelerate drug discovery. Nat Rev Drug Discov [Internet]. 2018 Nov 27 [cited 2019 Jan 11]; Available from:

31. HEAL Initiative [Internet]. National Institutes of Health (NIH). [cited 2019 Jan 4]. Available from: