Applying AI language translation to the creation of new pharmaceuticals

Share via

Posted: 16 August 2019 | Dr Sam Genway | No comments yet

Designing new drug molecules is crucial to R&D. Dr Sam Genway suggests that one way to improve and speed up this process is using AI inspired by language translation.

At the heart of Pharma R&D is the development of new drug molecules, which are effective against particular disease targets. There is considerable research going into technologies that take the number of explorable molecules from the millions towards the billions.

Efficacy is only one aim; molecules also need a host of other properties, from non-toxicity, solubility and stability to being synthesisable and patentable. Developing these complex drug molecules is an iterative process to refine down thousands of candidate molecules to find the most suitable.

This involves huge numbers of experiments, predictive models and expertise, applied across several rounds of optimisation. Each of these requires modifications to discover the best set of potential molecules. Design improvements at each iteration might involve switching out parts of a molecule for others that are predicted to create better properties.

Developing these complex drug molecules is an iterative process to refine down thousands of candidate molecules to find the most suitable”

A common approach to drug design is to use a higher-level description of the needed molecular shape. One such description is the ‘reduced graph’, which involves specifying what structure the molecule should have: for example “an aromatic ring connected to a linker, which in turn is connected to an aliphatic ring acceptor, which in turn will potentially be connected to several other molecular substructures with different characterisations.”

This high-level description is useful because it limits the search for molecules to those that meet specified criteria, ie, having a similar structure to a known active compound. Creating a reduced graph for a known molecule is not difficult; the bigger challenge is the opposite process – finding suitable potential molecules which match the desired reduced graph. It is comparable to buying a house: if the criterion is “any house”, you will never find what you are looking for. But if you specify the location, how many bedrooms and the price, you have a better chance of success. Specifying the reduced graph of a molecule is like providing a detailed layout of the house you would like to own. However, while there are a million or so property ads online in the UK, the number of molecules in the chemical space available for drug design is around 10⁶⁰, with the overwhelming majority never having been synthesised in a laboratory.

Cheminformatics – computational and mathematical techniques which analyse collections of molecules and their properties – is used routinely in drug development on the path to finding a novel drug candidate. These computational, or in silico, drug modelling techniques have long relied on machine learning techniques.

With the recent boom in artificial intelligence (AI), many are now asking how the breakthroughs in AI will transform drug design.

AI language translation, a solution to predicting molecules for new drugs

It isn’t immediately obvious, but the challenge of generating a set of candidate molecules from a reduced graph description of the ‘right kind of molecule’ is something AI can help research and develop.

Remarkably, we found that this problem can be related to a separate AI challenge: translating languages.

Language translation has been transformed in recent years through cutting-edge developments in neural networks such as ‘sequence-to-sequence learning’ and ‘attention mechanisms’.

Sequence-to-sequence learning takes a sequence of words, eg, a sentence in English and outputs another sequence of words, eg, a translation in French. Languages have very different structures, which is why successful machine learning approaches consider sentences in their entirety and generate a new sentence that captures the whole meaning of the first.

It is also useful to know that particular words in each language relate to each other and this is where the ‘attention mechanism’ comes in. Attention mechanisms allow the model to focus on particular words in the input sentence when generating particular words in the output.

Overall, this approach allows translations which are accurate locally, meaning that the correct words are selected, but also capture the overall meaning in the translation.

Unlike many problems in machine learning, there is often no single right answer with language translation. When asking lots of human linguists to translate a sentence from English to French we get multiple, equally valid answers. The same is true of the AI translation system. We can get multiple correct answers from the system if we ask it to translate the same sentence multiple times.

Generating molecules to match specification

So, what does this have to do with creating molecules?

A molecule can be represented as a text sequence using a code called a SMILES string. The same is true of the high-level reduced graph capturing the outline of what the molecule should look like.

…computational, or in silico, drug modelling techniques have long relied on machine learning techniques”

We found we could create an approach that applied the same basic principles of language translation to “translate” the outline of a molecule into a specified novel molecule that matched the outline to project a molecule to match our requirements.

All that was required was a dataset with hundreds of thousands of molecules and their equivalent reduced graph outline to train the AI system. Fortunately, there are huge datasets of molecules available and generating high-level descriptions of a complete molecule is relatively simple. For any given reduced graph outlining a new molecule, the AI system can propose thousands of novel molecules that match the specification, which chemists can then use to guide their search for the next drug candidate.

How well does the AI work?

Having shown that new molecules can be generated with this technique, the AI needs to be tested to ensure it is doing something useful.

Full validation will need time, with expert chemists using AI tools in real discovery programmes, allowing the approach to be contrasted with existing methods. However, there are some tests that can be performed immediately by making use of historical data for proven drug molecules.

The dataset used to train the AI had certain molecules and reduced graphs removed completely and set aside. These were used to provide the system with high-level reduced graphs of drug candidates from published literature that the system had never seen before. If the AI system could take these high-level descriptions and generate a known active compound, this would be a great indication of its value in future discovery programmes.

In work published in the Journal of Chemical Information and Modelling, we performed this test with 20 different known active molecules, which had not been processed by the AI system. In most cases, a known active compound was generated. In all cases, there were molecules generated that were similar to a known active compound. Many of the thousands of molecules generated by the AI system will never have been synthesised in any lab, so there is no certainty surrounding their properties without making and testing them. However, the set of AI-generated molecules were diverse and creating an AI system able to propose a variety of molecules in this way is valuable for scientists trying to search for possible molecules to reach a drug candidate.

Being creative and collaborative

Establishing a connection between apparently unrelated problems – in drug discovery and language translation – may seem like a chance occurrence.

However, many successful applications of machine learning and analytics come from identifying related problems in other domains and understanding how to extend and specialise them for new challenges. It is only by combining broad expertise across AI techniques with deep subject matter expertise that it is possible to identify opportunities from seemingly unrelated techniques that could be used to solve R&D problems.

About the author

Dr Sam Genway joined Tessella in 2014 and is the Principal AI Solutions Engineer. He helps organisations exploit innovations in AI and develop novel capabilities. He has a PhD in Theoretical Physics from Imperial College London and worked as a Research Fellow at The University of Nottingham. Sam works across drug discovery, clinical development and pharmaceutical manufacturing, to identify transformative opportunities for data-driven decision-making.

Related organisations
Tessella

Cookie	Type	Duration	Description
cookielawinfo-checkbox-advertising-targeting	persistent	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertising & Targeting".
cookielawinfo-checkbox-analytics	persistent	1 year	This cookie is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category "Analytics".
cookielawinfo-checkbox-necessary	persistent	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	persistent	1 year	This cookie is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category "Performance".
PHPSESSID	session	1 year	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	persistent	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
zmember_logged	session	1 year	This session cookie is served by our membership/subscription system and controls whether you are able to see content which is only available to logged in users.

Cookie	Type	Duration	Description
advanced_ads_browser_width	persistent	1 month	This cookie is set by Advanced Ads and measures the browser width.
advanced_ads_page_impressions	persistent	2 years	This cookie is set by Advanced Ads and measures the number of previous page impressions.
advanced_ads_pro_server_info	persistent	1 month	This cookie is set by Advanced Ads and sets geo-location, user role and user capabilities. It is used by cache busting in Advanced Ads Pro when the appropriate visitor conditions are used.
advanced_ads_pro_visitor_referrer	persistent	1 year	This cookie is set by Advanced Ads and sets the referrer URL.
bscookie	persistent	2 years	This cookie is a browser ID cookie set by LinkedIn share Buttons and ad tags.
IDE	persistent	2 years	This cookie is set by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
li_sugr	persistent	3 months	This cookie is set by LinkedIn and is used for tracking.
UserMatchHistory	persistent	1 month	This cookie is set by Linkedin and is used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.
VISITOR_INFO1_LIVE	persistent	5 months	This cookie is set by YouTube. Used to track the information of the embedded YouTube videos on a website.

Cookie	Type	Duration	Description
bcookie	persistent	2 years	This cookie is set by LinkedIn. The purpose of the cookie is to enable LinkedIn functionalities on the page.
GPS	persistent	30 minutes	This cookie is set by YouTube and registers a unique ID for tracking users based on their geographical location
lang	session	1 year	This cookie is set by LinkedIn and is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	persistent	1 day	This cookie is set by LinkedIn and used for routing.
lissc	persistent	11 months	This cookie is set by LinkedIn share Buttons and ad tags.
vuid	persistent	2 years	We embed videos from our official Vimeo channel. When you press play, Vimeo will drop third party cookies to enable the video to play and to see how long a viewer has watched the video. This cookie does not track individuals.
wow.anonymousId	persistent	2 years	This cookie is set by Spotler and tracks an anonymous visitor ID.
wow.schedule	persistent	20 minutes	This cookie is set by Spotler and enables it to track the Load Balance Session Queue.
wow.session	persistent	20 minutes	This cookie is set by Spotler to track the Internet Information Services (IIS) session state.
wow.utmvalues	persistent	20 minutes	This cookie is set by Spotler and stores the UTM values for the session. UTM values are specific text strings that are appended to URLs that allow Communigator to track the URLs and the UTM values when they get clicked on.
_ga	persistent	2 years	This cookie is set by Google Analytics and is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. It stores information anonymously and assign a randomly generated number to identify unique visitors.
_gat	persistent	1 minute	This cookies is set by Google Universal Analytics to throttle the request rate to limit the collection of data on high traffic sites.
_gid	persistent	1 day	This cookie is set by Google Analytics and is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visited in an anonymous form.

Cookie	Type	Duration	Description
cf_ob_info	persistent	1 minute	This cookie is set by Cloudflare content delivery network and, in conjunction with the cookie 'cf_use_ob', is used to determine whether it should continue serving “Always Online” until the cookie expires.
cf_use_ob	persistent	1 minute	This cookie is set by Cloudflare content delivery network and is used to determine whether it should continue serving “Always Online” until the cookie expires.
free_subscription_only	session	1 year	This session cookie is served by our membership/subscription system and controls which types of content you are able to access.
ls_smartpush	persistent	1 month	This cookie is set by Litespeed Server and allows the server to store settings to help improve performance of the site.
one_signal_sdk_db	persistent	Until cleared	This cookie is set by OneSignal push notifications and is used for storing user preferences in connection with their notification permission status.
YSC	session	1 year	This cookie is set by Youtube and is used to track the views of embedded videos.