RNA and machine learning: rational design for multidimensional biomarkers

Modern day oncology therapies have seen significant innovation in the last decade. It is high time we commit to using biomarkers that are driven by rational design and the latest computational methods.

In an earlier age of medicine, new therapies were often discovered by ‘accident’. There was little technical knowledge of structure or function to guide the process of developing curative treatments. Trial and error dictated progress, resulting in slow and unpredictable successes. As our knowledge of small molecules, proteins and their structural relationships grew, we entered the era of rational drug design. Rational drug design has made a significant impact in the field of oncology, where we have gathered a deep knowledge of ligand binding and biochemical pathways. Modern day drug strategies utilise frameworks of rational drug design, driven by computational experimentation to further the pace of potential therapy identification.

In the early 2000s, for example, there was an unexpected race for a small molecule blocker for type I TGF receptor (TGFβ) kinase. Two groups, one led by Scott Sawyer, Eli Lilly, and the other by Juswinder Singh, Biogen-Idec, discovered an identical molecule via separate efforts.1,2 The Lilly team used conventional high-throughput screening (HTS) enzyme and cell assays, which were costly and time consuming. Independently, Singh’s team streamlined the discovery by employing computational methods to perform a ‘virtual screening’. This approach was faster, relatively less costly and enabled Biogen-Idec to garner an edge over Lilly. It was an early demonstration that computationally-guided design had the potential to prioritise or even replace expensive chemical and biological assays, minimising limitations and time to market. Since this time, databases of results from both low- and high-throughput studies have continued to explode, further enhancing our ability to rationally develop not only monotherapies, but also bispecific therapies and combination therapies.

The case for predictive biomarkers

…to achieve the goal of precision medicine, we need continued investment in rational biomarker design”

The evolution of biomarker design is not so different from the evolution of drug design. Even with the most efficacious therapies, not all patients respond. Furthermore, when the process of matching patients with certain therapies goes wrong, adverse events can be costly and even deadly. For some time, the industry has worked to find biomarkers that provide predictive insight for matching patients to the right treatments. Historically, this meant identifying specific patient populations that should receive, or not receive, a therapy.

Early on, macroscale pathological characteristics were used to make treatment decisions for patients, including for cancer. Tumour grade, size and location were documented and statistics from the clinical results of many patients were used to make these generalisations of who should receive therapy and who should not. Histology, once available, provided additional insight, taking us one step closer to a molecular-level understanding of why certain patients respond and others do not. However, the world of medicine changed drastically with the completion of the human genome project and the advent of genomic medicine.

The era of genomic medicine

The outcome of the human genome project was not a static reference sequence, as is often cited. Rather, the advancements made during the milestone effort and shortly after its completion resulted in the birth of genomic medicine. Genomic medicine represents a major breakthrough and significant driver towards what we know as precision medicine, often defined as the right patient receiving the right treatment at the right time. Since the completion of the human genome project, the technology known as high-throughput sequencing or next-generation sequencing (NGS) has generated trillions of genomic sequences from cancer patient’s tumour tissue.

Unfortunately, early attempts at using this data for rational biomarker design were not as effective as hoped. The field has relied heavily on DNA data. Linking observations in DNA to downstream biological implications of epigenetics, along with transcriptional and translational modifications has caused challenges. Enormous datasets have been mined to identify both drug targets and biomarkers in DNA, but the utility of single, static mutations has fallen short. There are a number of mutations where we understand the biological implications,  such as BRAF V600E, but the presence of these mutations is not as accurate as we would hope in predicting response in the context of certain drugs.3

Advancements in both the molecular and computational tools used to generate and analyse high-throughput RNA data has created a new and promising avenue for biomarker discovery. As RNA is one step closer to the downstream biology occurring at the protein-level, but can be measured by the same technologies developed for high-throughput DNA sequencing, it can provide a rich and dynamic view of a patient’s molecular profile. High-throughput RNA sequencing has been used to confirm the expression of a mutation or fusion transcript, which affords significant clinical value. However, these single-analyte biomarkers, such as gene fusions or mutations at the RNA level, in many ways limit analysis, similarly to DNA. They seek to predict patient response based on one facet of biology, which is highly oversimplified.

Multidimensional biomarkers

Taking a page out of the rational drug design book, the logical next step for rational biomarker design is to increase dimensionality. Just as bispecific and combination therapies have evolved to target multiple disease points, biomarkers should also seek to capture and utilise as much information about molecular profiles as possible. Early efforts to accomplish this in colorectal cancer resulted in a new system of molecular subtyping.4 Researchers also began to build “signatures” of RNA, which consisted of ranked gene-lists to better classify patients.5 Today, by leveraging machine-learning tools, researchers can filter out vast levels of noise and identify only the most useful data signals to build what are known as RNA models.

Using multidimensional RNA models built by machine-learning for predictive biomarkers is superior to single-analyte biomarkers. Beyond the molecular advantages described above, using machine-learning to build these models provides a rational, data-driven method and the output is the optimal combination of signals.

This approach requires researchers to put patient profiles at the centre, capturing a myriad of signals that represent disease, immune response, therapy response, etc. The resulting biomarkers have shown impressive improvements in predictive accuracy over single-analyte approaches in exploratory studies, even in the early stages of this new era of biomarkers built using predictive models.

What’s next?

In this narrative on modelling disease, one might assume we are talking about either the disease or the cancer cells themselves. However, it has become increasingly apparent that factors such as immune response to disease can be highly predictive of patient survival, response to traditional therapies and of course, response to some of the most advanced therapies currently available, such as immunotherapies. Multidimensional predictive biomarker models of the immune system are built using an approach that has been described as Predictive Immune Modelling. These models, like many, serve to capture biological complexity and use data to predict patient response. In the future, many comprehensive biomarkers will require curating highly standardised databases of multidimensional biomarkers complete with metadata, clinical data, outcomes data, etc. so that researchers may begin to draw conclusions from what they already know to be true.

How will today’s biomarkers, built to stratify one patient population, be able to inform another patient population with similar molecular profiles or select for a therapy with a similar mechanism of action? Without a doubt, to achieve the goal of precision medicine, we will need continued investment in rational biomarker design using the most informative molecular and computational tools available today, including RNA sequencing and machine-learning tools.

About the author

Dr Jarret Glasscock is a geneticist and computational biologist. He is the founder and CEO of Cofactor Genomics.  Prior to founding the company, Jarret was faculty in the Department of Genetics at Washington University and part of The Genome Institute. 


  1. Sawyer J, Anderson B, Beight D, Campbell R, Jones M, Herron D et al. Synthesis and Activity of New Aryl- and Heteroaryl-Substituted Pyrazole Inhibitors of the Transforming Growth Factor-β Type I Receptor Kinase Domain. Journal of Medicinal Chemistry [Internet]. 2003 [cited 23 December 2019];46(19):3953-3956. Available from:
  2. Singh J, Chuaqui C, Boriack-Sjodin P, Lee W, Pontz T, Corbley M et al. Successful shape-Based virtual screening: The discovery of a potent inhibitor of the type I TGFβ receptor kinase (TβRI). Bioorganic & Medicinal Chemistry Letters [Internet]. 2003 [cited 23 December 2019];13(24):4355-4359. Available from:
  3. Bonanno L, Zulato E, Attili I, Pavan A, Del Bianco P, Nardo G et al. 1830OLiquid biopsy as tool to monitor and predict clinical benefit from chemotherapy (CT) and immunotherapy (IT) in advanced non-small cell lung cancer (aNSCLC): A prospective study. Annals of Oncology. 2018;29(suppl_8).
  4. Menter D, Davis J, Broom B, Overman M, Morris J, Kopetz S. Back to the Colorectal Cancer Consensus Molecular Subtype Future. Current Gastroenterology Reports. 2019;21(2).
  5. Alexander E, Kennedy G, Baloch Z, Cibas E, Chudova D, Diggans J et al. Preoperative Diagnosis of Benign Thyroid Nodules with Indeterminate Cytology. New England Journal of Medicine. 2012;367(8):705-715.