AI, in Bioactivity Prediction to Aid Drug Discovery

The current cost of developing a new drug is approximately US$2.5b. This number has been steadily on the rise, nearly doubling every nine years. According to Frost and Sullivan AI has the potential to improve outcomes for medical imaging and diagnosis process by 30-40%.

And reduce the costs of treatments by as much as 50%, with a greater impact on patients as a result of earlier diagnosis.

Hence, one of the primary goals of the deployment of AI and Machine learning in drug discovery. Is to assist in the reduction of the cost of drug development and the creation of an efficient process in getting drugs to phase 1 of clinical trials. Currently 90% of drugs never enter phase 1 of clinical trials.

A technique used to identify lead compounds that may be used in the production of drugs is called virtual screening.

This refers to a range of in-silico, computer assisted drug design. The processing power of computers is deployed to search large compound databases. In order to select a smaller number of compounds for biological testing. Virtual screening can be used to

· Select compounds for screening from in-house databases

· Choose compounds to purchase from external suppliers

· Decide which compounds to synthesise next.

In machine learning, there are many methods that can be applied to optimise virtual drug screening. For example;

Support Vector Machines (SVM’s): Uses a kernel function to project data, for example candidate compounds are projected into a feature space in which data points can be literally separate into two classes, such as active and inactive compounds.

Decision Trees: Are a tree datatype with internal nodes containing test conditions for molecular descriptors and leaf nodes assigned to various molecular characteristics.

The reason machine learning and in particular deep learning may be a candidate technology to optimise drug discovery is due to the use of substantial neural networks that teach machines to automate tasks that would normally be carried out by humans.

But How Does it Really Work?

In drug discovery the goal is to find a lead compound that can be optimised as a potential drug candidate. What optimisation means, in the context of drug discovery, is the use of a process called chemical synthesis to modify the lead molecule in order to improve the chances of it being transformed into a successful drug.

According to the American Chemical Society, chemical synthesis is the skill set at the heart of discovering new drugs. Chemists in the pharmaceutical industry craft complex compounds in a carefully planned step-by-step process to make original molecules from available precursors.

However, the challenge chemists face is that the chemical space is vast

there are 30 million known compounds

a typical pharmaceutical compound collection contains approximately 1 million compounds

AI and machine learning algorithms can be trained to analyse large volumes of data. However, there is still a question, regarding how to utilise neural networks in the discovery of new drugs. Through the development of tools, and techniques that are more efficient in terms of

reducing the overall costs of drug discovery,
accuracy in identifying lead compounds that deliver the necessary therapeutic outcomes and
the speed to market

Drawbacks of Virtual Screening Techniques

Although virtual screening techniques are becoming an important tool in the discovery of lead molecules, (active chemicals) that beneficially aid in positive therapeutic outcomes.

Either, through the use of high throughput sequencing technologies (HTS) in conjunction with standalone virtual screening techniques. To economically aid in the critical selection of molecules for the initial screening process.

However, because the technology is not yet fully mature there are drawbacks to high throughput virtual screening.

Such as limited structural information, or missing information, imprecise understanding of the properties of drug-like molecules, the inability to map 3D properties onto 2D structures, poor scoring functions, and incorrect assessment of structure activity relationships (SAR) data.

Machine Learning Algorithms

AI technologies, such as natural language processing, neural-network capability and deep learning, are being applied to unstructured data for more accurate and faster diagnosis. See images below;

In deep learning, computational processing power is a limiting factor, both in terms of cost but also in terms of training time. Neural networks use graphic processing units (GPUs)which are more expensive than computer processing units (CPUs).

Also, the time necessary to train a neural network can range between a couple of hours to months depending on the amount of data and the number of layers in the network. Popular deep learning frameworks include, TensorFlow, Pytorch, Keras, DL4J, Caffe and Microsoft Cognitive Toolkit.

Key Drivers for AI in Diagnostics Include:

Earlier detection of pathogenesis, AI has the potential to detect signs of cancer in mammograms much earlier that human clinicians can
Earlier identification of pandemics and disease progression across populations
Machine learning can enable doctors to make more personalised treatment decisions based on specific records.

Also, diagnostic centres, within hospitals, are leveraging deep learning and pattern recognition to reduce diagnosis turnaround time and improve pathology workflow efficiency and accuracy of diagnosis.

Structure Activity Relationships (SAR)

Are key to many aspects of drug discovery, ranging from primary screening, of chemicals to identify the most promising molecules to lead optimisation.

The structure activity relationship can be described as a method used to predict the relationship between the biological activity of a molecule and its molecular structure.

This information enables chemists to modify the molecular structure of a collection of molecules, to optimise some property or activity, improve the potency of the active element of the molecule, reduce toxicity and ensure sufficient bioavailability. In pharmacology bioavailability refers to the rate an administered drug is absorbed into the body’s circulatory system.

High Throughput Systems (HTS) experimental techniques are deployed to aid medicinal chemists in the identification of drugs with structural activity relationships that have the highest probability of delivering the right therapeutic outcomes.

However, when using (HTS) a chemist may be provided with hundreds of chemical series, the data generated overwhelming the capabilities of a single chemist. The chemist will need to rapidly identify, the most promising chemical series, out of the hundreds produced by the HTS.

Computational processing may be deployed to prioritise the most promising series and build a variety of models to capture and encode SAR’s that predict the biological activities for new molecules with a higher degree of accuracy.

Modelling Structural Activity Relationships

This is carried out via two methods,

Statistical – regression models
Data Mining - pharmacophore models

The choice of modelling technique can influence to what extent, a SAR can be explored. 3D approaches are the preferable method, as the information on the ligand – receptor interactions is more comprehensive.

Ligands are molecules that bind to a receptor, this is a very precise and reversible process.

Creating Models

The model is created by training a set of molecules. To predict the activity for new molecules. As most biological systems contain multiple complex processes, it would be unrealistic to expect linear relationships. Hence modern non-linear methods such as neural networks, and support vectors machines are extensively used due to their high level of accuracy.

Train model with a set of molecules with known SAR structures
Use the model to predict the biological activity of new molecules

Using linear regression and random forests models one can ascertain the effects of specific structure features of a molecule on the observed biological activity of the model.

Statistical models and machine learning algorithms can be used to learn from SAR data and then predict the activities of new molecules, such predictions are not necessarily reliable.

Each model is denoted with a “domain applicability” this is a measure of the level of uncertainty associated with accurately predicting the biological activity of a new molecule. Based upon the similarity of the molecules used to train the predictive model to the structure of the new molecules. This essentially is a measure of the reliability of the predictive model.

AI Platforms used for Drug Discovery

Exscientia’s artificial-intelligence (AI) platform to hunt for metabolic-disease therapies, and Roche subsidiary Genentech is using an AI system from GNS Healthcare in Cambridge, Massachusetts, to help drive the multinational company’s search for cancer treatments.

Antidote focuses on matching patients and medical researchers in clinical trials. It enables patients to find the most suitable clinical trials, facilitates the streaming of the latest study information to millions of patients.

Atomwise uses supercomputers to reduce the cost of medicine development by predicting the molecular structures of, molecules that are most likely to work with a high degree of accuracy.

TurbineAI, built an AI solution over six years to design personalised treatments for any cancer type or patient faster than traditional healthcare service.

Next month FluxInsights will cover the application of neural networks in more depth.