How to add AI to Clinicaltrials.gov
by Josva Engmose Jensen (contact@me-ta.dk)
Posted on December 26, 2018
Updated on January 23, 2019 with
predictions using TensorFlow.js
Clinical trials are designed to answer specific questions about biomedical or behavioral
interventions in clinical research in the pharmaceutical industry. Clinicaltrials.gov is a
huge data base with around 500 new uploaded studies every week from around the world. Whit
access to all this data, why not try to feed it into a computer and see if we can do
something clever with it? Artificial Intelligence (AI) applications are emerging and
performs many tasks in the world we live in today. But how can AI be implemented in simple
task in the context of clinical trials? In this paper I will try to walk you trough an
example of a rather narrow task, which Deep Learning can help researchers manage clinical
trial workflows.
Problem description
The type of task I will guide you through is Supervised Learning, and the data is
downloaded from Clinicaltrials.gov. The goal is to predict the type of Interventional model
in a study given a summary text and a title from the study. This is a multi-class
classification problem, meaning that there is more than two classes to be predicted. After
reading this article, you would be able to implement and develop your own LSTM network for
your own prediction problems. The process is divided into following steps:
- Download trials from link: https://clinicaltrials.gov/AllPublicXML.zip
- Reading XML files into .csv files
- Reading .csv files into pandas dataframe in python
- Pre-process the text data
- Modelling and training
- Evaluation and prediction
The steps can be described and visualized as:
Import Classes and Functions
I will use a Python3 Anaconda environment together with Jupyter Notebook and i will use
Tensorflow backend. Tensorflow is an open source library and it is typically used in
machine learning for neural networks.
Here is all the required libraries:
import numpy as np # Linear algebra
# Data processing, CSV file I/O (e.g. pd.read_csv)
import pandas as pd
from pandas import DataFrame
import xml.etree.ElementTree as ET # Reading xml files
# For plotting
import matplotlib.pyplot as plt
import pydot
import pydotplus
import graphviz
from keras.utils.vis_utils import plot_model
from keras.utils import plot_model
from sklearn.manifold import TSNE
# For Modelling
import tensorflow as tf
from tensorflow.keras import layers, models, preprocessing, callbacks, optimizers
print(tf.VERSION)
print(tf.keras.__version__)
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Embedding, Input, Add
from tensorflow.keras.layers import LSTM, Bidirectional, GlobalMaxPool1D, Dropout
from tensorflow.keras.preprocessing import text, sequence
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Conv2D, MaxPooling2D, BatchNormalization
from tensorflow.keras.layers import concatenate
from keras.metrics import categorical_accuracy
# For Pre-processing
import string
from string import digits
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from tqdm import tqdm
import re
# Other useful modules
import h5py
from statistics import mode
import os
import datetime
import warnings
warnings.filterwarnings('ignore')
Download Clinical Trials
The first step is to download all the clinical trials from Clinicaltrials.gov and unpack
them into your working directory. After you have done this we are ready to work with the
xml files. A good idea would be to investigate the xml files, you can do this in several
editors, I personally prefer Visual Studio Code. Get an overview of an xml file and see how
it is structured and find the information of your interest.
Working with XML and .csv
XML is an inherently hierarchical data format, and the most natural way to represent it is
with a tree. Python has a module for parsing and creating xml data called
**xml.etree.ElementTree**. The ElementTree (ET) represents the whole XML document as a
tree. We now want to get the contents from our xml files we are interested in. In our case
we wish to look at *'nct_id'*, *'brief_summary'*, *'brief_title'* and
*'intervention_model'*. The text from these roots are the text that er going into our .csv
file. As we have many xml files, we create a function which iterates through the different
files and return the text from the 4 different roots. We are not interested in all the xml
files, we only look at Phase 2 or 3 and Interventional studies. Here is my code for doing
this:
def csv_row(xml_file):
tree = ET.parse(xml_file)
root = tree.getroot()
nct_text = ""
sum_text = ""
model_text = ""
ph_text = ""
title_text = ""
# Only iterates through Phase 2 and 3 studies
for ph in root.iter('phase'):
ph_text = ph.text
if (ph_text == "Phase 2" or ph_text == "Phase 3"):
#This bit finds all roots with nct_id which is a sub_root to id_info
for nct in root.findall('id_info'):
nctId_text = nct.find('nct_id').text
nct_text =nctId_text
#print(nct_text)
# This bit finds the brief summary text
for s in root.findall('brief_summary'):
summary_text = s.find('textblock').text
sum_text= summary_text
sum_text = sum_text.replace('\n', '') # Replaces newline with a whitespace
sum_text = re.sub(' +',' ',sum_text) # Compresses multiple whitespaces to only one
#print("Summary Text:", sum_text)
# Get's the official title for the study
for t in root.iter('brief_title'):
title_text = t.text
# This get's the type of intervention_model
for y in root.iter('intervention_model'):
model_text = y.text
total_text = "\"" + nct_text + "\"" + ";" + "\"" + sum_text + "\"" + ";" + "\"" + title_text + "\"" + ";" + "\"" + model_text + "\""
# This functions returns a text with Nct_Id, brief_summary, title and type of intervention model on the form we intended
return total_text
csv_row("search_result\\NCT00496392.xml")# This is for checking that the function works
We now have a function that returns the text of interest and the text is seperated with
';', which is one of standard seperators in .csv files. As you can see from this text, it
contains symbols, which we are not interested in when training our model, but more about
that later :) We now wish to write this text into 2 different .csv files, which is showned
below:
rdir = 'Subset_data' # Folders in directory where the all the xml files are placed
with open('train_data.csv', 'w', encoding="utf-8") as csvfile: # Opens a blank csv file
with open('test_data.csv', 'w', encoding="utf-8") as csvfile1:
for _, dirs, _ in os.walk(rdir):
for dir in dirs: # Looks at all the xml folders
if dir < 'NCT0012': #This is around 80% of the folders
for subdir, _, files in os.walk(os.path.join(rdir,dir)):
for file in files:
name = os.path.join(subdir, file)
csvfile.write(csv_row(name)) #Writes total_text into a row in to train_data.csv
csvfile.write("\n") # Skips to next line and do the same
else: #This is the remaining 20% of the folders
for subdir, _, files in os.walk(os.path.join(rdir,dir)):
for file in files:
name = os.path.join(subdir, file)
csvfile1.write(csv_row(name)) #Writes total_text into a row in to test_data.csv
csvfile1.write("\n")
As you can see from my code above, we condition on the size of string in foldernames to
make an appropriate division of train and test data. It is approximately about 80% training
data and 20% test data. Another approach for this could be loading the data into a pandas
dataframe and use the function train_test_split from the sklearn module, and here you
should specify the test_size. 80/20 split is very common in machine learning algorithms
*(Bronshtein, A. 2017, Train/Test Split and Cross Validation in Python.
[https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6])*.
We
now have 2 .csv files with containing the data of our interest.
Reading into Pandas
We now wish to read our 2 .csv files into dataframes in python, and here we can use the
module called pandas. The pandas module has a simple function doing this: **pd_read_csv**,
and here is how you use it:
# Earlier we saw that the returned text from our function was seperated by ';', so we use this as seperator when reading in the files
tr_df = pd.read_csv("train_data.csv", sep=';', header=None,error_bad_lines=False, warn_bad_lines=False)
t_df = pd.read_csv("test_data.csv", sep=';', header=None,error_bad_lines=False, warn_bad_lines=False)
# Give the data sets appropiate column names
tr_df.columns = ['Nct_id', 'Summary', 'Title',"Model"]
t_df.columns = ['Nct_id', 'Summary', 'Title',"Model"]
# We drop all the observations containing NaN's (missing values)
train = tr_df.dropna()
test = t_df.dropna()
This now gives us 2 dataframes with rows and columns corresponding to our .csv files. As
you can see in my code i have specified the column names myself. It is now a good idea to
visualize your response variable and plot the distribution. In our case it is a categorical
variable, so this will be a bar plot. In the figure below I have plotted the different
types of models from a small subset of our data and the belonging code is also specified.
import seaborn as sns
sns.set(style="darkgrid")
ax1 = sns.countplot(x="Model", data=train, order = train['Model'].value_counts().index)
ax1.set_title("Barplot for types of Intervention models")
for item in ax1.get_xticklabels():
item.set_rotation(45)
As you can see in the plot there are many Parrallel and Single Group study designs
compared to the others. This is only a small subset of the data from Clinicaltrials.gov, so
we can still be able to train a computer to predict the other categories. However we are
not interested in Factorial or Sequential Assignment, so we will collect this into one
category we call 'Other'. This means that our response variable now can take 4 different
values.
Some Machine Learning algorithms supports categorical values, but there are many cases
where the algorithms does not. The data analyst is therefore faced the challenge of turning
these text attributes into numerical values for further processing. There are many
different ways of encoding categorical variables, but one approach could be 'Label
Encoding'. This is simply converting the different values in our response to a number. This
is easily done in python:
# We want 4 categories: Crossover, Parallel, Single Group and Other
train.loc[(train['Model'] == 'Factorial Assignment'), 'Model'] = 'Other'
train.loc[(train['Model'] == 'Sequential Assignment'), 'Model'] = 'Other'
test.loc[(test['Model'] == 'Factorial Assignment'), 'Model'] = 'Other'
test.loc[(test['Model'] == 'Sequential Assignment'), 'Model'] = 'Other'
# Convert from object to category
train['Model'] = train['Model'].astype('category')
test['Model'] = test['Model'].astype('category')
#Label encoding
train["Model_type"] = train["Model"].cat.codes
test["Model_type"] = test["Model"].cat.codes
train.head() # Prints the first 5 rows of the data
The index in the left side of the dataframe just corresponds to the rows from the .csv
file. By default python will encode the categorical values by alphabetic order, so it will
be encoded as:
- 0: Crossover Assignment
- 1: Other
- 2: Parallel Assignment
- 3: Single Group Assignment
Label Encoding as you can see is straightforward and easily implemented, but it has the
disadvantage that the numeric values can be “misinterpreted” by the algorithms. The value 3
is obviously larger than the value 1, but the category *'Single Group Assignment'* is not
larger than *'Crossover Assignment'*. This is the case whenever your categorical variable
is nominal, which means that there is no natural ordering between the categories.
Another common approach is called one hot encoding *(Francois Chollet, 2018. Deep Learning
with Python. Manning Publications Co., Shelter Island, NY, 361 pp. )*. The basic idea is to
convert each category into a new column as a dummy variable (0/1). This will not weight the
values as in label encoding, but it does add more columns to the data set. Pandas has a
feature which support this:
# One Hot Encoding
train_dummy = pd.get_dummies(train, columns=['Model'], prefix =['Model'])
test_dummy = pd.get_dummies(test, columns=['Model'], prefix =['Model'])
print("Train shape:",train_dummy.shape)
print("Test shape:",test_dummy.shape)
I suggest, as i do in my code, to print the shape of the data sets or maybe print the
first 5 rows to verify. Before we only had 4 columns and now we have 8, where the last 4
corresponds to the dummy variable for each of the categories. In our case we only have
these 4 mentioned categories, but it can be challenging to manage this is, if you have a
large amount of categories.
Text Preprocessing
When dealing with textual data, it needs to be cleaned and encoded to numerical values
before feeding them into machine learning models, this process of cleaning and encoding is
called Text Preprocessing.
I will perform basic cleaning steps on the two features 'Title' and 'Summary', so it will
be ready to be fed into a classifier. *(Siddiqi, S. 2018, Text Preprocessing for Beginners
- Data Cleaning.
[https://www.kaggle.com/sabasiddiqi/workbook-1-text-pre-processing-for-beginners])*
The following steps will be performed:
- Removal of punctuation
- Removal of newline symbols (\n)
- Removal of digits
- Splitting combined words
- Converting words to lowercase
- Splitting each sentence using delimiter
- Converting words to base form
Here is my code for the above pre-processing steps:
# This needs to be download for the lemmatization (converting to base form)
nltk.download("wordnet")
def text_cleaner(dataframe_org):
dataframe = dataframe_org.copy()
columns = ['Summary', 'Title']
for col in columns:
dataframe[col] = dataframe[col].str.translate(str.maketrans(' ', ' ', string.punctuation)) # Remove punctuation
dataframe[col] = dataframe[col].str.translate(str.maketrans(' ', ' ', '\n')) # Remove newlines
dataframe[col] =dataframe[col].str.translate(str.maketrans(' ', ' ', digits)) # Remove digits
dataframe[col] =dataframe[col].apply(lambda tweet: re.sub(r'([a-z])([A-Z])',r'\1 \2',tweet)) # Split combined words
dataframe[col] =dataframe[col].str.lower() # Convert to lowercase
dataframe[col] =dataframe[col].str.split() # Split each sentence using delimiter
# This part is for converting to base form
lemmatizer = WordNetLemmatizer()
sum_l=[]
tit_l = []
for y in tqdm(dataframe[columns[0]]): # tqdm is just a progress bar, an this loop only looks at summaries
sum_new=[]
for x in y: # Looks at words in every summary text
z=lemmatizer.lemmatize(x)
z=lemmatizer.lemmatize(z,'v') # The v specifies that it is in doubt of example a word is a noun or verb, it would consider it a verb.
sum_new.append(z)
y = sum_new
sum_l.append(y)
for w in tqdm(dataframe[columns[1]]): # Looks at titles
tit_new=[]
for x in w: # Every word in the titles
z=lemmatizer.lemmatize(x)
z=lemmatizer.lemmatize(z,'v')
tit_new.append(z)
w = tit_new
tit_l.append(w)
# This will join the words into strings as in the original data, just pre-processed and put into list
sum_l2 = []
for col in sum_l:
col = ' '.join(col)
sum_l2.append(col)
tit_l2 = []
for col in tit_l:
col = ' '.join(col)
tit_l2.append(col)
# Data obtained after Lemmatization is in array form, and is converted to Dataframe in the next step.
sum_data=pd.DataFrame(np.array(sum_l2), index=dataframe.index,columns={columns[0]})
tit_data=pd.DataFrame(np.array(tit_l2), index=dataframe.index,columns={columns[1]})
frames = [sum_data, tit_data]
merged = pd.concat(frames, axis=1)
return merged
def create_tok(train_data, MAX_FEATURES):
clean_data = text_cleaner(train_data)
tokenizer_sum = text.Tokenizer(num_words=MAX_FEATURES) # Keep the 20.000 most frequent words
tokenizer_tit = text.Tokenizer(num_words=MAX_FEATURES)
# Summary Text
summary_list = clean_data['Summary']
tokenizer_sum.fit_on_texts(list(summary_list)) # Builds the word index
#Title Text
title_list = clean_data['Title'] # Text from Title
tokenizer_tit.fit_on_texts(list(title_list))
return tokenizer_sum, tokenizer_tit
def pre_process(dataframe, tokenizer, col, MAXLEN):
clean_data = text_cleaner(dataframe)
tokenized_list = tokenizer.texts_to_sequences(clean_data[col])
X = sequence.pad_sequences(tokenized_list, maxlen=MAXLEN)
return X
I will now try to explain some of the above code, besides the comments i have made in
transit.
We need to keep in mind that deep learning models does not truly understand text in human
sense, but they can be great for solving simple textual tasks. Like all neural networks,
deep learning models can't be fed with raw text as input, text data must be encoded as
numbers. Keras provides the Tokenizer class for preparing text for deep learning *(Francois
Chollet, 2018. Deep Learning with Python. Manning Publications Co., Shelter Island, NY, 361
pp. )*. This Tokenizer is fitted on the text data as you can see in my code.
The text.Tokenizer() is chopping the sequences of text into sequences of **'tokens'**,
where these tokens represents the meaning of the words in a dictionary. The
**'tokenized_list'** will now be a list of lists containing only integers representing the
words, and the **'X'** will be a list of lists.
The .pad_sequences() is used to ensure that all sequences in a list has the same length.
This is done by padding 0's in the beginning of each sequence until each sequence has the
same length as the longest sequence. So if the sequence for example has length 200, the
first 100 elements will be filled with 0's.
We can now run this function on our train and test data sets:
MAX_FEATURES = 20000 # Size of vocabluary
MAXLEN = 300 # Size of each text sequence, you can tune this depending on the mean length of you text sequences
tok_sum, tok_tit = create_tok(train_dummy,MAX_FEATURES )
# The following are used for model.fit
X_sum = pre_process(train_dummy, tok_sum, 'Summary', MAXLEN)
X_tit = pre_process(train_dummy, tok_tit, 'Title', MAXLEN)
# This is used for prediction
X_sum_test = pre_process(test_dummy, tok_sum, 'Summary', MAXLEN)
X_tit_test = pre_process(test_dummy, tok_tit, 'Title', MAXLEN)
list_classes = ["Model_Crossover Assignment", "Model_Other", "Model_Parallel Assignment", "Model_Single Group Assignment"] # The 4 categories
y = train_dummy[list_classes].values
# y_test is used for model.evaluate later on
y_test = test_dummy[list_classes].values
We now have every thing we need to train a deep recurrent neural network. But before doing
that I will just take some time and briefly explain something about LSTM network.
Long short-term memory
The Long Short-Term Memory network or LSTM network is a type of recurrent neural network
used in deep learning. It was developed becuase of the lack of RNN, and LSTM can solve the
problem long-term dependency, beacuse it uses gates to control the memorizing process.
LSTM is typically used for language modeling, sentiment analysis and text prediction. It
has the ability to forget, remember and update information, and this pushes it one step
ahead of RNN's. If you want to perform Supervised Learning with sequences as input, you
want to use a gated recurrent net such as LSTM or GRU. If your input is for example images
or of other topological structure the best approach would be a convolutional network.
*(Goodfellow, I. and Bengio, Y. and Courville, A. 2016. Deep Learning. MIT Press. 781 pp.
[https://www.deeplearningbook.org/] )*
Below you see a figure that visualize how an LSTM network is structured.
I will now try to breifly explain the figure.
There are 3 main components of LSTM units:
- 1. It can forget unnecessary information. A Sigmoid layer, which outputs a number
between 0 and 1 is used to forget or remember information. It looks at the current
input (Xt) and the previous output (h(t-1)), and decides which part of the previous
output there should be removed (if the sigmoid returns a 0). This we call the
forget gate, f(t), where the output is f(t) * c(t-1), where c(t-1) is the memory
from the last LSTM unit.
- 2. Then it needs to decide which information to store from the new input X(t).
Sigmoid decides if the information should be updated or ignored, and a tanh layer
creates a vector of values for new input. If these 2 are muliplied, it will update
the new cell state. The new memory from these 2 layers are added to the old memory
(c(t-1)) to give us c(t).
- 3. The last step is to decide what the output should be. A sigmoid layer decides
which parts from the cell state (c(t)) we will output. Then the cell state is put
through a tanh, which will generate all possible values and is mulitplied with the
output from the sigmoid gate. So the output is only the parts we decide to output.
So in a few words a LSTM model will not learn from the immediate dependency, it will
learn from long term dependency. *(Sinha, M. 2018. Understanding LSTM and its quick
implementation in Keras for sentiment analysis.
[https://towardsdatascience.com/understanding-lstm-and-its-quick-implementation-in-keras-for-sentiment-analysis-af410fd85b47])*
Building a model
Now we have a little idea of what a LSTM network is and we will now try to build one.
My code for this is showned below, and I will try to explain it afterwards.
def get_con_model():
embed_size = 50 # How big each word vector should be
inp_sum = Input(shape=(MAXLEN, ))
inp_title = Input(shape=(MAXLEN, ))
total_inp = concatenate([inp_sum, inp_title]) # Merge the 2 inputs
embed_layer = Embedding(MAX_FEATURES, embed_size)(total_inp)
lstm_layer = LSTM(50)(embed_layer)
layer1 = Dropout(0.1)(lstm_layer) # Regularization method, has the effect of reducing overfitting
layer2 = Dense(50, activation="relu")(layer1) # The relu function can return very large values
layer3 = Dropout(0.1)(layer2) # Again regularization
layer4 =BatchNormalization()(layer3) # Maintains the mean activation close to 0 and the activation standard deviation close to 1
layer5 = Dense(4, activation="softmax")(layer4) # Only outputs values between 0 and 1, this is the final layer
model_con = Model(inputs=[inp_sum,inp_title], outputs=layer5)
model_con.compile(loss='categorical_crossentropy', # This is the loss function, and this type of function is used when solving categorical classification
optimizer='rmsprop', # Algorithm that update network weights iterative based in training data
metrics=['accuracy']) # This is our statistical measure
return model_con
con_model = get_con_model()
# Gets informations about the layers in the model, including output, input and number of parameters:
con_model.summary()
I suggest to always look at the summary and to visualize your model. Here are some few
reasons to that:
**To confirm the layer order**. It is easy to make mistakes and adding layer in the
wrong order, and the plot of the model can help you confirm that it is done right.
**To confirm parameters**. Number of parameters are given in the model.summary, and
it can help you spot possible layers to reduce number of parameters.
**To confirm the output shape of the layers**. In complex networks it can be
difficult to specify the shape of input data. The summary and the plot can help you
confirm the shapes as you intended.
We will now visualize our network, and to do so you need to make sure you have
Graphviz installed separately on your system, not just in your project directory. It can be
downloaded from https://www.graphviz.org/download/.
Here is the code for plotting our
model:
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/release/bin/'
plot_model(con_model, to_file='model_plot.png', show_shapes=True, show_layer_names=True)
If we walk through the steps, we see the first thing is the 2 input layers, and we do not
differ between MAXLEN (input) for summary and title text, for simplicity. Next we
concatenate the 2 input layers before embedding, and this is primarily done to reduce the
number of parameters in the model, because number of parameters from an embedding layer is
**MAX_FEATURES** times **embed_size**, which in this case is 1 million parameters. Then
comes the Embedding layer, which takes the concanated input layer as input. The meaning of
word embedding is to map human language into a geometric space. In this space we would like
synonyms to be embedded into similar word vectos, and that the geometric distance between 2
word vectors would relate to the semantic distance. The word representation with word
embedding become relatively low dimensional and dense, beacuse it is learned from data.
This is a clear benefit compared to one-hot word vectors, which is high dimensional and
hardcoded. *(Francois Chollet, 2018. Deep Learning with Python. Manning Publications Co.,
Shelter Island, NY, 361 pp. )*
For word embedding to make a little bit more sense, i
will now try to visualize it:
In the figure above, we have 4 words embedded in 2D: *Man*, *King*, *Woman* and *Queen*.
With vector representations, the semantic relationships between the words can be encoded as
geometric transformations. In this case the same vector allows us to go from *Man* to
*King* and from *Woman* to *Queen*. We could interpret this vector as *"From gender to
royal status"*. In the same way we could also consider the vector which allows us to go
from *King* to *Queen* and from *Man* to *Woman*.
The type of word embedding depends on
your problem, and you have to consider in which context it is used. Their exists some
pretrained embedding files on the internet, which may can be useful for your task (Fx
GloVe). We chosed to use the embedding layer from the keras.layers.
The next layer in
our network is an LSTM layer, where the output from the **embed_layer** is given is input.
Here the number of LSTM units is set to 50, which is the dimensionality of the output space
of this layer. After the LSTM layer we have a Dropout layer. Dropout is a common
regularization technique *(Francois Chollet, 2018. Deep Learning with Python. Manning
Publications Co., Shelter Island, NY, 361 pp. )*, which randomly select neurons that are
ignored during the training and the weight updates are not applied to the neuron. The
dropout rate, which in this case is set to 0.1, corresponding to 10%. You can tune this
hyperparamter while experimenting building your own model.
After the first dropout layer
we have a Dense layer, which is just a regular layer of neurons in a neural network. Each
neuron recieves input from all the neurons in the previous layer, thus densely connected.
In this layer we used the activation function **'relu'** which range is from (0,inf), and
with this function we can pass the maximum of the error through the network. A result of
this layer could be very large outputs, which the network finds very challenging to handle.
We use again a dropout layer, with the same 10% dropoutrate. The next layer is
**BatchNormalization** which is a way of mantain the mean activation to be close to 0 and
the standard deviation to be close to 1, this handles the possible large outputs from the
dense layer with our relu function *(Keras Documentation,
[https://keras.io/layers/normalization/])*. Our final layer uses **softmax** as activation
function, and this function only outputs probabilities range. The range is therefor from 0
to 1 and the sum of all the probabilities will be equal to one. The reason why we set units
in the Dense layer to 4, is that we have 4 catogories, and the model returns probabilities
for each of these categories, where the target has the highest probability.
Before we
train our model, there is a few parameters we should specify. We need to give the model.fit
a **batch_size**, which is the number of samples that will be propagated through the
network. We have set this to 32, which means that the algorithm will take the first 32
samples and train the network. Next it takes the second 32 samples and trains the network
again. We continue with this process until we have propagated all samples in our data
through the network.
A benefit of using af batch size smaller than the number of
samples, is that it requiers less memory. This can be very profitable when having a large
data set. Our network will also train faster. The reason for this is that we update the
network parameters after each propagation, but if we used all the samples it will only
update 1 time *(2015. What is batch size in neural network?,
[https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network]
)*.
We
also need to specify something called **epoch**. An epoch is simply the number of passes
over the entire data set. This number can vary a lot, but for a start we will set it to 10
and later on we will set it lower, but more about that in a minute. In our case we have
3625 samples and since we chosed our batch size to be 32, it will take 3625/32 = 114
iterations to complete one epoch.
We will now train our model and i will explain rest of
the code below after the training.
batch_size = 32 # number of samples that will be propagated through the network.
epochs = 10 # Number of passes over the entire data set
file_path="weights_base.hdf5"
checkpoint = ModelCheckpoint(file_path, monitor='val_loss', verbose=1, save_best_only=True, mode='min') # Verbose means that it prints acc and loss
early = EarlyStopping(monitor="val_loss", mode="min", patience=3) # EarlyStopping should only be includede when tuning your model
callbacks_list = [checkpoint, early]
history = con_model.fit([X_sum, X_tit], y, batch_size=batch_size, epochs=epochs, validation_split=0.1, callbacks=callbacks_list, verbose=2) # Model fit
You can safely ignore the warning. It's a preemptive warning from TensorFlow when it
cannot be certain of the size of the generated tensor
The model has now trained on our
training data, and i will now explain what is actually writtin in my code and why i have
done it the way i did.
A common problem in Machine learning algorithms is the term
called overfitting. If we train our model we will see that the training loss decreases with
every epoch and the training accuracy increases. A model that performs better on training
data will not necessarily be a model that will do better on completely new data. A way of
preventing your model to overfit, we have made someting called a validation split.
*(Brownlee, J. 2016. Overfitting and Underfitting With Machine Learning Algorithms.
[https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/])*
In
this split we will save 10% of our training data as validation. As you can see in my code,
i have made the validation split in the model.fit and specified an early stopping. The
early stopping is set to avoid keep training the model when we don't see an improvement in
validation loss. Sometimes local minima can occur, and this is why you give early stopping
some patience, and in our code we set it to 3, which means if we don't see an improvement 3
epochs in a row, the model will stop training.
We now visualize the model accuracy and
loss from our training:
# Final evaluation of the model
%matplotlib inline
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train','val'], loc='upper left')
plt.show()
%matplotlib inline
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','val'], loc='upper left')
plt.show()
From the above plots we set the number of epochs to 4 and remove the early stopping and
validation split and now use all of the training data in our training.
batch_size = 32
epochs = 4
file_path="weights_base.hdf5"
checkpoint = ModelCheckpoint(file_path, monitor='loss', save_best_only=True, mode='min')
history = con_model.fit([X_sum, X_tit], y, batch_size=batch_size, epochs=epochs, callbacks=[checkpoint], verbose=2)
Epoch 1/4
- 48s - loss: 0.9551 - acc: 0.6696
Epoch 2/4
- 44s - loss: 0.6527 - acc: 0.7686
Epoch 3/4
- 44s - loss: 0.5618 - acc: 0.7972
Epoch 4/4
- 51s - loss: 0.4942 - acc: 0.8222
We see that this gives us an accuracy of around 82%, but let's evaluate the model and see
how good it is to handle completely new data (test data)
con_model.load_weights(file_path)
con_model.evaluate([X_sum_test, X_tit_test], y_test, verbose=2) # Returns loss value and the metric specified, so in this case, model accuracy
[0.7706552325145077, 0.7159199235096105]
We now get an accuracy of almost 72 %, and this is not a bad result at all. We have only
trained the model on a small subset of the total data set (only 4000 studies) and it seems
that the model is doing quite okay, and we can generalize it to complete untrained data.
Prediction
We have now trained and evaluated our LSTM network. But to get a feeling about how it
actually works and to demonstrate how it can be used for new clinical trials i will now
give some examples of some new data and make prediction from that.
In my code below i
have made a function which takes a Title and a Summary and put into a pandas dataframe.
This data is going through the same pre-processing as our train and data set, so it is
ready to be fed into our model.
def my_pred(Title, Summary):
original_data = pd.DataFrame({ 'Summary' : [Summary],
'Title': [Title]})
# Clean data
X_pred_sum = pre_process(original_data, tok_sum, 'Summary', MAXLEN)
X_pred_tit = pre_process(original_data, tok_tit, 'Title', MAXLEN)
con_model.load_weights(file_path)
prediction = con_model.predict([X_pred_sum, X_pred_tit])
return prediction
We have taking a study, which is not a part of our train data and we will now try to
predict the type of intervention model from the summary text and title. First as you see
below we try where we specify the full summary text and all of the title:
Study_sum = "This clinical trial will be performed in previously untreated patients with metastatic
colorectal cancer. The study will evaluate the safety, tolerability and efficacy of the study drug,
CT-011, in combination with FOLFOX chemotherapy (FOLFOX4 or mFOLFOX6) compared with treatment by
FOLFOX alone."
Study_tit = "Study to Evaluate the Safety, Tolerability and Efficacy of FOLFOX + CT-011 Versus
FOLFOX Alone"
my_pred(Study_tit, Study_sum)
array([[0.01352272, 0.01536347, 0.8943102 , 0.07680359]], dtype=float32)
We get a
prediction of 89%, that this study is a Parallel Assignment and this is also the case. But
what if we can just use the title as predictor and give the model an empty summary? Let's
look at that!
array([[0.01352273, 0.01536347, 0.8943102 , 0.07680362]], dtype=float32)
In this case
we get accurat the same prediction as with the full summary text. The title is almost
always a short and compact version of the summary, so it contains many of the same words,
which our model most likely find interesting. It therefor make great sense, that the
prediction is identical. Let's now try to give the model some few key words from the title
and still let the summary text be empty and see what happens.
key_tit="Study Evaluate Safety, Tolerability Versus Alone"
my_pred(key_tit, empty_sum)
array([[0.01274031, 0.01423935, 0.8709181 , 0.10210223]], dtype=float32)
We see that
our model now predicts 87%, so just a little bit lower than before. So whether the model
predicts what type of model it is, depends only on some few key words, and it is still
accurate without being specific in the title our summary if the study is for example
Parallel or Crossover.
Summary
In this article you have learned how to build a LSTM recurrent neural network for
categorical prediction in Python and using Tensorflow (Keras) deep learning
network.
Specifically
important things:
- Prepraring textual data to feed into a neural network
- How LSTM can be useful in the context of clinical trials
- How to create an LSTM network for categorical prediction
- How to evaluate a model
- Make predictions from a model
Article as pdf
Finnaly, let's try it out!
Below you will see two textboxes which i have already been filling
out with information from a random study, in this case a
Parallel Assignment, from clinicaltrials.gov.
To get the predictions, just click on the "Get Prediction" button and
4 probabilties will now appear in the table. You can just erase my
text and try with your own study.
Thanks for reading! Enjoy!
Interested to learn how you come from building a model to implementation on a web
site?
Read this
article.