Northernlands 2 - Open Data Can Save Lives; If You Have the Tools to Mine It
Description
Core Life Analytics
Transcript
This transcript comes from the captions associated with the video above. It is "as spoken".
Good afternoon, my name is David Egan I'm the CEO of Core Life
Analytics. Core Life Analytics is a company that helps biologists
to analyse their own complex data. Now I'd like to thank the
organizers for inviting me to speak your ODI Leeds and Embassy
of the Netherlands, but the first thing I have to start with
actually is an abject apology because I realized that.
In the 3 1/2 years or so that Core Life Analytics has been
active, I've never actually had a business trip to the North of
England. They've all been either to the obvious Oxford,
Cambridge, London Triangle in the South or Scotland in the
North. So yeah, I realized there are some very good work
going on in the North, and I promise this is something that
will be addressed when we're allowed to travel again.
My interest in open data stems from the fact that during my
postdoc in California, back in the late 90s, I got interested
in a relatively data intensive area of biology and this was in
the use of automation for biology and so what we were
doing in the lab was working in California is we we're using
robotic systems for setting up experiments in plates like this,
and so this is a 384 well
plate. And what you can see is that you can set up an
individual experiment in each of these plates.
And so this is technology that is used in... heavily used in
industry and also in academia and use of this technology
turned into career in what's known as high
throughput screening, where I actually ended up delivering
these automation services to groups in industry and in
academic centers.
My last position before founding Core Life Analytics was as
manager of a screening facility called the cell screening core
at the University Medical Center in Utrecht. There we had a
robotic system for carrying out high throughput screening
assays, micro plates, but what we also had was an automated
microscope. Essentially this is just a microscope in a box
that allows you to take images of cells in those 96 and 384
well plates that I described. Now the automated microscopy
gives you access to a technology called high content screening.
And this just describes how it works. Basically you can take
images of fluorescently labelled cells using your automated
microscope and then with automated image analysis
software that generally comes packaged with the instrument or
open source software. You can extract, extract numeric
descriptors of the cells that you've been taking images of. So
for example, if we look at this nucleus, we could extract
numbers that describe the area, the shape, the perimeter, the
intensity of the nucleus similarly with the cell, we can
identify shape. We can identify intensity of various labels,
fibers, etc etc. Spots. And of course all of these numbers
create a profile that's essentially describing the
actual morphology. What this cell looks like?
The power of this technology is that that profile is actually
related to the mechanism of action of some chemical or
genetic reagent that you might be testing, and so how it's used
is that large libraries or or smaller numbers of chemicals or
genetic reagents can be tested in these plates. And then you can
generate profiles that are related to the activity of the chemicals.
Now one problem we had at the cell screening core
is that while we had all the technology required to
generate the images and extract the numeric features
from the images. We didn't have any tools in order to
help our clients analyse the data and so essentially they
couldn't make full use of this technology and do what we call
multiparametric analysis.
In order to solve this problem, I hired a graduate student
Wienand Omta who subsequently became my cofounder at Core Life
Analytics and what we did was we built a web based tool called
HC StratoMineR, and the idea this tool was that it would allow
biologists to upload their data from these high content
screening experiments, and then it would walk them through a
data analytics workflow. So it would allow them to isolate the
metadata, carry out variable selection, essentially throw out
the garbage. Then do QC on the data. Things that are specific
to these type of experiments in micro well plates. Plate
normalization, transforming the data, scaling the data and then on
to what's known as data reduction or dimensionality reduction
This would reduce the very large numbers of features
to a smaller number of relevant scores. We could then use these
to identify the outliers based on these profiles of numbers and
so essentially these were the ones that look different from
something else or looked
similar to something else. Then we can cluster these.
To separate out the different types of profiles, and then if
we found something interesting, we could build machine learning
models and then apply these to the whole data set.
So essentially this meant that what we were doing was we were
allowing biologists to analyse their own data by turning data
science into data technology. Initially we were able to test
and validate HC StratoMineR using data generated at the cell
streaming core, but we quickly found that we needed larger and
more complex datasets in order to really push the envelope
and to show the value of the platform.
Fortunately, in the high content fields and the
image-based screening field there are a number of good open data
repositories available, so the first one is the IDR or the
image data repository that was built by Jason Swedlow and his
group at Dundee and the other is the broad bioimage benchmark
collection that was developed at the Broad Institute in
Cambridge, MA. So let's have a quick look at those.
The broad benchmark collection is quite straightforward.
Basically, the idea is that it is a collection of high quality
image sets that can be used for benchmarking, and most of them
have some sort of ground truth built into and so here you can
see the collection of data sets and what you can do is if you
open a particular data set you can get an introduction to the
experiment that was done, a reference and then the images
and whatever metadata is available.
But it's simply repository it you can do, view
the data or browse the data. You simply get some information
about it and then you can download the images and download
the metadata that's available.
The IDR on the other hand is far more elaborate. This is
gives not only is it a repository for many screens
it gives a lot more functionality as regards browsing the data.
So here for example you can search for particular data set.
And then it opens up here and then you can see on the left all
of the individual plates and when you click on one of these,
the actual plate pops up and you can see a kind of a plate-based
map of all the all the images. If you want to look at one of
the images more closely then you click on it and then up pops
this elaborate browser. This is really cool. You can turn on and
off the various channels here.
You can adjust the scaling. You can actually copy and paste
the scaling so that you compare multiple images using the same
scaling. It's all quite elaborate, and when you go back
to the actual screen you can get information about the metadata
that's in related to the well, the compound structure for
example the compound name.
Also, at a higher level of the screen, you can download their
files of metadata here that you can download. It is also
possible to download images from the repository, but it's not as
simple as the broad repository you need to interact with an
API, but it is doable and you can get some help online to
figure out how to do that.
One very interesting screen that was present in both the
road, the collection and the ID or was based on paper that was
published by Neil Carragher from the University of Edinburgh
back in 2010. This is interesting to us because it was
a small molecule, high content screen where Neil and his
group tested 102 drugs in eight different concentrations against
a number of different cell lines and the images to that data from
one of these cell lines. MCF 7
was uploaded into the repository's and so these are
the various components that were tested and what was of interest
to us was that there were from a wide variety of different
mechanisms of action, and these mechanisms of action were well
annotated in the metadata along with the concentration and the
various structures of the chemicals themselves.
So what we did was we collaborated with Jonny Sexton
at the University of Michigan in the US, and Jonny's group
downloaded the images. And ran them through Cell Profiler.
So Cell profiler is as I described earlier an open
source image analysis software which is extremely powerful
So then we were able to take the
numeric data that Johnny extracted and this was 479
individual numbers and this was done at cell level, meaning we
had 479 features for every cell in the data set. And so this
rich large data set we were able to run through StratoMineR and
again as I described earlier on
we were able to upload the data, identify the
metadata and then carry on to the process of normalization,
transformation, scaling until dimensionality reduction and
then at dimensionality reduction we were able to reduce these
hundreds of features down to 9 robust individual scores called
principle components. And StratoMineR helps the user to
decide how many, how many different scores they should
calculate in the principal component analysis.
Let's jump into the StratoMineR platform
and I can show you a little bit about how this works and then
what we did after this dimensionality reduction step.
'cause this starts to get very interesting from this point.
So here we are in the Strato MineR web based platform and
here on the left what you can see is this: the workflow that I
described earlier and now we're at this step here.
The dimensionality reduction. What I've done here already is
I've processed the data and we've taken 336 features. And
then we've reduced them down to 9 different, what we call
principle components, and the platform makes it very easy to
investigate these different
components. And you can see here what sort of features are
loading on the different components, and then this
these individual scores. You can see how different types of
chemicals actually trigger different scores and are
related to different principle components, and so this is
essentially the biology coming out in the data reduction.
Now, as you can see here, this is kind of an indication of how
the larger number of hundreds of features are reduced to 9
individual scores, and then we can take these individual scores
and then we can carry on and use them to actually do hit picking
in the next step. So what we do is we hit save and continue.
And then what we can do is just to demonstrate how we do this
hit picking we can generate here a 3 dimensional plot using just
three of these different... of these nine principal components.
So let's have a look at this now. Here we have all these
different classes of wells. These are all of the wells in
the data that we've processed here, and let's turn off these
ones, the samples and let's just look at some pretty obvious
controls here. Now all of these are color coded, and so these
negative controls here, the ones that we know nothing should
happen, are labeled in red here and then we have various different
types of controls labeled in different colors. The green ones
are positive controls where we know there's going to be a
phenotype. Now if you look at these positive controls, you can
see that there are far away from the negative controls in three
dimensional space. Here we're just plotting three of the
components, and that's actually how we define our essentially
phenotypic hits. These are far away because they look different
from the cells and these wells and what we can actually do is
we can actually measure the distance from the negative
controls to each of these, and that's indicative of how
different they look. And then that's what we do. We actually
calculate a geometric.
What's called the Euclidean distance for each well and then
that can determine whether it's it's actually a significant
outlier. So let's do that will select all of our components
for this what we call hit picking step and then we're just
going to calculate the distance for every well from the negative
controls, and then what we get is a list of hits.
Then all of our distances are visualised, and here we can see
Our negative controls and then our positive controls.
Then these are all the samples that we're testing and then we can
see that there are a number of hits here. Things that are
significantly different from the negative controls.
And then these are dumped into a list and
then we can sort this list. We can filter it and so on. Now now
that we've isolated these hits, what we can actually do is we
can go back to the original
9 principle component scores. Those nine reduced scores
that we generated and then we can do clustering
based on those scores.
And here you can see the clustering.
And now this is the beauty of high content analysis.
Because now what you can see is that the chemicals here in the
hits are actually clustering based on their mechanism of
action. So we can see here is cytochalesin, latrunculin and
cytochalesin, latrunculin these are all acting inhibitors
farther down here what we can see is taxol, taxol, epthilone B
These are all microtubule compounds.
And then over here what we can see is
compounds that are DNA damaging agents, and so this is the real
high content paradigm where we can take images, extract
numbers and then isolate chemicals that are giving us an
effect. And then we can separate them according to
mechanism of action, all based on a cell based assay.
So this is just another look at that clustering and you can see here
Here's our DNA replication and damage agents that are
clustering here and then down here are Actin Inhibitors and
then our microtubule modifiers.
And another thing we can do with this distance score that we used
is that we can actually plot it against the concentration of the
chemical that's being involved. And what happens is that then as
you get an increase in concentration, if it's a bio
active you get this nice sigmoidal curve. This makes it
very easy in order to identify chemicals that are giving a
biological effect. So if we look at that little bit more closely
here with one chemical called docetaxel is a microtubule
inhibitor. What you can see is that as that distance score
increases, we see the phenotype coming out here one would low
distance. These are very similar to the negatives intermediate
distance and now here a high distance and you see this is a
very strong phenotype. If we look at Latrunculin which
is a chemical or drug with a different type of
mechanism of action, it affects Actin and then what you
can see is also low. That looks very similar to
negatives medium. You see the phenotype started to come out
and then with a high distance score you can see there's a
very strong phenotype. But it's very obvious that this is
a very different phenotype from Docetaxel.
And so this is why they cluster separately on that hierarchical
clustering diagram. Then what we can do is because we have data
at cell resolution, we can actually build AI models. And
this is another way of doing the analysis within StratoMineR.
We can isolate interesting wells that we have found during our
clustering analysis and we can label those and then use them to
build AI models. Here we've built a random forest machine
learning model. Based on these different reagents and then what
we can do is we can ask the question. OK, what wells look
very similar to either docetaxel, latrunculin,
doxorubicin, or AZ-I
And then what we can see is actually they pull out.
wells that have a similar mechanism of action.
One thing that this highlights the use of the StratoMineR
platform with open data, it highlights the what we can do is
actually iterative data mining. So what we can do is we can load
our open data that we've extracted from images from the
IDR or from the Broad collection, and then using
StratoMineR we can extract new knowledge from that. Then this
new knowledge can be turned into metadata that we can merge with
our original data.
And then do another round of analysis and so this is
essentially iterative data mining. It's the same way that
people iterate and do multiple experiments based on answers
they find experiments. We can do the same thing with the analysis
of our data. One thing that our work with these open data also
shows is the actual challenges of using the open data.
In certain cases you end up with data that is unusual or weird
formats some of the files can be very large and can be hard to
handle. Also, some of the metadata can be split between
different files and so this is a critical point where you have to
merge them. You have to join them and sometimes with this can
be a challenge. As you saw with our images downloading the
images as possible, but then you do need advanced tools in order
to extract the features from the images. And of course you need
some tool like StratoMineR for a biologist to be able to.
get some useful information out of the numeric data.
Another problem of course that
has been addressed earlier is that there is a lack of
standardisation. Now will it be possible for everything to be
done in a completely standardised fashion?
I don't think that's realistic. But one thing that regularly help
with this is proper tools that are flexible enough to allow
people to pull in data from various different sources.
One thing you notice once you browse the various datasets in
the image repository's is that they're all quite different.
Different types of biology.
Different cell systems, even different species, and so
they're hard to compare like with like. Now what if there was
a standard assay that people could test their chemicals against
and so compare different experiments? Well, there is a
platform that might make that possible. Let's called Cell Painting.
Now Cell Painting is a fairly straightforward protocol
that uses 5 different dyes that label different parts of the cell.
Eight different cellular
compartments and what this generates is a very rich,
high content assay that can be used to profile genetic reagents
or various different chemicals, and then the idea is that if you
build up a kind of library of these profiles, maybe that would
be useful for comparing with and clustering with new
chemicals or hits from new screens and possibly you could
compare between different experiments and so this really
would make you know open data from high content experiments
extremely useful and there are a number of consortia who are
actually working on this idea. Generating these libraries of
profiles using cell painting. There's also a lot of interest
in cell painting for things like predictive toxicology where you
could generate profiles of
unwanted phenotypes. And then you could use those to predict unwanted
phenotypes in novel chemicals that are coming out of screens.
So I hope that gives you an idea of how we've been
using open data and the possibilities and challenges
of using open data, and I'd like to thank you for your
attention and I'm happy to answer any of your questions.
-
David Egan
Co-founder and CEO
Core Life Analytics
© David Egan 2020David Egan is the co-founder and CEO of Core Life Analytics, a Netherlands-based technology company that gives biologists the ability to analyze their own data. While working at the Salk Institute in LA Jolla, CA David became interested in the use of automation for drug discovery. This led to a career in delivering high through screening services in the pharmaceuticals industry and academia. Core Life Analytics was founded by David with Wienand Omta in 2016 based on the StratoMineR platform developed by David & Wienand at the Cell Screening Core at the University Medical Center Utrecht.
The StratoMineR platform is being successfully used by customers globally, at companies such as Pfizer, Astra Zeneca & Janssen; as well as academic centers of excellence such as the University of Michigan and Oxford University. In 2020 the founders secured a Series A investment of €1m to further expand the marketing, sales, & development of the platform.
Sponsors
Nothernlands 2 is a collaboration between ODI Leeds and The Kingdom of the Netherlands, the start of activity to create, support, and amplify the cultural links between The Netherlands and the North of England. It is with their generous and vigourous support, and the support of other energetic organisations, that Northernlands can be delivered.