Northernlands 2 - Open Data Can Save Lives; If You Have the Tools to Mine It


Core Life Analytics


This transcript comes from the captions associated with the video above. It is "as spoken".

Good afternoon, my name is David Egan I'm the CEO of Core Life

Analytics. Core Life Analytics is a company that helps biologists

to analyse their own complex data. Now I'd like to thank the

organizers for inviting me to speak your ODI Leeds and Embassy

of the Netherlands, but the first thing I have to start with

actually is an abject apology because I realized that.

In the 3 1/2 years or so that Core Life Analytics has been

active, I've never actually had a business trip to the North of

England. They've all been either to the obvious Oxford,

Cambridge, London Triangle in the South or Scotland in the

North. So yeah, I realized there are some very good work

going on in the North, and I promise this is something that

will be addressed when we're allowed to travel again.

My interest in open data stems from the fact that during my

postdoc in California, back in the late 90s, I got interested

in a relatively data intensive area of biology and this was in

the use of automation for biology and so what we were

doing in the lab was working in California is we we're using

robotic systems for setting up experiments in plates like this,

and so this is a 384 well

plate. And what you can see is that you can set up an

individual experiment in each of these plates.

And so this is technology that is used in... heavily used in

industry and also in academia and use of this technology

turned into career in what's known as high

throughput screening, where I actually ended up delivering

these automation services to groups in industry and in

academic centers.

My last position before founding Core Life Analytics was as

manager of a screening facility called the cell screening core

at the University Medical Center in Utrecht. There we had a

robotic system for carrying out high throughput screening

assays, micro plates, but what we also had was an automated

microscope. Essentially this is just a microscope in a box

that allows you to take images of cells in those 96 and 384

well plates that I described. Now the automated microscopy

gives you access to a technology called high content screening.

And this just describes how it works. Basically you can take

images of fluorescently labelled cells using your automated

microscope and then with automated image analysis

software that generally comes packaged with the instrument or

open source software. You can extract, extract numeric

descriptors of the cells that you've been taking images of. So

for example, if we look at this nucleus, we could extract

numbers that describe the area, the shape, the perimeter, the

intensity of the nucleus similarly with the cell, we can

identify shape. We can identify intensity of various labels,

fibers, etc etc. Spots. And of course all of these numbers

create a profile that's essentially describing the

actual morphology. What this cell looks like?

The power of this technology is that that profile is actually

related to the mechanism of action of some chemical or

genetic reagent that you might be testing, and so how it's used

is that large libraries or or smaller numbers of chemicals or

genetic reagents can be tested in these plates. And then you can

generate profiles that are related to the activity of the chemicals.

Now one problem we had at the cell screening core

is that while we had all the technology required to

generate the images and extract the numeric features

from the images. We didn't have any tools in order to

help our clients analyse the data and so essentially they

couldn't make full use of this technology and do what we call

multiparametric analysis.

In order to solve this problem, I hired a graduate student

Wienand Omta who subsequently became my cofounder at Core Life

Analytics and what we did was we built a web based tool called

HC StratoMineR, and the idea this tool was that it would allow

biologists to upload their data from these high content

screening experiments, and then it would walk them through a

data analytics workflow. So it would allow them to isolate the

metadata, carry out variable selection, essentially throw out

the garbage. Then do QC on the data. Things that are specific

to these type of experiments in micro well plates. Plate

normalization, transforming the data, scaling the data and then on

to what's known as data reduction or dimensionality reduction

This would reduce the very large numbers of features

to a smaller number of relevant scores. We could then use these

to identify the outliers based on these profiles of numbers and

so essentially these were the ones that look different from

something else or looked

similar to something else. Then we can cluster these.

To separate out the different types of profiles, and then if

we found something interesting, we could build machine learning

models and then apply these to the whole data set.

So essentially this meant that what we were doing was we were

allowing biologists to analyse their own data by turning data

science into data technology. Initially we were able to test

and validate HC StratoMineR using data generated at the cell

streaming core, but we quickly found that we needed larger and

more complex datasets in order to really push the envelope

and to show the value of the platform.

Fortunately, in the high content fields and the

image-based screening field there are a number of good open data

repositories available, so the first one is the IDR or the

image data repository that was built by Jason Swedlow and his

group at Dundee and the other is the broad bioimage benchmark

collection that was developed at the Broad Institute in

Cambridge, MA. So let's have a quick look at those.

The broad benchmark collection is quite straightforward.

Basically, the idea is that it is a collection of high quality

image sets that can be used for benchmarking, and most of them

have some sort of ground truth built into and so here you can

see the collection of data sets and what you can do is if you

open a particular data set you can get an introduction to the

experiment that was done, a reference and then the images

and whatever metadata is available.

But it's simply repository it you can do, view

the data or browse the data. You simply get some information

about it and then you can download the images and download

the metadata that's available.

The IDR on the other hand is far more elaborate. This is

gives not only is it a repository for many screens

it gives a lot more functionality as regards browsing the data.

So here for example you can search for particular data set.

And then it opens up here and then you can see on the left all

of the individual plates and when you click on one of these,

the actual plate pops up and you can see a kind of a plate-based

map of all the all the images. If you want to look at one of

the images more closely then you click on it and then up pops

this elaborate browser. This is really cool. You can turn on and

off the various channels here.

You can adjust the scaling. You can actually copy and paste

the scaling so that you compare multiple images using the same

scaling. It's all quite elaborate, and when you go back

to the actual screen you can get information about the metadata

that's in related to the well, the compound structure for

example the compound name.

Also, at a higher level of the screen, you can download their

files of metadata here that you can download. It is also

possible to download images from the repository, but it's not as

simple as the broad repository you need to interact with an

API, but it is doable and you can get some help online to

figure out how to do that.

One very interesting screen that was present in both the

road, the collection and the ID or was based on paper that was

published by Neil Carragher from the University of Edinburgh

back in 2010. This is interesting to us because it was

a small molecule, high content screen where Neil and his

group tested 102 drugs in eight different concentrations against

a number of different cell lines and the images to that data from

one of these cell lines. MCF 7

was uploaded into the repository's and so these are

the various components that were tested and what was of interest

to us was that there were from a wide variety of different

mechanisms of action, and these mechanisms of action were well

annotated in the metadata along with the concentration and the

various structures of the chemicals themselves.

So what we did was we collaborated with Jonny Sexton

at the University of Michigan in the US, and Jonny's group

downloaded the images. And ran them through Cell Profiler.

So Cell profiler is as I described earlier an open

source image analysis software which is extremely powerful

So then we were able to take the

numeric data that Johnny extracted and this was 479

individual numbers and this was done at cell level, meaning we

had 479 features for every cell in the data set. And so this

rich large data set we were able to run through StratoMineR and

again as I described earlier on

we were able to upload the data, identify the

metadata and then carry on to the process of normalization,

transformation, scaling until dimensionality reduction and

then at dimensionality reduction we were able to reduce these

hundreds of features down to 9 robust individual scores called

principle components. And StratoMineR helps the user to

decide how many, how many different scores they should

calculate in the principal component analysis.

Let's jump into the StratoMineR platform

and I can show you a little bit about how this works and then

what we did after this dimensionality reduction step.

'cause this starts to get very interesting from this point.

So here we are in the Strato MineR web based platform and

here on the left what you can see is this: the workflow that I

described earlier and now we're at this step here.

The dimensionality reduction. What I've done here already is

I've processed the data and we've taken 336 features. And

then we've reduced them down to 9 different, what we call

principle components, and the platform makes it very easy to

investigate these different

components. And you can see here what sort of features are

loading on the different components, and then this

these individual scores. You can see how different types of

chemicals actually trigger different scores and are

related to different principle components, and so this is

essentially the biology coming out in the data reduction.

Now, as you can see here, this is kind of an indication of how

the larger number of hundreds of features are reduced to 9

individual scores, and then we can take these individual scores

and then we can carry on and use them to actually do hit picking

in the next step. So what we do is we hit save and continue.

And then what we can do is just to demonstrate how we do this

hit picking we can generate here a 3 dimensional plot using just

three of these different... of these nine principal components.

So let's have a look at this now. Here we have all these

different classes of wells. These are all of the wells in

the data that we've processed here, and let's turn off these

ones, the samples and let's just look at some pretty obvious

controls here. Now all of these are color coded, and so these

negative controls here, the ones that we know nothing should

happen, are labeled in red here and then we have various different

types of controls labeled in different colors. The green ones

are positive controls where we know there's going to be a

phenotype. Now if you look at these positive controls, you can

see that there are far away from the negative controls in three

dimensional space. Here we're just plotting three of the

components, and that's actually how we define our essentially

phenotypic hits. These are far away because they look different

from the cells and these wells and what we can actually do is

we can actually measure the distance from the negative

controls to each of these, and that's indicative of how

different they look. And then that's what we do. We actually

calculate a geometric.

What's called the Euclidean distance for each well and then

that can determine whether it's it's actually a significant

outlier. So let's do that will select all of our components

for this what we call hit picking step and then we're just

going to calculate the distance for every well from the negative

controls, and then what we get is a list of hits.

Then all of our distances are visualised, and here we can see

Our negative controls and then our positive controls.

Then these are all the samples that we're testing and then we can

see that there are a number of hits here. Things that are

significantly different from the negative controls.

And then these are dumped into a list and

then we can sort this list. We can filter it and so on. Now now

that we've isolated these hits, what we can actually do is we

can go back to the original

9 principle component scores. Those nine reduced scores

that we generated and then we can do clustering

based on those scores.

And here you can see the clustering.

And now this is the beauty of high content analysis.

Because now what you can see is that the chemicals here in the

hits are actually clustering based on their mechanism of

action. So we can see here is cytochalesin, latrunculin and

cytochalesin, latrunculin these are all acting inhibitors

farther down here what we can see is taxol, taxol, epthilone B

These are all microtubule compounds.

And then over here what we can see is

compounds that are DNA damaging agents, and so this is the real

high content paradigm where we can take images, extract

numbers and then isolate chemicals that are giving us an

effect. And then we can separate them according to

mechanism of action, all based on a cell based assay.

So this is just another look at that clustering and you can see here

Here's our DNA replication and damage agents that are

clustering here and then down here are Actin Inhibitors and

then our microtubule modifiers.

And another thing we can do with this distance score that we used

is that we can actually plot it against the concentration of the

chemical that's being involved. And what happens is that then as

you get an increase in concentration, if it's a bio

active you get this nice sigmoidal curve. This makes it

very easy in order to identify chemicals that are giving a

biological effect. So if we look at that little bit more closely

here with one chemical called docetaxel is a microtubule

inhibitor. What you can see is that as that distance score

increases, we see the phenotype coming out here one would low

distance. These are very similar to the negatives intermediate

distance and now here a high distance and you see this is a

very strong phenotype. If we look at Latrunculin which

is a chemical or drug with a different type of

mechanism of action, it affects Actin and then what you

can see is also low. That looks very similar to

negatives medium. You see the phenotype started to come out

and then with a high distance score you can see there's a

very strong phenotype. But it's very obvious that this is

a very different phenotype from Docetaxel.

And so this is why they cluster separately on that hierarchical

clustering diagram. Then what we can do is because we have data

at cell resolution, we can actually build AI models. And

this is another way of doing the analysis within StratoMineR.

We can isolate interesting wells that we have found during our

clustering analysis and we can label those and then use them to

build AI models. Here we've built a random forest machine

learning model. Based on these different reagents and then what

we can do is we can ask the question. OK, what wells look

very similar to either docetaxel, latrunculin,

doxorubicin, or AZ-I

And then what we can see is actually they pull out.

wells that have a similar mechanism of action.

One thing that this highlights the use of the StratoMineR

platform with open data, it highlights the what we can do is

actually iterative data mining. So what we can do is we can load

our open data that we've extracted from images from the

IDR or from the Broad collection, and then using

StratoMineR we can extract new knowledge from that. Then this

new knowledge can be turned into metadata that we can merge with

our original data.

And then do another round of analysis and so this is

essentially iterative data mining. It's the same way that

people iterate and do multiple experiments based on answers

they find experiments. We can do the same thing with the analysis

of our data. One thing that our work with these open data also

shows is the actual challenges of using the open data.

In certain cases you end up with data that is unusual or weird

formats some of the files can be very large and can be hard to

handle. Also, some of the metadata can be split between

different files and so this is a critical point where you have to

merge them. You have to join them and sometimes with this can

be a challenge. As you saw with our images downloading the

images as possible, but then you do need advanced tools in order

to extract the features from the images. And of course you need

some tool like StratoMineR for a biologist to be able to.

get some useful information out of the numeric data.

Another problem of course that

has been addressed earlier is that there is a lack of

standardisation. Now will it be possible for everything to be

done in a completely standardised fashion?

I don't think that's realistic. But one thing that regularly help

with this is proper tools that are flexible enough to allow

people to pull in data from various different sources.

One thing you notice once you browse the various datasets in

the image repository's is that they're all quite different.

Different types of biology.

Different cell systems, even different species, and so

they're hard to compare like with like. Now what if there was

a standard assay that people could test their chemicals against

and so compare different experiments? Well, there is a

platform that might make that possible. Let's called Cell Painting.

Now Cell Painting is a fairly straightforward protocol

that uses 5 different dyes that label different parts of the cell.

Eight different cellular

compartments and what this generates is a very rich,

high content assay that can be used to profile genetic reagents

or various different chemicals, and then the idea is that if you

build up a kind of library of these profiles, maybe that would

be useful for comparing with and clustering with new

chemicals or hits from new screens and possibly you could

compare between different experiments and so this really

would make you know open data from high content experiments

extremely useful and there are a number of consortia who are

actually working on this idea. Generating these libraries of

profiles using cell painting. There's also a lot of interest

in cell painting for things like predictive toxicology where you

could generate profiles of

unwanted phenotypes. And then you could use those to predict unwanted

phenotypes in novel chemicals that are coming out of screens.

So I hope that gives you an idea of how we've been

using open data and the possibilities and challenges

of using open data, and I'd like to thank you for your

attention and I'm happy to answer any of your questions.

  • David Egan

    Co-founder and CEO
    Core Life Analytics

    David Egan
    © David Egan 2020

    David Egan is the co-founder and CEO of Core Life Analytics, a Netherlands-based technology company that gives biologists the ability to analyze their own data. While working at the Salk Institute in LA Jolla, CA David became interested in the use of automation for drug discovery. This led to a career in delivering high through screening services in the pharmaceuticals industry and academia. Core Life Analytics was founded by David with Wienand Omta in 2016 based on the StratoMineR platform developed by David & Wienand at the Cell Screening Core at the University Medical Center Utrecht.

    The StratoMineR platform is being successfully used by customers globally, at companies such as Pfizer, Astra Zeneca & Janssen; as well as academic centers of excellence such as the University of Michigan and Oxford University. In 2020 the founders secured a Series A investment of €1m to further expand the marketing, sales, & development of the platform.


Nothernlands 2 is a collaboration between ODI Leeds and The Kingdom of the Netherlands, the start of activity to create, support, and amplify the cultural links between The Netherlands and the North of England. It is with their generous and vigourous support, and the support of other energetic organisations, that Northernlands can be delivered.

  • Kingdom of the Netherlands