Northernlands 2 - The FAIR guiding principles in times of crisis
Description
Professor Barend Mons, who founded GO FAIR and is President of CODATA, will share the principles of FAIR
Transcript
This transcript comes from the captions associated with the video above. It is "as spoken".
Good day to everyone.
Sorry that I cannot be there in person, but I would like to
tell you today about the VODAN virus outbreak data network
that we are running.
In times of crisis.
And I would like to explain.
That the title is open data saves lives.
I would like to argue coming back to it later in
this presentation that FAIR, not only open, data saves lives.
One of the major issues we're dealing with in Corona is that
the data that we need to get our answers cannot be opened. They are
patient data. They don't leave the country, certainly not in a
highly politicised environment as now with COVID-19.
So we need to
work with data that are findable, accessible,
interoperable, and reusable, but not necessarily open, and I will
hope to convince you in this talk that this is the way to go.
So first of all we have set up this virus
outbreak data network - VODAN
And it is based on so-called FAIR data points as a service
for data driven research;
distributed analytics. It is a pressure cooker use case for our
approach because now everybody is panicking. But of course we
can wait for the other outbreaks to come so we don't do this just
for covid. We do it for the general virus - in this case -
outbreaks and what you see here is that four major international
organizations - CoData GoFair RDA and the World Data System - I'm
not going to explain all of them
right here. You can look this all up. This is on the Web.
They have made a joint statement
that confirms this statement I made in the beginning that we
cannot have a central database. We cannot have all the data
open, so we have to find additional and alternative
methods to do this and our approach is as follows:
We see a globally distributed
network of FAIR data points from China to the US.
And Africa, already several have been installed in Africa
actually under the VODAN project. That's the green dots
and the orange dots are stations with lots of established
knowledge and reference data via medical knowledge to interpret
what we see in the green dots. So that is the basic dream, and
maybe some of you would say "Woah! This is way out there".
No. This is running in a small setting already, and this is
also our belief of how, for example, the European Open
Science Clouds or any other
system should work and to make this happen we have to construct
everything on the web
and make it understandable for machines. So I now summarize
FAIR as the machine knows what I mean, which will now also
address this openness versus fairness. So when you have any
FAIR digital record, if it's a single identifier, an assertion,
a graph, database, whatever,
spreadsheet. And it's a digital object. It needs to be a FAIR
digital object, which means a number of things. Let's not go
into the details at the bottom of the slide that it needs to
have a globally unique, persistent and resolvable
identifier that the computer knows where it is. It needs
metadata. It points to a resource. All these things are a
bit technical, but at the
highest level the computer needs to know "what is this?"
So it needs to have a type. Is this a triple? Is this a graph?
Is this database, an Excel sheet?
Based on that it can answer question 2 on the left hand side
What can I technically do with this digital object and
then the third question is what am I allowed to do? And here we
get the link to open.
Because it may find in the
metadata of the FAIR digital
object, that this data
is highly relevant for me, but I'm a company or virtual
machine from a company in our
setting and I'm not allowed to see these data and others are.
That is not the same as open, but it's still FAIR.
So the idea of the personal health train that we are setting
up in the Netherlands and also of the trusted world of
Corona hotels here on the right hand side, is essentially that a
train which is a virtual machine learning algorithm or analytics
algorithm. This is data stations that have FAIR data. It does not
take any of the original patient data with it.
It can also visit an established data station to interpret the data,
but the only thing it takes with it is for example how many
people in your hospital that went into the intensive care
unit had an elevated prostaglandin E or whatever
cytokine IL6 level.
And it collects all these data without ever having access to
the patient data themselves. So this hospital, let's say it's
us in LUMC. We say we had 15 patients. Yes, all of them had
increased IL6 and prostaglandin E. In the end that
gives you that the idea you were coming in with - that's why it's
analytics not necessarily learning because you have a
hypothesis. You can do that without ever taking some of any
of the data outside their safe
silos. Which is critical for the crisis that we face today.
How we do this is that applications, data sources, and
the necessary infrastructure like compute power is
distributed all over the world.
We need to facilitate human to
machine connections, machine to human and machine to machine. It
should be based on the FAIR guiding principles and then it
can work because then we can give each application FAIR
metadata so it can instruct any computer. I'm looking for A, B & C
Let's say I'm looking for CT scans of covid patients. They
have to be in Dicom format then it is searched on the metadata
and you say can I visit with my algorithm to learn on these
pictures whether a particular structure that I always seem to
associate with severe disease is in the pictures or not
You don't have to bring the pictures together.
Then you can construct beautiful things like this. And of course
people that have constructed these pictures, like the
mutations of the virus, the maps or even this thing that came
recently about all the genes that may be involved in
susceptibility of people for
covid. Some lucky few people have a lot of data at their
hands. In the UK biobank or where ever like this precision life
picture and they can make these kind of pictures. But you need
need a lot of data and in most cases this is totally impossible
and we have lost a lot of lives because it took far too long
before we have enough data to see the pattern that now
gives us the impression that we probably did not treat the
first wave of severe patients correctly.
So what we did based on FAIR data, but we had to
painstakingly find them in publications, You know, there's
250 publications/day coming out on covid so it's a crazy
pandemonium of information here. Then we have case report forms
like this is the WHO case report form measurements. We
measured 96 cytokines in LUMC at the moment for every patient
that is severe you have apps and
self reporting. All these data we called them real world
observations. We need them to see what's really happening.
Then you have all kinds of hypothesis on the right
hand side, in the yellow panel. We know which viral proteins
interact with the human protein, Meanwhile Proteo. Meanwhile we
have the receptors that the virus uses and that can be
disturbed. We know that it can cause a cytokine storm by
disturbing also the RAS system. You get thrombosis and vestial
leakage and we can have any other hypothesis. You want to
test those. But you can only do it with the computer because all
the pictures I showed you before can only be made and interpreted
from pattern recognition by computers. So the data
needs to be machine readable. With a Dutch company we can make
a disease model for any of those hypothesis, and now we can
actually start rationalising drugs or interventions and see
if we add this drug to this model what is likely to happen?
And we can do that for many drugs. And here in Leiden
that's just one example, but in other data hotels as we now
call them, you would see the same data, but you may have
other wet lab possibilities to test things. And in Leiden
we can actually pass the plasma of severe patients versus
controls through micro vessels that are intact with that erial
cells and everything. And you can see whether the effect of
the plus one that you expect based on your hypotheses
in this case causing vascular leakage is actually
observed in vitro, so you have the real world observations, the
hypotheses, and right now it is a nightmare to get this done.
So nobody should tell me "oh it's easy to get to the data", no, but
they can also not be open.
So this is the data model we developed in five phases,
healthy patients or sick but not yet having severe lung problems.
And in the end you can see here that people basically die from
lungs that are completely dysfunctional, and they got
actually multiple organ failure, but the cytokine storm and a
number of other things that we have here with a lot of genes.
47 genes seemed to be involved
in driving hypo-coagulation vascular leakage,
which is of course a deadly combination when it comes
into a particular organ. Could be your heart, could be your
brain, could be your lung and of course you get
different manifestations and you die sometimes from
multiple organ failure, but it is always in essence the
same system underlying this, in broad strokes.
So the virus interacts with a number of proteins in the
human proteome that can cause cytokine storm totally
different from the hypothesis that it works via ACE2 and RAS
But in both cases, most people get better even
before they go to the hospital. They could still take aspirin or
some things to help them a
little bit. Then some people get into a full blown
cytokine storm and some of those developed leakage and
thrombosis and they get severely ill and they go into
the intensive care and then of course maybe here it makes no
sense to put people on a ventilator anymore because the
lungs are completely collapsed with lympha that has been
pushed out of the microvessels in the lungs.
So this model can be now used to test all the drugs that we just
took here from Wikipedia or Wiki Data and we threw them
theoretically in silico in this
Petri dish
and saw, for example, that hydroxychloroquine has hardly
any connections in the model, of course to heart failure,
but some other drugs that you see here have very strong
connections and have a much higher chance to
mitigate some of the stuff that is going on here, causing
disease, than others.
For example, aspirin has a lot of effects on many of the
genes that are involved, including IL6 and is
potentially possible that it can prevent, you know, at not
a very strong level, but this switch to a cytokine storm or an
aberrant immune system because it also effects IL6 for
example, and in some cases you are much more severe. You need
monoclonals that very specifically inhibit IL6 for
example, like those initial map from RAS, but if you are
already in a full blown cytokine storm or the virus causes
thrombosis and vascular leakage without a cytokine storm
it doesn't really make any sense to give these patients you know
an IL6 inhibitor.
We have to be aware that the covid patient is not a covid
patient. You have to really look at the state at which they
are and only slowly by measuring all these cytokines in large
number of patients these patterns come clear and
dexamethasone and heparin have already saved a lot of lives.
Drugs that we use all the time. If you give them at the right
moment to the right patients.
Here you see the dexamethasone that just came out, inhibits
prostaglandin E2, which is one of the major intermediaries
leading to vascular leakage. So we can now finally see in a very
complicated analysis that I cannot explain to you right here
how this might work under the hood.
But if, and that's the final part of my presentation, if we
do not invest serious funds
in ensuring that data are reusable,
this is impossible. So this distributed analytics system
where the data stay where they are in China, in the United
States, in Iran, in Italy, with all the political hassle that is
going around covid at the moment.
Nobody, none of these countries is going to send their data
across the pond or even to WHO
But if we spent about 5% of every research project on
proper data stewardship, and we make the data FAIR and visible.
Under well defined conditions, then we could have made the
pictures that I just showed you
months earlier. And I don't have to explain anyone in this
audience how many lives that
could have potentially saved. So, yes, open data saves lives.
But I would rather specify that FAIR data saves lives.
Thank you.
-
Professor Barend Mons
President at CODATA, Founder of GO FAIR
© Barend Mons 2020Barend Mons is a global expert on FAIR principles and he led the 5 day long early meeting in January 2014 (Leiden) where the principles were first defined. Originally a molecular biologist with 15 years of basic research experience on malaria parasites and vaccines, he refocused in 2000 on semantic technologies and later on Open Science. He has thus been in this field from the very beginning and started various early movements for open science ‘avant la lettre’ (a.o. Wiki professional, Concept Web Alliance). Mons published over 100 peer reviewed articles and more recently a handbook named: Data Stewardship for Open Science. He was the senior author on the now widely cited FAIR principles paper in Nature’s Scientific Data in 2016. In 2015, Barend was appointed Chair of the High Level Expert Group (HLEG) for the European Open Science Cloud, and the group published its report, which marked a critical step towards realising the aspiration of the EOSC. After leaving the HLEG he continued to be active towards the practical realisation of the EOSC, defined in the report as the Internet of FAIR data and services. Three countries (The Netherlands, Germany and France) took the early initiative to create a Global, Open approach to the implementation of FAIR principles in practice, called GO FAIR, with the aim to kick-start the developments towards EOSC in a global, open science and innovation context. Mons was appointed director of the Dutch International Support and Coordination Office of the infinitive with sister offices in Germany and France. He is also the elected president of CODATA, the standing committee on research data related issues of the International Science Council.
Sponsors
Nothernlands 2 is a collaboration between ODI Leeds and The Kingdom of the Netherlands, the start of activity to create, support, and amplify the cultural links between The Netherlands and the North of England. It is with their generous and vigourous support, and the support of other energetic organisations, that Northernlands can be delivered.