Northernlands 2 - The FAIR guiding principles in times of crisis

Description

Professor Barend Mons, who founded GO FAIR and is President of CODATA, will share the principles of FAIR

Transcript

This transcript comes from the captions associated with the video above. It is "as spoken".

Good day to everyone.

Sorry that I cannot be there in person, but I would like to

tell you today about the VODAN virus outbreak data network

that we are running.

In times of crisis.

And I would like to explain.

That the title is open data saves lives.

I would like to argue coming back to it later in

this presentation that FAIR, not only open, data saves lives.

One of the major issues we're dealing with in Corona is that

the data that we need to get our answers cannot be opened. They are

patient data. They don't leave the country, certainly not in a

highly politicised environment as now with COVID-19.

So we need to

work with data that are findable, accessible,

interoperable, and reusable, but not necessarily open, and I will

hope to convince you in this talk that this is the way to go.

So first of all we have set up this virus

outbreak data network - VODAN

And it is based on so-called FAIR data points as a service

for data driven research;

distributed analytics. It is a pressure cooker use case for our

approach because now everybody is panicking. But of course we

can wait for the other outbreaks to come so we don't do this just

for covid. We do it for the general virus - in this case -

outbreaks and what you see here is that four major international

organizations - CoData GoFair RDA and the World Data System - I'm

not going to explain all of them

right here. You can look this all up. This is on the Web.

They have made a joint statement

that confirms this statement I made in the beginning that we

cannot have a central database. We cannot have all the data

open, so we have to find additional and alternative

methods to do this and our approach is as follows:

We see a globally distributed

network of FAIR data points from China to the US.

And Africa, already several have been installed in Africa

actually under the VODAN project. That's the green dots

and the orange dots are stations with lots of established

knowledge and reference data via medical knowledge to interpret

what we see in the green dots. So that is the basic dream, and

maybe some of you would say "Woah! This is way out there".

No. This is running in a small setting already, and this is

also our belief of how, for example, the European Open

Science Clouds or any other

system should work and to make this happen we have to construct

everything on the web

and make it understandable for machines. So I now summarize

FAIR as the machine knows what I mean, which will now also

address this openness versus fairness. So when you have any

FAIR digital record, if it's a single identifier, an assertion,

a graph, database, whatever,

spreadsheet. And it's a digital object. It needs to be a FAIR

digital object, which means a number of things. Let's not go

into the details at the bottom of the slide that it needs to

have a globally unique, persistent and resolvable

identifier that the computer knows where it is. It needs

metadata. It points to a resource. All these things are a

bit technical, but at the

highest level the computer needs to know "what is this?"

So it needs to have a type. Is this a triple? Is this a graph?

Is this database, an Excel sheet?

Based on that it can answer question 2 on the left hand side

What can I technically do with this digital object and

then the third question is what am I allowed to do? And here we

get the link to open.

Because it may find in the

metadata of the FAIR digital

object, that this data

is highly relevant for me, but I'm a company or virtual

machine from a company in our

setting and I'm not allowed to see these data and others are.

That is not the same as open, but it's still FAIR.

So the idea of the personal health train that we are setting

up in the Netherlands and also of the trusted world of

Corona hotels here on the right hand side, is essentially that a

train which is a virtual machine learning algorithm or analytics

algorithm. This is data stations that have FAIR data. It does not

take any of the original patient data with it.

It can also visit an established data station to interpret the data,

but the only thing it takes with it is for example how many

people in your hospital that went into the intensive care

unit had an elevated prostaglandin E or whatever

cytokine IL6 level.

And it collects all these data without ever having access to

the patient data themselves. So this hospital, let's say it's

us in LUMC. We say we had 15 patients. Yes, all of them had

increased IL6 and prostaglandin E. In the end that

gives you that the idea you were coming in with - that's why it's

analytics not necessarily learning because you have a

hypothesis. You can do that without ever taking some of any

of the data outside their safe

silos. Which is critical for the crisis that we face today.

How we do this is that applications, data sources, and

the necessary infrastructure like compute power is

distributed all over the world.

We need to facilitate human to

machine connections, machine to human and machine to machine. It

should be based on the FAIR guiding principles and then it

can work because then we can give each application FAIR

metadata so it can instruct any computer. I'm looking for A, B & C

Let's say I'm looking for CT scans of covid patients. They

have to be in Dicom format then it is searched on the metadata

and you say can I visit with my algorithm to learn on these

pictures whether a particular structure that I always seem to

associate with severe disease is in the pictures or not

You don't have to bring the pictures together.

Then you can construct beautiful things like this. And of course

people that have constructed these pictures, like the

mutations of the virus, the maps or even this thing that came

recently about all the genes that may be involved in

susceptibility of people for

covid. Some lucky few people have a lot of data at their

hands. In the UK biobank or where ever like this precision life

picture and they can make these kind of pictures. But you need

need a lot of data and in most cases this is totally impossible

and we have lost a lot of lives because it took far too long

before we have enough data to see the pattern that now

gives us the impression that we probably did not treat the

first wave of severe patients correctly.

So what we did based on FAIR data, but we had to

painstakingly find them in publications, You know, there's

250 publications/day coming out on covid so it's a crazy

pandemonium of information here. Then we have case report forms

like this is the WHO case report form measurements. We

measured 96 cytokines in LUMC at the moment for every patient

that is severe you have apps and

self reporting. All these data we called them real world

observations. We need them to see what's really happening.

Then you have all kinds of hypothesis on the right

hand side, in the yellow panel. We know which viral proteins

interact with the human protein, Meanwhile Proteo. Meanwhile we

have the receptors that the virus uses and that can be

disturbed. We know that it can cause a cytokine storm by

disturbing also the RAS system. You get thrombosis and vestial

leakage and we can have any other hypothesis. You want to

test those. But you can only do it with the computer because all

the pictures I showed you before can only be made and interpreted

from pattern recognition by computers. So the data

needs to be machine readable. With a Dutch company we can make

a disease model for any of those hypothesis, and now we can

actually start rationalising drugs or interventions and see

if we add this drug to this model what is likely to happen?

And we can do that for many drugs. And here in Leiden

that's just one example, but in other data hotels as we now

call them, you would see the same data, but you may have

other wet lab possibilities to test things. And in Leiden

we can actually pass the plasma of severe patients versus

controls through micro vessels that are intact with that erial

cells and everything. And you can see whether the effect of

the plus one that you expect based on your hypotheses

in this case causing vascular leakage is actually

observed in vitro, so you have the real world observations, the

hypotheses, and right now it is a nightmare to get this done.

So nobody should tell me "oh it's easy to get to the data", no, but

they can also not be open.

So this is the data model we developed in five phases,

healthy patients or sick but not yet having severe lung problems.

And in the end you can see here that people basically die from

lungs that are completely dysfunctional, and they got

actually multiple organ failure, but the cytokine storm and a

number of other things that we have here with a lot of genes.

47 genes seemed to be involved

in driving hypo-coagulation vascular leakage,

which is of course a deadly combination when it comes

into a particular organ. Could be your heart, could be your

brain, could be your lung and of course you get

different manifestations and you die sometimes from

multiple organ failure, but it is always in essence the

same system underlying this, in broad strokes.

So the virus interacts with a number of proteins in the

human proteome that can cause cytokine storm totally

different from the hypothesis that it works via ACE2 and RAS

But in both cases, most people get better even

before they go to the hospital. They could still take aspirin or

some things to help them a

little bit. Then some people get into a full blown

cytokine storm and some of those developed leakage and

thrombosis and they get severely ill and they go into

the intensive care and then of course maybe here it makes no

sense to put people on a ventilator anymore because the

lungs are completely collapsed with lympha that has been

pushed out of the microvessels in the lungs.

So this model can be now used to test all the drugs that we just

took here from Wikipedia or Wiki Data and we threw them

theoretically in silico in this

Petri dish

and saw, for example, that hydroxychloroquine has hardly

any connections in the model, of course to heart failure,

but some other drugs that you see here have very strong

connections and have a much higher chance to

mitigate some of the stuff that is going on here, causing

disease, than others.

For example, aspirin has a lot of effects on many of the

genes that are involved, including IL6 and is

potentially possible that it can prevent, you know, at not

a very strong level, but this switch to a cytokine storm or an

aberrant immune system because it also effects IL6 for

example, and in some cases you are much more severe. You need

monoclonals that very specifically inhibit IL6 for

example, like those initial map from RAS, but if you are

already in a full blown cytokine storm or the virus causes

thrombosis and vascular leakage without a cytokine storm

it doesn't really make any sense to give these patients you know

an IL6 inhibitor.

We have to be aware that the covid patient is not a covid

patient. You have to really look at the state at which they

are and only slowly by measuring all these cytokines in large

number of patients these patterns come clear and

dexamethasone and heparin have already saved a lot of lives.

Drugs that we use all the time. If you give them at the right

moment to the right patients.

Here you see the dexamethasone that just came out, inhibits

prostaglandin E2, which is one of the major intermediaries

leading to vascular leakage. So we can now finally see in a very

complicated analysis that I cannot explain to you right here

how this might work under the hood.

But if, and that's the final part of my presentation, if we

do not invest serious funds

in ensuring that data are reusable,

this is impossible. So this distributed analytics system

where the data stay where they are in China, in the United

States, in Iran, in Italy, with all the political hassle that is

going around covid at the moment.

Nobody, none of these countries is going to send their data

across the pond or even to WHO

But if we spent about 5% of every research project on

proper data stewardship, and we make the data FAIR and visible.

Under well defined conditions, then we could have made the

pictures that I just showed you

months earlier. And I don't have to explain anyone in this

audience how many lives that

could have potentially saved. So, yes, open data saves lives.

But I would rather specify that FAIR data saves lives.

Thank you.

  • Professor Barend Mons

    President at CODATA, Founder of GO FAIR

    Barend Mons
    © Barend Mons 2020

    Barend Mons is a global expert on FAIR principles and he led the 5 day long early meeting in January 2014 (Leiden) where the principles were first defined. Originally a molecular biologist with 15 years of basic research experience on malaria parasites and vaccines, he refocused in 2000 on semantic technologies and later on Open Science. He has thus been in this field from the very beginning and started various early movements for open science ‘avant la lettre’ (a.o. Wiki professional, Concept Web Alliance). Mons published over 100 peer reviewed articles and more recently a handbook named: Data Stewardship for Open Science. He was the senior author on the now widely cited FAIR principles paper in Nature’s Scientific Data in 2016. In 2015, Barend was appointed Chair of the High Level Expert Group (HLEG) for the European Open Science Cloud, and the group published its report, which marked a critical step towards realising the aspiration of the EOSC. After leaving the HLEG he continued to be active towards the practical realisation of the EOSC, defined in the report as the Internet of FAIR data and services. Three countries (The Netherlands, Germany and France) took the early initiative to create a Global, Open approach to the implementation of FAIR principles in practice, called GO FAIR, with the aim to kick-start the developments towards EOSC in a global, open science and innovation context. Mons was appointed director of the Dutch International Support and Coordination Office of the infinitive with sister offices in Germany and France. He is also the elected president of CODATA, the standing committee on research data related issues of the International Science Council.

Sponsors

Nothernlands 2 is a collaboration between ODI Leeds and The Kingdom of the Netherlands, the start of activity to create, support, and amplify the cultural links between The Netherlands and the North of England. It is with their generous and vigourous support, and the support of other energetic organisations, that Northernlands can be delivered.

  • Kingdom of the Netherlands