This document discusses myths and facts about big data in healthcare and proposes an innovation in healthcare IT standards called Multilevel Healthcare Information Modeling (MLHIM) to address some limitations of traditional standards. MLHIM uses XML schemas rather than ADL to define clinical concept constraints in a bottom-up way. This allows for multiple definitions of a concept and makes the standards more adaptable to big data. Tools are being developed to generate, edit, and work with MLHIM clinical models to facilitate reliable big data collection and interchange.
3. MYTH #1: "BIG DATA" HAS A UNIVERSALLY
ACCEPTED, CLEAR DEFINITION
Two of these aspects are a particular concern in healthcare:
Variability Velocity
The various definitions have the 3V in common:
Volume: Existence of gigantic amounts of data
Variability: Coexistence of structured, non-
structured, machine generated etc data
Velocity: Data is produced, and it has to be
processed and consumed very fast
There is no consensus in scientific literature and on the specialized blogosphere
about the definition of Big Data
4. MYTH #2: BIG DATA IS NEW
Collecting, processing and analyzing sheer
amounts of data is not a new activity in
mankind
• Example: Middle Age monks and their concordances
(correlations of every single word in the Bible)
What is new is the volume size and
the speed it can be processed and
analyzed
5. MYTH #3: BIGGER DATA IS BETTER
In biomedical
science, this is
partially fact: the
bigger the sample
size, the more
precise the
estimates are
However, large
sample sizes with
bad quality data
are dangerously
misleading
In healthcare,
precision and
reliability are
both equally
important
6. MYTH #4: BIG DATA MEANS BIG MARKETING
There is no evidence that
analyzing Big Data
increases the number of
customers
Big Data is useful when it
helps emerging actionable
insights (e.g., an unknown
relationship between a
gene and a disease)
That has little relevance in
healthcare, especially in
universal healthcare
systems
7. HOW TO GET RELIABLE BIG DATA?
TRADITIONAL STANDARDS X INNOVATION
8. THE TRADITIONAL HEALTHCARE IT STANDARDS
HL7,
openEHR,
ISO 13606
Primary focus on
message exchange
among EMRs
All of them precede
in history the
emergence of Big
Data and the
Semantic Web
Top-down data
modeling approach:
not prepared to deal
with the 3V of Big
Data
SNOMED-
CT, LOINC,
ICD
Controlled
vocabularies
Also preceding Big
Data and Semantic
Web
Main focus on pre-
coordination (top-
down approach)
In other words: the traditional healthcare IT standards are not prepared
to deal with Big Data
9. A DEVELOPMENT ABOUT OPENEHR
The current version of the
Archetype Definition
Language is 1.4
It requires an archetype to
be the maximal data set for a
given concept
By the book, it means that
there can be just one
archetype for each single
concept in the whole globe
There are several archetypes
being developed in isolation,
not being submitted to the
proper governance tool (the
CKM)
In the ADL 1.5 spec, it is
promised that the “maximal
data model” requirement
will be removed
11. A BIG DATA-AWARE HEALTHCARE IT STANDARD IS:
Compliant to Semantic Web Technologies
Respectful to the different points of view coming from different medical schools
Welcoming to all healthcare professionals (and their concepts)
Not limited to EMR data modeling
Prepared to deal with the emerging mHealth and the Internet of Things
13. THE BACKGROUND - 1
The typical application design locks up semantics in the database
structure and application source code
Different use cases in different scenarios often interpret seemingly
similar data, differently when the semantics are missing
Multilevel modelling provides a way to share semantics about
any medical (healthcare) concept between distributed and
independent applications
14. THE BACKGROUND - 2
MLHIM is based on the core
modelling concepts of
openEHR to provide
semantics external from
applications
From openEHR, MLHIM
inherited the multilevel
model principles
MLHIM also uses certain
conceptual principles from
HL7 v3
From HL7, MLHIM inherited
the XML-based
implementation
15. THE IMPLEMENTATION
MLHIM simplifies the openEHR
Reference Model
It is called a
‘minimalistic’
multilevel model
MLHIM uses XML instead of
ADL so that ubiquitous tooling
and training are available
The whole
Semantic Web is
based on XML
technologies
Because MLHIM is based on the
XML Schema data model there
is no loss of information
between model semantics and
serialization in XML instances
This is a problem
when serializing
ADL into XML
(see next)
16. A NOTE ON ADL X XML
There is a loss of
information when
moving between an
object model (AOM)
and XML Schema
dADL is the proper
instance serialization
for the AOM
However, in practice
implementers are
serializing
openEHR/ISO13606
data in XML
17. ADL X XML: A COMPARISON
ADL XML
The openEHR test suite includes approximately 1600
total files, with known independent validations of its
files
The XML Schema test suite contains more than 40,000
independently validated tests
OpenEHR tools are developed by one company and
there is one open source reference model
There are more than 30 XML editors, open source and
proprietary from as many companies. There are
additional tools in the XML family, XSLT, Xquery, Xlink
and Xproc
The FOSS Java RM has not been thoroughly tested
and validated
There are at least 3 widely used, XML
parser/validators, open source and proprietary from
different companies and communities
The only ADL courses are from Ocean Informatics and
a few startup course taught by non-experts
XML is taught in all computer science courses as well
as online
There are zero books on ADL O'Reilly has 54 books on XML, Amazon has 11,890
results for Books: "xml"
20. CLINICAL KNOWLEDGE MODELING: FUNDAMENTALS
Modeling
clinical data is
a complex task
Requires deep
knowledge of the
specific clinical
domain
Requires at least
an intermediate
understanding of
data types
Modeling clinical data
is a core activity in
healthcare IT
It is the only way
to produce Big
Data in
healthcare with
responsibility
Even well
designed clinical
data modes in
conventional
software are not
interoperable
Multilevel model
software is
interoperable
and it requires
thoughtful
clinical
knowledge
modeling
21. CLINICAL MODELS IN MULTILEVEL MODELING
• The Reference Model: generic information
model shared by the ecosystem
• The Domain Model: definition of constraints to
the Reference Model for each medical concept
In multilevel modeling,
the information
ecosystem is
structured in (at least)
two levels:
Multilevel Model openEHR MLHIM
Domain Model Archetype Concept Constraint Definition (CCD)
Language ADL XML Schema 1.1
# of DM/concept 1 n
Governance Top down, consensus Bottom-up, merit
22. CONCEPT CONSTRAINT DEFINITION (CCD)
In MLHIM, CCDs are XML
Schemas that define
constraints to the Reference
Model, in order to model
clinical concepts
CCDs can be validated to the
correspondent MLHIM
Reference Model by third-
party applications
The CCD Schema informs the
application developer of the
structure of a valid data
instance for each concept
modeled for that system
If the CCD is made public, any
receptor of a data instance
coming from this application
can store, validate, query etc
that data instance
23. CCD HIGH LEVEL STRUCTURE
CCD
Care, Demographic or AdminEntry
Cluster
DvAdapter (or Cluster)
DataType
25. MLHIM ELEMENTS: PRINCIPLES
The elements of a CCD do not carry any semantics
Since element names are structural identifiers, this is in keeping with the best practices of
healthcare knowledge artifact identifiers, as first proposed by Dr. James Cimino (circa 1988)
Characteristic #3 - Dumb Identifiers
An identifier itself should not have meaning. If an identifier is comprised of other identifiers that have
been combined, then the composite identifier is inherently unstable. If the circumstances that related the
composite identifiers together in the first place change, the resulting identifier must also change.
26. MLHIM CCDS: TECHNICAL ASPECTS
CCDs are the equivalent of an archetype in CEN13606 and openEHR
• They may be defined at any level, for any application use
• complexType definitions may be reused in multiple CCDs
• CCDs persist for all time and are not versioned, this is essential for data integrity across
time
• All element names are unique identifiers (Type 4 UUIDs)
With the exceptions:
27. CCD GOVERNANCE MODEL
Artifact governance
in MLHIM consists
of maintaining a
copy of the CCDs
and Reference
Models
This can be on the
web at the specified
location or locally
and referenced
using the standard
XML Catalog tools
Because of the
naming conventions,
changes to the
MLHIM reference
model does not
impact previously
defined CCDs or
data
This maintains
accurate semantics
for all time
29. MLHIM REFERENCE MODEL
The release version is availble
at www.mlhim.org
The development version is
available at
www.github.com/mlhim
30. CCD GENERATOR (CCD-GEN)
CCD editor maintained by the MLHIM Laboratory at www.ccdgen.com
Produces CCDs according to the correspondent MLHIM Reference Model
CCDs are automatically validated
Other products include:
A sample data instance
JSON serialization of the
data instance
A sample HTML form
Modules for the R
programming language
to pull MLHIM data into
R data frames for
processing and analysis
31. OTHER MLHIM TOOLS
•A MLHIM repository using an SQL DB for persistence with a browser and a REST interface
MLHIM Application Platform & Learning Environment (MAPLE)
•Utility to convert MLHIM CCD XML instances to use shortuuids and to convert to JSON and back
again to XML
•It is intended to demonstrate how mobile apps can use smaller data files to pass over the wire to
an API that expects these formats and can convert them back to full XML instances for validation
MLHIM XML Instance Converter (MXIC)
•Web application to build a form and create a CCD from it (work in progress)
Form2CCD
•FOSS CCD editor (work in progress)
Constraint Definition Designer (CDD)
33. MLHIM IS BIG DATA READY
MLHIM uses standard XML
technologies and embedded
RDF to define the syntax and
semantics
The semantics are in the CCD
and can be easily exchanged
or referenced via the web
Their RDF can be queried,
analyzed and linked using
standard tools
MLHIM data can be stored in
SQL or NoSQL databases
Examples are on GitHub for
eXist-DB (XML) and SQLite3
(can easily be ported to use
PostgreSQL, MySQL, Oracle,
etc.)
We also have experience
with MLHIM data in a
MarkLogic NoSQL cloud
cluster environment
In addition to native XML
DBs, the small document
oriented nature of MLHIM
data is a perfect fit for
document databases such as
MongoDB and CouchDB
MLHIM XML data can easily
be round-trip converted to
JSON for permanent storage
and/or as an exchange
serialization via REST APIs
34. OUR VISION OF THE FUTURE
There are intuitions
inside the healthcare
IT world already
about the
inadequacy of
conventional EMRs
to collect reliable
data at the point of
care
The real Big Data in
healthcare will come
from purpose-specific
applications modeled by
the domain experts
The hardware support of
choice for those apps is the
mobile computing
The other source of Big Data
in healthcare will come from
the Internet of Things
All that data which is
MLHIM compliant will
participate in a semantically
interoperable health
information ecosystem