Preso on relevance of big data analytics to scholarly publishers, given at annual AAP/PSP conference. Focuses on the "product side" of big data and how advances in new models for evaluating medical evidence will affect medical publishers and offers recommendations on how to prepare for new developments in data-driven evidence-based medicine.
1. BIG DATA:
HELPING SCHOLARLY PUBLISHERS
CUT THROUGH THE HYPE
Janice McCallum
Health Content Advisors
Association of American Publishers
2013 PSP Annual Conference, Washington, DC
February 6-8, 2013
2. Focus for this talk
Is Big Data Overhyped?
Let’s start with some definitions
Relevance to scholarly publishers
What should you be doing to profit from--and avoid being
disrupted by--Big Data upstarts?
5. Other Headline Grabbers from Retail
and Financial Services
Analyzing customer behavior to predict purchasing or payment
patterns:
• Target recognizing likely pregnancy of shopper
• Credit card companies knowing when you’re likely to be late with
payments
• Gathering real-time behavior data (EyeSee mannequins)
6. There is More to Big Data
than Watson…Is It All
Hype?
Comprehensive List of Big Data
Statistics
“Comprehensive” should be in quotes.
Sources and magnitude are constantly
changing.
7. What’s the Big Deal with Big Data?
Growth in computing power/reduction in cost: Moore’s Law.
• Why Big Data is So Big
New Sources of Data
• Mobile devices and sensor data hugely expand the amount of
data generated.
• Social media: Facebook and Twitter
•
But keep in mind:
What Executives Don’t Understand About Big Data
• Schrage: Question shouldn’t be how do we get more data, it
should be “what marriage of data and algorithms” achieve our
desired outcome.
Value of Big Data lies in unlocking patterns and insights that
would not have been possible without the combination of
computing power, tools, and data.
Alternative definition for Big Data:
Advanced Analytics for Complex Problem Solving
9. • Volume Large datasets: too big for standard
enterprise database applications.
• Variety Combining structured and “unstructured”
data sources. Think “union” of different sources, not big vs. small
or structured vs. unstructured.
• Velocity Big data systems can integrate and process
near real-time data from mobile devices and sensors.
• Veracity Data management best practices still matter.
It’s not about trading off size vs. quality; it’s about combining
best of both worlds.
Big Data is an umbrella term that can encompass infinite
use cases. The ability to incorporate large diverse data
sources into an analytic model is paramount.
The 4 Vs of Big Data
‟Controlling
Data Volume,
Velocity, and
Variety….to
improve internal
and external
collaboration.”
Doug Laney, 2001
10. RELEVANCE OF BIG DATA TO
SCHOLARLY PUBLISHERS
With Examples from Medical Publishing
11. Expectations of Researchers
Have Changed
Scientific and Medical Researchers Need:
Easier faster access to data sets;
Ability to trace data provenance;
Central repositories or better discovery
options for data sets;
Business models for accessing, sharing,
and adding value to the base of
knowledge.
Raw Data,
Now!
Tim Berners-Lee,
2009
[Biomedicine
is] going to
have to
become more
dynamic, more
computational.
Stephen
Wolfram, 2006
12. Expectations of Clinicians Have Changed
“…the problem is no longer getting access to
data, whether it's a genome sequence or
whether it's a glucose sensor, but how do you
process that data in an efficient way…”
--Eric Topol, 2012
13. Partial Set of Data Sources in Medical Research
Clinical
Research
Patient
registries/
Outcomes
Data
Rx data
Sensor
data/Exercise
tracking
OTC & food
purchases
Disease
registries
Genomic
data
Almost all medical research currently occurs on data types displayed above the fold.
14. Big Data Uses in Healthcare
Are you prepared to play in this fast-
growing fast-changing segment?
15. Getting Started in Big Data
First: recognize that you are all data publishers. If
the content is digital, it’s data.
Create standard formats for data sets that are
submitted with articles.
Plan for collaborating with Big Data analytics
companies.
Develop expertise in new more complex models of
medical evidence.
16. Accelerated Pace of Data Flows
Evidence
Base
Software,
models
Analysis, insights
Data sets, registries,
directories
Curated news, textual content,
summaries
Big Data isn’t about
structured vs. unstructured
data. It’s about building upon
the existing base of
knowledge with the ability to
constantly update the
evidence base with new data
that either reinforce or
replace currently accepted
knowledge.
A strong foundation remains
essential and requires multi-
directional data flows.
17. What Role Will Your Organization Play in Big
Data Era? Some possibilities:
Evidence
Base
Software,
models
Analysis, insights
Data sets, registries,
directories
Curated news, textual content,
summaries
← Disseminate latest evidence-based guidelines
← Provide software platform that
incorporates latest algorithms and
integrates data
← Employ analysts, data scientists,
researchers to conduct studies and report
results
← Provide master data management
services; become clearinghouse for
data exchange
← Publish curated scientific
research results
Need I say more? For this audience in particular, the NLP orientation of IBM Watson and it’s experiments in the healthcare field with Wellpoint and Mt. Sinai Hospital in NY speak volumes.
Mines scholarly articles as primary base of evidence and assigns probabilities to results from queries
Note my use of the word “likely”. Predictions are based on probabilities.
Big Data systems are “learning systems”. See Hilary Mason (CTO of Bitly) talk on machine learning:
http://www.hilarymason.com/presentations-2/devs-love-bacon-everything-you-need-to-know-about-machine-learning-in-30-minutes-or-less/
Let’s move on to some less dramatic material.
1) Growth in information – and availability of new sources of information from devices/sensor. Everything’s being digitized; everything digital can be tracked; mobile devices and sensors are connecting physical & digital worlds (and creating huge amounts of new data).
Terabytes, petabytes, exabytes, zettabytes, yottabytes. You’ve heard the terms and get the gist.
Couple of questions to the audience:
How many of you consider yourselves data publishers?
How many have any idea what Hadoop and MapReduce are?
Picture of Sequoia supercomputers at Lawrence Livermore Lab/DOE.
“We have gone through about 32 doublings in computer power,” since the start of the computer age, Brynjolfsson points out. That is the first half of the chessboard. The second half—the ”next decade of exponential growth—is going to be far more impactful. . . . The ability to take advantage of this explosion in computer power and storage capacity is going to be key to business success.”
2) It’s not really about big datasets; it’s about the tools that have been developed to manage & mine very large datasets and integrate different types of data to extract information that wouldn’t have been possible when data were kept in siloes.
3)Think “Advanced Analytics for Complex Problem Solving” as a better term for “Big Data”. The really big computational tools may be the latest and greatest, but it’s the union of all of the data management and analytics tools that we have that make Big Data a Big Deal.
Before proceeding, let’s step back and define the term “big data”. The most common definition that includes the 3 (or 4) Vs has its origins in a report written by Gartner Analytst (Meta Group at the time), Doug Laney.
Doug Laney’s original article describing the 3 “Vs” from 2001: http://blogs.gartner.com/doug-laney/deja-vvvue-others-claiming-gartners-volume-velocity-variety-construct-for-big-data/
It’s important to emphasize that size or “volume” isn’t the only defining characteristic of big data. A combination of volume and variety are necessary conditions to earn the “big data” moniker in my opinion. The ability to handle high velocity data can be important, depending on the use case.
Veracity, the term I’ve adopted from others to represent good data hygiene and data management practices, it of fundamental importance. Without quality data—and a good understanding of the limitations of your data—you won’t get good predictive capabilities.
Tim Berners-Lee talks frequently about the importance of raw data and linked data. Data in silos isn’t going to produce the progress we need.
Stephen Wolfram, the founder of Mathematica & well-known for founding Wolfram|Allpha, the computational search engine, is now focusing on biomedicine and plans to introduce a medical version of Wolfram Alpha.
Eric Topol, one of the leading voices for rapid change in medical research to facilitate much faster “bench to bedside” dissemination of new discoveries. Also, “personalized medicine” requires a big data model.
And, I haven’t even included payment data, which is used heavily now, because other sources, especially outcomes data, are not widely available. Nor did I include social media.
If you’re not going to become an analytics company, you are going to have to learn how to license your content (data) to analytics companies. What terms will be acceptable?
How many in the room are working with IBM Watson on a pilot basis? How many have a clear plan for commercial licensing terms for IBM Watson?
Wrapping up, this diagram illustrates the data and analysis that underlie the “evidence base” in medicine—in a simplified manner.
In current era, the bottom layer continues to be the dominant means for disseminating new discoveries.
In the future, more and more discoveries will enter the evidence base directly via machine learning. But, we have a ways to go before we have figured out the right data sources and algorithms so that we can safely apply Big Data to medicine.
Just some possibilities listed here. Let’s start at the bottom. I think you all understand the bottom role.
In my view, there will always be a role for specialty publishers to disseminate information about recent scientific discoveries to interested audiences.
However, I predict that the scientists involved in research will be sharing information in different forms in the future. For instance, new algorithms will be shared. More direct means of sharing new discoveries will be employed. New discoveries will go directly into the models and be reflected in the updated evidence base.
Okay, enough prognosticating from me. I’ve covered a lot of material and hope that I’ve left you with some greater insight into “Big Data” and what it means for scholarly publishers.
I’m happy to take a few questions now. You can find my contact information on the next page.