Hadoop

How LinkedIn Uses Apache Samza
Apache Samza is a stream processor LinkedIn recently open-sourced. In his presentation, Samza: Real-
timeStreamProcessingatLinkedIn,ChrisRiccominidiscussesSamza’sfeatureset,howSamzaintegrates
with YARN and Kafka, how it’s used at LinkedIn, and what’s next on the roadmap. PAGE 17
Hadoop
eMag Issue 13 - May 2014
FACILITATING THE SPREAD OF KNOWLEDGE AND INNOVATION IN PROFESSIONAL SOFTWARE DEVELOPMENT
INTRODUCTION P. 3
BUILDING APPLICATIONS WITH HADOOP P. 5
WHAT IS APACHE TEZ? P. 9
MODERN HEALTHCARE ARCHITECTURES BUILT WITH HADOOP P. 14

Contents
Introduction Page 3
Apache Hadoop is an open-source framework that runs applications on large clustered
hardware (servers). It is designed to scale from a single server to thousands of machines, with
a very high degree of fault tolerance.
Building Applications With Hadoop Page 5
When building applications using Hadoop, it is common to have input data from various
sources coming in various formats. In his presentation, “New Tools for Building Applications
on Apache Hadoop”, Eli Collins overviews how to build better products with Hadoop and
various tools that can help, such as Apache Avro, Apache Crunch, Cloudera ML and the
Cloudera Development Kit.
What is Apache Tez? Page 9
Apache Tez is a new distributed execution framework that is targeted towards data-
processing applications on Hadoop. But what exactly is it? How does it work? In the
presentation, “Apache Tez: Accelerating Hadoop Query Processing”, Bikas Saha and Arun
Murthy discuss Tez’s design, highlight some of its features and share initial results obtained
by making Hive use Tez instead of MapReduce.
Modern Healthcare Architectures Built with Hadoop Page 14
This article explores some specific use cases where Hadoop can play a major role in the health
care industry, as well as a possible reference architecture.
How LinkedIn Uses Apache Samza Page17
Apache Samza is a stream processor LinkedIn recently open-sourced. In his presentation,
Samza: Real-time Stream Processing at LinkedIn, Chris Riccomini discusses Samza’s feature
set, how Samza integrates with YARN and Kafka, how it’s used at LinkedIn, and what’s next
on the roadmap.

CONTENTS
Page 3
Hadoop / eMag Issue 13 - May 2014
Introduction
We are living in the era of “big data”. With today’s technology powering increases
in computing power, electronic devices, and accessibility to the Internet, more data
than ever is being transmitted and collected. Organizations are producing data
at an astounding rate. Facebook alone collects 250 terabytes a day. According to
Thompson Reuters News Analytics, digital data production has more than doubled
from almost one zettabyte (a zettabyte is equal to 1 million petabytes) in 2009 and
is expected to reach 7.9 zettabytes in 2015, and 35 zettabytes in 2020.
As organizations have begun collecting and
producing massive amounts of data, they have
started to recognize the advantages of data analysis,
but they are also struggling to manage the massive
amounts of information they have. According to
Alistair Croll, “Companies that have massive amounts
of data without massive amounts of clue are going to
be displaced by startups that have less data but more
clue.…”
Unless your business understands the data it has,
it will not be able to compete with businesses that
do. Businesses realize that there are tremendous
benefits to be gained in analyzing big data related
to business competition, situational awareness,
productivity, science, and innovation – and most see
Hadoop as a main tool for analyzing their massive
amounts of information and mastering the big-data
challenges.
Apache Hadoop is an open-source framework that
runs applications on large clustered hardware
(servers). It is designed to scale from a single server
to thousands of machines, with a very high degree
of fault tolerance. Rather than relying on high-end
hardware, the reliability of these clusters comes from
the software’s ability to detect and handle failures of
its own.
According to a Hortonworks survey, many
large, mainstream organizations (50% of survey
respondents were from organizations with over
$500M in revenues) currently deploy Hadoop
across many industries including high-tech,
healthcare, retail, financial services, government, and
manufacturing.
In the majority of cases, Hadoop does not replace
existing data-processing systems but rather
complements them. It is typically used to supplement
existing systems to tap into additional business data
and a more powerful analytics system in order to
get a competitive advantage through better insights
in business information. Some 54% of respondents
by Roopesh Shanoy and Boris Lublinsky
content extracted from this InfoQ news post

CONTENTS
Page 4
are utilizing Hadoop to capture new types
of data, while 48% are planning to do the
same. The main new data types include the
following:
• Server logs data enabling IT
departments to better manage their
infrastructure (64% of respondents are
already doing it, while 28% plan to).
• Clickstream data enabling better
understanding of how customers
are using applications (52.3% of
respondents are already doing it, while
37.4% plan to).
• Social-media data enabling
understanding of the public’s
perception of the company (36.5% of
respondents are already doing it, while
32.5% plan to).
• Geolocation data enabling analysis of
travel patterns (30.8% of respondents
are already doing it, while 26.8% plan
to).
• Machine data enabling analysis of
machine usage (29.3% of respondents
are already doing it, while 33.3% plan
to).
According to the survey, traditional data
grows with an average rate of about 8%
a year but new data types are growing at
a rate exceeding 85% and, as a result, it is
virtually impossible to collect and process it
without Hadoop.
While Version 1 of Hadoop came with
the MapReduce processing model, the
recently released Hadoop Version 2
comes with YARN, which separates the
cluster management from MapReduce job
manager, making the Hadoop architecture
more modular. One of the side-effects is
that MapReduce is now just one of the ways
to leverage a Hadoop cluster; a number of
new projects such as Tez and Stinger can
also process data in the Hadoop Distributed
File System (HDFS) without the constraint
of the batch-job execution model of
MapReduce.

CONTENTS
Page 5
Building Applications With Hadoop
When building applications using Hadoop, it is common to have input data from
various sources coming in various formats. In his presentation, “New Tools for
Building Applications on Apache Hadoop”, Eli Collins, tech lead for Cloudera’s
Platform Team overviews how to build better products with Hadoop and various
tools that can help, such as Apache Avro, Apache Crunch, Cloudera ML and the
Cloudera Development Kit.
Avro
Avro is a project for data serialization in formats. It is
similar to Thrift or Protocol Buffers. It’s expressive.
You can deal in terms of records, arrays, unions,
enums. It’s efficient so it has a compact binary
representation. One of the benefits of logging an
Avro is that you get much smaller data files. All the
traditional aspects of Hadoop data formats, like
compressible or splittable data, are true of Avro.
One of the reasons Doug Cutting (founder of the
Hadoop project) created the Avro project was that
a lot of the formats in Hadoop were Java only. It’s
important for Avro to be interoperable – to a lot of
different languages like Java, C, C++, C#, Python,
Ruby, etc. – and to be usable by a lot of tools.
One of the goals for Avro is a set of formats and
serialization that’s usable throughout the data
platform that you’re using, not just in a subset of
the components. So MapReduce, Pig, Hive, Crunch,
Flume, Sqoop, etc. all support Avro.
Avro is dynamic and one of its neat features is that
you can read and write data without generating any
code. It will use reflection and look at the schema
that you’ve given it to create classes on the fly. That’s
called Avro-generic formats. You can also specify
formats for which Avro will generate optimal code.
Avro was designed with expectation that you would
change your schema over time. That’s an important
attribute in a big-data system because you generate
lots of data, and you don’t want to constantly
reprocess it. You’re going to generate data at one
time and have tools process that data maybe two,
three, or four years down the line. Avro has the
ability to negotiate differences between schemata so
that new tools can read old data and vice versa.
Avro forms an important basis for the following
projects.
Crunch
You’re probably familiar with Pig and Hive and how
to process data with them and integrate valuable
Presentation transcript edited by Roopesh Shenoy

CONTENTS
Page 6
tools. However, not all data formats that you use will
fit Pig and Hive.
Pig and Hive are great for a lot of logged data or
relational data, but other data types don’t fit as well.
You can still process poorly fitting data with Pig and
Hive, which don’t force you to a relational model or a
log structure, but you have to do a lot of work around
it. You might find yourself writing unwieldy user-
defined functions or doing things that are not natural
in the language. People, sometimes, just give up and
start writing raw Java MapReduce programs because
that’s easier.
Crunch was created to fill this gap. It’s a higher-level
API than MapReduce. It’s in Java. It’s lower level
than, say, Pig, Hive, Cascade, or other frameworks
you might be used to. It’s based on a paper that
Google published called FlumeJava. It’s a very
similar API. Crunch has you combine a small number
of primitives with a small number of types and
effectively allow the user to create really lightweight
UDS, which are just Java methods and classes to
create complex data pipelines.
Crunch has a number of advantages.
• It’s just Java. You have access to a full
programming language.
• You don’t have to learn Pig.
• The type system is well-integrated. You can use
Java POJOs, but there’s also a native support for
Hadoop Writables in Avro. There’s no impedance
mismatch between the Java codes you’re writing
and the data that you’re analyzing.
• It’s built as a modular library for reuse. You can
capture your pipelines in Crunch code in Java and
then combine it with arbitrary machine learning
program later, so that someone else can reuse
that algorithm.
The fundamental structure is a parallel collection so
it’s a distributed, unordered collection of elements.
This collection has a parallel do operator which you
can imagine turns into a MapReduce job. So if you
had a bunch of data that you want to operate in
parallel, you can use a parallel collection.
And there’s something called the parallel table,
which is a subinterface of the collection, and it’s
a distributed sorted map. It also has a group by
operators you can use to aggregate all the values for
a given key. We’ll go through an example that shows
how that works.
Finally, there’s a pipeline class and pipelines are really
for coordinating the execution of the MapReduce
jobs that will actually do the back-end processing for
this Crunch program.
Let’s take an example for which you’ve probably seen all
the Java code before, word count, and see what it looks
like in Crunch.
Crunch – word count
7
public class WordCount {
public static void main(String[] args) throws Exception {
Pipeline pipeline = new MRPipeline(WordCount.class);
PCollection lines = pipeline.readTextFile(args[0]);
PCollection words = lines.parallelDo("my splitter", new DoFn() {
public void process(String line, Emitter emitter) {
for (String word : line.split("s+")) {
emitter.emit(word);
}
}
}, Writables.strings());
PTable counts = Aggregate.count(words);
pipeline.writeTextFile(counts, args[1]);
pipeline.run();
}
}

CONTENTS
Page 7
It’s a lot smaller and simpler. The first line creates
a pipeline. We create a parallel collection of all the
lines from a given file by using the pipeline class. And
then we get a collection of words by running the
parallel do operator on these lines.
We’ve got a defined anonymous function here that
basically processes the input and word count splits
on the word and emits that word for each map task.
Finally, we want to aggregate the counts for each
word and write them out. There’s a line at the
bottom, pipeline run. Crunch’s planner does lazy
evaluation. We’re going to create and run the
MapReduce jobs until we’ve gotten a full pipeline
together.
If you’re used to programming Java and you’ve seen
the Hadoop examples for writing word count in Java,
you can tell that this is a more natural way to express
that. This is among the simplest pipelines you can
create, and you can imagine you can do many more
complicated things.
If you want to go even one step easier than this,
there’s a wrapper for Scala. This is very similar idea to
Cascade, which was built on Google FlumeJava. Since
Scala runs on the JVM, it’s an obvious natural fit.
Scala’s type inference actually ends up being really
powerful in the context of Crunch.
The listing at the bottom of this page is the same
program written in Scala.We have the pipeline and
we can use Scala’s built-in functions that map really
nicely to Crunch – so word count becomes a one-line
program. It’s pretty cool and very powerful if you’re
writing Java code already and want to do complex
pipelines.
Cloudera ML
Cloudera ML (machine learning) is an open-source
library and tools to help data scientists perform the
day-to-day tasks, primarily of data preparation to
model evaluation.
With built-in commands for summarizing, sampling,
normalizing, and pivoting data, Cloudera ML has
recently added a built-in clustering algorithm
for k-means, based on an algorithm that was just
developed a year or two back. There are a couple of
other implementations as well. It’s a home for tools
you can use so you can focus on data analysis and
modeling instead of on building or wrangling the
tools.
It’s built using Crunch. It leverages a lot of existing
projects. For example, the vector formats: a lot of
ML involves transforming raw data that’s in a record
format to vector formats for machine-learning
algorithms. It leverages Mahout’s vector interface
and classes for that purpose. The record format is
just a thin wrapper in Avro, and HCatalog is record
Scrunch – Scala wrapper
8
class WordCountExample {
val pipeline = new Pipeline[WordCountExample]
def wordCount(fileName: String) = {
pipeline.read(from.textFile(fileName))
.flatMap(_.toLowerCase.split("W+"))
.filter(!_.isEmpty())
.count
}
}
Based on Google’s Cascade project

CONTENTS
Page 8
and schema formats so you can easily integrate with
existing data sources.
For more information on Cloudera ML, visit the
projects’ GitHub page; there’s a bunch of examples
with datasets that can get you started.
Cloudera Development Kit
Like Cloudera ML, Cloudera Development Kit a set
of open-source libraries and tools that make writing
applications on Hadoop easier. Unlike ML though,
it’s not focused on using machine learning like a data
scientist. It’s directed at developers trying to build
applications on Hadoop. It’s really the plumbing of
a lot of different frameworks and pipelines and the
integration of a lot of different components.
The purpose of the CDK is to provide higher level
APIs on top of the existing Hadoop components
in the CDH stack that codify a lot of patterns in
common use cases.
CDK is prescriptive, has an opinion on the way to
do things, and tries to make it easy for you to do
the right thing by default, but its architecture is a
system of loosely coupled modules. You can use
modules independently of each other. It’s not an
uber- framework that you have to adopt whole. You
can adopt it piecemeal. It doesn’t force you into any
particular programming paradigms. It doesn’t force
you to adopt a ton of dependencies. You can adopt
only the dependencies of the particular modules you
want.
Let’s look at an example. The first module in
CDK is the data module, and the goal of the data
module is to make it easier for you to work with
datasets on Hadoop file systems. There are a lot
of gory details to clean up to make this work in
practice; you have to worry about serialization,
deserialization, compression, partitioning, directory
layout, commuting, getting that directory layout,
partitioning to other people who want to consume
the data, etc.
The CDK data module handles all this for you. It
automatically serializes and deserializes data from
Java POJOs, if that’s what you have, or Avro records
if you use them. It has built-in compression, and
built-in policies around file and directory layouts so
that you don’t have to repeat a lot of these decisions
and you get smart policies out of the box. It will
automatically partition data within those layouts.
It lets you focus on working on a dataset on HDFS
instead of all the implementation details. It also has
plugin providers for existing systems.
Imagine you’re already using Hive and HCatalog
as a metadata repository, and you’ve already got a
schema for what these files look like. CDK integrates
with that. It doesn’t require you to define all of
your metadata for your entire data repository from
scratch. It integrates with existing systems.
You can learn more about the various CDK modules
and how to use them in the documentation.
In summary, working with data from various sources,
preparing and cleansing data and processing them
via Hadoop involves a lot of work. Tools such as
Crunch, Cloudera ML and CDK make it easier to do
this and leverage Hadoop more effectively.
ABOUT THE SPEAKER
Eli Collins is the tech lead for
Cloudera’s Platform team, an active
contributor to Apache Hadoop and
member of its project management
committee (PMC) at the Apache
Software Foundation. Eli holds
Bachelor’s and Master’s degrees in
Computer Science from New York
University and the University of
Wisconsin-Madison, respectively.
WATCH THE FULL
PRESENTATION ON InfoQ

CONTENTS
Page 9
What is Apache Tez?
You might have heard of Apache Tez, a new distributed execution framework that
is targeted towards data-processing applications on Hadoop. But what exactly is
it? How does it work? Who should use it and why? In their presentation, “Apache
Tez: Accelerating Hadoop Query Processing”, Bikas Saha and Arun Murthy discuss
Tez’s design, highlight some of its features and share some of the initial results
obtained by making Hive use Tez instead of MapReduce.
Tez generalizes the MapReduce paradigm to a
more powerful framework based on expressing
computations as a dataflow graph. Tez is not meant
directly for end-users – in fact it enables developers
to build end-user applications with much better
performance and flexibility. Hadoop has traditionally
been a batch-processing platform for large amounts
of data. However, there are a lot of use cases for
near-real-time performance of query processing.
There are
also several
workloads,
such as
Machine
Learning,
which do
not fit will
into the
MapReduce
paradigm.
Tez helps
Hadoop
address
these use
cases.
The Tez project aims to be highly customizable
so that it can meet a broad spectrum of use cases
without forcing people to go out of their way to
make things work; projects such as Hiveand Pig are
seeing significant improvements in response times
when they use Tez instead of MapReduce as the
backbone for data processing. Tez is built on top
of YARN, which is the new resource-management
framework for Hadoop.

CONTENTS
Page 10
Design Philosophy
The main reason for Tez to exist is to get around
limitations imposed by MapReduce. Other than
being limited to writing mappers and reducers, there
are other inefficiencies in force-fitting all kinds of
computations into this paradigm – for e.g. HDFS
is used to store temporary data between multiple
MR jobs, which is an overhead. (In Hive, this is
common when queries require multiple shuffles on
keys without correlation, such as with join - grp by -
window function - order by.)
The key elements forming the design philosophy
behind Tez -
• Empowering developers (and hence end users) to
do what they want in the most efficient manner
• Better execution performance
Some of the things that helps Tez achieve these goals
are:
• Expressive Dataflow APIs - The Tez team wants
to have an expressive-dataflow-definition API so
that you can describe the Direct Acyclic Graph
(DAG) of computation that you want to run. For
this, Tez has a structural kind of API in which you
add all processors and edges and visualize what
you are actually constructing.
• Flexible Input-Processor-Output runtime model
– can construct runtime executors dynamically
by connecting different inputs, processors and
outputs.
• Data type agnostic – only concerned with
movement of data, not with the data format (key-
value pairs, tuple oriented formats, etc)
• Dynamic Graph Reconfiguration
• Simple Deployment – Tez is completely a client-
side application, leverages YARN local resources
and distributed cache. There’s no need to deploy
anything on your cluster as far as using Tez is
concerned. You just upload the relevant Tez
libraries to HDFS then use your Tez client to
submit with those libraries.
You can even have two copies of the libraries on
your cluster. One would be a production copy,
which is the stable version and which all your
production jobs use. Your users can experiment
with a second copy, the latest version of Tez. And
they will not interfere with each other.
Tez can run any MR job without any modification.
This allows for stage-wise migration of tools that
currently depend on MR.
Exploring the Expressive Dataflow APIs in detail -
what can you do with this? For e.g. instead of using
multiple MapReduce jobs, you can use the MRR
pattern, such that a single map has multiple reduce
stages; this can allow streaming of data from one
processor to another to another, without writing
anything to HDFS (it will be written to disk only for
check-pointing), leading to much better performance.
The below diagrams demonstrate this -

CONTENTS
Page 11
The first diagram demonstrates a process that has
multiple MR jobs, each storing intermediate results
to the HDFS – the reducers of the previous step
feeding the mappers of the next step. The second
diagram shows how with Tez, the same processing
can be done in just one job, with no need to access
HDFS in between.
Tez’s flexibility means that it requires a bit more
effort than MapReduce to start consuming; there’s a
bit more API and a bit more processing logic that you
need to implement. This is fine since it is not an end-
user application like MapReduce; it is designed to let
developers build end-user applications on top of it.
Given that overview of Tez and its broad goals, let’s
try to understand the actual APIs.
Tez API
The Tez API has the following components:
• DAG (Directed Acyclic Graph) – defines the
overall job. One DAG object corresponds to one
job
• Vertex – defines the user logic along with the
resources and the environment needed to
execute the user logic. One Vertex corresponds
to one step in the job
• Edge – defines the connection between producer
and consumer vertices.
Edges need to be assigned properties; these
properties are essential for Tez to be able to
expand that logical graph at runtime into the
physical set of tasks that can be done in parallel
on the cluster. There are several such properties:
• The data-movement property defines how
data moves from a producer to a consumer.
• Scheduling properties (sequential or
concurrent) helps us define when the
producer and consumer tasks can be
scheduled relative to each other.
• Data-source property (persisted, reliable or
ephemeral), defines the lifetime or durability
of the output produced by our task so that
we can determine when we can terminate it.
You can view this Hortonworks article to see an
example of the API in action, more detail about these
properties and how the logical graph expands at run-
time.
The runtime API is based on an input-processor-
output model which allows all inputs and outputs
to be pluggable. To facilitate this, Tez uses an event-
based model in order to communicate between tasks
and the system, and between various components.
Events are used to pass information such as task
failures to the required components, flow of data
from Output to the Input such as location of data
that it generates, enabling run-time changes to the
DAG execution plan, etc.
Tez also comes with various Input and Output
processors out-of-the-box.
The expressive API allows higher language (such as
Hive) writers to elegantly transform their queries
into Tez jobs.
Tez Scheduler
The Tez scheduler considers a lot of things when
deciding on task assignments – task-locality
requirements, compatibility of containers, total
available resources on the cluster, priority of
pending task requests, automatic parallelization,
freeing up resources that the application cannot
use anymore (because the data is not local to it) etc.
It also maintains a connection pool of pre-warmed
JVMs with shared registry objects. The application
can choose to store different kinds of pre-computed
information in those shared registry objects so that
they can be reused without having to recompute
them later on, and this shared set of connections and
container-pool resources can run those tasks very
fast.
You can read more about reusing of containers in
Apache Tez.
Flexibility
Overall, Tez provides a great deal of flexibility for
developers to deal with complex processing logic.
This can be illustrated with one example of how Hive
is able to leverage Tez.
Let’s take this typical TPC-DS query pattern in which
you are joining multiple tables with a fact table. Most
optimizers and query systems can do what is there
in the top-right corner: if the dimension tables are

CONTENTS
Page 12
small, then they can broadcast-join all of them with
the large fact table, and you can do that same thing
on Tez.
But what if these broadcasts have user-defined
functions that are expensive to compute? You may
not be able to do all of that this way. You may have to
break up your tasks into different stages, and that’s
what the left-side topology shows you. The first
dimension table is broadcast-joined with the fact
table. The result is then broadcast-joined with the
second dimension table.
Here, the third dimension table is not broadcastable
because it is too large. You can choose to do a shuffle
join, and Tez can efficiently navigate the topology
without falling over just because you can’t do the
top-right one.
The two benefits for this kind of Hive query with Tez
are:
• it gives you full DAG support and does a lot
automatically on the cluster so that it can fully
utilize the parallelism that is available in the
cluster; as already discussed above, this means
there is no need for reading/writing from HDFS
between multiple MR jobs, all the computation
can be done in a single Tez job.
• it provides sessions and reusable containers so
that you have low latency and can avoid
recombination as much as possible.
This particular Hive query is seeing performance
improvement of more than 100% with the new Tez
engine.
Roadmap
• Richer DAG support. For example, can Samza
use Tez as a substrate on which to build the
application? It needs some support in order
for Tez to handle Samza’s core scheduling and
streaming requirements. The Tez team wants
to explore how we would enable those kinds of
connection patterns in our DAGs. They also want
more fault-tolerance support, more efficient data
transfer for further performance optimization,
and improved session performance.
• Given that these DAGs can get arbitrarily
complex, we need a lot of automatic tooling to
help the users understand their performance
bottlenecks

CONTENTS
Page 13
Summary
Tez is a distributed execution framework that works
on computations represented as dataflow graphs. It
maps naturally to higher-level declarative languages
like Hive, Pig, Cascading, etc. It’s designed to have
highly customizable execution architecture so that
we can make dynamic performance optimizations
at runtime based on real information about the data
and the resources. The framework itself automatically
determines a lot of the hard stuff, allowing it to work
right out-of-the-box.
You get good performance and efficiency out-of-the-
box. Tez aims to address the broad spectrum of use
cases in the data-processing domain in Hadoop,
ranging from latency to complexity of the execution. It
is an open-source project. Tez works, Saha and Murthy
suggest, and is already being used by Hive and Pig.
ABOUT THE SPEAKERS
Arun Murthy is the lead of the
MapReduce project in Apache
Hadoop where he has been a full-time
contributor to Apache Hadoop since
its inception in 2006. He is a long-time
committer and member of the Apache
Hadoop PMC and jointly holds the
current world sorting record using
Apache Hadoop. Prior to co-founding
Hortonworks, Arun was responsible for
all MapReduce code and configuration
deployed across the 42,000+ servers at
Yahoo!
Bikas Saha has been working on
Apache Hadoop for over a year and is a
committer on the project. He has been
a key contributor in making Hadoop run
natively on Windows and has focused
on YARN and the Hadoop compute
stack. Prior to Hadoop, he has worked
extensively on the Dryad distributed
data processing framework that runs on
some of the worlds largest clusters as
part of Microsoft Bing infrastructure
WATCH THE FULL

CONTENTS
Page 14
Modern Healthcare Architectures
Built with Hadoop
We have heard plenty in the news lately about healthcare challenges and the
difficult choices faced by hospital administrators, technology and pharmaceutical
providers, researchers, and clinicians. At the same time, consumers are
experiencing increased costs without a corresponding increase in health security or
in the reliability of clinical outcomes.
One key obstacle in the healthcare market is data
liquidity (for patients, practitioners and payers) and
some are using Apache Hadoop to overcome this
challenge, as part of a modern data architecture.
This post describes some healthcare use cases, a
healthcare reference architecture and how Hadoop
can ease the pain caused by poor data liquidity.
New Value Pathways for Healthcare
In January 2013, McKinsey & Company published
a report named “The ‘Big Data’ Revolution in
Healthcare”. The report points out how big data
is creating value in five “new value pathways”
allowing data to flow more freely. Below we present
a summary of these five new value pathways and
an an example how Hadoop can be used to address
each. Thanks to the Clinical Informatics Group at UC
Irvine Health for many of the use cases, described in
their UCIH case study.
by Justin Sears
Pathway Benefit Hadoop Use Case
Right Living Patients can build value by taking an
active role in their own treatment,
including disease prevention.
Predictive Analytics: Heart patients weigh
themselves at home with scales that transmit data
wirelessly to their health center. Algorithms analyze
the data and flag patterns that indicate a high risk of
readmission, alerting a physician.
Right Care Patients get the most timely,
appropriate treatment available.
Real-time Monitoring: Patient vital statistics are
transmitted from wireless sensors every minute.
If vital signs cross certain risk thresholds, staff can
attend to the patient immediately.
Right
Provider
Provider skill sets matched to the
complexity of the assignment— for
instance, nurses or physicians’ assistants
performing tasks that do not require a
doctor. Also the specific selection of the
provider with the best outcomes.
Historical EMR Analysis: Hadoop reduces the cost
to store data on clinical operations, allowing longer
retention of data on staffing decisions and clinical
outcomes. Analysis of this data allows administrators
to promote individuals and practices that achieve the
best results.

CONTENTS
Page 15
Pathway Benefit Hadoop Use Case
Right Value Ensure cost-effectiveness of care, such
as tying provider reimbursement to
patient outcomes, or eliminating fraud,
waste, or abuse in the system.
Medical Device Management: For biomedical device
maintenance, use geolocation and sensor data to
manage its medical equipment. The biomedical
team can know where all the equipment is, so they
don’t waste time searching for an item.Over time,
determine the usage of different devices, and use this
information to make rational decisions about when to
repair or replace equipment.
Right
Innovation
The identification of new therapies and
approaches to delivering care, across all
aspects of the system. Also improving
the innovation engines themselves.
Research Cohort Selection: Researchers at teaching
hospitals can access patient data in Hadoop for cohort
discovery, then present the anonymous sample cohort
to their Internal Review Board for approval, without
ever having seen uniquely identifiable information.
Source: The ‘Big Data’ Revolution in Healthcare. McKinsey & Company, January 2013.
At Hortonworks, we see our healthcare customers ingest and analyze data from many sources. The following
reference architecture is an amalgam of Hadoop data patterns that we’ve seen with our customers’ use of
Hortonworks Data Platform (HDP). Components shaded green are part of HDP.
Sources of Healthcare Data
Source data comes from:
• Legacy Electronic Medical Records (EMRs)
• Transcriptions
• PACS
• Medication Administration
• Financial
• Laboratory (e.g. SunQuest, Cerner)
• RTLS (for locating medical equipment & patient
throughput)
• Bio Repository
• Device Integration (e.g. iSirona)
• Home Devices (e.g. scales and heart monitors)
• Clinical Trials
• Genomics (e.g. 23andMe, Cancer Genomics Hub)
• Radiology (e.g. RadNet)
• Quantified Self Sensors (e.g. Fitbit, SmartSleep)
• Social Media Streams (e.g. FourSquare, Twitter)

CONTENTS
Page 16
Loading Healthcare Data
Apache Sqoop is included in Hortonworks Data
Platform, as a tool to transfer data between external
structured data stores (such as Teradata, Netezza,
MySQL, or Oracle) into HDFS or related systems like
Hive and HBase. We also see our customers using
other tools or standards for loading healthcare data
into Hadoop. Some of these are:
Health Level 7 (HL7) International Standards
Apache UIMA
JAVA ETL rules
Processing Healthcare Data
Depending on the use case, healthcare organizations
process data in batch (using Apache Hadoop
MapReduce and Apache Pig); interactively (with
Apache Hive); online (with Apache HBase) or
streaming (with Apache Storm).
Analyzing Healthcare Data
Once data is stored and processed in Hadoop it
can either be analyzed in the cluster or exported to
relational data stores for analysis there. These data
stores might include:
• Enterprise data warehouse
• Quality data mart
• Surgical data mart
• Clinical info data mart
• Diagnosis data mart
• Neo4j graph database
Many data analysis and visualization applications
can also work with the data directly in Hadoop.
Hortonworks healthcare customers typically use the
following business intelligence and visualization tools
to inform their decisions:
• Microsoft Excel
• Tableau
• RESTful Web Services
• EMR Real-time analytics
• Metric Insights
• Patient Scorecards
• Research Portals
• Operational Dashboard
• Quality Dashboards
The following diagram shows how healthcare
organizations can integrate Hadoop into their
existing data architecture to create a modern data
architecture that is interoperable and familiar, so that
the same team of analysts and practitioners can use
their existing skills in new ways:
As more and more healthcare organizations adopt
Hadoop to disseminate data to their teams and
partners, they empower caregivers to combine their
training, intuition, and professional experience with
big data to make data-driven decisions that cure
patients and reduce costs.
Watch our blog in the coming weeks as we share
reference architectures for other industry verticals.
READ THIS ARTICLE ON
Hortonworks.com
ABOUT THE AUTHOR
Justin Sears is an experienced
marketing manager with sixteen years
leading teams to create and position
enterprise software, risk-controlled
consumer banking products, desktop
and mobile web properties, and
services for Latino customers in the US
and Latin America. Expert in enterprise
big data use cases for Apache Hadoop.

CONTENTS
Page 17
How LinkedIn Uses Apache Samza
Apache Samza is a stream processor LinkedIn recently open-sourced. In
his presentation, Samza: Real-time Stream Processing at LinkedIn, Chris
Riccomini discusses Samza’s feature set, how Samza integrates with YARN and
Kafka, how it’s used at LinkedIn, and what’s next on the roadmap.
Apache Samza is a stream processor LinkedIn
recently open-sourced. In his presentation,Samza:
Real-time Stream Processing at LinkedIn, Chris
Riccomini discusses Samza’s feature set, how Samza
integrates with YARN and Kafka, how it’s used at
LinkedIn, and what’s next on the roadmap.
Bulk of processing that happens at LinkedIn is RPC-
style data processing, where one expects a very fast
response. On the other end of their response latency
spectrum, they have batch processing, for which they
use Hadoop for quite a bit. Hadoop processing and
batch processing typically happens after the fact,
often hours later.
There’s this gap between synchronous RPC
processing, where the user is actively waiting for a
response, and this Hadoop-style processing which
despite efforts to shrink it still it takes a long time to
run through.

CONTENTS
Page 18
That’s where Samza fits in. This is where we can
process stuff asynchronously, but we’re also not
waiting for hours. It typically operates in the order of
milliseconds to minutes. The idea is to process stuff
relatively quickly and get the data back to wherever
it needs to be, whether that’s a downstream system
or some real-time service.
Chris mentions that right now, this stream processing
is the worst-supported in terms of tooling and
environment.
LinkedIn sees a lot of use cases for this type of
processing –
• Newsfeed displays when people move to another
company, when they like an article, when they
join a group, et cetera.
News is latency-sensitive and if you use Hadoop to
batch-compute it, you might be getting responses
hours or maybe even a day later. It is important to get
trending articles in News pretty quickly.
• Advertising – getting relevant advertisements, as
well as tracking and monitoring ad display, clicks
and other metrics
• Sophisticated monitoring that allows performing
of complex querys like “the top five slowest pages
for the last minute.”
Existing Ecosystem at LinkedIn
The existing ecosystem at LinkedIn has had a huge
influence in the motivation behind Samza as well
as it’s architecture. Hence it is important to have at
least a glimpse of what this looks like before diving
into Samza.
Kafka is an open-source project that LinkedIn
released a few years ago. It is a messaging system
that fulfills two needs – message-queuing and log
aggregation. All of LinkedIn’s user activity, all the
metrics and monitoring data, and even database
changes go into this.
LinkedIn also has a specialized system
called Databus, which models all of their databases
as a stream. It is like a database with the latest data
for each key-value pair. But as this database mutates,
you can actually model that set of mutations as a
stream. Each individual change is a message in that
stream.
Because LinkedIn has Kafka and because they’ve
integrated with it for the past few years, a lot of data
at LinkedIn, almost all of it, is available in a stream
format as opposed to a data format or on Hadoop.
Motivation for Building Samza
Chris mentions that when they began doing stream
processing, with Kafka and all this data in their
system, they started with something like a web
service that would start up, read messages from
Kafka and do some processing, and then write the
messages back out.
As they did this, they realized that there were a lot of
problems that needed to be solved in order to make
it really useful and scalable. Things like partitioning:
how do you partition your stream? How do you
partition your processor? How do you manage state,
where state is defined essentially as something that
you maintain in your processor between messages,
or things like count if you’re incrementing a counter
every time a message comes in. How do you re-
process?
With failure semantics, you get at least once, at most
once, exactly once messaging. There is also non-
determinism. If your stream processor is interacting
with another system, whether it’s a database or it’s
depending on time or the ordering of messages,
how you deal with stuff that actually determines the
output that you will end up sending?
Samza tries to address some of these problems.
Samza Architecture
The most basic element of Samza is a stream. The
stream definition for Samza is much more rigid and
heavyweight than you would expect from other
stream processing systems. Other processing
systems, such as Storm, tend to have very lightweight
stream definitions to reduce latency, everything
from, say, UDP to a straight-up TCP connection.
Samza goes the other direction. It wants its streams
to be, for starters, partitions. It wants them to be
ordered. If you read Message 3 and then Message
4, you are never going to get those inverted within
a single partition. It also wants them to replayable,
which means you should be able to go back to reread
a message at a later date. It wants them to be fault-
tolerant. If a host from Partition 1 disappears, it
should still be readable on some other hosts. Also,
the streams are usually infinite. Once you get to the

CONTENTS
Page 19
end – say, Message 6 of Partition 0 – you would just
try to reread the next message when it’s available.
It’s not the case that you’re finished.
This definition maps very well to Kafka, which
LinkedIn uses as the streaming infrastructure for
Samza.
There are many concepts to understand within
Samza. In a gist, they are –
• Streams – Samza processes streams. A stream
is composed of immutable messages of a similar
type or category. The actual implementation
can be provided via a messaging system such
as Kafka (where each topic becomes a Samza
Stream) or a database (table) or even Hadoop (a
directory of files in HDFS)
• Things like message ordering, batching are
handled via streams.
• Jobs – a Samza job is code that performs logical
transformation on a set of input streams to
append messages to a set of output streams
• Partitions – For scalability, each stream is broken
into one or more partitions. Each partition is a
totally ordered sequence of messages
• Tasks – again for scalability, a job is distributed
by breaking it into multiple tasks. The task
consumes data from one partition for each of the
job’s input streams
• Containers – whereas partitions and tasks are
logical units of parallelism, containers are unit
physical parallelism. Each container is a unix
process (or linux cgroup) and runs one or more
tasks.
• TaskRunner – Taskrunner is Samza’s stream
processing container. It manages startup,
execution and shutdown of one or more
StreamTask instances.
• Checkpointing – Checkpointing is generally done
to enable failure recovery. If a taskrunner goes
down for some reason (hardware failure, for e.g.),
when it comes back up, it should start consuming
messages where it left off last – this is achieved
via Checkpointing.
• State management – Data that needs to be
passed between processing of different messages
can be called state – this can be something as
simple as a keeping count or something a lot
more complex. Samza allows tasks to maintain
persistent, mutable, queryable state that is
physically co-located with each task. The state
is highly available: in the event of a task failure
it will be restored when the task fails over to
another machine.
This datastore is pluggable, but Samza comes with a
key-value store out-of-the-box.
• YARN (Yet Another Resource Manager) is
Hadoop v2’s biggest improvement over v1 – it
separates the Map-Reduce Job tracker from the
resource management and enables Map-reduce
alternatives to use the same resource manager.
Samza utilizes YARN to do cluster management,
tracking failures, etc.
Samza provides a YARN ApplicationMaster and a
YARN job runner out of the box.
You can understand how the various components
(YARN, Kafka and Samza API) interact by
looking at the detailed architecture. Also read
the overall documentation to understand each
component in detail.
Possible Improvements
One of the advantages of using something like YARN
with Samza is that it enables you to potentially run
Samza on the same grid that you already run your
draft tasks, test tasks, and MapReduce tasks. You
could use the same infrastructure for all of that.
However, LinkedIn currently does not run Samza in a
multi-framework environment because the existing
setup itself is quite experimental.
In order to get into a more multi-framework
environment, Chris says that the process isolation
would have to get a little better.

CONTENTS
Page 20
Conclusion
Samza is a relatively young project incubating at
Apache so there’s a lot of room to get involved. A good
way to get started is with the hello-samza project,
which is a little thing that will get you up and running
in about five minutes. It will let you play with a real-
time change log from the Wikipedia servers to let you
figure out what’s going on in and give you a stream of
stuff to play with.
The other stream processing project built on top of
Hadoop is STORM. You can see acomparison between
Samza and STORM
ABOUT THE SPEAKER
Chris Riccomini is a Staff Software
Engineer at LinkedIn, where he’s is
currently working as a committer
and PMC member for Apache Samza.
He’s been involved in a wide range of
projects at LinkedIn, including, “People
You May Know”, REST.li, Hadoop,
engineering tooling, and OLAP systems.
Prior to LinkedIn, he worked on data
visualization and fraud modeling at
PayPal.
WATCH THE FULL

Hadoop

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (17)

Semelhante a Hadoop

Semelhante a Hadoop (20)

Mais de Manuel Vargas

Mais de Manuel Vargas (20)

Último

Último (20)

Hadoop