SlideShare uma empresa Scribd logo
1 de 20
How LinkedIn Uses Apache Samza
Apache Samza is a stream processor LinkedIn recently open-sourced. In his presentation, Samza: Real-
timeStreamProcessingatLinkedIn,ChrisRiccominidiscussesSamza’sfeatureset,howSamzaintegrates
with YARN and Kafka, how it’s used at LinkedIn, and what’s next on the roadmap. PAGE 17
Hadoop
eMag Issue 13 - May 2014
FACILITATING THE SPREAD OF KNOWLEDGE AND INNOVATION IN PROFESSIONAL SOFTWARE DEVELOPMENT
INTRODUCTION P. 3
BUILDING APPLICATIONS WITH HADOOP P. 5
WHAT IS APACHE TEZ? P. 9
MODERN HEALTHCARE ARCHITECTURES BUILT WITH HADOOP P. 14
Contents
Introduction	 Page 3
Apache Hadoop is an open-source framework that runs applications on large clustered
hardware (servers). It is designed to scale from a single server to thousands of machines, with
a very high degree of fault tolerance.
Building Applications With Hadoop 	 Page 5
When building applications using Hadoop, it is common to have input data from various
sources coming in various formats. In his presentation, “New Tools for Building Applications
on Apache Hadoop”, Eli Collins overviews how to build better products with Hadoop and
various tools that can help, such as Apache Avro, Apache Crunch, Cloudera ML and the
Cloudera Development Kit.
What is Apache Tez? 	 Page 9
Apache Tez is a new distributed execution framework that is targeted towards data-
processing applications on Hadoop. But what exactly is it? How does it work? In the
presentation, “Apache Tez: Accelerating Hadoop Query Processing”, Bikas Saha and Arun
Murthy discuss Tez’s design, highlight some of its features and share initial results obtained
by making Hive use Tez instead of MapReduce.
Modern Healthcare Architectures Built with Hadoop 	 Page 14
This article explores some specific use cases where Hadoop can play a major role in the health
care industry, as well as a possible reference architecture.
How LinkedIn Uses Apache Samza 	 Page17
Apache Samza is a stream processor LinkedIn recently open-sourced. In his presentation,
Samza: Real-time Stream Processing at LinkedIn, Chris Riccomini discusses Samza’s feature
set, how Samza integrates with YARN and Kafka, how it’s used at LinkedIn, and what’s next
on the roadmap.
CONTENTS
Page 3
Hadoop / eMag Issue 13 - May 2014
Introduction
We are living in the era of “big data”. With today’s technology powering increases
in computing power, electronic devices, and accessibility to the Internet, more data
than ever is being transmitted and collected. Organizations are producing data
at an astounding rate. Facebook alone collects 250 terabytes a day. According to
Thompson Reuters News Analytics, digital data production has more than doubled
from almost one zettabyte (a zettabyte is equal to 1 million petabytes) in 2009 and
is expected to reach 7.9 zettabytes in 2015, and 35 zettabytes in 2020.
As organizations have begun collecting and
producing massive amounts of data, they have
started to recognize the advantages of data analysis,
but they are also struggling to manage the massive
amounts of information they have. According to
Alistair Croll, “Companies that have massive amounts
of data without massive amounts of clue are going to
be displaced by startups that have less data but more
clue.…”
Unless your business understands the data it has,
it will not be able to compete with businesses that
do. Businesses realize that there are tremendous
benefits to be gained in analyzing big data related
to business competition, situational awareness,
productivity, science, and innovation – and most see
Hadoop as a main tool for analyzing their massive
amounts of information and mastering the big-data
challenges.
Apache Hadoop is an open-source framework that
runs applications on large clustered hardware
(servers). It is designed to scale from a single server
to thousands of machines, with a very high degree
of fault tolerance. Rather than relying on high-end
hardware, the reliability of these clusters comes from
the software’s ability to detect and handle failures of
its own.
According to a Hortonworks survey, many
large, mainstream organizations (50% of survey
respondents were from organizations with over
$500M in revenues) currently deploy Hadoop
across many industries including high-tech,
healthcare, retail, financial services, government, and
manufacturing.
In the majority of cases, Hadoop does not replace
existing data-processing systems but rather
complements them. It is typically used to supplement
existing systems to tap into additional business data
and a more powerful analytics system in order to
get a competitive advantage through better insights
in business information. Some 54% of respondents
by Roopesh Shanoy and Boris Lublinsky
content extracted from this InfoQ news post
CONTENTS
Page 4
Hadoop / eMag Issue 13 - May 2014
are utilizing Hadoop to capture new types
of data, while 48% are planning to do the
same. The main new data types include the
following:
•	 Server logs data enabling IT
departments to better manage their
infrastructure (64% of respondents are
already doing it, while 28% plan to).
•	 Clickstream data enabling better
understanding of how customers
are using applications (52.3% of
respondents are already doing it, while
37.4% plan to).
•	 Social-media data enabling
understanding of the public’s
perception of the company (36.5% of
respondents are already doing it, while
32.5% plan to).
•	 Geolocation data enabling analysis of
travel patterns (30.8% of respondents
are already doing it, while 26.8% plan
to).
•	 Machine data enabling analysis of
machine usage (29.3% of respondents
are already doing it, while 33.3% plan
to).
According to the survey, traditional data
grows with an average rate of about 8%
a year but new data types are growing at
a rate exceeding 85% and, as a result, it is
virtually impossible to collect and process it
without Hadoop.
While Version 1 of Hadoop came with
the MapReduce processing model, the
recently released Hadoop Version 2
comes with YARN, which separates the
cluster management from MapReduce job
manager, making the Hadoop architecture
more modular. One of the side-effects is
that MapReduce is now just one of the ways
to leverage a Hadoop cluster; a number of
new projects such as Tez and Stinger can
also process data in the Hadoop Distributed
File System (HDFS) without the constraint
of the batch-job execution model of
MapReduce.
CONTENTS
Page 5
Hadoop / eMag Issue 13 - May 2014
Building Applications With Hadoop
When building applications using Hadoop, it is common to have input data from
various sources coming in various formats. In his presentation, “New Tools for
Building Applications on Apache Hadoop”, Eli Collins, tech lead for Cloudera’s
Platform Team overviews how to build better products with Hadoop and various
tools that can help, such as Apache Avro, Apache Crunch, Cloudera ML and the
Cloudera Development Kit.
Avro
Avro is a project for data serialization in formats. It is
similar to Thrift or Protocol Buffers. It’s expressive.
You can deal in terms of records, arrays, unions,
enums. It’s efficient so it has a compact binary
representation. One of the benefits of logging an
Avro is that you get much smaller data files. All the
traditional aspects of Hadoop data formats, like
compressible or splittable data, are true of Avro.
One of the reasons Doug Cutting (founder of the
Hadoop project) created the Avro project was that
a lot of the formats in Hadoop were Java only. It’s
important for Avro to be interoperable – to a lot of
different languages like Java, C, C++, C#, Python,
Ruby, etc. – and to be usable by a lot of tools.
One of the goals for Avro is a set of formats and
serialization that’s usable throughout the data
platform that you’re using, not just in a subset of
the components. So MapReduce, Pig, Hive, Crunch,
Flume, Sqoop, etc. all support Avro.
Avro is dynamic and one of its neat features is that
you can read and write data without generating any
code. It will use reflection and look at the schema
that you’ve given it to create classes on the fly. That’s
called Avro-generic formats. You can also specify
formats for which Avro will generate optimal code.
Avro was designed with expectation that you would
change your schema over time. That’s an important
attribute in a big-data system because you generate
lots of data, and you don’t want to constantly
reprocess it. You’re going to generate data at one
time and have tools process that data maybe two,
three, or four years down the line. Avro has the
ability to negotiate differences between schemata so
that new tools can read old data and vice versa.
Avro forms an important basis for the following
projects.
Crunch
You’re probably familiar with Pig and Hive and how
to process data with them and integrate valuable
Presentation transcript edited by Roopesh Shenoy
CONTENTS
Page 6
Hadoop / eMag Issue 13 - May 2014
tools. However, not all data formats that you use will
fit Pig and Hive.
Pig and Hive are great for a lot of logged data or
relational data, but other data types don’t fit as well.
You can still process poorly fitting data with Pig and
Hive, which don’t force you to a relational model or a
log structure, but you have to do a lot of work around
it. You might find yourself writing unwieldy user-
defined functions or doing things that are not natural
in the language. People, sometimes, just give up and
start writing raw Java MapReduce programs because
that’s easier.
Crunch was created to fill this gap. It’s a higher-level
API than MapReduce. It’s in Java. It’s lower level
than, say, Pig, Hive, Cascade, or other frameworks
you might be used to. It’s based on a paper that
Google published called FlumeJava. It’s a very
similar API. Crunch has you combine a small number
of primitives with a small number of types and
effectively allow the user to create really lightweight
UDS, which are just Java methods and classes to
create complex data pipelines.
Crunch has a number of advantages.
•	 It’s just Java. You have access to a full
programming language.
•	 You don’t have to learn Pig.
•	 The type system is well-integrated. You can use
Java POJOs, but there’s also a native support for
Hadoop Writables in Avro. There’s no impedance
mismatch between the Java codes you’re writing
and the data that you’re analyzing.
•	 It’s built as a modular library for reuse. You can
capture your pipelines in Crunch code in Java and
then combine it with arbitrary machine learning
program later, so that someone else can reuse
that algorithm.
The fundamental structure is a parallel collection so
it’s a distributed, unordered collection of elements.
This collection has a parallel do operator which you
can imagine turns into a MapReduce job. So if you
had a bunch of data that you want to operate in
parallel, you can use a parallel collection.
And there’s something called the parallel table,
which is a subinterface of the collection, and it’s
a distributed sorted map. It also has a group by
operators you can use to aggregate all the values for
a given key. We’ll go through an example that shows
how that works.
Finally, there’s a pipeline class and pipelines are really
for coordinating the execution of the MapReduce
jobs that will actually do the back-end processing for
this Crunch program.
Let’s take an example for which you’ve probably seen all
the Java code before, word count, and see what it looks
like in Crunch.
Crunch – word count
7
public class WordCount {
public static void main(String[] args) throws Exception {
Pipeline pipeline = new MRPipeline(WordCount.class);
PCollection lines = pipeline.readTextFile(args[0]);
PCollection words = lines.parallelDo("my splitter", new DoFn() {
public void process(String line, Emitter emitter) {
for (String word : line.split("s+")) {
emitter.emit(word);
}
}
}, Writables.strings());
PTable counts = Aggregate.count(words);
pipeline.writeTextFile(counts, args[1]);
pipeline.run();
}
}
CONTENTS
Page 7
Hadoop / eMag Issue 13 - May 2014
It’s a lot smaller and simpler. The first line creates
a pipeline. We create a parallel collection of all the
lines from a given file by using the pipeline class. And
then we get a collection of words by running the
parallel do operator on these lines.
We’ve got a defined anonymous function here that
basically processes the input and word count splits
on the word and emits that word for each map task.
Finally, we want to aggregate the counts for each
word and write them out. There’s a line at the
bottom, pipeline run. Crunch’s planner does lazy
evaluation. We’re going to create and run the
MapReduce jobs until we’ve gotten a full pipeline
together.
If you’re used to programming Java and you’ve seen
the Hadoop examples for writing word count in Java,
you can tell that this is a more natural way to express
that. This is among the simplest pipelines you can
create, and you can imagine you can do many more
complicated things.
If you want to go even one step easier than this,
there’s a wrapper for Scala. This is very similar idea to
Cascade, which was built on Google FlumeJava. Since
Scala runs on the JVM, it’s an obvious natural fit.
Scala’s type inference actually ends up being really
powerful in the context of Crunch.
The listing at the bottom of this page is the same
program written in Scala.We have the pipeline and
we can use Scala’s built-in functions that map really
nicely to Crunch – so word count becomes a one-line
program. It’s pretty cool and very powerful if you’re
writing Java code already and want to do complex
pipelines.
Cloudera ML
Cloudera ML (machine learning) is an open-source
library and tools to help data scientists perform the
day-to-day tasks, primarily of data preparation to
model evaluation.
With built-in commands for summarizing, sampling,
normalizing, and pivoting data, Cloudera ML has
recently added a built-in clustering algorithm
for k-means, based on an algorithm that was just
developed a year or two back. There are a couple of
other implementations as well. It’s a home for tools
you can use so you can focus on data analysis and
modeling instead of on building or wrangling the
tools.
It’s built using Crunch. It leverages a lot of existing
projects. For example, the vector formats: a lot of
ML involves transforming raw data that’s in a record
format to vector formats for machine-learning
algorithms. It leverages Mahout’s vector interface
and classes for that purpose. The record format is
just a thin wrapper in Avro, and HCatalog is record
Scrunch – Scala wrapper
8
class WordCountExample {
val pipeline = new Pipeline[WordCountExample]
def wordCount(fileName: String) = {
pipeline.read(from.textFile(fileName))
.flatMap(_.toLowerCase.split("W+"))
.filter(!_.isEmpty())
.count
}
}
Based on Google’s Cascade project
CONTENTS
Page 8
Hadoop / eMag Issue 13 - May 2014
and schema formats so you can easily integrate with
existing data sources.
For more information on Cloudera ML, visit the
projects’ GitHub page; there’s a bunch of examples
with datasets that can get you started.
Cloudera Development Kit
Like Cloudera ML, Cloudera Development Kit a set
of open-source libraries and tools that make writing
applications on Hadoop easier. Unlike ML though,
it’s not focused on using machine learning like a data
scientist. It’s directed at developers trying to build
applications on Hadoop. It’s really the plumbing of
a lot of different frameworks and pipelines and the
integration of a lot of different components.
The purpose of the CDK is to provide higher level
APIs on top of the existing Hadoop components
in the CDH stack that codify a lot of patterns in
common use cases.
CDK is prescriptive, has an opinion on the way to
do things, and tries to make it easy for you to do
the right thing by default, but its architecture is a
system of loosely coupled modules. You can use
modules independently of each other. It’s not an
uber- framework that you have to adopt whole. You
can adopt it piecemeal. It doesn’t force you into any
particular programming paradigms. It doesn’t force
you to adopt a ton of dependencies. You can adopt
only the dependencies of the particular modules you
want.
Let’s look at an example. The first module in
CDK is the data module, and the goal of the data
module is to make it easier for you to work with
datasets on Hadoop file systems. There are a lot
of gory details to clean up to make this work in
practice; you have to worry about serialization,
deserialization, compression, partitioning, directory
layout, commuting, getting that directory layout,
partitioning to other people who want to consume
the data, etc.
The CDK data module handles all this for you. It
automatically serializes and deserializes data from
Java POJOs, if that’s what you have, or Avro records
if you use them. It has built-in compression, and
built-in policies around file and directory layouts so
that you don’t have to repeat a lot of these decisions
and you get smart policies out of the box. It will
automatically partition data within those layouts.
It lets you focus on working on a dataset on HDFS
instead of all the implementation details. It also has
plugin providers for existing systems.
Imagine you’re already using Hive and HCatalog
as a metadata repository, and you’ve already got a
schema for what these files look like. CDK integrates
with that. It doesn’t require you to define all of
your metadata for your entire data repository from
scratch. It integrates with existing systems.
You can learn more about the various CDK modules
and how to use them in the documentation.
In summary, working with data from various sources,
preparing and cleansing data and processing them
via Hadoop involves a lot of work. Tools such as
Crunch, Cloudera ML and CDK make it easier to do
this and leverage Hadoop more effectively.
ABOUT THE SPEAKER
Eli Collins is the tech lead for
Cloudera’s Platform team, an active
contributor to Apache Hadoop and
member of its project management
committee (PMC) at the Apache
Software Foundation. Eli holds
Bachelor’s and Master’s degrees in
Computer Science from New York
University and the University of
Wisconsin-Madison, respectively.
WATCH THE FULL
PRESENTATION ON InfoQ
CONTENTS
Page 9
Hadoop / eMag Issue 13 - May 2014
What is Apache Tez?
You might have heard of Apache Tez, a new distributed execution framework that
is targeted towards data-processing applications on Hadoop. But what exactly is
it? How does it work? Who should use it and why? In their presentation, “Apache
Tez: Accelerating Hadoop Query Processing”, Bikas Saha and Arun Murthy discuss
Tez’s design, highlight some of its features and share some of the initial results
obtained by making Hive use Tez instead of MapReduce.
Tez generalizes the MapReduce paradigm to a
more powerful framework based on expressing
computations as a dataflow graph. Tez is not meant
directly for end-users – in fact it enables developers
to build end-user applications with much better
performance and flexibility. Hadoop has traditionally
been a batch-processing platform for large amounts
of data. However, there are a lot of use cases for
near-real-time performance of query processing.
There are
also several
workloads,
such as
Machine
Learning,
which do
not fit will
into the
MapReduce
paradigm.
Tez helps
Hadoop
address
these use
cases.
The Tez project aims to be highly customizable
so that it can meet a broad spectrum of use cases
without forcing people to go out of their way to
make things work; projects such as Hiveand Pig are
seeing significant improvements in response times
when they use Tez instead of MapReduce as the
backbone for data processing. Tez is built on top
of YARN, which is the new resource-management
framework for Hadoop.
Presentation transcript edited by Roopesh Shenoy
CONTENTS
Page 10
Hadoop / eMag Issue 13 - May 2014
Design Philosophy
The main reason for Tez to exist is to get around
limitations imposed by MapReduce. Other than
being limited to writing mappers and reducers, there
are other inefficiencies in force-fitting all kinds of
computations into this paradigm – for e.g. HDFS
is used to store temporary data between multiple
MR jobs, which is an overhead. (In Hive, this is
common when queries require multiple shuffles on
keys without correlation, such as with join - grp by -
window function - order by.)
The key elements forming the design philosophy
behind Tez -
•	 Empowering developers (and hence end users) to
do what they want in the most efficient manner
•	 Better execution performance
Some of the things that helps Tez achieve these goals
are:
•	 Expressive Dataflow APIs - The Tez team wants
to have an expressive-dataflow-definition API so
that you can describe the Direct Acyclic Graph
(DAG) of computation that you want to run. For
this, Tez has a structural kind of API in which you
add all processors and edges and visualize what
you are actually constructing.
•	 Flexible Input-Processor-Output runtime model
– can construct runtime executors dynamically
by connecting different inputs, processors and
outputs.
•	 Data type agnostic – only concerned with
movement of data, not with the data format (key-
value pairs, tuple oriented formats, etc)
•	 Dynamic Graph Reconfiguration
•	 Simple Deployment – Tez is completely a client-
side application, leverages YARN local resources
and distributed cache. There’s no need to deploy
anything on your cluster as far as using Tez is
concerned. You just upload the relevant Tez
libraries to HDFS then use your Tez client to
submit with those libraries.
You can even have two copies of the libraries on
your cluster. One would be a production copy,
which is the stable version and which all your
production jobs use. Your users can experiment
with a second copy, the latest version of Tez. And
they will not interfere with each other.
Tez can run any MR job without any modification.
This allows for stage-wise migration of tools that
currently depend on MR.
Exploring the Expressive Dataflow APIs in detail -
what can you do with this? For e.g. instead of using
multiple MapReduce jobs, you can use the MRR
pattern, such that a single map has multiple reduce
stages; this can allow streaming of data from one
processor to another to another, without writing
anything to HDFS (it will be written to disk only for
check-pointing), leading to much better performance.
The below diagrams demonstrate this -
CONTENTS
Page 11
Hadoop / eMag Issue 13 - May 2014
The first diagram demonstrates a process that has
multiple MR jobs, each storing intermediate results
to the HDFS – the reducers of the previous step
feeding the mappers of the next step. The second
diagram shows how with Tez, the same processing
can be done in just one job, with no need to access
HDFS in between.
Tez’s flexibility means that it requires a bit more
effort than MapReduce to start consuming; there’s a
bit more API and a bit more processing logic that you
need to implement. This is fine since it is not an end-
user application like MapReduce; it is designed to let
developers build end-user applications on top of it.
Given that overview of Tez and its broad goals, let’s
try to understand the actual APIs.
Tez API
The Tez API has the following components:
•	 DAG (Directed Acyclic Graph) – defines the
overall job. One DAG object corresponds to one
job
•	 Vertex – defines the user logic along with the
resources and the environment needed to
execute the user logic. One Vertex corresponds
to one step in the job
•	 Edge – defines the connection between producer
and consumer vertices.
Edges need to be assigned properties; these
properties are essential for Tez to be able to
expand that logical graph at runtime into the
physical set of tasks that can be done in parallel
on the cluster. There are several such properties:
•	 The data-movement property defines how
data moves from a producer to a consumer.
•	 Scheduling properties (sequential or
concurrent) helps us define when the
producer and consumer tasks can be
scheduled relative to each other.
•	 Data-source property (persisted, reliable or
ephemeral), defines the lifetime or durability
of the output produced by our task so that
we can determine when we can terminate it.
You can view this Hortonworks article to see an
example of the API in action, more detail about these
properties and how the logical graph expands at run-
time.
The runtime API is based on an input-processor-
output model which allows all inputs and outputs
to be pluggable. To facilitate this, Tez uses an event-
based model in order to communicate between tasks
and the system, and between various components.
Events are used to pass information such as task
failures to the required components, flow of data
from Output to the Input such as location of data
that it generates, enabling run-time changes to the
DAG execution plan, etc.
Tez also comes with various Input and Output
processors out-of-the-box.
The expressive API allows higher language (such as
Hive) writers to elegantly transform their queries
into Tez jobs.
Tez Scheduler
The Tez scheduler considers a lot of things when
deciding on task assignments – task-locality
requirements, compatibility of containers, total
available resources on the cluster, priority of
pending task requests, automatic parallelization,
freeing up resources that the application cannot
use anymore (because the data is not local to it) etc.
It also maintains a connection pool of pre-warmed
JVMs with shared registry objects. The application
can choose to store different kinds of pre-computed
information in those shared registry objects so that
they can be reused without having to recompute
them later on, and this shared set of connections and
container-pool resources can run those tasks very
fast.
You can read more about reusing of containers in
Apache Tez.
Flexibility
Overall, Tez provides a great deal of flexibility for
developers to deal with complex processing logic.
This can be illustrated with one example of how Hive
is able to leverage Tez.
Let’s take this typical TPC-DS query pattern in which
you are joining multiple tables with a fact table. Most
optimizers and query systems can do what is there
in the top-right corner: if the dimension tables are
CONTENTS
Page 12
Hadoop / eMag Issue 13 - May 2014
small, then they can broadcast-join all of them with
the large fact table, and you can do that same thing
on Tez.
But what if these broadcasts have user-defined
functions that are expensive to compute? You may
not be able to do all of that this way. You may have to
break up your tasks into different stages, and that’s
what the left-side topology shows you. The first
dimension table is broadcast-joined with the fact
table. The result is then broadcast-joined with the
second dimension table.
Here, the third dimension table is not broadcastable
because it is too large. You can choose to do a shuffle
join, and Tez can efficiently navigate the topology
without falling over just because you can’t do the
top-right one.
The two benefits for this kind of Hive query with Tez
are:
•	 it gives you full DAG support and does a lot
automatically on the cluster so that it can fully
utilize the parallelism that is available in the
cluster; as already discussed above, this means
there is no need for reading/writing from HDFS
between multiple MR jobs, all the computation
can be done in a single Tez job.
•	 it provides sessions and reusable containers so
that you have low latency and can avoid
recombination as much as possible.
This particular Hive query is seeing performance
improvement of more than 100% with the new Tez
engine.
Roadmap
•	 Richer DAG support. For example, can Samza
use Tez as a substrate on which to build the
application? It needs some support in order
for Tez to handle Samza’s core scheduling and
streaming requirements. The Tez team wants
to explore how we would enable those kinds of
connection patterns in our DAGs. They also want
more fault-tolerance support, more efficient data
transfer for further performance optimization,
and improved session performance.
•	 Given that these DAGs can get arbitrarily
complex, we need a lot of automatic tooling to
help the users understand their performance
bottlenecks
CONTENTS
Page 13
Hadoop / eMag Issue 13 - May 2014
Summary
Tez is a distributed execution framework that works
on computations represented as dataflow graphs. It
maps naturally to higher-level declarative languages
like Hive, Pig, Cascading, etc. It’s designed to have
highly customizable execution architecture so that
we can make dynamic performance optimizations
at runtime based on real information about the data
and the resources. The framework itself automatically
determines a lot of the hard stuff, allowing it to work
right out-of-the-box.
You get good performance and efficiency out-of-the-
box. Tez aims to address the broad spectrum of use
cases in the data-processing domain in Hadoop,
ranging from latency to complexity of the execution. It
is an open-source project. Tez works, Saha and Murthy
suggest, and is already being used by Hive and Pig.
ABOUT THE SPEAKERS
Arun Murthy is the lead of the
MapReduce project in Apache
Hadoop where he has been a full-time
contributor to Apache Hadoop since
its inception in 2006. He is a long-time
committer and member of the Apache
Hadoop PMC and jointly holds the
current world sorting record using
Apache Hadoop. Prior to co-founding
Hortonworks, Arun was responsible for
all MapReduce code and configuration
deployed across the 42,000+ servers at
Yahoo!
Bikas Saha has been working on
Apache Hadoop for over a year and is a
committer on the project. He has been
a key contributor in making Hadoop run
natively on Windows and has focused
on YARN and the Hadoop compute
stack. Prior to Hadoop, he has worked
extensively on the Dryad distributed
data processing framework that runs on
some of the worlds largest clusters as
part of Microsoft Bing infrastructure
WATCH THE FULL
PRESENTATION ON InfoQ
CONTENTS
Page 14
Hadoop / eMag Issue 13 - May 2014
Modern Healthcare Architectures
Built with Hadoop
We have heard plenty in the news lately about healthcare challenges and the
difficult choices faced by hospital administrators, technology and pharmaceutical
providers, researchers, and clinicians. At the same time, consumers are
experiencing increased costs without a corresponding increase in health security or
in the reliability of clinical outcomes.
One key obstacle in the healthcare market is data
liquidity (for patients, practitioners and payers) and
some are using Apache Hadoop to overcome this
challenge, as part of a modern data architecture.
This post describes some healthcare use cases, a
healthcare reference architecture and how Hadoop
can ease the pain caused by poor data liquidity.
New Value Pathways for Healthcare
In January 2013, McKinsey & Company published
a report named “The ‘Big Data’ Revolution in
Healthcare”. The report points out how big data
is creating value in five “new value pathways”
allowing data to flow more freely. Below we present
a summary of these five new value pathways and
an an example how Hadoop can be used to address
each. Thanks to the Clinical Informatics Group at UC
Irvine Health for many of the use cases, described in
their UCIH case study.
by Justin Sears
Pathway Benefit Hadoop Use Case
Right Living Patients can build value by taking an
active role in their own treatment,
including disease prevention.
Predictive Analytics: Heart patients weigh
themselves at home with scales that transmit data
wirelessly to their health center. Algorithms analyze
the data and flag patterns that indicate a high risk of
readmission, alerting a physician.
Right Care Patients get the most timely,
appropriate treatment available.
Real-time Monitoring: Patient vital statistics are
transmitted from wireless sensors every minute.
If vital signs cross certain risk thresholds, staff can
attend to the patient immediately.
Right
Provider
Provider skill sets matched to the
complexity of the assignment— for
instance, nurses or physicians’ assistants
performing tasks that do not require a
doctor. Also the specific selection of the
provider with the best outcomes.
Historical EMR Analysis: Hadoop reduces the cost
to store data on clinical operations, allowing longer
retention of data on staffing decisions and clinical
outcomes. Analysis of this data allows administrators
to promote individuals and practices that achieve the
best results.
CONTENTS
Page 15
Hadoop / eMag Issue 13 - May 2014
Pathway Benefit Hadoop Use Case
Right Value Ensure cost-effectiveness of care, such
as tying provider reimbursement to
patient outcomes, or eliminating fraud,
waste, or abuse in the system.
Medical Device Management: For biomedical device
maintenance, use geolocation and sensor data to
manage its medical equipment. The biomedical
team can know where all the equipment is, so they
don’t waste time searching for an item.Over time,
determine the usage of different devices, and use this
information to make rational decisions about when to
repair or replace equipment.
Right
Innovation
The identification of new therapies and
approaches to delivering care, across all
aspects of the system. Also improving
the innovation engines themselves.
Research Cohort Selection: Researchers at teaching
hospitals can access patient data in Hadoop for cohort
discovery, then present the anonymous sample cohort
to their Internal Review Board for approval, without
ever having seen uniquely identifiable information.
Source: The ‘Big Data’ Revolution in Healthcare. McKinsey & Company, January 2013.
At Hortonworks, we see our healthcare customers ingest and analyze data from many sources. The following
reference architecture is an amalgam of Hadoop data patterns that we’ve seen with our customers’ use of
Hortonworks Data Platform (HDP). Components shaded green are part of HDP.
Sources of Healthcare Data
Source data comes from:
•	 Legacy Electronic Medical Records (EMRs)
•	 Transcriptions
•	 PACS
•	 Medication Administration
•	 Financial
•	 Laboratory (e.g. SunQuest, Cerner)
•	 RTLS (for locating medical equipment & patient
throughput)
•	 Bio Repository
•	 Device Integration (e.g. iSirona)
•	 Home Devices (e.g. scales and heart monitors)
•	 Clinical Trials
•	 Genomics (e.g. 23andMe, Cancer Genomics Hub)
•	 Radiology (e.g. RadNet)
•	 Quantified Self Sensors (e.g. Fitbit, SmartSleep)
•	 Social Media Streams (e.g. FourSquare, Twitter)
CONTENTS
Page 16
Hadoop / eMag Issue 13 - May 2014
Loading Healthcare Data
Apache Sqoop is included in Hortonworks Data
Platform, as a tool to transfer data between external
structured data stores (such as Teradata, Netezza,
MySQL, or Oracle) into HDFS or related systems like
Hive and HBase. We also see our customers using
other tools or standards for loading healthcare data
into Hadoop. Some of these are:
Health Level 7 (HL7) International Standards
Apache UIMA
JAVA ETL rules
Processing Healthcare Data
Depending on the use case, healthcare organizations
process data in batch (using Apache Hadoop
MapReduce and Apache Pig); interactively (with
Apache Hive); online (with Apache HBase) or
streaming (with Apache Storm).
Analyzing Healthcare Data
Once data is stored and processed in Hadoop it
can either be analyzed in the cluster or exported to
relational data stores for analysis there. These data
stores might include:
•	 Enterprise data warehouse
•	 Quality data mart
•	 Surgical data mart
•	 Clinical info data mart
•	 Diagnosis data mart
•	 Neo4j graph database
Many data analysis and visualization applications
can also work with the data directly in Hadoop.
Hortonworks healthcare customers typically use the
following business intelligence and visualization tools
to inform their decisions:
•	 Microsoft Excel
•	 Tableau
•	 RESTful Web Services
•	 EMR Real-time analytics
•	 Metric Insights
•	 Patient Scorecards
•	 Research Portals
•	 Operational Dashboard
•	 Quality Dashboards
The following diagram shows how healthcare
organizations can integrate Hadoop into their
existing data architecture to create a modern data
architecture that is interoperable and familiar, so that
the same team of analysts and practitioners can use
their existing skills in new ways:
As more and more healthcare organizations adopt
Hadoop to disseminate data to their teams and
partners, they empower caregivers to combine their
training, intuition, and professional experience with
big data to make data-driven decisions that cure
patients and reduce costs.
Watch our blog in the coming weeks as we share
reference architectures for other industry verticals.
READ THIS ARTICLE ON
Hortonworks.com
ABOUT THE AUTHOR
Justin Sears is an experienced
marketing manager with sixteen years
leading teams to create and position
enterprise software, risk-controlled
consumer banking products, desktop
and mobile web properties, and
services for Latino customers in the US
and Latin America. Expert in enterprise
big data use cases for Apache Hadoop.
CONTENTS
Page 17
Hadoop / eMag Issue 13 - May 2014
How LinkedIn Uses Apache Samza
Apache Samza is a stream processor LinkedIn recently open-sourced. In
his presentation, Samza: Real-time Stream Processing at LinkedIn, Chris
Riccomini discusses Samza’s feature set, how Samza integrates with YARN and
Kafka, how it’s used at LinkedIn, and what’s next on the roadmap.
Apache Samza is a stream processor LinkedIn
recently open-sourced. In his presentation,Samza:
Real-time Stream Processing at LinkedIn, Chris
Riccomini discusses Samza’s feature set, how Samza
integrates with YARN and Kafka, how it’s used at
LinkedIn, and what’s next on the roadmap.
Bulk of processing that happens at LinkedIn is RPC-
style data processing, where one expects a very fast
response. On the other end of their response latency
spectrum, they have batch processing, for which they
use Hadoop for quite a bit. Hadoop processing and
batch processing typically happens after the fact,
often hours later.
There’s this gap between synchronous RPC
processing, where the user is actively waiting for a
response, and this Hadoop-style processing which
despite efforts to shrink it still it takes a long time to
run through.
Presentation transcript edited by Roopesh Shenoy
CONTENTS
Page 18
Hadoop / eMag Issue 13 - May 2014
That’s where Samza fits in. This is where we can
process stuff asynchronously, but we’re also not
waiting for hours. It typically operates in the order of
milliseconds to minutes. The idea is to process stuff
relatively quickly and get the data back to wherever
it needs to be, whether that’s a downstream system
or some real-time service.
Chris mentions that right now, this stream processing
is the worst-supported in terms of tooling and
environment.
LinkedIn sees a lot of use cases for this type of
processing –
•	 Newsfeed displays when people move to another
company, when they like an article, when they
join a group, et cetera.
News is latency-sensitive and if you use Hadoop to
batch-compute it, you might be getting responses
hours or maybe even a day later. It is important to get
trending articles in News pretty quickly.
•	 Advertising – getting relevant advertisements, as
well as tracking and monitoring ad display, clicks
and other metrics
•	 Sophisticated monitoring that allows performing
of complex querys like “the top five slowest pages
for the last minute.”
Existing Ecosystem at LinkedIn
The existing ecosystem at LinkedIn has had a huge
influence in the motivation behind Samza as well
as it’s architecture. Hence it is important to have at
least a glimpse of what this looks like before diving
into Samza.
Kafka is an open-source project that LinkedIn
released a few years ago. It is a messaging system
that fulfills two needs – message-queuing and log
aggregation. All of LinkedIn’s user activity, all the
metrics and monitoring data, and even database
changes go into this.
LinkedIn also has a specialized system
called Databus, which models all of their databases
as a stream. It is like a database with the latest data
for each key-value pair. But as this database mutates,
you can actually model that set of mutations as a
stream. Each individual change is a message in that
stream.
Because LinkedIn has Kafka and because they’ve
integrated with it for the past few years, a lot of data
at LinkedIn, almost all of it, is available in a stream
format as opposed to a data format or on Hadoop.
Motivation for Building Samza
Chris mentions that when they began doing stream
processing, with Kafka and all this data in their
system, they started with something like a web
service that would start up, read messages from
Kafka and do some processing, and then write the
messages back out.
As they did this, they realized that there were a lot of
problems that needed to be solved in order to make
it really useful and scalable. Things like partitioning:
how do you partition your stream? How do you
partition your processor? How do you manage state,
where state is defined essentially as something that
you maintain in your processor between messages,
or things like count if you’re incrementing a counter
every time a message comes in. How do you re-
process?
With failure semantics, you get at least once, at most
once, exactly once messaging. There is also non-
determinism. If your stream processor is interacting
with another system, whether it’s a database or it’s
depending on time or the ordering of messages,
how you deal with stuff that actually determines the
output that you will end up sending?
Samza tries to address some of these problems.
Samza Architecture
The most basic element of Samza is a stream. The
stream definition for Samza is much more rigid and
heavyweight than you would expect from other
stream processing systems. Other processing
systems, such as Storm, tend to have very lightweight
stream definitions to reduce latency, everything
from, say, UDP to a straight-up TCP connection.
Samza goes the other direction. It wants its streams
to be, for starters, partitions. It wants them to be
ordered. If you read Message 3 and then Message
4, you are never going to get those inverted within
a single partition. It also wants them to replayable,
which means you should be able to go back to reread
a message at a later date. It wants them to be fault-
tolerant. If a host from Partition 1 disappears, it
should still be readable on some other hosts. Also,
the streams are usually infinite. Once you get to the
CONTENTS
Page 19
Hadoop / eMag Issue 13 - May 2014
end – say, Message 6 of Partition 0 – you would just
try to reread the next message when it’s available.
It’s not the case that you’re finished.
This definition maps very well to Kafka, which
LinkedIn uses as the streaming infrastructure for
Samza.
There are many concepts to understand within
Samza. In a gist, they are –
•	 Streams – Samza processes streams. A stream
is composed of immutable messages of a similar
type or category. The actual implementation
can be provided via a messaging system such
as Kafka (where each topic becomes a Samza
Stream) or a database (table) or even Hadoop (a
directory of files in HDFS)
•	 Things like message ordering, batching are
handled via streams.
•	 Jobs – a Samza job is code that performs logical
transformation on a set of input streams to
append messages to a set of output streams
•	 Partitions – For scalability, each stream is broken
into one or more partitions. Each partition is a
totally ordered sequence of messages
•	 Tasks – again for scalability, a job is distributed
by breaking it into multiple tasks. The task
consumes data from one partition for each of the
job’s input streams
•	 Containers – whereas partitions and tasks are
logical units of parallelism, containers are unit
physical parallelism. Each container is a unix
process (or linux cgroup) and runs one or more
tasks.
•	 TaskRunner – Taskrunner is Samza’s stream
processing container. It manages startup,
execution and shutdown of one or more
StreamTask instances.
•	 Checkpointing – Checkpointing is generally done
to enable failure recovery. If a taskrunner goes
down for some reason (hardware failure, for e.g.),
when it comes back up, it should start consuming
messages where it left off last – this is achieved
via Checkpointing.
•	 State management – Data that needs to be
passed between processing of different messages
can be called state – this can be something as
simple as a keeping count or something a lot
more complex. Samza allows tasks to maintain
persistent, mutable, queryable state that is
physically co-located with each task. The state
is highly available: in the event of a task failure
it will be restored when the task fails over to
another machine.
This datastore is pluggable, but Samza comes with a
key-value store out-of-the-box.
•	 YARN (Yet Another Resource Manager) is
Hadoop v2’s biggest improvement over v1 – it
separates the Map-Reduce Job tracker from the
resource management and enables Map-reduce
alternatives to use the same resource manager.
Samza utilizes YARN to do cluster management,
tracking failures, etc.
Samza provides a YARN ApplicationMaster and a
YARN job runner out of the box.
You can understand how the various components
(YARN, Kafka and Samza API) interact by
looking at the detailed architecture. Also read
the overall documentation to understand each
component in detail.
Possible Improvements
One of the advantages of using something like YARN
with Samza is that it enables you to potentially run
Samza on the same grid that you already run your
draft tasks, test tasks, and MapReduce tasks. You
could use the same infrastructure for all of that.
However, LinkedIn currently does not run Samza in a
multi-framework environment because the existing
setup itself is quite experimental.
In order to get into a more multi-framework
environment, Chris says that the process isolation
would have to get a little better.
CONTENTS
Page 20
Hadoop / eMag Issue 13 - May 2014
Conclusion
Samza is a relatively young project incubating at
Apache so there’s a lot of room to get involved. A good
way to get started is with the hello-samza project,
which is a little thing that will get you up and running
in about five minutes. It will let you play with a real-
time change log from the Wikipedia servers to let you
figure out what’s going on in and give you a stream of
stuff to play with.
The other stream processing project built on top of
Hadoop is STORM. You can see acomparison between
Samza and STORM
ABOUT THE SPEAKER
Chris Riccomini is a Staff Software
Engineer at LinkedIn, where he’s is
currently working as a committer
and PMC member for Apache Samza.
He’s been involved in a wide range of
projects at LinkedIn, including, “People
You May Know”, REST.li, Hadoop,
engineering tooling, and OLAP systems.
Prior to LinkedIn, he worked on data
visualization and fraud modeling at
PayPal.
WATCH THE FULL
PRESENTATION ON InfoQ

Mais conteúdo relacionado

Mais procurados

IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
 
Twitter word frequency count using hadoop components 150331221753
Twitter word frequency count using hadoop components 150331221753Twitter word frequency count using hadoop components 150331221753
Twitter word frequency count using hadoop components 150331221753pradip patel
 
Webinar : Talend : The Non-Programmer's Swiss Knife for Big Data
Webinar  : Talend : The Non-Programmer's Swiss Knife for Big DataWebinar  : Talend : The Non-Programmer's Swiss Knife for Big Data
Webinar : Talend : The Non-Programmer's Swiss Knife for Big DataEdureka!
 
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...Daniel Abadi
 
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...Yuanyuan Tian
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation17aroumougamh
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkLaxmi8
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworksIJDKP
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 
Project report on hadoop and docker
Project report on hadoop and dockerProject report on hadoop and docker
Project report on hadoop and dockerAkhil Goyal
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big DataLewis Crawford
 
Hadoop for Java Professionals
Hadoop for Java ProfessionalsHadoop for Java Professionals
Hadoop for Java ProfessionalsEdureka!
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IEdureka!
 
Data analytics using the cloud challenges and opportunities for india
Data analytics using the cloud   challenges and opportunities for india Data analytics using the cloud   challenges and opportunities for india
Data analytics using the cloud challenges and opportunities for india Ajay Ohri
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoopMaulik Thaker
 
Dba to data scientist -Satyendra
Dba to data scientist -SatyendraDba to data scientist -Satyendra
Dba to data scientist -Satyendrapasalapudi123
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopEdureka!
 

Mais procurados (20)

IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
 
Twitter word frequency count using hadoop components 150331221753
Twitter word frequency count using hadoop components 150331221753Twitter word frequency count using hadoop components 150331221753
Twitter word frequency count using hadoop components 150331221753
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Webinar : Talend : The Non-Programmer's Swiss Knife for Big Data
Webinar  : Talend : The Non-Programmer's Swiss Knife for Big DataWebinar  : Talend : The Non-Programmer's Swiss Knife for Big Data
Webinar : Talend : The Non-Programmer's Swiss Knife for Big Data
 
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
 
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Project report on hadoop and docker
Project report on hadoop and dockerProject report on hadoop and docker
Project report on hadoop and docker
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
 
Big Data
Big DataBig Data
Big Data
 
Hadoop for Java Professionals
Hadoop for Java ProfessionalsHadoop for Java Professionals
Hadoop for Java Professionals
 
Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -I
 
Data analytics using the cloud challenges and opportunities for india
Data analytics using the cloud   challenges and opportunities for india Data analytics using the cloud   challenges and opportunities for india
Data analytics using the cloud challenges and opportunities for india
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoop
 
Dba to data scientist -Satyendra
Dba to data scientist -SatyendraDba to data scientist -Satyendra
Dba to data scientist -Satyendra
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 

Destaque

Crean un ser vivo con el genoma más pequeño del mundo
Crean un ser vivo con el genoma más pequeño del mundoCrean un ser vivo con el genoma más pequeño del mundo
Crean un ser vivo con el genoma más pequeño del mundoUNACARTemasSelectosdeFisica
 
Giải pháp nâng cao hiệu sử dụng tài sản của công ty tnhh toyota long biên
Giải pháp nâng cao hiệu sử dụng tài sản của công ty tnhh toyota long biênGiải pháp nâng cao hiệu sử dụng tài sản của công ty tnhh toyota long biên
Giải pháp nâng cao hiệu sử dụng tài sản của công ty tnhh toyota long biênNOT
 
đề Thi tốt nghiệp nghề may thời trang 11
đề Thi tốt nghiệp nghề may   thời trang 11đề Thi tốt nghiệp nghề may   thời trang 11
đề Thi tốt nghiệp nghề may thời trang 11TÀI LIỆU NGÀNH MAY
 
đồ áN ngành may xây dựng tài liệu kỹ thuật triển khai sản xuất mã hàng áo thu...
đồ áN ngành may xây dựng tài liệu kỹ thuật triển khai sản xuất mã hàng áo thu...đồ áN ngành may xây dựng tài liệu kỹ thuật triển khai sản xuất mã hàng áo thu...
đồ áN ngành may xây dựng tài liệu kỹ thuật triển khai sản xuất mã hàng áo thu...TÀI LIỆU NGÀNH MAY
 
Wireless power transmission
Wireless power transmissionWireless power transmission
Wireless power transmissionArpit Modh
 
Nonverbal (The Structure of Non Verbal Codes).
Nonverbal (The Structure of Non Verbal Codes).Nonverbal (The Structure of Non Verbal Codes).
Nonverbal (The Structure of Non Verbal Codes).Ratih Aini
 
Giải pháp nâng cao hiệu quả sử dụng vốn lưu động tại công ty cổ phần giải phá...
Giải pháp nâng cao hiệu quả sử dụng vốn lưu động tại công ty cổ phần giải phá...Giải pháp nâng cao hiệu quả sử dụng vốn lưu động tại công ty cổ phần giải phá...
Giải pháp nâng cao hiệu quả sử dụng vốn lưu động tại công ty cổ phần giải phá...NOT
 
Hoàn thiện công tác phân tích tài chính tại công ty cổ phần intesys
Hoàn thiện công tác phân tích tài chính tại công ty cổ phần intesysHoàn thiện công tác phân tích tài chính tại công ty cổ phần intesys
Hoàn thiện công tác phân tích tài chính tại công ty cổ phần intesyshttps://www.facebook.com/garmentspace
 
SODA通信2015裏面
SODA通信2015裏面SODA通信2015裏面
SODA通信2015裏面awajisoda
 
Hoàn thiện công tác phân tích tình hình tài chính tại công ty cổ phần dầu khí...
Hoàn thiện công tác phân tích tình hình tài chính tại công ty cổ phần dầu khí...Hoàn thiện công tác phân tích tình hình tài chính tại công ty cổ phần dầu khí...
Hoàn thiện công tác phân tích tình hình tài chính tại công ty cổ phần dầu khí...https://www.facebook.com/garmentspace
 
Hoàn thiện công tác phân tích tài chính tại công ty trách nhiệm hữu hạn một t...
Hoàn thiện công tác phân tích tài chính tại công ty trách nhiệm hữu hạn một t...Hoàn thiện công tác phân tích tài chính tại công ty trách nhiệm hữu hạn một t...
Hoàn thiện công tác phân tích tài chính tại công ty trách nhiệm hữu hạn một t...https://www.facebook.com/garmentspace
 
(PART1/2)COLLAPSE OF THE HYATT REGENCY WALKWAYS 1981
(PART1/2)COLLAPSE OF THE HYATT REGENCY WALKWAYS 1981(PART1/2)COLLAPSE OF THE HYATT REGENCY WALKWAYS 1981
(PART1/2)COLLAPSE OF THE HYATT REGENCY WALKWAYS 1981Ali Faizan Wattoo
 
Communication skills
Communication skillsCommunication skills
Communication skillsArpit Modh
 

Destaque (17)

Crean un ser vivo con el genoma más pequeño del mundo
Crean un ser vivo con el genoma más pequeño del mundoCrean un ser vivo con el genoma más pequeño del mundo
Crean un ser vivo con el genoma más pequeño del mundo
 
Giải pháp nâng cao hiệu sử dụng tài sản của công ty tnhh toyota long biên
Giải pháp nâng cao hiệu sử dụng tài sản của công ty tnhh toyota long biênGiải pháp nâng cao hiệu sử dụng tài sản của công ty tnhh toyota long biên
Giải pháp nâng cao hiệu sử dụng tài sản của công ty tnhh toyota long biên
 
đề Thi tốt nghiệp nghề may thời trang 11
đề Thi tốt nghiệp nghề may   thời trang 11đề Thi tốt nghiệp nghề may   thời trang 11
đề Thi tốt nghiệp nghề may thời trang 11
 
Rayos X
Rayos XRayos X
Rayos X
 
đồ áN ngành may xây dựng tài liệu kỹ thuật triển khai sản xuất mã hàng áo thu...
đồ áN ngành may xây dựng tài liệu kỹ thuật triển khai sản xuất mã hàng áo thu...đồ áN ngành may xây dựng tài liệu kỹ thuật triển khai sản xuất mã hàng áo thu...
đồ áN ngành may xây dựng tài liệu kỹ thuật triển khai sản xuất mã hàng áo thu...
 
Conceptos
ConceptosConceptos
Conceptos
 
Wireless power transmission
Wireless power transmissionWireless power transmission
Wireless power transmission
 
Nonverbal (The Structure of Non Verbal Codes).
Nonverbal (The Structure of Non Verbal Codes).Nonverbal (The Structure of Non Verbal Codes).
Nonverbal (The Structure of Non Verbal Codes).
 
Giải pháp nâng cao hiệu quả sử dụng vốn lưu động tại công ty cổ phần giải phá...
Giải pháp nâng cao hiệu quả sử dụng vốn lưu động tại công ty cổ phần giải phá...Giải pháp nâng cao hiệu quả sử dụng vốn lưu động tại công ty cổ phần giải phá...
Giải pháp nâng cao hiệu quả sử dụng vốn lưu động tại công ty cổ phần giải phá...
 
Hoàn thiện công tác phân tích tài chính tại công ty cổ phần intesys
Hoàn thiện công tác phân tích tài chính tại công ty cổ phần intesysHoàn thiện công tác phân tích tài chính tại công ty cổ phần intesys
Hoàn thiện công tác phân tích tài chính tại công ty cổ phần intesys
 
SODA通信2015裏面
SODA通信2015裏面SODA通信2015裏面
SODA通信2015裏面
 
Hoàn thiện công tác phân tích tình hình tài chính tại công ty cổ phần dầu khí...
Hoàn thiện công tác phân tích tình hình tài chính tại công ty cổ phần dầu khí...Hoàn thiện công tác phân tích tình hình tài chính tại công ty cổ phần dầu khí...
Hoàn thiện công tác phân tích tình hình tài chính tại công ty cổ phần dầu khí...
 
Hoàn thiện công tác phân tích tài chính tại công ty trách nhiệm hữu hạn một t...
Hoàn thiện công tác phân tích tài chính tại công ty trách nhiệm hữu hạn một t...Hoàn thiện công tác phân tích tài chính tại công ty trách nhiệm hữu hạn một t...
Hoàn thiện công tác phân tích tài chính tại công ty trách nhiệm hữu hạn một t...
 
(PART1/2)COLLAPSE OF THE HYATT REGENCY WALKWAYS 1981
(PART1/2)COLLAPSE OF THE HYATT REGENCY WALKWAYS 1981(PART1/2)COLLAPSE OF THE HYATT REGENCY WALKWAYS 1981
(PART1/2)COLLAPSE OF THE HYATT REGENCY WALKWAYS 1981
 
Giáo trình cad cam cnc căn bản
Giáo trình cad cam cnc căn bảnGiáo trình cad cam cnc căn bản
Giáo trình cad cam cnc căn bản
 
Communication skills
Communication skillsCommunication skills
Communication skills
 
Pasteurizacion
PasteurizacionPasteurizacion
Pasteurizacion
 

Semelhante a Hadoop

Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paperSupratim Ray
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkAgnihotriGhosh2
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampSpotle.ai
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Rio Info
 
Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handyPraveen Sripati
 

Semelhante a Hadoop (20)

Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
Case study on big data
Case study on big dataCase study on big data
Case study on big data
 
Hadoop Overview
Hadoop OverviewHadoop Overview
Hadoop Overview
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
Big data with java
Big data with javaBig data with java
Big data with java
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
Hadoop
HadoopHadoop
Hadoop
 
INFO491FinalPaper
INFO491FinalPaperINFO491FinalPaper
INFO491FinalPaper
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
CSB_community
CSB_communityCSB_community
CSB_community
 
00 hadoop welcome_transcript
00 hadoop welcome_transcript00 hadoop welcome_transcript
00 hadoop welcome_transcript
 
Big data
Big dataBig data
Big data
 
Hadoop
HadoopHadoop
Hadoop
 
BIG DATA HADOOP
BIG DATA HADOOPBIG DATA HADOOP
BIG DATA HADOOP
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
 
Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handy
 
tools
toolstools
tools
 

Mais de Manuel Vargas

Mi estrategias aprendizaje (lo) ok
Mi estrategias aprendizaje (lo) okMi estrategias aprendizaje (lo) ok
Mi estrategias aprendizaje (lo) okManuel Vargas
 
Mi lecturas complementarias_ok
Mi lecturas complementarias_okMi lecturas complementarias_ok
Mi lecturas complementarias_okManuel Vargas
 
Mii criterios de_selección_recursos_digitales (lo)
Mii criterios de_selección_recursos_digitales (lo)Mii criterios de_selección_recursos_digitales (lo)
Mii criterios de_selección_recursos_digitales (lo)Manuel Vargas
 
Mii guia buenas_practicas (lo)
Mii guia buenas_practicas (lo)Mii guia buenas_practicas (lo)
Mii guia buenas_practicas (lo)Manuel Vargas
 
Mii lección 1 ok(1)
Mii lección 1 ok(1)Mii lección 1 ok(1)
Mii lección 1 ok(1)Manuel Vargas
 
Mii lección 2 ok(1)
Mii lección 2 ok(1)Mii lección 2 ok(1)
Mii lección 2 ok(1)Manuel Vargas
 
Mii criterios de_selección_recursos_digitales (lo)(1)
Mii criterios de_selección_recursos_digitales (lo)(1)Mii criterios de_selección_recursos_digitales (lo)(1)
Mii criterios de_selección_recursos_digitales (lo)(1)Manuel Vargas
 
Mi guia buenas_practicas
Mi guia buenas_practicasMi guia buenas_practicas
Mi guia buenas_practicasManuel Vargas
 
Mi criterios de_selección_recursos_digitales
Mi criterios de_selección_recursos_digitalesMi criterios de_selección_recursos_digitales
Mi criterios de_selección_recursos_digitalesManuel Vargas
 
Mi guia buenas_practicas
Mi guia buenas_practicasMi guia buenas_practicas
Mi guia buenas_practicasManuel Vargas
 
Mi lecturas complementarias_ok
Mi lecturas complementarias_okMi lecturas complementarias_ok
Mi lecturas complementarias_okManuel Vargas
 
Mi ejemplo actividad_n°2
Mi ejemplo actividad_n°2Mi ejemplo actividad_n°2
Mi ejemplo actividad_n°2Manuel Vargas
 

Mais de Manuel Vargas (20)

M0 glosario ok
M0 glosario okM0 glosario ok
M0 glosario ok
 
Mi lección 1 ok
Mi lección 1 okMi lección 1 ok
Mi lección 1 ok
 
Mi lección 2 ok
Mi lección 2 okMi lección 2 ok
Mi lección 2 ok
 
Mi estrategias aprendizaje (lo) ok
Mi estrategias aprendizaje (lo) okMi estrategias aprendizaje (lo) ok
Mi estrategias aprendizaje (lo) ok
 
Mi lecturas complementarias_ok
Mi lecturas complementarias_okMi lecturas complementarias_ok
Mi lecturas complementarias_ok
 
Matriz actividad 1
Matriz actividad 1Matriz actividad 1
Matriz actividad 1
 
Mii lección 1 ok
Mii lección 1 okMii lección 1 ok
Mii lección 1 ok
 
Mii lección 2 ok
Mii lección 2 okMii lección 2 ok
Mii lección 2 ok
 
Mii criterios de_selección_recursos_digitales (lo)
Mii criterios de_selección_recursos_digitales (lo)Mii criterios de_selección_recursos_digitales (lo)
Mii criterios de_selección_recursos_digitales (lo)
 
Mii guia buenas_practicas (lo)
Mii guia buenas_practicas (lo)Mii guia buenas_practicas (lo)
Mii guia buenas_practicas (lo)
 
Mii lección 1 ok(1)
Mii lección 1 ok(1)Mii lección 1 ok(1)
Mii lección 1 ok(1)
 
Mii lección 2 ok(1)
Mii lección 2 ok(1)Mii lección 2 ok(1)
Mii lección 2 ok(1)
 
Mii criterios de_selección_recursos_digitales (lo)(1)
Mii criterios de_selección_recursos_digitales (lo)(1)Mii criterios de_selección_recursos_digitales (lo)(1)
Mii criterios de_selección_recursos_digitales (lo)(1)
 
Mi guia buenas_practicas
Mi guia buenas_practicasMi guia buenas_practicas
Mi guia buenas_practicas
 
Mi lección 1 ok
Mi lección 1 okMi lección 1 ok
Mi lección 1 ok
 
Mi lección 2.1
Mi lección 2.1Mi lección 2.1
Mi lección 2.1
 
Mi criterios de_selección_recursos_digitales
Mi criterios de_selección_recursos_digitalesMi criterios de_selección_recursos_digitales
Mi criterios de_selección_recursos_digitales
 
Mi guia buenas_practicas
Mi guia buenas_practicasMi guia buenas_practicas
Mi guia buenas_practicas
 
Mi lecturas complementarias_ok
Mi lecturas complementarias_okMi lecturas complementarias_ok
Mi lecturas complementarias_ok
 
Mi ejemplo actividad_n°2
Mi ejemplo actividad_n°2Mi ejemplo actividad_n°2
Mi ejemplo actividad_n°2
 

Último

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 

Último (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Hadoop

  • 1. How LinkedIn Uses Apache Samza Apache Samza is a stream processor LinkedIn recently open-sourced. In his presentation, Samza: Real- timeStreamProcessingatLinkedIn,ChrisRiccominidiscussesSamza’sfeatureset,howSamzaintegrates with YARN and Kafka, how it’s used at LinkedIn, and what’s next on the roadmap. PAGE 17 Hadoop eMag Issue 13 - May 2014 FACILITATING THE SPREAD OF KNOWLEDGE AND INNOVATION IN PROFESSIONAL SOFTWARE DEVELOPMENT INTRODUCTION P. 3 BUILDING APPLICATIONS WITH HADOOP P. 5 WHAT IS APACHE TEZ? P. 9 MODERN HEALTHCARE ARCHITECTURES BUILT WITH HADOOP P. 14
  • 2. Contents Introduction Page 3 Apache Hadoop is an open-source framework that runs applications on large clustered hardware (servers). It is designed to scale from a single server to thousands of machines, with a very high degree of fault tolerance. Building Applications With Hadoop Page 5 When building applications using Hadoop, it is common to have input data from various sources coming in various formats. In his presentation, “New Tools for Building Applications on Apache Hadoop”, Eli Collins overviews how to build better products with Hadoop and various tools that can help, such as Apache Avro, Apache Crunch, Cloudera ML and the Cloudera Development Kit. What is Apache Tez? Page 9 Apache Tez is a new distributed execution framework that is targeted towards data- processing applications on Hadoop. But what exactly is it? How does it work? In the presentation, “Apache Tez: Accelerating Hadoop Query Processing”, Bikas Saha and Arun Murthy discuss Tez’s design, highlight some of its features and share initial results obtained by making Hive use Tez instead of MapReduce. Modern Healthcare Architectures Built with Hadoop Page 14 This article explores some specific use cases where Hadoop can play a major role in the health care industry, as well as a possible reference architecture. How LinkedIn Uses Apache Samza Page17 Apache Samza is a stream processor LinkedIn recently open-sourced. In his presentation, Samza: Real-time Stream Processing at LinkedIn, Chris Riccomini discusses Samza’s feature set, how Samza integrates with YARN and Kafka, how it’s used at LinkedIn, and what’s next on the roadmap.
  • 3. CONTENTS Page 3 Hadoop / eMag Issue 13 - May 2014 Introduction We are living in the era of “big data”. With today’s technology powering increases in computing power, electronic devices, and accessibility to the Internet, more data than ever is being transmitted and collected. Organizations are producing data at an astounding rate. Facebook alone collects 250 terabytes a day. According to Thompson Reuters News Analytics, digital data production has more than doubled from almost one zettabyte (a zettabyte is equal to 1 million petabytes) in 2009 and is expected to reach 7.9 zettabytes in 2015, and 35 zettabytes in 2020. As organizations have begun collecting and producing massive amounts of data, they have started to recognize the advantages of data analysis, but they are also struggling to manage the massive amounts of information they have. According to Alistair Croll, “Companies that have massive amounts of data without massive amounts of clue are going to be displaced by startups that have less data but more clue.…” Unless your business understands the data it has, it will not be able to compete with businesses that do. Businesses realize that there are tremendous benefits to be gained in analyzing big data related to business competition, situational awareness, productivity, science, and innovation – and most see Hadoop as a main tool for analyzing their massive amounts of information and mastering the big-data challenges. Apache Hadoop is an open-source framework that runs applications on large clustered hardware (servers). It is designed to scale from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the reliability of these clusters comes from the software’s ability to detect and handle failures of its own. According to a Hortonworks survey, many large, mainstream organizations (50% of survey respondents were from organizations with over $500M in revenues) currently deploy Hadoop across many industries including high-tech, healthcare, retail, financial services, government, and manufacturing. In the majority of cases, Hadoop does not replace existing data-processing systems but rather complements them. It is typically used to supplement existing systems to tap into additional business data and a more powerful analytics system in order to get a competitive advantage through better insights in business information. Some 54% of respondents by Roopesh Shanoy and Boris Lublinsky content extracted from this InfoQ news post
  • 4. CONTENTS Page 4 Hadoop / eMag Issue 13 - May 2014 are utilizing Hadoop to capture new types of data, while 48% are planning to do the same. The main new data types include the following: • Server logs data enabling IT departments to better manage their infrastructure (64% of respondents are already doing it, while 28% plan to). • Clickstream data enabling better understanding of how customers are using applications (52.3% of respondents are already doing it, while 37.4% plan to). • Social-media data enabling understanding of the public’s perception of the company (36.5% of respondents are already doing it, while 32.5% plan to). • Geolocation data enabling analysis of travel patterns (30.8% of respondents are already doing it, while 26.8% plan to). • Machine data enabling analysis of machine usage (29.3% of respondents are already doing it, while 33.3% plan to). According to the survey, traditional data grows with an average rate of about 8% a year but new data types are growing at a rate exceeding 85% and, as a result, it is virtually impossible to collect and process it without Hadoop. While Version 1 of Hadoop came with the MapReduce processing model, the recently released Hadoop Version 2 comes with YARN, which separates the cluster management from MapReduce job manager, making the Hadoop architecture more modular. One of the side-effects is that MapReduce is now just one of the ways to leverage a Hadoop cluster; a number of new projects such as Tez and Stinger can also process data in the Hadoop Distributed File System (HDFS) without the constraint of the batch-job execution model of MapReduce.
  • 5. CONTENTS Page 5 Hadoop / eMag Issue 13 - May 2014 Building Applications With Hadoop When building applications using Hadoop, it is common to have input data from various sources coming in various formats. In his presentation, “New Tools for Building Applications on Apache Hadoop”, Eli Collins, tech lead for Cloudera’s Platform Team overviews how to build better products with Hadoop and various tools that can help, such as Apache Avro, Apache Crunch, Cloudera ML and the Cloudera Development Kit. Avro Avro is a project for data serialization in formats. It is similar to Thrift or Protocol Buffers. It’s expressive. You can deal in terms of records, arrays, unions, enums. It’s efficient so it has a compact binary representation. One of the benefits of logging an Avro is that you get much smaller data files. All the traditional aspects of Hadoop data formats, like compressible or splittable data, are true of Avro. One of the reasons Doug Cutting (founder of the Hadoop project) created the Avro project was that a lot of the formats in Hadoop were Java only. It’s important for Avro to be interoperable – to a lot of different languages like Java, C, C++, C#, Python, Ruby, etc. – and to be usable by a lot of tools. One of the goals for Avro is a set of formats and serialization that’s usable throughout the data platform that you’re using, not just in a subset of the components. So MapReduce, Pig, Hive, Crunch, Flume, Sqoop, etc. all support Avro. Avro is dynamic and one of its neat features is that you can read and write data without generating any code. It will use reflection and look at the schema that you’ve given it to create classes on the fly. That’s called Avro-generic formats. You can also specify formats for which Avro will generate optimal code. Avro was designed with expectation that you would change your schema over time. That’s an important attribute in a big-data system because you generate lots of data, and you don’t want to constantly reprocess it. You’re going to generate data at one time and have tools process that data maybe two, three, or four years down the line. Avro has the ability to negotiate differences between schemata so that new tools can read old data and vice versa. Avro forms an important basis for the following projects. Crunch You’re probably familiar with Pig and Hive and how to process data with them and integrate valuable Presentation transcript edited by Roopesh Shenoy
  • 6. CONTENTS Page 6 Hadoop / eMag Issue 13 - May 2014 tools. However, not all data formats that you use will fit Pig and Hive. Pig and Hive are great for a lot of logged data or relational data, but other data types don’t fit as well. You can still process poorly fitting data with Pig and Hive, which don’t force you to a relational model or a log structure, but you have to do a lot of work around it. You might find yourself writing unwieldy user- defined functions or doing things that are not natural in the language. People, sometimes, just give up and start writing raw Java MapReduce programs because that’s easier. Crunch was created to fill this gap. It’s a higher-level API than MapReduce. It’s in Java. It’s lower level than, say, Pig, Hive, Cascade, or other frameworks you might be used to. It’s based on a paper that Google published called FlumeJava. It’s a very similar API. Crunch has you combine a small number of primitives with a small number of types and effectively allow the user to create really lightweight UDS, which are just Java methods and classes to create complex data pipelines. Crunch has a number of advantages. • It’s just Java. You have access to a full programming language. • You don’t have to learn Pig. • The type system is well-integrated. You can use Java POJOs, but there’s also a native support for Hadoop Writables in Avro. There’s no impedance mismatch between the Java codes you’re writing and the data that you’re analyzing. • It’s built as a modular library for reuse. You can capture your pipelines in Crunch code in Java and then combine it with arbitrary machine learning program later, so that someone else can reuse that algorithm. The fundamental structure is a parallel collection so it’s a distributed, unordered collection of elements. This collection has a parallel do operator which you can imagine turns into a MapReduce job. So if you had a bunch of data that you want to operate in parallel, you can use a parallel collection. And there’s something called the parallel table, which is a subinterface of the collection, and it’s a distributed sorted map. It also has a group by operators you can use to aggregate all the values for a given key. We’ll go through an example that shows how that works. Finally, there’s a pipeline class and pipelines are really for coordinating the execution of the MapReduce jobs that will actually do the back-end processing for this Crunch program. Let’s take an example for which you’ve probably seen all the Java code before, word count, and see what it looks like in Crunch. Crunch – word count 7 public class WordCount { public static void main(String[] args) throws Exception { Pipeline pipeline = new MRPipeline(WordCount.class); PCollection lines = pipeline.readTextFile(args[0]); PCollection words = lines.parallelDo("my splitter", new DoFn() { public void process(String line, Emitter emitter) { for (String word : line.split("s+")) { emitter.emit(word); } } }, Writables.strings()); PTable counts = Aggregate.count(words); pipeline.writeTextFile(counts, args[1]); pipeline.run(); } }
  • 7. CONTENTS Page 7 Hadoop / eMag Issue 13 - May 2014 It’s a lot smaller and simpler. The first line creates a pipeline. We create a parallel collection of all the lines from a given file by using the pipeline class. And then we get a collection of words by running the parallel do operator on these lines. We’ve got a defined anonymous function here that basically processes the input and word count splits on the word and emits that word for each map task. Finally, we want to aggregate the counts for each word and write them out. There’s a line at the bottom, pipeline run. Crunch’s planner does lazy evaluation. We’re going to create and run the MapReduce jobs until we’ve gotten a full pipeline together. If you’re used to programming Java and you’ve seen the Hadoop examples for writing word count in Java, you can tell that this is a more natural way to express that. This is among the simplest pipelines you can create, and you can imagine you can do many more complicated things. If you want to go even one step easier than this, there’s a wrapper for Scala. This is very similar idea to Cascade, which was built on Google FlumeJava. Since Scala runs on the JVM, it’s an obvious natural fit. Scala’s type inference actually ends up being really powerful in the context of Crunch. The listing at the bottom of this page is the same program written in Scala.We have the pipeline and we can use Scala’s built-in functions that map really nicely to Crunch – so word count becomes a one-line program. It’s pretty cool and very powerful if you’re writing Java code already and want to do complex pipelines. Cloudera ML Cloudera ML (machine learning) is an open-source library and tools to help data scientists perform the day-to-day tasks, primarily of data preparation to model evaluation. With built-in commands for summarizing, sampling, normalizing, and pivoting data, Cloudera ML has recently added a built-in clustering algorithm for k-means, based on an algorithm that was just developed a year or two back. There are a couple of other implementations as well. It’s a home for tools you can use so you can focus on data analysis and modeling instead of on building or wrangling the tools. It’s built using Crunch. It leverages a lot of existing projects. For example, the vector formats: a lot of ML involves transforming raw data that’s in a record format to vector formats for machine-learning algorithms. It leverages Mahout’s vector interface and classes for that purpose. The record format is just a thin wrapper in Avro, and HCatalog is record Scrunch – Scala wrapper 8 class WordCountExample { val pipeline = new Pipeline[WordCountExample] def wordCount(fileName: String) = { pipeline.read(from.textFile(fileName)) .flatMap(_.toLowerCase.split("W+")) .filter(!_.isEmpty()) .count } } Based on Google’s Cascade project
  • 8. CONTENTS Page 8 Hadoop / eMag Issue 13 - May 2014 and schema formats so you can easily integrate with existing data sources. For more information on Cloudera ML, visit the projects’ GitHub page; there’s a bunch of examples with datasets that can get you started. Cloudera Development Kit Like Cloudera ML, Cloudera Development Kit a set of open-source libraries and tools that make writing applications on Hadoop easier. Unlike ML though, it’s not focused on using machine learning like a data scientist. It’s directed at developers trying to build applications on Hadoop. It’s really the plumbing of a lot of different frameworks and pipelines and the integration of a lot of different components. The purpose of the CDK is to provide higher level APIs on top of the existing Hadoop components in the CDH stack that codify a lot of patterns in common use cases. CDK is prescriptive, has an opinion on the way to do things, and tries to make it easy for you to do the right thing by default, but its architecture is a system of loosely coupled modules. You can use modules independently of each other. It’s not an uber- framework that you have to adopt whole. You can adopt it piecemeal. It doesn’t force you into any particular programming paradigms. It doesn’t force you to adopt a ton of dependencies. You can adopt only the dependencies of the particular modules you want. Let’s look at an example. The first module in CDK is the data module, and the goal of the data module is to make it easier for you to work with datasets on Hadoop file systems. There are a lot of gory details to clean up to make this work in practice; you have to worry about serialization, deserialization, compression, partitioning, directory layout, commuting, getting that directory layout, partitioning to other people who want to consume the data, etc. The CDK data module handles all this for you. It automatically serializes and deserializes data from Java POJOs, if that’s what you have, or Avro records if you use them. It has built-in compression, and built-in policies around file and directory layouts so that you don’t have to repeat a lot of these decisions and you get smart policies out of the box. It will automatically partition data within those layouts. It lets you focus on working on a dataset on HDFS instead of all the implementation details. It also has plugin providers for existing systems. Imagine you’re already using Hive and HCatalog as a metadata repository, and you’ve already got a schema for what these files look like. CDK integrates with that. It doesn’t require you to define all of your metadata for your entire data repository from scratch. It integrates with existing systems. You can learn more about the various CDK modules and how to use them in the documentation. In summary, working with data from various sources, preparing and cleansing data and processing them via Hadoop involves a lot of work. Tools such as Crunch, Cloudera ML and CDK make it easier to do this and leverage Hadoop more effectively. ABOUT THE SPEAKER Eli Collins is the tech lead for Cloudera’s Platform team, an active contributor to Apache Hadoop and member of its project management committee (PMC) at the Apache Software Foundation. Eli holds Bachelor’s and Master’s degrees in Computer Science from New York University and the University of Wisconsin-Madison, respectively. WATCH THE FULL PRESENTATION ON InfoQ
  • 9. CONTENTS Page 9 Hadoop / eMag Issue 13 - May 2014 What is Apache Tez? You might have heard of Apache Tez, a new distributed execution framework that is targeted towards data-processing applications on Hadoop. But what exactly is it? How does it work? Who should use it and why? In their presentation, “Apache Tez: Accelerating Hadoop Query Processing”, Bikas Saha and Arun Murthy discuss Tez’s design, highlight some of its features and share some of the initial results obtained by making Hive use Tez instead of MapReduce. Tez generalizes the MapReduce paradigm to a more powerful framework based on expressing computations as a dataflow graph. Tez is not meant directly for end-users – in fact it enables developers to build end-user applications with much better performance and flexibility. Hadoop has traditionally been a batch-processing platform for large amounts of data. However, there are a lot of use cases for near-real-time performance of query processing. There are also several workloads, such as Machine Learning, which do not fit will into the MapReduce paradigm. Tez helps Hadoop address these use cases. The Tez project aims to be highly customizable so that it can meet a broad spectrum of use cases without forcing people to go out of their way to make things work; projects such as Hiveand Pig are seeing significant improvements in response times when they use Tez instead of MapReduce as the backbone for data processing. Tez is built on top of YARN, which is the new resource-management framework for Hadoop. Presentation transcript edited by Roopesh Shenoy
  • 10. CONTENTS Page 10 Hadoop / eMag Issue 13 - May 2014 Design Philosophy The main reason for Tez to exist is to get around limitations imposed by MapReduce. Other than being limited to writing mappers and reducers, there are other inefficiencies in force-fitting all kinds of computations into this paradigm – for e.g. HDFS is used to store temporary data between multiple MR jobs, which is an overhead. (In Hive, this is common when queries require multiple shuffles on keys without correlation, such as with join - grp by - window function - order by.) The key elements forming the design philosophy behind Tez - • Empowering developers (and hence end users) to do what they want in the most efficient manner • Better execution performance Some of the things that helps Tez achieve these goals are: • Expressive Dataflow APIs - The Tez team wants to have an expressive-dataflow-definition API so that you can describe the Direct Acyclic Graph (DAG) of computation that you want to run. For this, Tez has a structural kind of API in which you add all processors and edges and visualize what you are actually constructing. • Flexible Input-Processor-Output runtime model – can construct runtime executors dynamically by connecting different inputs, processors and outputs. • Data type agnostic – only concerned with movement of data, not with the data format (key- value pairs, tuple oriented formats, etc) • Dynamic Graph Reconfiguration • Simple Deployment – Tez is completely a client- side application, leverages YARN local resources and distributed cache. There’s no need to deploy anything on your cluster as far as using Tez is concerned. You just upload the relevant Tez libraries to HDFS then use your Tez client to submit with those libraries. You can even have two copies of the libraries on your cluster. One would be a production copy, which is the stable version and which all your production jobs use. Your users can experiment with a second copy, the latest version of Tez. And they will not interfere with each other. Tez can run any MR job without any modification. This allows for stage-wise migration of tools that currently depend on MR. Exploring the Expressive Dataflow APIs in detail - what can you do with this? For e.g. instead of using multiple MapReduce jobs, you can use the MRR pattern, such that a single map has multiple reduce stages; this can allow streaming of data from one processor to another to another, without writing anything to HDFS (it will be written to disk only for check-pointing), leading to much better performance. The below diagrams demonstrate this -
  • 11. CONTENTS Page 11 Hadoop / eMag Issue 13 - May 2014 The first diagram demonstrates a process that has multiple MR jobs, each storing intermediate results to the HDFS – the reducers of the previous step feeding the mappers of the next step. The second diagram shows how with Tez, the same processing can be done in just one job, with no need to access HDFS in between. Tez’s flexibility means that it requires a bit more effort than MapReduce to start consuming; there’s a bit more API and a bit more processing logic that you need to implement. This is fine since it is not an end- user application like MapReduce; it is designed to let developers build end-user applications on top of it. Given that overview of Tez and its broad goals, let’s try to understand the actual APIs. Tez API The Tez API has the following components: • DAG (Directed Acyclic Graph) – defines the overall job. One DAG object corresponds to one job • Vertex – defines the user logic along with the resources and the environment needed to execute the user logic. One Vertex corresponds to one step in the job • Edge – defines the connection between producer and consumer vertices. Edges need to be assigned properties; these properties are essential for Tez to be able to expand that logical graph at runtime into the physical set of tasks that can be done in parallel on the cluster. There are several such properties: • The data-movement property defines how data moves from a producer to a consumer. • Scheduling properties (sequential or concurrent) helps us define when the producer and consumer tasks can be scheduled relative to each other. • Data-source property (persisted, reliable or ephemeral), defines the lifetime or durability of the output produced by our task so that we can determine when we can terminate it. You can view this Hortonworks article to see an example of the API in action, more detail about these properties and how the logical graph expands at run- time. The runtime API is based on an input-processor- output model which allows all inputs and outputs to be pluggable. To facilitate this, Tez uses an event- based model in order to communicate between tasks and the system, and between various components. Events are used to pass information such as task failures to the required components, flow of data from Output to the Input such as location of data that it generates, enabling run-time changes to the DAG execution plan, etc. Tez also comes with various Input and Output processors out-of-the-box. The expressive API allows higher language (such as Hive) writers to elegantly transform their queries into Tez jobs. Tez Scheduler The Tez scheduler considers a lot of things when deciding on task assignments – task-locality requirements, compatibility of containers, total available resources on the cluster, priority of pending task requests, automatic parallelization, freeing up resources that the application cannot use anymore (because the data is not local to it) etc. It also maintains a connection pool of pre-warmed JVMs with shared registry objects. The application can choose to store different kinds of pre-computed information in those shared registry objects so that they can be reused without having to recompute them later on, and this shared set of connections and container-pool resources can run those tasks very fast. You can read more about reusing of containers in Apache Tez. Flexibility Overall, Tez provides a great deal of flexibility for developers to deal with complex processing logic. This can be illustrated with one example of how Hive is able to leverage Tez. Let’s take this typical TPC-DS query pattern in which you are joining multiple tables with a fact table. Most optimizers and query systems can do what is there in the top-right corner: if the dimension tables are
  • 12. CONTENTS Page 12 Hadoop / eMag Issue 13 - May 2014 small, then they can broadcast-join all of them with the large fact table, and you can do that same thing on Tez. But what if these broadcasts have user-defined functions that are expensive to compute? You may not be able to do all of that this way. You may have to break up your tasks into different stages, and that’s what the left-side topology shows you. The first dimension table is broadcast-joined with the fact table. The result is then broadcast-joined with the second dimension table. Here, the third dimension table is not broadcastable because it is too large. You can choose to do a shuffle join, and Tez can efficiently navigate the topology without falling over just because you can’t do the top-right one. The two benefits for this kind of Hive query with Tez are: • it gives you full DAG support and does a lot automatically on the cluster so that it can fully utilize the parallelism that is available in the cluster; as already discussed above, this means there is no need for reading/writing from HDFS between multiple MR jobs, all the computation can be done in a single Tez job. • it provides sessions and reusable containers so that you have low latency and can avoid recombination as much as possible. This particular Hive query is seeing performance improvement of more than 100% with the new Tez engine. Roadmap • Richer DAG support. For example, can Samza use Tez as a substrate on which to build the application? It needs some support in order for Tez to handle Samza’s core scheduling and streaming requirements. The Tez team wants to explore how we would enable those kinds of connection patterns in our DAGs. They also want more fault-tolerance support, more efficient data transfer for further performance optimization, and improved session performance. • Given that these DAGs can get arbitrarily complex, we need a lot of automatic tooling to help the users understand their performance bottlenecks
  • 13. CONTENTS Page 13 Hadoop / eMag Issue 13 - May 2014 Summary Tez is a distributed execution framework that works on computations represented as dataflow graphs. It maps naturally to higher-level declarative languages like Hive, Pig, Cascading, etc. It’s designed to have highly customizable execution architecture so that we can make dynamic performance optimizations at runtime based on real information about the data and the resources. The framework itself automatically determines a lot of the hard stuff, allowing it to work right out-of-the-box. You get good performance and efficiency out-of-the- box. Tez aims to address the broad spectrum of use cases in the data-processing domain in Hadoop, ranging from latency to complexity of the execution. It is an open-source project. Tez works, Saha and Murthy suggest, and is already being used by Hive and Pig. ABOUT THE SPEAKERS Arun Murthy is the lead of the MapReduce project in Apache Hadoop where he has been a full-time contributor to Apache Hadoop since its inception in 2006. He is a long-time committer and member of the Apache Hadoop PMC and jointly holds the current world sorting record using Apache Hadoop. Prior to co-founding Hortonworks, Arun was responsible for all MapReduce code and configuration deployed across the 42,000+ servers at Yahoo! Bikas Saha has been working on Apache Hadoop for over a year and is a committer on the project. He has been a key contributor in making Hadoop run natively on Windows and has focused on YARN and the Hadoop compute stack. Prior to Hadoop, he has worked extensively on the Dryad distributed data processing framework that runs on some of the worlds largest clusters as part of Microsoft Bing infrastructure WATCH THE FULL PRESENTATION ON InfoQ
  • 14. CONTENTS Page 14 Hadoop / eMag Issue 13 - May 2014 Modern Healthcare Architectures Built with Hadoop We have heard plenty in the news lately about healthcare challenges and the difficult choices faced by hospital administrators, technology and pharmaceutical providers, researchers, and clinicians. At the same time, consumers are experiencing increased costs without a corresponding increase in health security or in the reliability of clinical outcomes. One key obstacle in the healthcare market is data liquidity (for patients, practitioners and payers) and some are using Apache Hadoop to overcome this challenge, as part of a modern data architecture. This post describes some healthcare use cases, a healthcare reference architecture and how Hadoop can ease the pain caused by poor data liquidity. New Value Pathways for Healthcare In January 2013, McKinsey & Company published a report named “The ‘Big Data’ Revolution in Healthcare”. The report points out how big data is creating value in five “new value pathways” allowing data to flow more freely. Below we present a summary of these five new value pathways and an an example how Hadoop can be used to address each. Thanks to the Clinical Informatics Group at UC Irvine Health for many of the use cases, described in their UCIH case study. by Justin Sears Pathway Benefit Hadoop Use Case Right Living Patients can build value by taking an active role in their own treatment, including disease prevention. Predictive Analytics: Heart patients weigh themselves at home with scales that transmit data wirelessly to their health center. Algorithms analyze the data and flag patterns that indicate a high risk of readmission, alerting a physician. Right Care Patients get the most timely, appropriate treatment available. Real-time Monitoring: Patient vital statistics are transmitted from wireless sensors every minute. If vital signs cross certain risk thresholds, staff can attend to the patient immediately. Right Provider Provider skill sets matched to the complexity of the assignment— for instance, nurses or physicians’ assistants performing tasks that do not require a doctor. Also the specific selection of the provider with the best outcomes. Historical EMR Analysis: Hadoop reduces the cost to store data on clinical operations, allowing longer retention of data on staffing decisions and clinical outcomes. Analysis of this data allows administrators to promote individuals and practices that achieve the best results.
  • 15. CONTENTS Page 15 Hadoop / eMag Issue 13 - May 2014 Pathway Benefit Hadoop Use Case Right Value Ensure cost-effectiveness of care, such as tying provider reimbursement to patient outcomes, or eliminating fraud, waste, or abuse in the system. Medical Device Management: For biomedical device maintenance, use geolocation and sensor data to manage its medical equipment. The biomedical team can know where all the equipment is, so they don’t waste time searching for an item.Over time, determine the usage of different devices, and use this information to make rational decisions about when to repair or replace equipment. Right Innovation The identification of new therapies and approaches to delivering care, across all aspects of the system. Also improving the innovation engines themselves. Research Cohort Selection: Researchers at teaching hospitals can access patient data in Hadoop for cohort discovery, then present the anonymous sample cohort to their Internal Review Board for approval, without ever having seen uniquely identifiable information. Source: The ‘Big Data’ Revolution in Healthcare. McKinsey & Company, January 2013. At Hortonworks, we see our healthcare customers ingest and analyze data from many sources. The following reference architecture is an amalgam of Hadoop data patterns that we’ve seen with our customers’ use of Hortonworks Data Platform (HDP). Components shaded green are part of HDP. Sources of Healthcare Data Source data comes from: • Legacy Electronic Medical Records (EMRs) • Transcriptions • PACS • Medication Administration • Financial • Laboratory (e.g. SunQuest, Cerner) • RTLS (for locating medical equipment & patient throughput) • Bio Repository • Device Integration (e.g. iSirona) • Home Devices (e.g. scales and heart monitors) • Clinical Trials • Genomics (e.g. 23andMe, Cancer Genomics Hub) • Radiology (e.g. RadNet) • Quantified Self Sensors (e.g. Fitbit, SmartSleep) • Social Media Streams (e.g. FourSquare, Twitter)
  • 16. CONTENTS Page 16 Hadoop / eMag Issue 13 - May 2014 Loading Healthcare Data Apache Sqoop is included in Hortonworks Data Platform, as a tool to transfer data between external structured data stores (such as Teradata, Netezza, MySQL, or Oracle) into HDFS or related systems like Hive and HBase. We also see our customers using other tools or standards for loading healthcare data into Hadoop. Some of these are: Health Level 7 (HL7) International Standards Apache UIMA JAVA ETL rules Processing Healthcare Data Depending on the use case, healthcare organizations process data in batch (using Apache Hadoop MapReduce and Apache Pig); interactively (with Apache Hive); online (with Apache HBase) or streaming (with Apache Storm). Analyzing Healthcare Data Once data is stored and processed in Hadoop it can either be analyzed in the cluster or exported to relational data stores for analysis there. These data stores might include: • Enterprise data warehouse • Quality data mart • Surgical data mart • Clinical info data mart • Diagnosis data mart • Neo4j graph database Many data analysis and visualization applications can also work with the data directly in Hadoop. Hortonworks healthcare customers typically use the following business intelligence and visualization tools to inform their decisions: • Microsoft Excel • Tableau • RESTful Web Services • EMR Real-time analytics • Metric Insights • Patient Scorecards • Research Portals • Operational Dashboard • Quality Dashboards The following diagram shows how healthcare organizations can integrate Hadoop into their existing data architecture to create a modern data architecture that is interoperable and familiar, so that the same team of analysts and practitioners can use their existing skills in new ways: As more and more healthcare organizations adopt Hadoop to disseminate data to their teams and partners, they empower caregivers to combine their training, intuition, and professional experience with big data to make data-driven decisions that cure patients and reduce costs. Watch our blog in the coming weeks as we share reference architectures for other industry verticals. READ THIS ARTICLE ON Hortonworks.com ABOUT THE AUTHOR Justin Sears is an experienced marketing manager with sixteen years leading teams to create and position enterprise software, risk-controlled consumer banking products, desktop and mobile web properties, and services for Latino customers in the US and Latin America. Expert in enterprise big data use cases for Apache Hadoop.
  • 17. CONTENTS Page 17 Hadoop / eMag Issue 13 - May 2014 How LinkedIn Uses Apache Samza Apache Samza is a stream processor LinkedIn recently open-sourced. In his presentation, Samza: Real-time Stream Processing at LinkedIn, Chris Riccomini discusses Samza’s feature set, how Samza integrates with YARN and Kafka, how it’s used at LinkedIn, and what’s next on the roadmap. Apache Samza is a stream processor LinkedIn recently open-sourced. In his presentation,Samza: Real-time Stream Processing at LinkedIn, Chris Riccomini discusses Samza’s feature set, how Samza integrates with YARN and Kafka, how it’s used at LinkedIn, and what’s next on the roadmap. Bulk of processing that happens at LinkedIn is RPC- style data processing, where one expects a very fast response. On the other end of their response latency spectrum, they have batch processing, for which they use Hadoop for quite a bit. Hadoop processing and batch processing typically happens after the fact, often hours later. There’s this gap between synchronous RPC processing, where the user is actively waiting for a response, and this Hadoop-style processing which despite efforts to shrink it still it takes a long time to run through. Presentation transcript edited by Roopesh Shenoy
  • 18. CONTENTS Page 18 Hadoop / eMag Issue 13 - May 2014 That’s where Samza fits in. This is where we can process stuff asynchronously, but we’re also not waiting for hours. It typically operates in the order of milliseconds to minutes. The idea is to process stuff relatively quickly and get the data back to wherever it needs to be, whether that’s a downstream system or some real-time service. Chris mentions that right now, this stream processing is the worst-supported in terms of tooling and environment. LinkedIn sees a lot of use cases for this type of processing – • Newsfeed displays when people move to another company, when they like an article, when they join a group, et cetera. News is latency-sensitive and if you use Hadoop to batch-compute it, you might be getting responses hours or maybe even a day later. It is important to get trending articles in News pretty quickly. • Advertising – getting relevant advertisements, as well as tracking and monitoring ad display, clicks and other metrics • Sophisticated monitoring that allows performing of complex querys like “the top five slowest pages for the last minute.” Existing Ecosystem at LinkedIn The existing ecosystem at LinkedIn has had a huge influence in the motivation behind Samza as well as it’s architecture. Hence it is important to have at least a glimpse of what this looks like before diving into Samza. Kafka is an open-source project that LinkedIn released a few years ago. It is a messaging system that fulfills two needs – message-queuing and log aggregation. All of LinkedIn’s user activity, all the metrics and monitoring data, and even database changes go into this. LinkedIn also has a specialized system called Databus, which models all of their databases as a stream. It is like a database with the latest data for each key-value pair. But as this database mutates, you can actually model that set of mutations as a stream. Each individual change is a message in that stream. Because LinkedIn has Kafka and because they’ve integrated with it for the past few years, a lot of data at LinkedIn, almost all of it, is available in a stream format as opposed to a data format or on Hadoop. Motivation for Building Samza Chris mentions that when they began doing stream processing, with Kafka and all this data in their system, they started with something like a web service that would start up, read messages from Kafka and do some processing, and then write the messages back out. As they did this, they realized that there were a lot of problems that needed to be solved in order to make it really useful and scalable. Things like partitioning: how do you partition your stream? How do you partition your processor? How do you manage state, where state is defined essentially as something that you maintain in your processor between messages, or things like count if you’re incrementing a counter every time a message comes in. How do you re- process? With failure semantics, you get at least once, at most once, exactly once messaging. There is also non- determinism. If your stream processor is interacting with another system, whether it’s a database or it’s depending on time or the ordering of messages, how you deal with stuff that actually determines the output that you will end up sending? Samza tries to address some of these problems. Samza Architecture The most basic element of Samza is a stream. The stream definition for Samza is much more rigid and heavyweight than you would expect from other stream processing systems. Other processing systems, such as Storm, tend to have very lightweight stream definitions to reduce latency, everything from, say, UDP to a straight-up TCP connection. Samza goes the other direction. It wants its streams to be, for starters, partitions. It wants them to be ordered. If you read Message 3 and then Message 4, you are never going to get those inverted within a single partition. It also wants them to replayable, which means you should be able to go back to reread a message at a later date. It wants them to be fault- tolerant. If a host from Partition 1 disappears, it should still be readable on some other hosts. Also, the streams are usually infinite. Once you get to the
  • 19. CONTENTS Page 19 Hadoop / eMag Issue 13 - May 2014 end – say, Message 6 of Partition 0 – you would just try to reread the next message when it’s available. It’s not the case that you’re finished. This definition maps very well to Kafka, which LinkedIn uses as the streaming infrastructure for Samza. There are many concepts to understand within Samza. In a gist, they are – • Streams – Samza processes streams. A stream is composed of immutable messages of a similar type or category. The actual implementation can be provided via a messaging system such as Kafka (where each topic becomes a Samza Stream) or a database (table) or even Hadoop (a directory of files in HDFS) • Things like message ordering, batching are handled via streams. • Jobs – a Samza job is code that performs logical transformation on a set of input streams to append messages to a set of output streams • Partitions – For scalability, each stream is broken into one or more partitions. Each partition is a totally ordered sequence of messages • Tasks – again for scalability, a job is distributed by breaking it into multiple tasks. The task consumes data from one partition for each of the job’s input streams • Containers – whereas partitions and tasks are logical units of parallelism, containers are unit physical parallelism. Each container is a unix process (or linux cgroup) and runs one or more tasks. • TaskRunner – Taskrunner is Samza’s stream processing container. It manages startup, execution and shutdown of one or more StreamTask instances. • Checkpointing – Checkpointing is generally done to enable failure recovery. If a taskrunner goes down for some reason (hardware failure, for e.g.), when it comes back up, it should start consuming messages where it left off last – this is achieved via Checkpointing. • State management – Data that needs to be passed between processing of different messages can be called state – this can be something as simple as a keeping count or something a lot more complex. Samza allows tasks to maintain persistent, mutable, queryable state that is physically co-located with each task. The state is highly available: in the event of a task failure it will be restored when the task fails over to another machine. This datastore is pluggable, but Samza comes with a key-value store out-of-the-box. • YARN (Yet Another Resource Manager) is Hadoop v2’s biggest improvement over v1 – it separates the Map-Reduce Job tracker from the resource management and enables Map-reduce alternatives to use the same resource manager. Samza utilizes YARN to do cluster management, tracking failures, etc. Samza provides a YARN ApplicationMaster and a YARN job runner out of the box. You can understand how the various components (YARN, Kafka and Samza API) interact by looking at the detailed architecture. Also read the overall documentation to understand each component in detail. Possible Improvements One of the advantages of using something like YARN with Samza is that it enables you to potentially run Samza on the same grid that you already run your draft tasks, test tasks, and MapReduce tasks. You could use the same infrastructure for all of that. However, LinkedIn currently does not run Samza in a multi-framework environment because the existing setup itself is quite experimental. In order to get into a more multi-framework environment, Chris says that the process isolation would have to get a little better.
  • 20. CONTENTS Page 20 Hadoop / eMag Issue 13 - May 2014 Conclusion Samza is a relatively young project incubating at Apache so there’s a lot of room to get involved. A good way to get started is with the hello-samza project, which is a little thing that will get you up and running in about five minutes. It will let you play with a real- time change log from the Wikipedia servers to let you figure out what’s going on in and give you a stream of stuff to play with. The other stream processing project built on top of Hadoop is STORM. You can see acomparison between Samza and STORM ABOUT THE SPEAKER Chris Riccomini is a Staff Software Engineer at LinkedIn, where he’s is currently working as a committer and PMC member for Apache Samza. He’s been involved in a wide range of projects at LinkedIn, including, “People You May Know”, REST.li, Hadoop, engineering tooling, and OLAP systems. Prior to LinkedIn, he worked on data visualization and fraud modeling at PayPal. WATCH THE FULL PRESENTATION ON InfoQ