Current big data technology scope overview prepared for V.I.Tech and Wellcentive companies. Answers questions why we are taking these products and what do we really do with them on very high level.
2. Any real big data is
just about
DIGITAL LIFE
FOOTPRINT
www.vitech.com.ua 2
3. BIG DATA is not about the
data. It is about OUR ABILITY
TO HANDLE THEM.
www.vitech.com.ua 3
4. Our stack
What is our stack
of big data
technologies?
Some of our
specifics
But we are
always special,
don't you?
Couple of buzz
words
Arguments for
meetings with
management ;-)
www.vitech.com.ua 4
5. YARN
Linear scalability: 2
times more power costs
2 times more money
No natural keys so load
balancing is perfect
No 'special' hardware
so staging is closer to
production.
www.vitech.com.ua 5
7. ● Hadoop is open source
framework for big
data. Both distributed
storage and
processing.
● Hadoop is reliable and
fault tolerant with no
rely on hardware for
these properties.
● Hadoop has unique
horisontal scalability.
Currently — from
single computer up to
thousands of cluster
nodes.
What is
it?
What is
HADOOP?
www.vitech.com.ua 7
8. What is HADOOP INDEED?
Why
hadoop
BIG
DATA BIG
=
+
www.vitech.com.ua 8
?
x MAX
DATA
BIG
DATA
BIG
DATA
BIG
DATA
BIG
DATA
BIG
DATA
BIG
DATA
BIG
DATA
BIG
DATA
9. SIMPLE BUT
RELIABLE
● Really big amount of data
stored in reliable manner.
● Storage is simple,
recoverable and cheap
(relatively).
● The same is about
processing power.
www.vitech.com.ua 9
10. COMPLEX INSIDE,
SIMPLE OUTSIDE
● Complexity is burried
inside. Most of really
complex operations are
taken by engine.
● Interface is remote,
compatible between
versions so clients are
relatively safe against
implementation changes.
www.vitech.com.ua 10
11. DECENTRALIZED
● No single point of failure
(almost).
● Scalable as close to linear as
possible.
● No manual actions to recover
in case of failures
www.vitech.com.ua 11
12. Hadoop
historical
top view
● HDFS serves as file
system layer
● MapReduce originally
served as distributed
processing framework.
● Native client API is
Java but there are lot
of alternatives.
● This is only initial
architecture and it is
now more complex.
www.vitech.com.ua 12
13. HDFS
top
view
HDFS is... scalable
● Namenode is
'management'
component. Keeps
'directory' of what file
blocks are stored
where.
● Actual work is
performed by data
nodes.
www.vitech.com.ua 13
14. HDFS is... reliable
● Files are stored in large enough blocks. Every block is
replicated to several data nodes.
● Replication is tracked by namenode. Clients only locate
blocks using namenode and actual load is taken by
datanode.
● Datanode failure leads to replication recovery. Namenode
could be backed by standby scheme.
www.vitech.com.ua 14
16. MapReduce is...
● 2 steps data processing model: transform and then
reduce. Really nice to do things in distributed manner.
● Large class of jobs can be adopted but not all of them.
www.vitech.com.ua 16
17. BIG
DATA
process
ing:
require
DISTRIBUTION
LOAD HAS TO BE
SHARED
● Work is to be
balanced.
● Work can be shared
in accordance to
data placement.
● Work is to be
balanced to reflect
resource balance.
www.vitech.com.ua 17
18. DATA LOCALITY
TOOLS ARE TO BE CLOSE
TO WORK PLACE
● Process data on the
same nodes as it is
stored on with
MapReduce.
● Distributed storage
— distributed
processing.
www.vitech.com.ua 18
19. DISTRIBUTION
+ LOCALITY
Do it locally
Share it
YOUR DATA TOGETHER THEY GO!
BIG
DATA
BIG
DATA
BIG
DATA
Partition
Partition
Partition
WORK TO DO
JOINED RESULT
Data partitioning drives
work sharing. Good
partitioning — good
scalability.
www.vitech.com.ua 19
20. Now with resource management
● New component (YARN) forms resource management
layer and completes real distributed data OS.
● MapReduce is from now only one among other YARN
appliactions.
www.vitech.com.ua 20
21. Why YARN is SO
important?
● Better resource balance for
heterogeneous clusterss
and multple applications.
● Dynamic applications over
static services.
● Much wider applications
model over simple
MapReduce. Things like
Spark ot Tez.
www.vitech.com.ua 21
22. First ever world
DATA OS
10.000 nodes computer...
Recent technology changes are focused on
higher scale. Better resource usage and
control, lower MTTR, higher security,
redundancy, fault tolerance.
www.vitech.com.ua 22
24. Choose your destiny! We did.
● HortonWorks are 'barely open source'. Innovative, but
'running too fast'. Most ot their key technologies are not
so mature yet.
Cloudera is stable enough but not stale. Hadoop 2.3 with
YARN, HBase 0.98.x. Balance. Spark 1.x is bold move!
● MapR focuses on performance per node but they are
slightly outdated in term of functionality and their
distribution costs. For cases where node performance is
high priority.
www.vitech.com.ua 24
25. HBase
motivat
ion
But Hadoop is...
● Designed for throughput,
not for latency.
● HDFS blocks are expected
to be large. There is issue
with lot of small files.
● Write once, read many
times ideology.
● MapReduce is not so
flexible so any database
built on top of it.
● How about realtime?
www.vitech.com.ua 25
26. LATENCY, SPEED and all
Hadoop properties.
HBase
motivat
ion
BUT WE
OFTEN
NEED...
www.vitech.com.ua 26
27. High layer applications
Resource management
YARN
Distributed file system
www.vitech.com.ua 27
28. Logical data model
Table
Region
Region
Every row
consists of
columns.
Row
Key Family #1 Family #2 ...
Column Column ... ...
...
...
...
Data is
placed in
tables.
Tables are split
into regions
based on row
key ranges.
Columns are
grouped into
Every table row families.
is identified by
unique row key.
www.vitech.com.ua 28
29. Table
Region
Real data
model
● Data is stored in HFile.
● Families are stored on
disk in separate files.
● Row keys are
indexed in memory.
● Column includes key,
qualifier, value and timestamp.
● No column limit.
● Storage is block based.
Region
Row
Key Family #1 Family #2 ...
Column Column ... ...
...
HFile: family #1
Row key Column Value TS
... ... ... ...
... ... ... ...
● Delete is just another
marker record.
● Periodic compaction is
required.
HFile: family #2
Row key Column Value TS
... ... ... ...
... ... ... ...
www.vitech.com.ua 29
30. Hbase: infrastructure view
Zookeeper coordinates
distributed elements and
is primary contact point
for client.
META
DATA
Master server keeps metadata and
manages data distribution over
Region servers.
Zookeeper Master
RS RS RS RS
Client
Region servers
manage data
table regions.
Clients directly
communicate
with region
server for data.
Clients locate master
through ZooKeeper
then needed regions
through master.
www.vitech.com.ua 30
31. Zookeeper
coordinates
distributed
elements and is
primary contact
point for client.
META
DATA
RS RS
DN DN
Rack
RS RS
DN DN
Rack
RS RS
DN DN
Rack
NameNode
www.vitech.com.ua 31
Client
Master
Zookeeper
Master server keeps
metadata and manages data
distribution over Region
servers.
Region servers
manage data
table regions.
Actual data
storage service
including
replication is on
HDFS data
nodes.
Clients directly
communicate
with region
server for data.
Clients locate
master through
ZooKeeper then
needed regions
through master.
Together with HDFS
32. DATA LAKE
Take as much data
about your business
processes as you can
take. The more data
you have the more
value you could get
from it.
www.vitech.com.ua 32
33. Apache
ZooKeeper
… because coordinating
distributed systems is a Zoo
Zookee
per
www.vitech.com.ua 33
34. Apache
ZooKeeper
We use this guy:
● As a part of Hadoop /
HBase infrastructure
● To coordinate MapReduce
job tasks
www.vitech.com.ua 34
35. Apache
Spark
● Better MapReduce with at least some
MapReduce elements able to be reused.
● Dynamic, faster to startup and does not need
anything from cluster.
● New job models. Not only Map and Reduce.
● Results can be passed through memory
including final one.
www.vitech.com.ua 35
36. SOLR is just about search
INDEX UPDATE
INDEX QUERY
Search responses
Index update request is
analyzed, tokenized,
transformed... and the
same is for queries.
● SOLR indexes documents. What is stored into
SOLR index is not what you index. SOLR is NOT A
STORAGE, ONLY INDEX
● But it can index ANYTHING. Search result is
document ID
www.vitech.com.ua 36
37. ● HBase handles user data change online
requests.
● NGData Lily indexer handles stream of changes
and transforms them into SOLR index change
requests.
● Indexes are built on SOLR so HBase data are
searchable.
www.vitech.com.ua 37
38. ENTERPRISE DATA HUB
Don't ruine your existing data warehouse.
Just extend it with new, centralized big
data storage through data migration
solution.
www.vitech.com.ua 38
39. HBase: Data and search integration
Replication can be
set up to column
HBase regions
HDFS
Data update
www.vitech.com.ua 39
Client
User just puts (or
deletes) data.
Search responses
Lily HBase
NRT indexer
family level.
REPLICATION
HBase
cluster
Translates data
changes into SOLR
index updates.
SOLR cloud
Search requests (HTTP)
Apache
Zookeeper does
all coordination
Finally provides
search
Serves low level
file system.