SlideShare uma empresa Scribd logo
1 de 40
Baixar para ler offline
Roman Nikitchenko, 09.10.2014 
BIG.DATA 
Technology scope
Any real big data is 
just about 
DIGITAL LIFE 
FOOTPRINT 
www.vitech.com.ua 2
BIG DATA is not about the 
data. It is about OUR ABILITY 
TO HANDLE THEM. 
www.vitech.com.ua 3
Our stack 
What is our stack 
of big data 
technologies? 
Some of our 
specifics 
But we are 
always special, 
don't you? 
Couple of buzz 
words 
Arguments for 
meetings with 
management ;-) 
www.vitech.com.ua 4
YARN 
Linear scalability: 2 
times more power costs 
2 times more money 
No natural keys so load 
balancing is perfect 
No 'special' hardware 
so staging is closer to 
production. 
www.vitech.com.ua 5
HADOOP magic is here! 
www.vitech.com.ua 6
● Hadoop is open source 
framework for big 
data. Both distributed 
storage and 
processing. 
● Hadoop is reliable and 
fault tolerant with no 
rely on hardware for 
these properties. 
● Hadoop has unique 
horisontal scalability. 
Currently — from 
single computer up to 
thousands of cluster 
nodes. 
What is 
it? 
What is 
HADOOP? 
www.vitech.com.ua 7
What is HADOOP INDEED? 
Why 
hadoop 
BIG 
DATA BIG 
= 
+ 
www.vitech.com.ua 8 
? 
x MAX 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA
SIMPLE BUT 
RELIABLE 
● Really big amount of data 
stored in reliable manner. 
● Storage is simple, 
recoverable and cheap 
(relatively). 
● The same is about 
processing power. 
www.vitech.com.ua 9
COMPLEX INSIDE, 
SIMPLE OUTSIDE 
● Complexity is burried 
inside. Most of really 
complex operations are 
taken by engine. 
● Interface is remote, 
compatible between 
versions so clients are 
relatively safe against 
implementation changes. 
www.vitech.com.ua 10
DECENTRALIZED 
● No single point of failure 
(almost). 
● Scalable as close to linear as 
possible. 
● No manual actions to recover 
in case of failures 
www.vitech.com.ua 11
Hadoop 
historical 
top view 
● HDFS serves as file 
system layer 
● MapReduce originally 
served as distributed 
processing framework. 
● Native client API is 
Java but there are lot 
of alternatives. 
● This is only initial 
architecture and it is 
now more complex. 
www.vitech.com.ua 12
HDFS 
top 
view 
HDFS is... scalable 
● Namenode is 
'management' 
component. Keeps 
'directory' of what file 
blocks are stored 
where. 
● Actual work is 
performed by data 
nodes. 
www.vitech.com.ua 13
HDFS is... reliable 
● Files are stored in large enough blocks. Every block is 
replicated to several data nodes. 
● Replication is tracked by namenode. Clients only locate 
blocks using namenode and actual load is taken by 
datanode. 
● Datanode failure leads to replication recovery. Namenode 
could be backed by standby scheme. 
www.vitech.com.ua 14
NO BACKUPS 
www.vitech.com.ua 15
MapReduce is... 
● 2 steps data processing model: transform and then 
reduce. Really nice to do things in distributed manner. 
● Large class of jobs can be adopted but not all of them. 
www.vitech.com.ua 16
BIG 
DATA 
process 
ing: 
require 
DISTRIBUTION 
LOAD HAS TO BE 
SHARED 
● Work is to be 
balanced. 
● Work can be shared 
in accordance to 
data placement. 
● Work is to be 
balanced to reflect 
resource balance. 
www.vitech.com.ua 17
DATA LOCALITY 
TOOLS ARE TO BE CLOSE 
TO WORK PLACE 
● Process data on the 
same nodes as it is 
stored on with 
MapReduce. 
● Distributed storage 
— distributed 
processing. 
www.vitech.com.ua 18
DISTRIBUTION 
+ LOCALITY 
Do it locally 
Share it 
YOUR DATA TOGETHER THEY GO! 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
Partition 
Partition 
Partition 
WORK TO DO 
JOINED RESULT 
Data partitioning drives 
work sharing. Good 
partitioning — good 
scalability. 
www.vitech.com.ua 19
Now with resource management 
● New component (YARN) forms resource management 
layer and completes real distributed data OS. 
● MapReduce is from now only one among other YARN 
appliactions. 
www.vitech.com.ua 20
Why YARN is SO 
important? 
● Better resource balance for 
heterogeneous clusterss 
and multple applications. 
● Dynamic applications over 
static services. 
● Much wider applications 
model over simple 
MapReduce. Things like 
Spark ot Tez. 
www.vitech.com.ua 21
First ever world 
DATA OS 
10.000 nodes computer... 
Recent technology changes are focused on 
higher scale. Better resource usage and 
control, lower MTTR, higher security, 
redundancy, fault tolerance. 
www.vitech.com.ua 22
Hadoop 
: don't 
do it 
yoursel 
www.vitech.com.ua 23
Choose your destiny! We did. 
● HortonWorks are 'barely open source'. Innovative, but 
'running too fast'. Most ot their key technologies are not 
so mature yet. 
Cloudera is stable enough but not stale. Hadoop 2.3 with 
YARN, HBase 0.98.x. Balance. Spark 1.x is bold move! 
● MapR focuses on performance per node but they are 
slightly outdated in term of functionality and their 
distribution costs. For cases where node performance is 
high priority. 
www.vitech.com.ua 24
HBase 
motivat 
ion 
But Hadoop is... 
● Designed for throughput, 
not for latency. 
● HDFS blocks are expected 
to be large. There is issue 
with lot of small files. 
● Write once, read many 
times ideology. 
● MapReduce is not so 
flexible so any database 
built on top of it. 
● How about realtime? 
www.vitech.com.ua 25
LATENCY, SPEED and all 
Hadoop properties. 
HBase 
motivat 
ion 
BUT WE 
OFTEN 
NEED... 
www.vitech.com.ua 26
High layer applications 
Resource management 
YARN 
Distributed file system 
www.vitech.com.ua 27
Logical data model 
Table 
Region 
Region 
Every row 
consists of 
columns. 
Row 
Key Family #1 Family #2 ... 
Column Column ... ... 
... 
... 
... 
Data is 
placed in 
tables. 
Tables are split 
into regions 
based on row 
key ranges. 
Columns are 
grouped into 
Every table row families. 
is identified by 
unique row key. 
www.vitech.com.ua 28
Table 
Region 
Real data 
model 
● Data is stored in HFile. 
● Families are stored on 
disk in separate files. 
● Row keys are 
indexed in memory. 
● Column includes key, 
qualifier, value and timestamp. 
● No column limit. 
● Storage is block based. 
Region 
Row 
Key Family #1 Family #2 ... 
Column Column ... ... 
... 
HFile: family #1 
Row key Column Value TS 
... ... ... ... 
... ... ... ... 
● Delete is just another 
marker record. 
● Periodic compaction is 
required. 
HFile: family #2 
Row key Column Value TS 
... ... ... ... 
... ... ... ... 
www.vitech.com.ua 29
Hbase: infrastructure view 
Zookeeper coordinates 
distributed elements and 
is primary contact point 
for client. 
META 
DATA 
Master server keeps metadata and 
manages data distribution over 
Region servers. 
Zookeeper Master 
RS RS RS RS 
Client 
Region servers 
manage data 
table regions. 
Clients directly 
communicate 
with region 
server for data. 
Clients locate master 
through ZooKeeper 
then needed regions 
through master. 
www.vitech.com.ua 30
Zookeeper 
coordinates 
distributed 
elements and is 
primary contact 
point for client. 
META 
DATA 
RS RS 
DN DN 
Rack 
RS RS 
DN DN 
Rack 
RS RS 
DN DN 
Rack 
NameNode 
www.vitech.com.ua 31 
Client 
Master 
Zookeeper 
Master server keeps 
metadata and manages data 
distribution over Region 
servers. 
Region servers 
manage data 
table regions. 
Actual data 
storage service 
including 
replication is on 
HDFS data 
nodes. 
Clients directly 
communicate 
with region 
server for data. 
Clients locate 
master through 
ZooKeeper then 
needed regions 
through master. 
Together with HDFS
DATA LAKE 
Take as much data 
about your business 
processes as you can 
take. The more data 
you have the more 
value you could get 
from it. 
www.vitech.com.ua 32
Apache 
ZooKeeper 
… because coordinating 
distributed systems is a Zoo 
Zookee 
per 
www.vitech.com.ua 33
Apache 
ZooKeeper 
We use this guy: 
● As a part of Hadoop / 
HBase infrastructure 
● To coordinate MapReduce 
job tasks 
www.vitech.com.ua 34
Apache 
Spark 
● Better MapReduce with at least some 
MapReduce elements able to be reused. 
● Dynamic, faster to startup and does not need 
anything from cluster. 
● New job models. Not only Map and Reduce. 
● Results can be passed through memory 
including final one. 
www.vitech.com.ua 35
SOLR is just about search 
INDEX UPDATE 
INDEX QUERY 
Search responses 
Index update request is 
analyzed, tokenized, 
transformed... and the 
same is for queries. 
● SOLR indexes documents. What is stored into 
SOLR index is not what you index. SOLR is NOT A 
STORAGE, ONLY INDEX 
● But it can index ANYTHING. Search result is 
document ID 
www.vitech.com.ua 36
● HBase handles user data change online 
requests. 
● NGData Lily indexer handles stream of changes 
and transforms them into SOLR index change 
requests. 
● Indexes are built on SOLR so HBase data are 
searchable. 
www.vitech.com.ua 37
ENTERPRISE DATA HUB 
Don't ruine your existing data warehouse. 
Just extend it with new, centralized big 
data storage through data migration 
solution. 
www.vitech.com.ua 38
HBase: Data and search integration 
Replication can be 
set up to column 
HBase regions 
HDFS 
Data update 
www.vitech.com.ua 39 
Client 
User just puts (or 
deletes) data. 
Search responses 
Lily HBase 
NRT indexer 
family level. 
REPLICATION 
HBase 
cluster 
Translates data 
changes into SOLR 
index updates. 
SOLR cloud 
Search requests (HTTP) 
Apache 
Zookeeper does 
all coordination 
Finally provides 
search 
Serves low level 
file system.
Questions and discussion 
www.vitech.com.ua 40

Mais conteúdo relacionado

Mais procurados

Introduction To Hibernate
Introduction To HibernateIntroduction To Hibernate
Introduction To Hibernateashishkulkarni
 
Hibernate Developer Reference
Hibernate Developer ReferenceHibernate Developer Reference
Hibernate Developer ReferenceMuthuselvam RS
 
Database Connection Pooling With c3p0
Database Connection Pooling With c3p0Database Connection Pooling With c3p0
Database Connection Pooling With c3p0Kasun Madusanke
 
Java Web Programming Using Cloud Platform: Module 3
Java Web Programming Using Cloud Platform: Module 3Java Web Programming Using Cloud Platform: Module 3
Java Web Programming Using Cloud Platform: Module 3IMC Institute
 
Owner - Java properties reinvented.
Owner - Java properties reinvented.Owner - Java properties reinvented.
Owner - Java properties reinvented.Luigi Viggiano
 
A first Draft to Java Configuration
A first Draft to Java ConfigurationA first Draft to Java Configuration
A first Draft to Java ConfigurationAnatole Tresch
 
5050 dev nation
5050 dev nation5050 dev nation
5050 dev nationArun Gupta
 
Security Multitenant
Security MultitenantSecurity Multitenant
Security MultitenantArush Jain
 
Advance java session 5
Advance java session 5Advance java session 5
Advance java session 5Smita B Kumar
 
JPA and Coherence with TopLink Grid
JPA and Coherence with TopLink GridJPA and Coherence with TopLink Grid
JPA and Coherence with TopLink GridJames Bayer
 

Mais procurados (20)

Orcale Presentation
Orcale PresentationOrcale Presentation
Orcale Presentation
 
Maven
MavenMaven
Maven
 
Servlet programming
Servlet programmingServlet programming
Servlet programming
 
Introduction To Hibernate
Introduction To HibernateIntroduction To Hibernate
Introduction To Hibernate
 
Jspprogramming
JspprogrammingJspprogramming
Jspprogramming
 
Hibernate Developer Reference
Hibernate Developer ReferenceHibernate Developer Reference
Hibernate Developer Reference
 
jsf2 Notes
jsf2 Notesjsf2 Notes
jsf2 Notes
 
Jdbc
JdbcJdbc
Jdbc
 
Database Connection Pooling With c3p0
Database Connection Pooling With c3p0Database Connection Pooling With c3p0
Database Connection Pooling With c3p0
 
Hibernate tutorial
Hibernate tutorialHibernate tutorial
Hibernate tutorial
 
Java Web Programming Using Cloud Platform: Module 3
Java Web Programming Using Cloud Platform: Module 3Java Web Programming Using Cloud Platform: Module 3
Java Web Programming Using Cloud Platform: Module 3
 
Owner - Java properties reinvented.
Owner - Java properties reinvented.Owner - Java properties reinvented.
Owner - Java properties reinvented.
 
Angularj2.0
Angularj2.0Angularj2.0
Angularj2.0
 
A first Draft to Java Configuration
A first Draft to Java ConfigurationA first Draft to Java Configuration
A first Draft to Java Configuration
 
5050 dev nation
5050 dev nation5050 dev nation
5050 dev nation
 
Security Multitenant
Security MultitenantSecurity Multitenant
Security Multitenant
 
Advance java session 5
Advance java session 5Advance java session 5
Advance java session 5
 
MyBatis
MyBatisMyBatis
MyBatis
 
Chap3 3 12
Chap3 3 12Chap3 3 12
Chap3 3 12
 
JPA and Coherence with TopLink Grid
JPA and Coherence with TopLink GridJPA and Coherence with TopLink Grid
JPA and Coherence with TopLink Grid
 

Destaque

Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsKaniska Mandal
 
Scaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy Carol
Scaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy CarolScaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy Carol
Scaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy CarolHakka Labs
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduceFARUK BERKSÖZ
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
 
Big data characteristics, value chain and challenges
Big data characteristics, value chain and challengesBig data characteristics, value chain and challenges
Big data characteristics, value chain and challengesMusfiqur Rahman
 
Relational algebra in dbms
Relational algebra in dbmsRelational algebra in dbms
Relational algebra in dbmsshekhar1991
 
BigData_Chp2: Hadoop & Map-Reduce
BigData_Chp2: Hadoop & Map-ReduceBigData_Chp2: Hadoop & Map-Reduce
BigData_Chp2: Hadoop & Map-ReduceLilia Sfaxi
 
BigData_Chp1: Introduction à la Big Data
BigData_Chp1: Introduction à la Big DataBigData_Chp1: Introduction à la Big Data
BigData_Chp1: Introduction à la Big DataLilia Sfaxi
 
Big data - Key Enablers, Drivers & Challenges
Big data - Key Enablers, Drivers & ChallengesBig data - Key Enablers, Drivers & Challenges
Big data - Key Enablers, Drivers & ChallengesShilpi Sharma
 
Effective Software Release Management
Effective Software Release ManagementEffective Software Release Management
Effective Software Release ManagementMichael Degnan
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellKhalid Imran
 
Big Data
Big DataBig Data
Big DataNGDATA
 

Destaque (18)

Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data Analytics
 
Implementing a Population Health Model (Hon Pak)
Implementing a Population Health Model (Hon Pak)Implementing a Population Health Model (Hon Pak)
Implementing a Population Health Model (Hon Pak)
 
Scaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy Carol
Scaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy CarolScaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy Carol
Scaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy Carol
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Big data characteristics, value chain and challenges
Big data characteristics, value chain and challengesBig data characteristics, value chain and challenges
Big data characteristics, value chain and challenges
 
Relational algebra in dbms
Relational algebra in dbmsRelational algebra in dbms
Relational algebra in dbms
 
BigData_Chp2: Hadoop & Map-Reduce
BigData_Chp2: Hadoop & Map-ReduceBigData_Chp2: Hadoop & Map-Reduce
BigData_Chp2: Hadoop & Map-Reduce
 
Big Data and Analytics on AWS
Big Data and Analytics on AWS Big Data and Analytics on AWS
Big Data and Analytics on AWS
 
BigData_Chp1: Introduction à la Big Data
BigData_Chp1: Introduction à la Big DataBigData_Chp1: Introduction à la Big Data
BigData_Chp1: Introduction à la Big Data
 
Big data - Key Enablers, Drivers & Challenges
Big data - Key Enablers, Drivers & ChallengesBig data - Key Enablers, Drivers & Challenges
Big data - Key Enablers, Drivers & Challenges
 
Effective Software Release Management
Effective Software Release ManagementEffective Software Release Management
Effective Software Release Management
 
Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Big Data
Big DataBig Data
Big Data
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Semelhante a Big data: current technology scope.

Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.GeeksLab Odessa
 
HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.Roman Nikitchenko
 
Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!Roman Nikitchenko
 
HBase, dances on the elephant back.
HBase, dances on the elephant back.HBase, dances on the elephant back.
HBase, dances on the elephant back.Roman Nikitchenko
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop BasicsSonal Tiwari
 
Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013Jean-Pierre König
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...Mark Rittman
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutionssolarisyougood
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 

Semelhante a Big data: current technology scope. (20)

Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
 
Big Data - Big Pitfalls.
Big Data - Big Pitfalls.Big Data - Big Pitfalls.
Big Data - Big Pitfalls.
 
HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.
 
Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
HBase, dances on the elephant back.
HBase, dances on the elephant back.HBase, dances on the elephant back.
HBase, dances on the elephant back.
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop
Hadoop Hadoop
Hadoop
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 

Último

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 

Último (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 

Big data: current technology scope.

  • 1. Roman Nikitchenko, 09.10.2014 BIG.DATA Technology scope
  • 2. Any real big data is just about DIGITAL LIFE FOOTPRINT www.vitech.com.ua 2
  • 3. BIG DATA is not about the data. It is about OUR ABILITY TO HANDLE THEM. www.vitech.com.ua 3
  • 4. Our stack What is our stack of big data technologies? Some of our specifics But we are always special, don't you? Couple of buzz words Arguments for meetings with management ;-) www.vitech.com.ua 4
  • 5. YARN Linear scalability: 2 times more power costs 2 times more money No natural keys so load balancing is perfect No 'special' hardware so staging is closer to production. www.vitech.com.ua 5
  • 6. HADOOP magic is here! www.vitech.com.ua 6
  • 7. ● Hadoop is open source framework for big data. Both distributed storage and processing. ● Hadoop is reliable and fault tolerant with no rely on hardware for these properties. ● Hadoop has unique horisontal scalability. Currently — from single computer up to thousands of cluster nodes. What is it? What is HADOOP? www.vitech.com.ua 7
  • 8. What is HADOOP INDEED? Why hadoop BIG DATA BIG = + www.vitech.com.ua 8 ? x MAX DATA BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA
  • 9. SIMPLE BUT RELIABLE ● Really big amount of data stored in reliable manner. ● Storage is simple, recoverable and cheap (relatively). ● The same is about processing power. www.vitech.com.ua 9
  • 10. COMPLEX INSIDE, SIMPLE OUTSIDE ● Complexity is burried inside. Most of really complex operations are taken by engine. ● Interface is remote, compatible between versions so clients are relatively safe against implementation changes. www.vitech.com.ua 10
  • 11. DECENTRALIZED ● No single point of failure (almost). ● Scalable as close to linear as possible. ● No manual actions to recover in case of failures www.vitech.com.ua 11
  • 12. Hadoop historical top view ● HDFS serves as file system layer ● MapReduce originally served as distributed processing framework. ● Native client API is Java but there are lot of alternatives. ● This is only initial architecture and it is now more complex. www.vitech.com.ua 12
  • 13. HDFS top view HDFS is... scalable ● Namenode is 'management' component. Keeps 'directory' of what file blocks are stored where. ● Actual work is performed by data nodes. www.vitech.com.ua 13
  • 14. HDFS is... reliable ● Files are stored in large enough blocks. Every block is replicated to several data nodes. ● Replication is tracked by namenode. Clients only locate blocks using namenode and actual load is taken by datanode. ● Datanode failure leads to replication recovery. Namenode could be backed by standby scheme. www.vitech.com.ua 14
  • 16. MapReduce is... ● 2 steps data processing model: transform and then reduce. Really nice to do things in distributed manner. ● Large class of jobs can be adopted but not all of them. www.vitech.com.ua 16
  • 17. BIG DATA process ing: require DISTRIBUTION LOAD HAS TO BE SHARED ● Work is to be balanced. ● Work can be shared in accordance to data placement. ● Work is to be balanced to reflect resource balance. www.vitech.com.ua 17
  • 18. DATA LOCALITY TOOLS ARE TO BE CLOSE TO WORK PLACE ● Process data on the same nodes as it is stored on with MapReduce. ● Distributed storage — distributed processing. www.vitech.com.ua 18
  • 19. DISTRIBUTION + LOCALITY Do it locally Share it YOUR DATA TOGETHER THEY GO! BIG DATA BIG DATA BIG DATA Partition Partition Partition WORK TO DO JOINED RESULT Data partitioning drives work sharing. Good partitioning — good scalability. www.vitech.com.ua 19
  • 20. Now with resource management ● New component (YARN) forms resource management layer and completes real distributed data OS. ● MapReduce is from now only one among other YARN appliactions. www.vitech.com.ua 20
  • 21. Why YARN is SO important? ● Better resource balance for heterogeneous clusterss and multple applications. ● Dynamic applications over static services. ● Much wider applications model over simple MapReduce. Things like Spark ot Tez. www.vitech.com.ua 21
  • 22. First ever world DATA OS 10.000 nodes computer... Recent technology changes are focused on higher scale. Better resource usage and control, lower MTTR, higher security, redundancy, fault tolerance. www.vitech.com.ua 22
  • 23. Hadoop : don't do it yoursel www.vitech.com.ua 23
  • 24. Choose your destiny! We did. ● HortonWorks are 'barely open source'. Innovative, but 'running too fast'. Most ot their key technologies are not so mature yet. Cloudera is stable enough but not stale. Hadoop 2.3 with YARN, HBase 0.98.x. Balance. Spark 1.x is bold move! ● MapR focuses on performance per node but they are slightly outdated in term of functionality and their distribution costs. For cases where node performance is high priority. www.vitech.com.ua 24
  • 25. HBase motivat ion But Hadoop is... ● Designed for throughput, not for latency. ● HDFS blocks are expected to be large. There is issue with lot of small files. ● Write once, read many times ideology. ● MapReduce is not so flexible so any database built on top of it. ● How about realtime? www.vitech.com.ua 25
  • 26. LATENCY, SPEED and all Hadoop properties. HBase motivat ion BUT WE OFTEN NEED... www.vitech.com.ua 26
  • 27. High layer applications Resource management YARN Distributed file system www.vitech.com.ua 27
  • 28. Logical data model Table Region Region Every row consists of columns. Row Key Family #1 Family #2 ... Column Column ... ... ... ... ... Data is placed in tables. Tables are split into regions based on row key ranges. Columns are grouped into Every table row families. is identified by unique row key. www.vitech.com.ua 28
  • 29. Table Region Real data model ● Data is stored in HFile. ● Families are stored on disk in separate files. ● Row keys are indexed in memory. ● Column includes key, qualifier, value and timestamp. ● No column limit. ● Storage is block based. Region Row Key Family #1 Family #2 ... Column Column ... ... ... HFile: family #1 Row key Column Value TS ... ... ... ... ... ... ... ... ● Delete is just another marker record. ● Periodic compaction is required. HFile: family #2 Row key Column Value TS ... ... ... ... ... ... ... ... www.vitech.com.ua 29
  • 30. Hbase: infrastructure view Zookeeper coordinates distributed elements and is primary contact point for client. META DATA Master server keeps metadata and manages data distribution over Region servers. Zookeeper Master RS RS RS RS Client Region servers manage data table regions. Clients directly communicate with region server for data. Clients locate master through ZooKeeper then needed regions through master. www.vitech.com.ua 30
  • 31. Zookeeper coordinates distributed elements and is primary contact point for client. META DATA RS RS DN DN Rack RS RS DN DN Rack RS RS DN DN Rack NameNode www.vitech.com.ua 31 Client Master Zookeeper Master server keeps metadata and manages data distribution over Region servers. Region servers manage data table regions. Actual data storage service including replication is on HDFS data nodes. Clients directly communicate with region server for data. Clients locate master through ZooKeeper then needed regions through master. Together with HDFS
  • 32. DATA LAKE Take as much data about your business processes as you can take. The more data you have the more value you could get from it. www.vitech.com.ua 32
  • 33. Apache ZooKeeper … because coordinating distributed systems is a Zoo Zookee per www.vitech.com.ua 33
  • 34. Apache ZooKeeper We use this guy: ● As a part of Hadoop / HBase infrastructure ● To coordinate MapReduce job tasks www.vitech.com.ua 34
  • 35. Apache Spark ● Better MapReduce with at least some MapReduce elements able to be reused. ● Dynamic, faster to startup and does not need anything from cluster. ● New job models. Not only Map and Reduce. ● Results can be passed through memory including final one. www.vitech.com.ua 35
  • 36. SOLR is just about search INDEX UPDATE INDEX QUERY Search responses Index update request is analyzed, tokenized, transformed... and the same is for queries. ● SOLR indexes documents. What is stored into SOLR index is not what you index. SOLR is NOT A STORAGE, ONLY INDEX ● But it can index ANYTHING. Search result is document ID www.vitech.com.ua 36
  • 37. ● HBase handles user data change online requests. ● NGData Lily indexer handles stream of changes and transforms them into SOLR index change requests. ● Indexes are built on SOLR so HBase data are searchable. www.vitech.com.ua 37
  • 38. ENTERPRISE DATA HUB Don't ruine your existing data warehouse. Just extend it with new, centralized big data storage through data migration solution. www.vitech.com.ua 38
  • 39. HBase: Data and search integration Replication can be set up to column HBase regions HDFS Data update www.vitech.com.ua 39 Client User just puts (or deletes) data. Search responses Lily HBase NRT indexer family level. REPLICATION HBase cluster Translates data changes into SOLR index updates. SOLR cloud Search requests (HTTP) Apache Zookeeper does all coordination Finally provides search Serves low level file system.
  • 40. Questions and discussion www.vitech.com.ua 40