SlideShare a Scribd company logo
1 of 2
Situation: Assume there are 10 devices. Each device writesto a .csv log file 10 timesa second. With
this setup, there are approximately 1 million rows/day added. Thisdata must be bulk loaded in real-time
to Hadoop cluster. The data must also be aggregated daily and monthly for analytics. For each of the
big data platforms: What isthe minimum hardware requirements, recommended cluster size, and
approximated cost. What isthe setup of dashboardsto use for analytics? Are there any methodsof
adding additional data streams? If the number of devicesisincreased to 100, 1000, how easy isit to
scale the system?
Hadoop platform also includes the Hadoop Distributed File System (HDFS), which is designed for scalability and
fault-tolerance. HDFS stores large files by dividing them into blocks (usually 64 or 128 MB) and replicating the
blocks on three or more servers.HDFS clusters in production use today reliably hold petabytes ofdata on thousands
of nodes.
What is the minimum hardware requirements,
Computing, Memory, Storage, Network, Software are the primary drivers that establish hardware requirements.
Main features required in each component for deciding hardware are presented below:-
Computing:-
Integrated I/O, which reduces I/O latency and increases I/O bandwidth which is particularly beneficial when
processing large data sets.Advanced Encryption Standard New Instructions (AES-NI), which accelerates common
cryptographic functions to erase the performance penalty typically associated with encryption and decryption of big-
data files
Memory:-
Hadoop typically requires 48 GB to 96 GB of RAM per server, and 64 GB is optimal in most cases.Memory errors
are one of the most common causes ofserver failure and data corruption, so error-correcting code (ECC) memory is
highly recommended.
Storage:-
Each server in a Hadoop cluster requires a relatively large number of storage drives to avoid I/O bottlenecks.a
single solid-state drive (SSD) per core can deliver higher I/O throughput,reduced latency, and betteroverall cluster
performance. Higher RPM's drive provide a good balance between cost and performance.
Network:-
A fast network not only allows data to be imported and exported quickly, but can also improve performance for the
shuffle phase of Map Reduce applications. Higher gigabit network provides simple, cost-effective solution.
recommended cluster size and approximated cost
Throw more nodes to the problem is the thumb rule for hadoop.
Applying linear regression equation techniques reveals that the variation in cost and efficiency is explained by linear
relationship between higher the amount of node and higher the speed.
HDFS is itself based on Master/Slave architecture with two main components: the Name Node / Secondary Name
Node and DataNode components.These are critical components and need a lot of memory to store the file's meta
information such as attributes and file localization, directory structure,names, and to process data.
The memory needed by NameNode to manage the HDFS cluster metadata in memory and the memory needed for
the OS must be added together. Typically, the memory needed by Secondary NameNode should be identical to
NameNode.
Memory amount = HDFS cluster management memory + NameNode memory + OS memory
It is also easy to determine the DataNode memory amount. But this time, the memory amount depends on the
physical CPU's core number installed on each DataNode.
Memory amount = Memory per CPU core * number of CPU's core + DataNode process memory +
DataNode TaskTracker memory + OS memory
Methods of adding additional data streams
The following are the various frameworks for data streams:-
Apache Storm
Apache Samza
Apache Spark
Apache Flink
The best fit for situation will depend heavily upon the state of the data to process, how time-bound your
requirements are, and what kind of results you are interested in.
What is the setup of dashboards to use for analytics?
Using Hive to extract data regularly from HDFS and store the output in a "more real time" RDBMS database like
SQL will be more realistic for analytics.Dashboards can be created with the analytical reports on the contents in
there e.g counts of different types of objects, trends overtime.
How easy is it to scale the system?
Hadoop solves the hard scaling problems caused by large amounts of complex data.
As the amount of data in a cluster grows, new servers can be added incrementally and inexpensively to store and
analyze it.
Scale up in hadoop architecture will be achieved by adhering to the flowing:-
Ensure Potential free space in the data center near the Hadoop cluster.
Network to cope with more servers
More memory in the master servers
Source:-
http://hadoop.intel.com
http://blog.cloudera.com/wp-content/uploads/2011/03/ten_common_hadoopable_problems_final.pdf
https://www.packtpub.com/books/content/sizing-and-configuring-your-hadoop-cluster
https://www.digitalocean.com/community/tutorials/hadoop-storm-samza-spark-and-flink-big-data-frameworks-
compared

More Related Content

What's hot

Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processingroyans
 
ImpalaToGo introduction
ImpalaToGo introductionImpalaToGo introduction
ImpalaToGo introductionDavid Groozman
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabsSiva Sankar
 
ImpalaToGo design explained
ImpalaToGo design explainedImpalaToGo design explained
ImpalaToGo design explainedDavid Groozman
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Map reduce & HDFS with Hadoop
Map reduce & HDFS with HadoopMap reduce & HDFS with Hadoop
Map reduce & HDFS with HadoopDiego Pacheco
 
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...AyeeshaParveen
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Big Data and its emergence
Big Data and its emergenceBig Data and its emergence
Big Data and its emergencekoolkalpz
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...AyeeshaParveen
 

What's hot (20)

Hadoop
HadoopHadoop
Hadoop
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processing
 
ImpalaToGo introduction
ImpalaToGo introductionImpalaToGo introduction
ImpalaToGo introduction
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabs
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
ImpalaToGo design explained
ImpalaToGo design explainedImpalaToGo design explained
ImpalaToGo design explained
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Map reduce & HDFS with Hadoop
Map reduce & HDFS with HadoopMap reduce & HDFS with Hadoop
Map reduce & HDFS with Hadoop
 
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
 
HDFS Tiered Storage
HDFS Tiered StorageHDFS Tiered Storage
HDFS Tiered Storage
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
HDFS Issues
HDFS IssuesHDFS Issues
HDFS Issues
 
Big Data and its emergence
Big Data and its emergenceBig Data and its emergence
Big Data and its emergence
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 

Similar to Hadoop Research

Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesDavid Tjahjono,MD,MBA(UK)
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedDouglas Bernardini
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsKaniska Mandal
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizITJobZone.biz
 
Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree									Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree AnikeyRoy
 
Six Steps to Modernize Your Data Ecosystem - Mindtree
Six Steps to Modernize Your Data Ecosystem  - MindtreeSix Steps to Modernize Your Data Ecosystem  - Mindtree
Six Steps to Modernize Your Data Ecosystem - Mindtreesamirandev1
 
6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtree6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtreedevraajsingh
 
Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog sameerroshan
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introductionrajsandhu1989
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 

Similar to Hadoop Research (20)

getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data Analytics
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
 
Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree									Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree
 
Six Steps to Modernize Your Data Ecosystem - Mindtree
Six Steps to Modernize Your Data Ecosystem  - MindtreeSix Steps to Modernize Your Data Ecosystem  - Mindtree
Six Steps to Modernize Your Data Ecosystem - Mindtree
 
6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtree6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtree
 
Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 

Hadoop Research

  • 1. Situation: Assume there are 10 devices. Each device writesto a .csv log file 10 timesa second. With this setup, there are approximately 1 million rows/day added. Thisdata must be bulk loaded in real-time to Hadoop cluster. The data must also be aggregated daily and monthly for analytics. For each of the big data platforms: What isthe minimum hardware requirements, recommended cluster size, and approximated cost. What isthe setup of dashboardsto use for analytics? Are there any methodsof adding additional data streams? If the number of devicesisincreased to 100, 1000, how easy isit to scale the system? Hadoop platform also includes the Hadoop Distributed File System (HDFS), which is designed for scalability and fault-tolerance. HDFS stores large files by dividing them into blocks (usually 64 or 128 MB) and replicating the blocks on three or more servers.HDFS clusters in production use today reliably hold petabytes ofdata on thousands of nodes. What is the minimum hardware requirements, Computing, Memory, Storage, Network, Software are the primary drivers that establish hardware requirements. Main features required in each component for deciding hardware are presented below:- Computing:- Integrated I/O, which reduces I/O latency and increases I/O bandwidth which is particularly beneficial when processing large data sets.Advanced Encryption Standard New Instructions (AES-NI), which accelerates common cryptographic functions to erase the performance penalty typically associated with encryption and decryption of big- data files Memory:- Hadoop typically requires 48 GB to 96 GB of RAM per server, and 64 GB is optimal in most cases.Memory errors are one of the most common causes ofserver failure and data corruption, so error-correcting code (ECC) memory is highly recommended. Storage:- Each server in a Hadoop cluster requires a relatively large number of storage drives to avoid I/O bottlenecks.a single solid-state drive (SSD) per core can deliver higher I/O throughput,reduced latency, and betteroverall cluster performance. Higher RPM's drive provide a good balance between cost and performance. Network:- A fast network not only allows data to be imported and exported quickly, but can also improve performance for the shuffle phase of Map Reduce applications. Higher gigabit network provides simple, cost-effective solution. recommended cluster size and approximated cost Throw more nodes to the problem is the thumb rule for hadoop. Applying linear regression equation techniques reveals that the variation in cost and efficiency is explained by linear relationship between higher the amount of node and higher the speed. HDFS is itself based on Master/Slave architecture with two main components: the Name Node / Secondary Name Node and DataNode components.These are critical components and need a lot of memory to store the file's meta information such as attributes and file localization, directory structure,names, and to process data. The memory needed by NameNode to manage the HDFS cluster metadata in memory and the memory needed for the OS must be added together. Typically, the memory needed by Secondary NameNode should be identical to NameNode. Memory amount = HDFS cluster management memory + NameNode memory + OS memory It is also easy to determine the DataNode memory amount. But this time, the memory amount depends on the physical CPU's core number installed on each DataNode.
  • 2. Memory amount = Memory per CPU core * number of CPU's core + DataNode process memory + DataNode TaskTracker memory + OS memory Methods of adding additional data streams The following are the various frameworks for data streams:- Apache Storm Apache Samza Apache Spark Apache Flink The best fit for situation will depend heavily upon the state of the data to process, how time-bound your requirements are, and what kind of results you are interested in. What is the setup of dashboards to use for analytics? Using Hive to extract data regularly from HDFS and store the output in a "more real time" RDBMS database like SQL will be more realistic for analytics.Dashboards can be created with the analytical reports on the contents in there e.g counts of different types of objects, trends overtime. How easy is it to scale the system? Hadoop solves the hard scaling problems caused by large amounts of complex data. As the amount of data in a cluster grows, new servers can be added incrementally and inexpensively to store and analyze it. Scale up in hadoop architecture will be achieved by adhering to the flowing:- Ensure Potential free space in the data center near the Hadoop cluster. Network to cope with more servers More memory in the master servers Source:- http://hadoop.intel.com http://blog.cloudera.com/wp-content/uploads/2011/03/ten_common_hadoopable_problems_final.pdf https://www.packtpub.com/books/content/sizing-and-configuring-your-hadoop-cluster https://www.digitalocean.com/community/tutorials/hadoop-storm-samza-spark-and-flink-big-data-frameworks- compared