1. Situation: Assume there are 10 devices. Each device writesto a .csv log file 10 timesa second. With
this setup, there are approximately 1 million rows/day added. Thisdata must be bulk loaded in real-time
to Hadoop cluster. The data must also be aggregated daily and monthly for analytics. For each of the
big data platforms: What isthe minimum hardware requirements, recommended cluster size, and
approximated cost. What isthe setup of dashboardsto use for analytics? Are there any methodsof
adding additional data streams? If the number of devicesisincreased to 100, 1000, how easy isit to
scale the system?
Hadoop platform also includes the Hadoop Distributed File System (HDFS), which is designed for scalability and
fault-tolerance. HDFS stores large files by dividing them into blocks (usually 64 or 128 MB) and replicating the
blocks on three or more servers.HDFS clusters in production use today reliably hold petabytes ofdata on thousands
of nodes.
What is the minimum hardware requirements,
Computing, Memory, Storage, Network, Software are the primary drivers that establish hardware requirements.
Main features required in each component for deciding hardware are presented below:-
Computing:-
Integrated I/O, which reduces I/O latency and increases I/O bandwidth which is particularly beneficial when
processing large data sets.Advanced Encryption Standard New Instructions (AES-NI), which accelerates common
cryptographic functions to erase the performance penalty typically associated with encryption and decryption of big-
data files
Memory:-
Hadoop typically requires 48 GB to 96 GB of RAM per server, and 64 GB is optimal in most cases.Memory errors
are one of the most common causes ofserver failure and data corruption, so error-correcting code (ECC) memory is
highly recommended.
Storage:-
Each server in a Hadoop cluster requires a relatively large number of storage drives to avoid I/O bottlenecks.a
single solid-state drive (SSD) per core can deliver higher I/O throughput,reduced latency, and betteroverall cluster
performance. Higher RPM's drive provide a good balance between cost and performance.
Network:-
A fast network not only allows data to be imported and exported quickly, but can also improve performance for the
shuffle phase of Map Reduce applications. Higher gigabit network provides simple, cost-effective solution.
recommended cluster size and approximated cost
Throw more nodes to the problem is the thumb rule for hadoop.
Applying linear regression equation techniques reveals that the variation in cost and efficiency is explained by linear
relationship between higher the amount of node and higher the speed.
HDFS is itself based on Master/Slave architecture with two main components: the Name Node / Secondary Name
Node and DataNode components.These are critical components and need a lot of memory to store the file's meta
information such as attributes and file localization, directory structure,names, and to process data.
The memory needed by NameNode to manage the HDFS cluster metadata in memory and the memory needed for
the OS must be added together. Typically, the memory needed by Secondary NameNode should be identical to
NameNode.
Memory amount = HDFS cluster management memory + NameNode memory + OS memory
It is also easy to determine the DataNode memory amount. But this time, the memory amount depends on the
physical CPU's core number installed on each DataNode.
2. Memory amount = Memory per CPU core * number of CPU's core + DataNode process memory +
DataNode TaskTracker memory + OS memory
Methods of adding additional data streams
The following are the various frameworks for data streams:-
Apache Storm
Apache Samza
Apache Spark
Apache Flink
The best fit for situation will depend heavily upon the state of the data to process, how time-bound your
requirements are, and what kind of results you are interested in.
What is the setup of dashboards to use for analytics?
Using Hive to extract data regularly from HDFS and store the output in a "more real time" RDBMS database like
SQL will be more realistic for analytics.Dashboards can be created with the analytical reports on the contents in
there e.g counts of different types of objects, trends overtime.
How easy is it to scale the system?
Hadoop solves the hard scaling problems caused by large amounts of complex data.
As the amount of data in a cluster grows, new servers can be added incrementally and inexpensively to store and
analyze it.
Scale up in hadoop architecture will be achieved by adhering to the flowing:-
Ensure Potential free space in the data center near the Hadoop cluster.
Network to cope with more servers
More memory in the master servers
Source:-
http://hadoop.intel.com
http://blog.cloudera.com/wp-content/uploads/2011/03/ten_common_hadoopable_problems_final.pdf
https://www.packtpub.com/books/content/sizing-and-configuring-your-hadoop-cluster
https://www.digitalocean.com/community/tutorials/hadoop-storm-samza-spark-and-flink-big-data-frameworks-
compared