This paper helps understand the tools and technologies related to a classic BigData setting. Someone who reads this paper, especially Enterprise Architects, will find it helpful in choosing several BigData database technologies in a Hadoop architecture.
Key aspects of big data storage and its architecture
1. Running Head:
Key Aspects of Big Data Storage and its Architecture
Page 1
Key Aspects of Big Data Storage and its Architecture
Rahul Chaturvedi
Rchatur2@hawk.iit.edu
Illinois Institute of Technology
2. Key Aspects of Big Data Storage and its Architecture
2
Abstract
This paper should help the reader gain an insight of the Big Data offering various perspectives
such as need of Big Data, Storage requirements, Compute. The content of this paper starts right
from understanding what big data is to a smooth landing from the perspective of network
engineer. The idea is to understand what it takes to achieve that new age computing at a
completely different level. People develop compute greedy applications and this crave is not only
fulfilled by processing the data in-memory. There are different aspects at every nook and corner.
By the end of the paper the reader should be able to find this as a useful source for basic
questions while architecting a Big Data network solution.
3. Key Aspects of Big Data Storage and its Architecture
3
Key Aspects of Big Data Storage and its Architecture
What is big data?
Data is something really raw. Information takes birth from data. When this information is
structured in a way that it can be put to use in real life then we call this as knowledge. This raw
data is collected from disparate sources such as video, documents, social media, sensors,
transmitters etc. Data is being generated at every moment and this data usually forms a pattern.
We need to analyze this data and connect the dots to find our answers.
Rise of Big data
Big data was initially started by web companies who indexed the web and provided the
people to search for things on the internet. A very basic example - Google uses page ranking
algorithm which ranks all the web pages in order of number of hits and provides the latest
relevant search results to users. As part of this companies had to manage vast amounts of data
and develop ways or algorithms to extract knowledge from this data. In present day world
devices around us are generating data all the time and we need to store this for analysis. The
characteristic of this data is that it is distributed, unstructured, possibly unrelated, and quickly
growing. Someone who plans to store and analyze this is bound to address below issues which
pertain to the “cloud” as well:
What are the data access methods?
Which file system should your storage use to manage the files of Big data?
How to keep the data secure?
What are the ways to achieve maximum throughput and bandwidth?
Needless to mention, this list is very basic. Let us try and look at the best techniques which can
be used different phases of the data from source to the target.
4. Key Aspects of Big Data Storage and its Architecture
4
1. Data Access Methods
ReST and WebDAV are two of the well-known ways of extracting data from the web. Assessing
the ways to extract the data is important with respect to the design of application and hence forth
designs of the storage architecture. There are various interfaces to access the information over
the internet but REST, which stands for Representational state transfer is the latest and widely
chosen one. It provides a kind of two way functionality – data can be fetched from the remote
server or REST can be used as an interface to perform several functions at the server side on
behalf of the client.
ReST relies on a stateless client-server, cacheable communication less protocol – and in virtually
all cases, the HTTP protocol is used. The idea here is that instead of using any of the complex
communication protocols such as CORBA or RPC to connect two or more machines hosting a
networked application we use HTTP which is easy to use and reliable. The fact that “world wide
web” itself uses HTTP gives a boost in confidence when considering its reliability and easiness.
HTTP request is generated between machines to read data, delete data and post data. Thus HTTP
helps implement CRUD operations (Create, update, read and delete). ReST is not a standard
which means that there are no guidelines by consortium such as W3C. Another advantage of
working with ReST is that you can roll out your own frameworks and libraries in languages such
as java, perl, C#. It can be easily used when the network is having firewalls. The best feature is
that this service is platform independent unlike WebDAV which is not supported by all kind of
platforms. There are minimal drawbacks and one of them is using this service raises security
concerns amongst the stake holders as it is open and networked. But there are various ways to
mitigate breach of security by using secure sockets and following HTTPS (secure HTTP).
5. Key Aspects of Big Data Storage and its Architecture
5
Username/password tokens may also be used. Another area of concern is the cookies that are
forwarded in a request contain all the sensitive information. All the information is passed as
parameters in the request just as an example if someone had to find the name of the student
whose ID was 10 and location was Chicago then the request would be posted as below:
“www.studenttgudieatiit.com/student/Studentdetails?id=10&location=chicago”
The above is a read/GET request. GET requests should always be used to read data while POST
is used for other tasks.
Other strategy to fetch the data is WebDAV – Web based distributed authoring and versioning. It
is a way to collaborate and edit files over the network on disparate web servers. The purpose of
WebDAV is to easily read/write from server. This kind of file sharing is analogous to file
management on the local computer. With respect to big data some features are useful such as –
User gets to maintain the properties of the file like author, modification date, namespaces,
collection, overwrite protection, override permissions etc.
2. Data storage types and applications to manage them – Depending on the kind of data
in business the database storage technique is chosen appropriately. This point covers all
the various database storage techniques and their respective applications present in the
market.
2.1 Wide Column Store/Column Families -
This might be considered as a clone of the Google’s “Big Table”. Big Table was built as a
proprietary technology by Google on top of its other technologies such as Map-Reduce, GFS
(Google file system), Chubby file locking, SSTable.
6. Key Aspects of Big Data Storage and its Architecture
6
The design of big table is such that the data is stored in a single associative array which helps
scale storage horizontally. It maps a row value to its corresponding key value and then attaches
the timestamp to it and stores this in the associative array. It is designed to scale into the Petabyte
range across hundreds or thousands of servers with the ability to easily add servers to scale
resources (compute and storage). Consequently, the focus is on performance and scalability at
the expense of other aspects such as cost of hardware.
This kind of column oriented database is faster than the row oriented databases. The
concept of column oriented database is that usually the front end user fires a query to the
database in which the result set contains many rows but only a subset of the columns. In addition
to that the database is scanned row wise. So, fetching a subset of columns row wise takes a long
time. Therefore, it’s beneficial to have a column oriented database for huge data sets or tables
having huge number of fields.
From Wikipedia, a column oriented database may be defined as –
"Column-oriented databases are typically more efficient when an aggregate
function needs to be computed over many rows but only for a subset of all of the
columns of data because reading that smaller subset of data can be faster than
reading all data"
It is also efficient when certain data needs to be updated across all rows as this would result in a
single row operation in the database and all the columns would get updated.
There are several example of column-oriented databases or data stores:
BigTable
AsterData
SAPHANA
Cassandra
HyperTable
7. Key Aspects of Big Data Storage and its Architecture
7
2.2 Document Store –
The name of this type doesn’t mean that the storage contains word document or pdf or
something else similar. This is similar to the previous key value store except that the database
here typically understands the kind of data retrieved. For an example data might be stored in an
xml format wherein each tag of the xml document tells us about the data within the tags. It is
similar to metadata.
Data can be read in a large read, emphasizing the streaming nature of the data. On the
other hand, if you use small amounts of data, then the storage starts to think that the data access
patterns are rather random and IOPS driven.
Examples of Document Stores for Big Data include:
MongoDB
CouchDB
TerraStore
2.3 Key-Value / Tuple storage
This is analogous to a hash table from the functionality perspective. The idea here is to
avoid duplicates in the database so as to achieve higher performance. These are also known as
the tuple stores and are very popular in the big data world. Binary objects such as BLOB can
also be stored against a key. One of the drawbacks of using values against keys is that the user
needs to write a code defining the data being stored. Or else the data returned cannot be used.
8. Key Aspects of Big Data Storage and its Architecture
8
Examples of key-value databases that are used in Big Data are:
BigTable
LevelDB
Couchbase Server
BerkeleyDB
Voldemort
Cassandra
MemcacheDB
AmazonDynamoDB
Dynomite
2.4 Graph Databases
These kinds of database emphasizes on building relations between the data sets at first and
then store. It maps data using various object of a graph – nodes, edges and properties. Each
element contains pointers to other elements. So, there is no real lookup. Start from a node and go
along the edge to reach another node. In graph databases, the nodes contain information you
want to keep track of, and the properties are pertinent information that relates to the node. Edges
are lines that connect nodes to nodes or nodes to properties, and they represent the relationship
between the two. Data access pattern can be determined by examining these connections between
the nodes.
Graph databases are becoming popular in Big Data because they can be used to gain more
information (insight) into the data because of the edges.
Examples of graph databases are:
InfoGrid
HyperGraphDB(UsesBerkeleyDBfordata storage)
AllegroGraph
BigData
9. Key Aspects of Big Data Storage and its Architecture
9
2.5 Multi model databases
This class of database can provide multiple techniques. Example a database could be a
document store and a graph database both. Examples of these databases are:
AllegroGraph
BigData
Virtuoso
2.6 Object Databases
This kind of databases implements the object oriented concepts to store the data. It is
combination of database with objects where objects are used in object oriented programming.
The database here implements the capability along with object oriented programming.
Therefore the database here appears tightly coupled with the language used to access it.
Examples of Object databases for Big Data are the following:
Versant
DB40
10. Key Aspects of Big Data Storage and its Architecture
10
3 Storing the data, HDF5 and Hadoop as a FileSystem
3.1 HDF5 -
It is basically a technology suite with various applications to manage, store, manipulate,
and analyze data. The way which we use to arrange the data is important as we need to write the
methods to access it likewise. How the data is stored and managed also depends on the
applications which access and store this.
HDF5 is a comprehensive data model, library, and file format for storing and managing
data. The greatest advantage is it supports any kind of data type. It is designed for flexible and
efficient storage for high volumes of complex data. Portability and scalability are its key
features. Following are the key features that the HDF5 technology suite includes:-
A robust data model that represents very complex data objects and a wide variety of
metadata.
A file format which supports any size and any kind of files.
Provides a software library that supports various computational platforms, from laptops
to massively parallel systems.
It implements a high-level API. Applications can be written in any of C, C++, Fortran 90,
and Java interfaces to communicate with the big data.
Has a set of features which allow for access time optimization and storage space
efficiency.
HDF5 is open source which means that the whole application set can be used as per the
requirements and can be tweaked as per the requirements. The application set covers API,
Library, Data Model, file format and all of these tools are open for the world.
11. Key Aspects of Big Data Storage and its Architecture
11
3.2 HyperScale
Hyperscale storage differs from the traditional enterprise storage. Storage volume can be
extended up to petabytes whereas conventional storage holds up to terabytes. This kind of
storage typically serves more concurrent users and lesser applications where as traditional
storage had more of the applications and fewer users.
The purpose of hyperscale is to achieve maximum raw storage and minimize the cost
which is done by removing the redundant content of the data. This kind of storage is software
defined focusing on minimum human involvement by covering all the required tasks in the
software. EMC provides ECS appliance which provides hyperscale storage.
3.3 Object based storage
This kind of storage is used when the number of files is huge forming a deep directory
structure to reach the desired file. Due to this the latency time and disk seem time increase
drastically. Object based storage gets around this by giving each file a unique ID and indexing
those set of ID so as to quickly access a file. Due to such kind of mechanism the storage can
expand to a large number of files. Product like Atmos and Centera of EMC provides object based
storage solution.
3. 4 Hadoop –
A number of tools mentioned in the second section have connections to use hadoop as a
storage platform. Hadoop is not a file system. It is a software framework that supports data-
intensive distributed applications, such as the ones discussed here and in the previous parts.
12. Key Aspects of Big Data Storage and its Architecture
12
Hadoop can be used along with MapReduce and this proved to be a very effective
solution for data intensive applications. This framework is popularly also known as Hadoop data
file system (HDFS). It is an open source file system derived from Google’s GFS(Google file
system). GFS is proprietary to Google. Hadoop has been designed in java and is actually a Meta
file system, which means that it sits on top of another file system and knows everything about it.
This design makes the system fault tolerant, allowing copies to be stored at various locations
within the file system which helps to recover from corrupt copy of data in the file system. These
copies can also be used to achieve parallelism.
3.4.1 Physical architecture of Hadoop:
The fundamental data block of hadoop is known as a data node. A data node is a
combination of a server - with some storage - and networking. Storage can be embedded in the
server or can be a storage device directly attached to it. Data nodes transfer the data over the
Ethernet to the destination. Neither the Ethernet nor the TCP/IP standards are followed to
transfer the data. Hadoop has its own transfer protocol defined which transfers data in blocks
rather than packets leading to maximum throughput in case of large reads in a data intensive
application. The data nodes are spread across multiple racks and each data node is identified by
a unique id on that rack. There is a metadata server known as the Namenode which holds the
data about the data node and is basically a metadata server. It can be considered as a
management node for HDFS. There is a secondary Namenode which isn’t a failover mechanism
but instead used for other file system tasks such as taking snapshots of the database and directory
information. Regardless, because there is only one Namenode, it can become a potential
bottleneck or single point of failure for HDFS.
13. Key Aspects of Big Data Storage and its Architecture
13
PCIe SSD technology is the latest in the solid state drives. Traditional SSD had a lot of
mechanical parts in them. This increased the seek time. Another technique to mitigate the slow
SATA drives is by combining the storage, HBA and SATA onto the same drive. PCI is fastest
due to the number of channels present on the bus for data transfer. Fusion IO’s PCIe SSD device
has 25 internal bus whereas SATA based SSD is Intel’s 10 channel controller. The PCIe SSD
can deliver speeds upto 2.1gbps making it easier to cache and buffer. There are few drawbacks of
this bus which arise due to interoperability issues. The problem arises due to its lack of standard
commands like they are present in SATA interfaces. Not just the bus but the controller raises
interoperability issues as well. Unlike established storage controller such as SATA there are no
standard storage commands for controllers with PCIe bus at their interface. Tasks such as
booting from media and running operating systems gets tough because there are no standard set
of commands or disk controls. Also, the applications are power hungry and extremely slow.
3.5 Connecting to Hadoop for storage –
Based on the various applications that each kind of database supports there are various ways
to connect with hadoop and access the data using that application. It is simple and easy to
access and retrieve the data using the java API, Thrift API, CLI (command line interface) or
browse through the HDFS-UI webpage over HTTP. It is not possible to use hadoop by
directly mounting HDFS on an OS. The only solution otherwise is to use a Linux Fuse client
that allows you to establish this connection.
It was stated earlier that hadoop has been derived from the Google file system which was
designed specially to support the column oriented big table. Hence, Hadoop usually supports
only column oriented databases. But many of the rest of the tools have developed interface to
store data using Hadoop. Below is a list of the applications that integrate with Hadoop –
14. Key Aspects of Big Data Storage and its Architecture
14
Wide Column Store/Column Families
AsterData
SAPHANA
HBase
Cassandra
HyperTable
Key-Value/Tuple Store
Couchbase Server
Voldemort
Cassandra
Memcache DB
AmazonDynamoDB
Document Store
MongoDB
CouchDB
Graph Database
InforGrid
The application which doesn’t comply with Hadoop uses hive and pig. Hive and pig are
interface which have been developed by the Hadoop community to work with Hadoop. Hive
is a data warehousing package. We know that essentially Hadoop is a NoSQLdatabase.
Hence, Hive was develop to send queries to the database. It can summarize the data and send
ad hoc queries to the database. The queries are written in a sql like language known as
HiveQL that allows you to perform basic searches as well as plug in any mapping code or
reduction code. The second tool is known as Pig. It is not just an interface to query the data
but also adds analysis capabilities. Pig was designed for parallelism, identical to Hadoop. It
has been written in a High-Level language known as pig latin. This language is compiled to
produce series of MapReduce programs that run on Hadoop.
15. Key Aspects of Big Data Storage and its Architecture
15
4. Data Analysis Tools to access the data in the storage
There are a variety of tools available in the market today. They can be broadly classified as
follows:
Custom code applications
Analytics oriented (ex :- R)
Data base oriented based on the concepts of NoSQL
Let’s get into each of these in detail. The first one as the name suggests is an application custom
coded as per the requirements. This custom code is to get data into and out of the storage, custom
storage, customer APIs and customer applications. These applications are frequently written in
Java.
The second set is some kind of analytical applications which help analyze an array of data
formats. These tools use heavy mathematical or statistics analysis tools. Statistics help to
understand a lot pictorially and take those critical business decisions. It is critical to extract
information from the data and make use of this information which is knowledge. One of the most
common tools in this regard is “R”. It is a programming language that is based on the language
“S”. There is an open source implementation of this language which people use for statistical
analysis. It is a powerful tool for statistical analysis and can be connected to other
representational tools such as Tableau, which is again open source, to represent the analysis in
various pictorial forms such as scatter graph, bar graph, pie charts etc. This gives a quick look of
the scenario. In future, higher versions of this tool would offer massive parallel capabilities.
Currently studies are going on to release the R-parallel version. This version will eventually
bring compute to time to near milliseconds. This version should be connectable to hadoop. Data
in this application needs to be accessed using methods written in R.
16. Key Aspects of Big Data Storage and its Architecture
16
There are other analytical languages as well, such as Matlab and Scipy, but R gives the best
performance with statistical analysis of data.
The third class is the NoSQL class. It is the largest class of applications with many options of
tools to be used as per the data and the relationship amongst the parts of data. As the name
suggests, NoSQL does not use the standard SQL commands to query the data from the database.
Property of some of the NoSQL databases is given below:
They do not use SQL commands to query the database
These tools do not comply with the ACID (atomicity, consistency, isolation and
durability)
They have a fault tolerant and distributed architecture.
5. Achieving parallelism in Hadoop-
5.1 Multiple Data Nodes - Typically, the data is copied over three datanodes. Two of the
datanodes are on the same rack and the third one on a separate rack. This helps to recover
from mishaps in which the whole rack goes down. It can also be used for parallelism.
When the data is requested then only one of the data nodes is supposed to respond with
the requested data. The data node can fetch the data and serve it to the user. This is
known as moving the job to data rather than moving the data to job. It is faster
implementation and is identical to infrastructure as a service (IAAS). This kind of
architecture also reduces the load on the network. Once the job to fetch the data is
triggered then there is no striping across the nodes or using multiple servers for data
access. The access is provided by the data node locally. The parallelism is achieved at the
application level where each instance of the application is split across various nodes.
17. Key Aspects of Big Data Storage and its Architecture
17
Moreover increased performance can be achieved because there are three copies of the
data. In the background however, data nodes communicate with each other using RPC to
perform a number of tasks as balancing the capacity, obeying replication rules,
comparing files and copies to maintain redundancy, check the number of data copies and
make additional copies if necessary.
5.2 MapReduce Framework used by the requesting application –
From the perspective of the application which is trying to reach out to the
database, the request is broken into parts and sent to the database. The data node serves
the data which is re-joined and provided to the user. This implies a large number of runs
over different large data sets.
To co- ordinate and manage all these runs there is MapReduce. MapReduce is a
framework for largely parallel computations that use potentially large data sets and a
large number of nodes. Per the name MapReduce performs two tasks – “Map”, which
takes the input and breaks into smaller sub problems and distributes them to worker
nodes which eventually process and send the data back to the worker nodes. “Reduce”,
which takes the results from the worker nodes and combines then results to create the
output. MapReduce works the best for large data sets because dividing smaller tasks
among the worker nodes doesn’t make sense and takes longer computational time than
usual. Using Hadoop as a file system provides fault tolerance. If a data node fails in the
cluster then usually we need to stop the job, check the file system and database and
restart the database. MapReduce actually provides fault tolerance when a job fails by
rescheduling the job when the data is still available. This means that with MapReduce the
system can recover from the failure of a datanode and complete the job.
18. Key Aspects of Big Data Storage and its Architecture
18
However, if this cannot be used then applications interact using some other API. But
using MapReduce should be on priority as it couples multiple data copies with the
parallelism and produces a very scale out and distributed solution. Another advantage of
using Hadoop is that when a new data node is added then the Hadoop and MapReduce
quickly recognize it and take advantage of the system.
5.3 Parallelism in applications –
When an application sends a request it is mapped by the HFS to the appropriate
storage node. This in turn reaches out to a particular data node on the server. The data
node can be RAID device or anything. Due to the involvement of hardware efficiency to
achieve a higher degree of parallelism the interface between the server and RAID can
become a bottleneck. To avoid this dependency on the attached nodes the user can add
more hardware to act as data nodes. Alternate would be to move the parallelism to
application itself which is what is exactly done by HPC (high performance computing).
The need is fulfilled by parallelizing the requests at file system and at the API accessing
the file system. Both serial and parallel applications can be parallelized here. In addition
to parallel API we can have parallel file systems such as GPFS, Lustre and Panasas.
19. Key Aspects of Big Data Storage and its Architecture
19
6. Problem with MapReduce –
In today’s world people are using big data for serial applications on local storage.
Example can be a music site which always provides serial data to the user. Such
applications use MapReduce. But at some point this parallelism hits a limit and needs
alternate ways to perform parallel tasks. This can be achieved by writing the application
code to interact with Hadoop using some of the parallel programming platforms provided
by vendors in the market today.
a. IBM X10
This is an example of an application programming language provided by IBM to
achieve parallelism. It was developed by IBM as part of the DARPA project on
Productive, Easy-to-Use, Reliable Compute System (PERC).
X10 achieves underlying parallelism through Partitioned Global Address Space (PGAS).
It uses a global name space, but the space can be logically partitioned locally with a
portion of the space local to each processor. This space has its own thread which can be
run locally. Big Data continues to grow and so does the importance of using languages
such as X10. It can interoperate with java which makes it easy to write applications that
interact with NoSQL and Hadoop.
b. SASusingMPI (Message passinginterface)
20. Key Aspects of Big Data Storage and its Architecture
20
7. Designing a Hadoop architecture –
Earlier this paper covered the various kinds of database, the applications used to store and
manage this database and the file systems used for such systems. This gives us some information
to pick the right fit for an application and join the pieces to architect a data warehousing solution.
If someone had to consider architecting a solution using Hadoop then that person should keep the
following in mind:
Data is in huge amount and enough disks are needed to store this at a data node so
as to satisfy the I/O demands. Question to be addressed is what’s the size and
number of IOPS and how is the data to be chosen. Is the application database read
intensive or write intensive.
Depending on the kind of application the degree of parallelism to be achieved can
be percept. Likewise, data nodes can be added to the array and the application can
be split into smaller parts across nodes to achieve the desired throughput. So for
example if the number of users reading/writing the data is high then multiple
copies of data need to be added across data nodes.
If the data nodes are not enough then the load balancing tasks are performed by
Hadoop or by namenode to be more precise. But this takes extra network
bandwidth as the datanodes communicate with each other using RPC to transfer
the data to another datanode. In the process the name node also communicates to
the data node to delete any not needed data or update copies. Remember, this is
all because the number of data nodes is less that the load is getting transferred
onto the system. This can be mitigated by either adding nodes or by completing
more number of tasks parallel at the application level.
21. Key Aspects of Big Data Storage and its Architecture
21
Hadoop does not use striping. Therefore, each data node should be of the size of
the largest file expected. Each file would reside on a single node. However,
multiple files can reside on a single data node server.
Rest is understood regarding hardware like using the best devices, PCIe bus for
transfers from storage to processor, Ethernet from the processor to user, Fallback
mechanisms using RAID, using SSD, best management software to manage the
nodes. Virtualization may be needed if hadoop is to be used to its best efficacy.
This saves the extra space that idle threads would consume.
22. Key Aspects of Big Data Storage and its Architecture
22
Bibliography:
Layton, Jeffrey (2014) Top of the Stack
(http://www.enterprisestorageforum.com/storage-management.html).
The HDF Group. Hierarchical Data Format, version 5, 1997-2014.
http://www.hdfgroup.org/HDF5/.
http://www.wikipedia.org/.