SlideShare uma empresa Scribd logo
1 de 22
Running Head:
Key Aspects of Big Data Storage and its Architecture
Page 1
Key Aspects of Big Data Storage and its Architecture
Rahul Chaturvedi
Rchatur2@hawk.iit.edu
Illinois Institute of Technology
Key Aspects of Big Data Storage and its Architecture
2
Abstract
This paper should help the reader gain an insight of the Big Data offering various perspectives
such as need of Big Data, Storage requirements, Compute. The content of this paper starts right
from understanding what big data is to a smooth landing from the perspective of network
engineer. The idea is to understand what it takes to achieve that new age computing at a
completely different level. People develop compute greedy applications and this crave is not only
fulfilled by processing the data in-memory. There are different aspects at every nook and corner.
By the end of the paper the reader should be able to find this as a useful source for basic
questions while architecting a Big Data network solution.
Key Aspects of Big Data Storage and its Architecture
3
Key Aspects of Big Data Storage and its Architecture
What is big data?
Data is something really raw. Information takes birth from data. When this information is
structured in a way that it can be put to use in real life then we call this as knowledge. This raw
data is collected from disparate sources such as video, documents, social media, sensors,
transmitters etc. Data is being generated at every moment and this data usually forms a pattern.
We need to analyze this data and connect the dots to find our answers.
Rise of Big data
Big data was initially started by web companies who indexed the web and provided the
people to search for things on the internet. A very basic example - Google uses page ranking
algorithm which ranks all the web pages in order of number of hits and provides the latest
relevant search results to users. As part of this companies had to manage vast amounts of data
and develop ways or algorithms to extract knowledge from this data. In present day world
devices around us are generating data all the time and we need to store this for analysis. The
characteristic of this data is that it is distributed, unstructured, possibly unrelated, and quickly
growing. Someone who plans to store and analyze this is bound to address below issues which
pertain to the “cloud” as well:
 What are the data access methods?
 Which file system should your storage use to manage the files of Big data?
 How to keep the data secure?
 What are the ways to achieve maximum throughput and bandwidth?
Needless to mention, this list is very basic. Let us try and look at the best techniques which can
be used different phases of the data from source to the target.
Key Aspects of Big Data Storage and its Architecture
4
1. Data Access Methods
ReST and WebDAV are two of the well-known ways of extracting data from the web. Assessing
the ways to extract the data is important with respect to the design of application and hence forth
designs of the storage architecture. There are various interfaces to access the information over
the internet but REST, which stands for Representational state transfer is the latest and widely
chosen one. It provides a kind of two way functionality – data can be fetched from the remote
server or REST can be used as an interface to perform several functions at the server side on
behalf of the client.
ReST relies on a stateless client-server, cacheable communication less protocol – and in virtually
all cases, the HTTP protocol is used. The idea here is that instead of using any of the complex
communication protocols such as CORBA or RPC to connect two or more machines hosting a
networked application we use HTTP which is easy to use and reliable. The fact that “world wide
web” itself uses HTTP gives a boost in confidence when considering its reliability and easiness.
HTTP request is generated between machines to read data, delete data and post data. Thus HTTP
helps implement CRUD operations (Create, update, read and delete). ReST is not a standard
which means that there are no guidelines by consortium such as W3C. Another advantage of
working with ReST is that you can roll out your own frameworks and libraries in languages such
as java, perl, C#. It can be easily used when the network is having firewalls. The best feature is
that this service is platform independent unlike WebDAV which is not supported by all kind of
platforms. There are minimal drawbacks and one of them is using this service raises security
concerns amongst the stake holders as it is open and networked. But there are various ways to
mitigate breach of security by using secure sockets and following HTTPS (secure HTTP).
Key Aspects of Big Data Storage and its Architecture
5
Username/password tokens may also be used. Another area of concern is the cookies that are
forwarded in a request contain all the sensitive information. All the information is passed as
parameters in the request just as an example if someone had to find the name of the student
whose ID was 10 and location was Chicago then the request would be posted as below:
“www.studenttgudieatiit.com/student/Studentdetails?id=10&location=chicago”
The above is a read/GET request. GET requests should always be used to read data while POST
is used for other tasks.
Other strategy to fetch the data is WebDAV – Web based distributed authoring and versioning. It
is a way to collaborate and edit files over the network on disparate web servers. The purpose of
WebDAV is to easily read/write from server. This kind of file sharing is analogous to file
management on the local computer. With respect to big data some features are useful such as –
User gets to maintain the properties of the file like author, modification date, namespaces,
collection, overwrite protection, override permissions etc.
2. Data storage types and applications to manage them – Depending on the kind of data
in business the database storage technique is chosen appropriately. This point covers all
the various database storage techniques and their respective applications present in the
market.
2.1 Wide Column Store/Column Families -
This might be considered as a clone of the Google’s “Big Table”. Big Table was built as a
proprietary technology by Google on top of its other technologies such as Map-Reduce, GFS
(Google file system), Chubby file locking, SSTable.
Key Aspects of Big Data Storage and its Architecture
6
The design of big table is such that the data is stored in a single associative array which helps
scale storage horizontally. It maps a row value to its corresponding key value and then attaches
the timestamp to it and stores this in the associative array. It is designed to scale into the Petabyte
range across hundreds or thousands of servers with the ability to easily add servers to scale
resources (compute and storage). Consequently, the focus is on performance and scalability at
the expense of other aspects such as cost of hardware.
This kind of column oriented database is faster than the row oriented databases. The
concept of column oriented database is that usually the front end user fires a query to the
database in which the result set contains many rows but only a subset of the columns. In addition
to that the database is scanned row wise. So, fetching a subset of columns row wise takes a long
time. Therefore, it’s beneficial to have a column oriented database for huge data sets or tables
having huge number of fields.
From Wikipedia, a column oriented database may be defined as –
"Column-oriented databases are typically more efficient when an aggregate
function needs to be computed over many rows but only for a subset of all of the
columns of data because reading that smaller subset of data can be faster than
reading all data"
It is also efficient when certain data needs to be updated across all rows as this would result in a
single row operation in the database and all the columns would get updated.
There are several example of column-oriented databases or data stores:
 BigTable
 AsterData
 SAPHANA
 Cassandra
 HyperTable
Key Aspects of Big Data Storage and its Architecture
7
2.2 Document Store –
The name of this type doesn’t mean that the storage contains word document or pdf or
something else similar. This is similar to the previous key value store except that the database
here typically understands the kind of data retrieved. For an example data might be stored in an
xml format wherein each tag of the xml document tells us about the data within the tags. It is
similar to metadata.
Data can be read in a large read, emphasizing the streaming nature of the data. On the
other hand, if you use small amounts of data, then the storage starts to think that the data access
patterns are rather random and IOPS driven.
Examples of Document Stores for Big Data include:
 MongoDB
 CouchDB
 TerraStore
2.3 Key-Value / Tuple storage
This is analogous to a hash table from the functionality perspective. The idea here is to
avoid duplicates in the database so as to achieve higher performance. These are also known as
the tuple stores and are very popular in the big data world. Binary objects such as BLOB can
also be stored against a key. One of the drawbacks of using values against keys is that the user
needs to write a code defining the data being stored. Or else the data returned cannot be used.
Key Aspects of Big Data Storage and its Architecture
8
Examples of key-value databases that are used in Big Data are:
 BigTable
 LevelDB
 Couchbase Server
 BerkeleyDB
 Voldemort
 Cassandra
 MemcacheDB
 AmazonDynamoDB
 Dynomite
2.4 Graph Databases
These kinds of database emphasizes on building relations between the data sets at first and
then store. It maps data using various object of a graph – nodes, edges and properties. Each
element contains pointers to other elements. So, there is no real lookup. Start from a node and go
along the edge to reach another node. In graph databases, the nodes contain information you
want to keep track of, and the properties are pertinent information that relates to the node. Edges
are lines that connect nodes to nodes or nodes to properties, and they represent the relationship
between the two. Data access pattern can be determined by examining these connections between
the nodes.
Graph databases are becoming popular in Big Data because they can be used to gain more
information (insight) into the data because of the edges.
Examples of graph databases are:
 InfoGrid
 HyperGraphDB(UsesBerkeleyDBfordata storage)
 AllegroGraph
 BigData
Key Aspects of Big Data Storage and its Architecture
9
2.5 Multi model databases
This class of database can provide multiple techniques. Example a database could be a
document store and a graph database both. Examples of these databases are:
 AllegroGraph
 BigData
 Virtuoso
2.6 Object Databases
This kind of databases implements the object oriented concepts to store the data. It is
combination of database with objects where objects are used in object oriented programming.
The database here implements the capability along with object oriented programming.
Therefore the database here appears tightly coupled with the language used to access it.
Examples of Object databases for Big Data are the following:
 Versant
 DB40
Key Aspects of Big Data Storage and its Architecture
10
3 Storing the data, HDF5 and Hadoop as a FileSystem
3.1 HDF5 -
It is basically a technology suite with various applications to manage, store, manipulate,
and analyze data. The way which we use to arrange the data is important as we need to write the
methods to access it likewise. How the data is stored and managed also depends on the
applications which access and store this.
HDF5 is a comprehensive data model, library, and file format for storing and managing
data. The greatest advantage is it supports any kind of data type. It is designed for flexible and
efficient storage for high volumes of complex data. Portability and scalability are its key
features. Following are the key features that the HDF5 technology suite includes:-
 A robust data model that represents very complex data objects and a wide variety of
metadata.
 A file format which supports any size and any kind of files.
 Provides a software library that supports various computational platforms, from laptops
to massively parallel systems.
 It implements a high-level API. Applications can be written in any of C, C++, Fortran 90,
and Java interfaces to communicate with the big data.
 Has a set of features which allow for access time optimization and storage space
efficiency.
HDF5 is open source which means that the whole application set can be used as per the
requirements and can be tweaked as per the requirements. The application set covers API,
Library, Data Model, file format and all of these tools are open for the world.
Key Aspects of Big Data Storage and its Architecture
11
3.2 HyperScale
Hyperscale storage differs from the traditional enterprise storage. Storage volume can be
extended up to petabytes whereas conventional storage holds up to terabytes. This kind of
storage typically serves more concurrent users and lesser applications where as traditional
storage had more of the applications and fewer users.
The purpose of hyperscale is to achieve maximum raw storage and minimize the cost
which is done by removing the redundant content of the data. This kind of storage is software
defined focusing on minimum human involvement by covering all the required tasks in the
software. EMC provides ECS appliance which provides hyperscale storage.
3.3 Object based storage
This kind of storage is used when the number of files is huge forming a deep directory
structure to reach the desired file. Due to this the latency time and disk seem time increase
drastically. Object based storage gets around this by giving each file a unique ID and indexing
those set of ID so as to quickly access a file. Due to such kind of mechanism the storage can
expand to a large number of files. Product like Atmos and Centera of EMC provides object based
storage solution.
3. 4 Hadoop –
A number of tools mentioned in the second section have connections to use hadoop as a
storage platform. Hadoop is not a file system. It is a software framework that supports data-
intensive distributed applications, such as the ones discussed here and in the previous parts.
Key Aspects of Big Data Storage and its Architecture
12
Hadoop can be used along with MapReduce and this proved to be a very effective
solution for data intensive applications. This framework is popularly also known as Hadoop data
file system (HDFS). It is an open source file system derived from Google’s GFS(Google file
system). GFS is proprietary to Google. Hadoop has been designed in java and is actually a Meta
file system, which means that it sits on top of another file system and knows everything about it.
This design makes the system fault tolerant, allowing copies to be stored at various locations
within the file system which helps to recover from corrupt copy of data in the file system. These
copies can also be used to achieve parallelism.
3.4.1 Physical architecture of Hadoop:
The fundamental data block of hadoop is known as a data node. A data node is a
combination of a server - with some storage - and networking. Storage can be embedded in the
server or can be a storage device directly attached to it. Data nodes transfer the data over the
Ethernet to the destination. Neither the Ethernet nor the TCP/IP standards are followed to
transfer the data. Hadoop has its own transfer protocol defined which transfers data in blocks
rather than packets leading to maximum throughput in case of large reads in a data intensive
application. The data nodes are spread across multiple racks and each data node is identified by
a unique id on that rack. There is a metadata server known as the Namenode which holds the
data about the data node and is basically a metadata server. It can be considered as a
management node for HDFS. There is a secondary Namenode which isn’t a failover mechanism
but instead used for other file system tasks such as taking snapshots of the database and directory
information. Regardless, because there is only one Namenode, it can become a potential
bottleneck or single point of failure for HDFS.
Key Aspects of Big Data Storage and its Architecture
13
PCIe SSD technology is the latest in the solid state drives. Traditional SSD had a lot of
mechanical parts in them. This increased the seek time. Another technique to mitigate the slow
SATA drives is by combining the storage, HBA and SATA onto the same drive. PCI is fastest
due to the number of channels present on the bus for data transfer. Fusion IO’s PCIe SSD device
has 25 internal bus whereas SATA based SSD is Intel’s 10 channel controller. The PCIe SSD
can deliver speeds upto 2.1gbps making it easier to cache and buffer. There are few drawbacks of
this bus which arise due to interoperability issues. The problem arises due to its lack of standard
commands like they are present in SATA interfaces. Not just the bus but the controller raises
interoperability issues as well. Unlike established storage controller such as SATA there are no
standard storage commands for controllers with PCIe bus at their interface. Tasks such as
booting from media and running operating systems gets tough because there are no standard set
of commands or disk controls. Also, the applications are power hungry and extremely slow.
3.5 Connecting to Hadoop for storage –
Based on the various applications that each kind of database supports there are various ways
to connect with hadoop and access the data using that application. It is simple and easy to
access and retrieve the data using the java API, Thrift API, CLI (command line interface) or
browse through the HDFS-UI webpage over HTTP. It is not possible to use hadoop by
directly mounting HDFS on an OS. The only solution otherwise is to use a Linux Fuse client
that allows you to establish this connection.
It was stated earlier that hadoop has been derived from the Google file system which was
designed specially to support the column oriented big table. Hence, Hadoop usually supports
only column oriented databases. But many of the rest of the tools have developed interface to
store data using Hadoop. Below is a list of the applications that integrate with Hadoop –
Key Aspects of Big Data Storage and its Architecture
14
Wide Column Store/Column Families
 AsterData
 SAPHANA
 HBase
 Cassandra
 HyperTable
Key-Value/Tuple Store
 Couchbase Server
 Voldemort
 Cassandra
 Memcache DB
 AmazonDynamoDB
Document Store
 MongoDB
 CouchDB
 Graph Database
 InforGrid
The application which doesn’t comply with Hadoop uses hive and pig. Hive and pig are
interface which have been developed by the Hadoop community to work with Hadoop. Hive
is a data warehousing package. We know that essentially Hadoop is a NoSQLdatabase.
Hence, Hive was develop to send queries to the database. It can summarize the data and send
ad hoc queries to the database. The queries are written in a sql like language known as
HiveQL that allows you to perform basic searches as well as plug in any mapping code or
reduction code. The second tool is known as Pig. It is not just an interface to query the data
but also adds analysis capabilities. Pig was designed for parallelism, identical to Hadoop. It
has been written in a High-Level language known as pig latin. This language is compiled to
produce series of MapReduce programs that run on Hadoop.
Key Aspects of Big Data Storage and its Architecture
15
4. Data Analysis Tools to access the data in the storage
There are a variety of tools available in the market today. They can be broadly classified as
follows:
 Custom code applications
 Analytics oriented (ex :- R)
 Data base oriented based on the concepts of NoSQL
Let’s get into each of these in detail. The first one as the name suggests is an application custom
coded as per the requirements. This custom code is to get data into and out of the storage, custom
storage, customer APIs and customer applications. These applications are frequently written in
Java.
The second set is some kind of analytical applications which help analyze an array of data
formats. These tools use heavy mathematical or statistics analysis tools. Statistics help to
understand a lot pictorially and take those critical business decisions. It is critical to extract
information from the data and make use of this information which is knowledge. One of the most
common tools in this regard is “R”. It is a programming language that is based on the language
“S”. There is an open source implementation of this language which people use for statistical
analysis. It is a powerful tool for statistical analysis and can be connected to other
representational tools such as Tableau, which is again open source, to represent the analysis in
various pictorial forms such as scatter graph, bar graph, pie charts etc. This gives a quick look of
the scenario. In future, higher versions of this tool would offer massive parallel capabilities.
Currently studies are going on to release the R-parallel version. This version will eventually
bring compute to time to near milliseconds. This version should be connectable to hadoop. Data
in this application needs to be accessed using methods written in R.
Key Aspects of Big Data Storage and its Architecture
16
There are other analytical languages as well, such as Matlab and Scipy, but R gives the best
performance with statistical analysis of data.
The third class is the NoSQL class. It is the largest class of applications with many options of
tools to be used as per the data and the relationship amongst the parts of data. As the name
suggests, NoSQL does not use the standard SQL commands to query the data from the database.
Property of some of the NoSQL databases is given below:
 They do not use SQL commands to query the database
 These tools do not comply with the ACID (atomicity, consistency, isolation and
durability)
 They have a fault tolerant and distributed architecture.
5. Achieving parallelism in Hadoop-
5.1 Multiple Data Nodes - Typically, the data is copied over three datanodes. Two of the
datanodes are on the same rack and the third one on a separate rack. This helps to recover
from mishaps in which the whole rack goes down. It can also be used for parallelism.
When the data is requested then only one of the data nodes is supposed to respond with
the requested data. The data node can fetch the data and serve it to the user. This is
known as moving the job to data rather than moving the data to job. It is faster
implementation and is identical to infrastructure as a service (IAAS). This kind of
architecture also reduces the load on the network. Once the job to fetch the data is
triggered then there is no striping across the nodes or using multiple servers for data
access. The access is provided by the data node locally. The parallelism is achieved at the
application level where each instance of the application is split across various nodes.
Key Aspects of Big Data Storage and its Architecture
17
Moreover increased performance can be achieved because there are three copies of the
data. In the background however, data nodes communicate with each other using RPC to
perform a number of tasks as balancing the capacity, obeying replication rules,
comparing files and copies to maintain redundancy, check the number of data copies and
make additional copies if necessary.
5.2 MapReduce Framework used by the requesting application –
From the perspective of the application which is trying to reach out to the
database, the request is broken into parts and sent to the database. The data node serves
the data which is re-joined and provided to the user. This implies a large number of runs
over different large data sets.
To co- ordinate and manage all these runs there is MapReduce. MapReduce is a
framework for largely parallel computations that use potentially large data sets and a
large number of nodes. Per the name MapReduce performs two tasks – “Map”, which
takes the input and breaks into smaller sub problems and distributes them to worker
nodes which eventually process and send the data back to the worker nodes. “Reduce”,
which takes the results from the worker nodes and combines then results to create the
output. MapReduce works the best for large data sets because dividing smaller tasks
among the worker nodes doesn’t make sense and takes longer computational time than
usual. Using Hadoop as a file system provides fault tolerance. If a data node fails in the
cluster then usually we need to stop the job, check the file system and database and
restart the database. MapReduce actually provides fault tolerance when a job fails by
rescheduling the job when the data is still available. This means that with MapReduce the
system can recover from the failure of a datanode and complete the job.
Key Aspects of Big Data Storage and its Architecture
18
However, if this cannot be used then applications interact using some other API. But
using MapReduce should be on priority as it couples multiple data copies with the
parallelism and produces a very scale out and distributed solution. Another advantage of
using Hadoop is that when a new data node is added then the Hadoop and MapReduce
quickly recognize it and take advantage of the system.
5.3 Parallelism in applications –
When an application sends a request it is mapped by the HFS to the appropriate
storage node. This in turn reaches out to a particular data node on the server. The data
node can be RAID device or anything. Due to the involvement of hardware efficiency to
achieve a higher degree of parallelism the interface between the server and RAID can
become a bottleneck. To avoid this dependency on the attached nodes the user can add
more hardware to act as data nodes. Alternate would be to move the parallelism to
application itself which is what is exactly done by HPC (high performance computing).
The need is fulfilled by parallelizing the requests at file system and at the API accessing
the file system. Both serial and parallel applications can be parallelized here. In addition
to parallel API we can have parallel file systems such as GPFS, Lustre and Panasas.
Key Aspects of Big Data Storage and its Architecture
19
6. Problem with MapReduce –
In today’s world people are using big data for serial applications on local storage.
Example can be a music site which always provides serial data to the user. Such
applications use MapReduce. But at some point this parallelism hits a limit and needs
alternate ways to perform parallel tasks. This can be achieved by writing the application
code to interact with Hadoop using some of the parallel programming platforms provided
by vendors in the market today.
a. IBM X10
This is an example of an application programming language provided by IBM to
achieve parallelism. It was developed by IBM as part of the DARPA project on
Productive, Easy-to-Use, Reliable Compute System (PERC).
X10 achieves underlying parallelism through Partitioned Global Address Space (PGAS).
It uses a global name space, but the space can be logically partitioned locally with a
portion of the space local to each processor. This space has its own thread which can be
run locally. Big Data continues to grow and so does the importance of using languages
such as X10. It can interoperate with java which makes it easy to write applications that
interact with NoSQL and Hadoop.
b. SASusingMPI (Message passinginterface)
Key Aspects of Big Data Storage and its Architecture
20
7. Designing a Hadoop architecture –
Earlier this paper covered the various kinds of database, the applications used to store and
manage this database and the file systems used for such systems. This gives us some information
to pick the right fit for an application and join the pieces to architect a data warehousing solution.
If someone had to consider architecting a solution using Hadoop then that person should keep the
following in mind:
 Data is in huge amount and enough disks are needed to store this at a data node so
as to satisfy the I/O demands. Question to be addressed is what’s the size and
number of IOPS and how is the data to be chosen. Is the application database read
intensive or write intensive.
 Depending on the kind of application the degree of parallelism to be achieved can
be percept. Likewise, data nodes can be added to the array and the application can
be split into smaller parts across nodes to achieve the desired throughput. So for
example if the number of users reading/writing the data is high then multiple
copies of data need to be added across data nodes.
 If the data nodes are not enough then the load balancing tasks are performed by
Hadoop or by namenode to be more precise. But this takes extra network
bandwidth as the datanodes communicate with each other using RPC to transfer
the data to another datanode. In the process the name node also communicates to
the data node to delete any not needed data or update copies. Remember, this is
all because the number of data nodes is less that the load is getting transferred
onto the system. This can be mitigated by either adding nodes or by completing
more number of tasks parallel at the application level.
Key Aspects of Big Data Storage and its Architecture
21
 Hadoop does not use striping. Therefore, each data node should be of the size of
the largest file expected. Each file would reside on a single node. However,
multiple files can reside on a single data node server.
 Rest is understood regarding hardware like using the best devices, PCIe bus for
transfers from storage to processor, Ethernet from the processor to user, Fallback
mechanisms using RAID, using SSD, best management software to manage the
nodes. Virtualization may be needed if hadoop is to be used to its best efficacy.
This saves the extra space that idle threads would consume.
Key Aspects of Big Data Storage and its Architecture
22
Bibliography:
 Layton, Jeffrey (2014) Top of the Stack
(http://www.enterprisestorageforum.com/storage-management.html).
 The HDF Group. Hierarchical Data Format, version 5, 1997-2014.
http://www.hdfgroup.org/HDF5/.
 http://www.wikipedia.org/.

Mais conteúdo relacionado

Mais procurados

A Survey on Big Data Analytics
A Survey on Big Data AnalyticsA Survey on Big Data Analytics
A Survey on Big Data AnalyticsBHARATH KUMAR
 
Chapter 8(designing of documnt databases)no sql for mere mortals
Chapter 8(designing of documnt databases)no sql for mere mortalsChapter 8(designing of documnt databases)no sql for mere mortals
Chapter 8(designing of documnt databases)no sql for mere mortalsnehabsairam
 
Redis Cashe is an open-source distributed in-memory data store.
Redis Cashe is an open-source distributed in-memory data store.Redis Cashe is an open-source distributed in-memory data store.
Redis Cashe is an open-source distributed in-memory data store.Artan Ajredini
 
Chapter 6(introduction to documnet databse) no sql for mere mortals
Chapter 6(introduction to documnet databse) no sql for mere mortalsChapter 6(introduction to documnet databse) no sql for mere mortals
Chapter 6(introduction to documnet databse) no sql for mere mortalsnehabsairam
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewIRJET Journal
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Using Regular Expressions in Document Management Data Capture and Indexing
Using Regular Expressions in Document Management Data Capture and IndexingUsing Regular Expressions in Document Management Data Capture and Indexing
Using Regular Expressions in Document Management Data Capture and IndexingSandy Schiele
 

Mais procurados (16)

Big data
Big dataBig data
Big data
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
Video 1 big data
Video 1 big dataVideo 1 big data
Video 1 big data
 
Hadoop
HadoopHadoop
Hadoop
 
NoSQL Type, Bigdata, and Analytics
NoSQL Type, Bigdata, and AnalyticsNoSQL Type, Bigdata, and Analytics
NoSQL Type, Bigdata, and Analytics
 
A Survey on Big Data Analytics
A Survey on Big Data AnalyticsA Survey on Big Data Analytics
A Survey on Big Data Analytics
 
Chapter 8(designing of documnt databases)no sql for mere mortals
Chapter 8(designing of documnt databases)no sql for mere mortalsChapter 8(designing of documnt databases)no sql for mere mortals
Chapter 8(designing of documnt databases)no sql for mere mortals
 
Digital Types
Digital TypesDigital Types
Digital Types
 
Redis Cashe is an open-source distributed in-memory data store.
Redis Cashe is an open-source distributed in-memory data store.Redis Cashe is an open-source distributed in-memory data store.
Redis Cashe is an open-source distributed in-memory data store.
 
Chapter 6(introduction to documnet databse) no sql for mere mortals
Chapter 6(introduction to documnet databse) no sql for mere mortalsChapter 6(introduction to documnet databse) no sql for mere mortals
Chapter 6(introduction to documnet databse) no sql for mere mortals
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A Review
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Using Regular Expressions in Document Management Data Capture and Indexing
Using Regular Expressions in Document Management Data Capture and IndexingUsing Regular Expressions in Document Management Data Capture and Indexing
Using Regular Expressions in Document Management Data Capture and Indexing
 
Big data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and HealthcareBig data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and Healthcare
 
Metadata in Business Intelligence
Metadata in Business IntelligenceMetadata in Business Intelligence
Metadata in Business Intelligence
 
The Big Metadata
The Big MetadataThe Big Metadata
The Big Metadata
 

Destaque

Shee-Yen Tay, MD Chest1-2(2011年)
Shee-Yen Tay, MD Chest1-2(2011年)Shee-Yen Tay, MD Chest1-2(2011年)
Shee-Yen Tay, MD Chest1-2(2011年)TMUIREL
 
CONSEJOS PARA CONSEGUIR SU MENSAJE SE NOTARON EN FACEBOOK
CONSEJOS PARA CONSEGUIR SU MENSAJE SE NOTARON EN FACEBOOKCONSEJOS PARA CONSEGUIR SU MENSAJE SE NOTARON EN FACEBOOK
CONSEJOS PARA CONSEGUIR SU MENSAJE SE NOTARON EN FACEBOOKfacebookenespanol
 
Shee-Yen Tay, MD Gi1-7(2011年)
Shee-Yen Tay, MD Gi1-7(2011年)Shee-Yen Tay, MD Gi1-7(2011年)
Shee-Yen Tay, MD Gi1-7(2011年)TMUIREL
 
Karkas интеллектуальный анализ
Karkas интеллектуальный анализKarkas интеллектуальный анализ
Karkas интеллектуальный анализVladimir Burdaev
 
Unified Small Business Alliance Proposal by Startup Elite
Unified Small Business Alliance Proposal by Startup EliteUnified Small Business Alliance Proposal by Startup Elite
Unified Small Business Alliance Proposal by Startup EliteMarkus Biegel
 
Jornal opção 187
Jornal opção 187Jornal opção 187
Jornal opção 187Alair Arruda
 
Administración del capital humano
Administración del capital humanoAdministración del capital humano
Administración del capital humanoAndrés Bravo
 
Raising the stakes
Raising the stakes Raising the stakes
Raising the stakes Horizons RG
 

Destaque (9)

Shee-Yen Tay, MD Chest1-2(2011年)
Shee-Yen Tay, MD Chest1-2(2011年)Shee-Yen Tay, MD Chest1-2(2011年)
Shee-Yen Tay, MD Chest1-2(2011年)
 
CONSEJOS PARA CONSEGUIR SU MENSAJE SE NOTARON EN FACEBOOK
CONSEJOS PARA CONSEGUIR SU MENSAJE SE NOTARON EN FACEBOOKCONSEJOS PARA CONSEGUIR SU MENSAJE SE NOTARON EN FACEBOOK
CONSEJOS PARA CONSEGUIR SU MENSAJE SE NOTARON EN FACEBOOK
 
Shee-Yen Tay, MD Gi1-7(2011年)
Shee-Yen Tay, MD Gi1-7(2011年)Shee-Yen Tay, MD Gi1-7(2011年)
Shee-Yen Tay, MD Gi1-7(2011年)
 
Karkas интеллектуальный анализ
Karkas интеллектуальный анализKarkas интеллектуальный анализ
Karkas интеллектуальный анализ
 
Unified Small Business Alliance Proposal by Startup Elite
Unified Small Business Alliance Proposal by Startup EliteUnified Small Business Alliance Proposal by Startup Elite
Unified Small Business Alliance Proposal by Startup Elite
 
Jornal opção 187
Jornal opção 187Jornal opção 187
Jornal opção 187
 
Consumerisation of IT
Consumerisation of ITConsumerisation of IT
Consumerisation of IT
 
Administración del capital humano
Administración del capital humanoAdministración del capital humano
Administración del capital humano
 
Raising the stakes
Raising the stakes Raising the stakes
Raising the stakes
 

Semelhante a Key aspects of big data storage and its architecture

The Proliferation And Advances Of Computer Networks
The Proliferation And Advances Of Computer NetworksThe Proliferation And Advances Of Computer Networks
The Proliferation And Advances Of Computer NetworksJessica Deakin
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Data Ware House System in Cloud Environment
Data Ware House System in Cloud EnvironmentData Ware House System in Cloud Environment
Data Ware House System in Cloud EnvironmentIJERA Editor
 
Master Meta Data
Master Meta DataMaster Meta Data
Master Meta DataDigikrit
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbAlexander Decker
 
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...dbpublications
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
 
Study on potential capabilities of a nodb system
Study on potential capabilities of a nodb systemStudy on potential capabilities of a nodb system
Study on potential capabilities of a nodb systemijitjournal
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataVipin Batra
 
8 Guiding Principles to Kickstart Your Healthcare Big Data Project
8 Guiding Principles to Kickstart Your Healthcare Big Data Project8 Guiding Principles to Kickstart Your Healthcare Big Data Project
8 Guiding Principles to Kickstart Your Healthcare Big Data ProjectCitiusTech
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training reportSarvesh Meena
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBhavya Gulati
 

Semelhante a Key aspects of big data storage and its architecture (20)

The Proliferation And Advances Of Computer Networks
The Proliferation And Advances Of Computer NetworksThe Proliferation And Advances Of Computer Networks
The Proliferation And Advances Of Computer Networks
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 
Big data architecture
Big data architectureBig data architecture
Big data architecture
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Data Ware House System in Cloud Environment
Data Ware House System in Cloud EnvironmentData Ware House System in Cloud Environment
Data Ware House System in Cloud Environment
 
Master Meta Data
Master Meta DataMaster Meta Data
Master Meta Data
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
 
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
 
Study on potential capabilities of a nodb system
Study on potential capabilities of a nodb systemStudy on potential capabilities of a nodb system
Study on potential capabilities of a nodb system
 
Report 1.0.docx
Report 1.0.docxReport 1.0.docx
Report 1.0.docx
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
8 Guiding Principles to Kickstart Your Healthcare Big Data Project
8 Guiding Principles to Kickstart Your Healthcare Big Data Project8 Guiding Principles to Kickstart Your Healthcare Big Data Project
8 Guiding Principles to Kickstart Your Healthcare Big Data Project
 
Big Data
Big DataBig Data
Big Data
 
Big data and oracle
Big data and oracleBig data and oracle
Big data and oracle
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
E018142329
E018142329E018142329
E018142329
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 

Último

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Último (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Key aspects of big data storage and its architecture

  • 1. Running Head: Key Aspects of Big Data Storage and its Architecture Page 1 Key Aspects of Big Data Storage and its Architecture Rahul Chaturvedi Rchatur2@hawk.iit.edu Illinois Institute of Technology
  • 2. Key Aspects of Big Data Storage and its Architecture 2 Abstract This paper should help the reader gain an insight of the Big Data offering various perspectives such as need of Big Data, Storage requirements, Compute. The content of this paper starts right from understanding what big data is to a smooth landing from the perspective of network engineer. The idea is to understand what it takes to achieve that new age computing at a completely different level. People develop compute greedy applications and this crave is not only fulfilled by processing the data in-memory. There are different aspects at every nook and corner. By the end of the paper the reader should be able to find this as a useful source for basic questions while architecting a Big Data network solution.
  • 3. Key Aspects of Big Data Storage and its Architecture 3 Key Aspects of Big Data Storage and its Architecture What is big data? Data is something really raw. Information takes birth from data. When this information is structured in a way that it can be put to use in real life then we call this as knowledge. This raw data is collected from disparate sources such as video, documents, social media, sensors, transmitters etc. Data is being generated at every moment and this data usually forms a pattern. We need to analyze this data and connect the dots to find our answers. Rise of Big data Big data was initially started by web companies who indexed the web and provided the people to search for things on the internet. A very basic example - Google uses page ranking algorithm which ranks all the web pages in order of number of hits and provides the latest relevant search results to users. As part of this companies had to manage vast amounts of data and develop ways or algorithms to extract knowledge from this data. In present day world devices around us are generating data all the time and we need to store this for analysis. The characteristic of this data is that it is distributed, unstructured, possibly unrelated, and quickly growing. Someone who plans to store and analyze this is bound to address below issues which pertain to the “cloud” as well:  What are the data access methods?  Which file system should your storage use to manage the files of Big data?  How to keep the data secure?  What are the ways to achieve maximum throughput and bandwidth? Needless to mention, this list is very basic. Let us try and look at the best techniques which can be used different phases of the data from source to the target.
  • 4. Key Aspects of Big Data Storage and its Architecture 4 1. Data Access Methods ReST and WebDAV are two of the well-known ways of extracting data from the web. Assessing the ways to extract the data is important with respect to the design of application and hence forth designs of the storage architecture. There are various interfaces to access the information over the internet but REST, which stands for Representational state transfer is the latest and widely chosen one. It provides a kind of two way functionality – data can be fetched from the remote server or REST can be used as an interface to perform several functions at the server side on behalf of the client. ReST relies on a stateless client-server, cacheable communication less protocol – and in virtually all cases, the HTTP protocol is used. The idea here is that instead of using any of the complex communication protocols such as CORBA or RPC to connect two or more machines hosting a networked application we use HTTP which is easy to use and reliable. The fact that “world wide web” itself uses HTTP gives a boost in confidence when considering its reliability and easiness. HTTP request is generated between machines to read data, delete data and post data. Thus HTTP helps implement CRUD operations (Create, update, read and delete). ReST is not a standard which means that there are no guidelines by consortium such as W3C. Another advantage of working with ReST is that you can roll out your own frameworks and libraries in languages such as java, perl, C#. It can be easily used when the network is having firewalls. The best feature is that this service is platform independent unlike WebDAV which is not supported by all kind of platforms. There are minimal drawbacks and one of them is using this service raises security concerns amongst the stake holders as it is open and networked. But there are various ways to mitigate breach of security by using secure sockets and following HTTPS (secure HTTP).
  • 5. Key Aspects of Big Data Storage and its Architecture 5 Username/password tokens may also be used. Another area of concern is the cookies that are forwarded in a request contain all the sensitive information. All the information is passed as parameters in the request just as an example if someone had to find the name of the student whose ID was 10 and location was Chicago then the request would be posted as below: “www.studenttgudieatiit.com/student/Studentdetails?id=10&location=chicago” The above is a read/GET request. GET requests should always be used to read data while POST is used for other tasks. Other strategy to fetch the data is WebDAV – Web based distributed authoring and versioning. It is a way to collaborate and edit files over the network on disparate web servers. The purpose of WebDAV is to easily read/write from server. This kind of file sharing is analogous to file management on the local computer. With respect to big data some features are useful such as – User gets to maintain the properties of the file like author, modification date, namespaces, collection, overwrite protection, override permissions etc. 2. Data storage types and applications to manage them – Depending on the kind of data in business the database storage technique is chosen appropriately. This point covers all the various database storage techniques and their respective applications present in the market. 2.1 Wide Column Store/Column Families - This might be considered as a clone of the Google’s “Big Table”. Big Table was built as a proprietary technology by Google on top of its other technologies such as Map-Reduce, GFS (Google file system), Chubby file locking, SSTable.
  • 6. Key Aspects of Big Data Storage and its Architecture 6 The design of big table is such that the data is stored in a single associative array which helps scale storage horizontally. It maps a row value to its corresponding key value and then attaches the timestamp to it and stores this in the associative array. It is designed to scale into the Petabyte range across hundreds or thousands of servers with the ability to easily add servers to scale resources (compute and storage). Consequently, the focus is on performance and scalability at the expense of other aspects such as cost of hardware. This kind of column oriented database is faster than the row oriented databases. The concept of column oriented database is that usually the front end user fires a query to the database in which the result set contains many rows but only a subset of the columns. In addition to that the database is scanned row wise. So, fetching a subset of columns row wise takes a long time. Therefore, it’s beneficial to have a column oriented database for huge data sets or tables having huge number of fields. From Wikipedia, a column oriented database may be defined as – "Column-oriented databases are typically more efficient when an aggregate function needs to be computed over many rows but only for a subset of all of the columns of data because reading that smaller subset of data can be faster than reading all data" It is also efficient when certain data needs to be updated across all rows as this would result in a single row operation in the database and all the columns would get updated. There are several example of column-oriented databases or data stores:  BigTable  AsterData  SAPHANA  Cassandra  HyperTable
  • 7. Key Aspects of Big Data Storage and its Architecture 7 2.2 Document Store – The name of this type doesn’t mean that the storage contains word document or pdf or something else similar. This is similar to the previous key value store except that the database here typically understands the kind of data retrieved. For an example data might be stored in an xml format wherein each tag of the xml document tells us about the data within the tags. It is similar to metadata. Data can be read in a large read, emphasizing the streaming nature of the data. On the other hand, if you use small amounts of data, then the storage starts to think that the data access patterns are rather random and IOPS driven. Examples of Document Stores for Big Data include:  MongoDB  CouchDB  TerraStore 2.3 Key-Value / Tuple storage This is analogous to a hash table from the functionality perspective. The idea here is to avoid duplicates in the database so as to achieve higher performance. These are also known as the tuple stores and are very popular in the big data world. Binary objects such as BLOB can also be stored against a key. One of the drawbacks of using values against keys is that the user needs to write a code defining the data being stored. Or else the data returned cannot be used.
  • 8. Key Aspects of Big Data Storage and its Architecture 8 Examples of key-value databases that are used in Big Data are:  BigTable  LevelDB  Couchbase Server  BerkeleyDB  Voldemort  Cassandra  MemcacheDB  AmazonDynamoDB  Dynomite 2.4 Graph Databases These kinds of database emphasizes on building relations between the data sets at first and then store. It maps data using various object of a graph – nodes, edges and properties. Each element contains pointers to other elements. So, there is no real lookup. Start from a node and go along the edge to reach another node. In graph databases, the nodes contain information you want to keep track of, and the properties are pertinent information that relates to the node. Edges are lines that connect nodes to nodes or nodes to properties, and they represent the relationship between the two. Data access pattern can be determined by examining these connections between the nodes. Graph databases are becoming popular in Big Data because they can be used to gain more information (insight) into the data because of the edges. Examples of graph databases are:  InfoGrid  HyperGraphDB(UsesBerkeleyDBfordata storage)  AllegroGraph  BigData
  • 9. Key Aspects of Big Data Storage and its Architecture 9 2.5 Multi model databases This class of database can provide multiple techniques. Example a database could be a document store and a graph database both. Examples of these databases are:  AllegroGraph  BigData  Virtuoso 2.6 Object Databases This kind of databases implements the object oriented concepts to store the data. It is combination of database with objects where objects are used in object oriented programming. The database here implements the capability along with object oriented programming. Therefore the database here appears tightly coupled with the language used to access it. Examples of Object databases for Big Data are the following:  Versant  DB40
  • 10. Key Aspects of Big Data Storage and its Architecture 10 3 Storing the data, HDF5 and Hadoop as a FileSystem 3.1 HDF5 - It is basically a technology suite with various applications to manage, store, manipulate, and analyze data. The way which we use to arrange the data is important as we need to write the methods to access it likewise. How the data is stored and managed also depends on the applications which access and store this. HDF5 is a comprehensive data model, library, and file format for storing and managing data. The greatest advantage is it supports any kind of data type. It is designed for flexible and efficient storage for high volumes of complex data. Portability and scalability are its key features. Following are the key features that the HDF5 technology suite includes:-  A robust data model that represents very complex data objects and a wide variety of metadata.  A file format which supports any size and any kind of files.  Provides a software library that supports various computational platforms, from laptops to massively parallel systems.  It implements a high-level API. Applications can be written in any of C, C++, Fortran 90, and Java interfaces to communicate with the big data.  Has a set of features which allow for access time optimization and storage space efficiency. HDF5 is open source which means that the whole application set can be used as per the requirements and can be tweaked as per the requirements. The application set covers API, Library, Data Model, file format and all of these tools are open for the world.
  • 11. Key Aspects of Big Data Storage and its Architecture 11 3.2 HyperScale Hyperscale storage differs from the traditional enterprise storage. Storage volume can be extended up to petabytes whereas conventional storage holds up to terabytes. This kind of storage typically serves more concurrent users and lesser applications where as traditional storage had more of the applications and fewer users. The purpose of hyperscale is to achieve maximum raw storage and minimize the cost which is done by removing the redundant content of the data. This kind of storage is software defined focusing on minimum human involvement by covering all the required tasks in the software. EMC provides ECS appliance which provides hyperscale storage. 3.3 Object based storage This kind of storage is used when the number of files is huge forming a deep directory structure to reach the desired file. Due to this the latency time and disk seem time increase drastically. Object based storage gets around this by giving each file a unique ID and indexing those set of ID so as to quickly access a file. Due to such kind of mechanism the storage can expand to a large number of files. Product like Atmos and Centera of EMC provides object based storage solution. 3. 4 Hadoop – A number of tools mentioned in the second section have connections to use hadoop as a storage platform. Hadoop is not a file system. It is a software framework that supports data- intensive distributed applications, such as the ones discussed here and in the previous parts.
  • 12. Key Aspects of Big Data Storage and its Architecture 12 Hadoop can be used along with MapReduce and this proved to be a very effective solution for data intensive applications. This framework is popularly also known as Hadoop data file system (HDFS). It is an open source file system derived from Google’s GFS(Google file system). GFS is proprietary to Google. Hadoop has been designed in java and is actually a Meta file system, which means that it sits on top of another file system and knows everything about it. This design makes the system fault tolerant, allowing copies to be stored at various locations within the file system which helps to recover from corrupt copy of data in the file system. These copies can also be used to achieve parallelism. 3.4.1 Physical architecture of Hadoop: The fundamental data block of hadoop is known as a data node. A data node is a combination of a server - with some storage - and networking. Storage can be embedded in the server or can be a storage device directly attached to it. Data nodes transfer the data over the Ethernet to the destination. Neither the Ethernet nor the TCP/IP standards are followed to transfer the data. Hadoop has its own transfer protocol defined which transfers data in blocks rather than packets leading to maximum throughput in case of large reads in a data intensive application. The data nodes are spread across multiple racks and each data node is identified by a unique id on that rack. There is a metadata server known as the Namenode which holds the data about the data node and is basically a metadata server. It can be considered as a management node for HDFS. There is a secondary Namenode which isn’t a failover mechanism but instead used for other file system tasks such as taking snapshots of the database and directory information. Regardless, because there is only one Namenode, it can become a potential bottleneck or single point of failure for HDFS.
  • 13. Key Aspects of Big Data Storage and its Architecture 13 PCIe SSD technology is the latest in the solid state drives. Traditional SSD had a lot of mechanical parts in them. This increased the seek time. Another technique to mitigate the slow SATA drives is by combining the storage, HBA and SATA onto the same drive. PCI is fastest due to the number of channels present on the bus for data transfer. Fusion IO’s PCIe SSD device has 25 internal bus whereas SATA based SSD is Intel’s 10 channel controller. The PCIe SSD can deliver speeds upto 2.1gbps making it easier to cache and buffer. There are few drawbacks of this bus which arise due to interoperability issues. The problem arises due to its lack of standard commands like they are present in SATA interfaces. Not just the bus but the controller raises interoperability issues as well. Unlike established storage controller such as SATA there are no standard storage commands for controllers with PCIe bus at their interface. Tasks such as booting from media and running operating systems gets tough because there are no standard set of commands or disk controls. Also, the applications are power hungry and extremely slow. 3.5 Connecting to Hadoop for storage – Based on the various applications that each kind of database supports there are various ways to connect with hadoop and access the data using that application. It is simple and easy to access and retrieve the data using the java API, Thrift API, CLI (command line interface) or browse through the HDFS-UI webpage over HTTP. It is not possible to use hadoop by directly mounting HDFS on an OS. The only solution otherwise is to use a Linux Fuse client that allows you to establish this connection. It was stated earlier that hadoop has been derived from the Google file system which was designed specially to support the column oriented big table. Hence, Hadoop usually supports only column oriented databases. But many of the rest of the tools have developed interface to store data using Hadoop. Below is a list of the applications that integrate with Hadoop –
  • 14. Key Aspects of Big Data Storage and its Architecture 14 Wide Column Store/Column Families  AsterData  SAPHANA  HBase  Cassandra  HyperTable Key-Value/Tuple Store  Couchbase Server  Voldemort  Cassandra  Memcache DB  AmazonDynamoDB Document Store  MongoDB  CouchDB  Graph Database  InforGrid The application which doesn’t comply with Hadoop uses hive and pig. Hive and pig are interface which have been developed by the Hadoop community to work with Hadoop. Hive is a data warehousing package. We know that essentially Hadoop is a NoSQLdatabase. Hence, Hive was develop to send queries to the database. It can summarize the data and send ad hoc queries to the database. The queries are written in a sql like language known as HiveQL that allows you to perform basic searches as well as plug in any mapping code or reduction code. The second tool is known as Pig. It is not just an interface to query the data but also adds analysis capabilities. Pig was designed for parallelism, identical to Hadoop. It has been written in a High-Level language known as pig latin. This language is compiled to produce series of MapReduce programs that run on Hadoop.
  • 15. Key Aspects of Big Data Storage and its Architecture 15 4. Data Analysis Tools to access the data in the storage There are a variety of tools available in the market today. They can be broadly classified as follows:  Custom code applications  Analytics oriented (ex :- R)  Data base oriented based on the concepts of NoSQL Let’s get into each of these in detail. The first one as the name suggests is an application custom coded as per the requirements. This custom code is to get data into and out of the storage, custom storage, customer APIs and customer applications. These applications are frequently written in Java. The second set is some kind of analytical applications which help analyze an array of data formats. These tools use heavy mathematical or statistics analysis tools. Statistics help to understand a lot pictorially and take those critical business decisions. It is critical to extract information from the data and make use of this information which is knowledge. One of the most common tools in this regard is “R”. It is a programming language that is based on the language “S”. There is an open source implementation of this language which people use for statistical analysis. It is a powerful tool for statistical analysis and can be connected to other representational tools such as Tableau, which is again open source, to represent the analysis in various pictorial forms such as scatter graph, bar graph, pie charts etc. This gives a quick look of the scenario. In future, higher versions of this tool would offer massive parallel capabilities. Currently studies are going on to release the R-parallel version. This version will eventually bring compute to time to near milliseconds. This version should be connectable to hadoop. Data in this application needs to be accessed using methods written in R.
  • 16. Key Aspects of Big Data Storage and its Architecture 16 There are other analytical languages as well, such as Matlab and Scipy, but R gives the best performance with statistical analysis of data. The third class is the NoSQL class. It is the largest class of applications with many options of tools to be used as per the data and the relationship amongst the parts of data. As the name suggests, NoSQL does not use the standard SQL commands to query the data from the database. Property of some of the NoSQL databases is given below:  They do not use SQL commands to query the database  These tools do not comply with the ACID (atomicity, consistency, isolation and durability)  They have a fault tolerant and distributed architecture. 5. Achieving parallelism in Hadoop- 5.1 Multiple Data Nodes - Typically, the data is copied over three datanodes. Two of the datanodes are on the same rack and the third one on a separate rack. This helps to recover from mishaps in which the whole rack goes down. It can also be used for parallelism. When the data is requested then only one of the data nodes is supposed to respond with the requested data. The data node can fetch the data and serve it to the user. This is known as moving the job to data rather than moving the data to job. It is faster implementation and is identical to infrastructure as a service (IAAS). This kind of architecture also reduces the load on the network. Once the job to fetch the data is triggered then there is no striping across the nodes or using multiple servers for data access. The access is provided by the data node locally. The parallelism is achieved at the application level where each instance of the application is split across various nodes.
  • 17. Key Aspects of Big Data Storage and its Architecture 17 Moreover increased performance can be achieved because there are three copies of the data. In the background however, data nodes communicate with each other using RPC to perform a number of tasks as balancing the capacity, obeying replication rules, comparing files and copies to maintain redundancy, check the number of data copies and make additional copies if necessary. 5.2 MapReduce Framework used by the requesting application – From the perspective of the application which is trying to reach out to the database, the request is broken into parts and sent to the database. The data node serves the data which is re-joined and provided to the user. This implies a large number of runs over different large data sets. To co- ordinate and manage all these runs there is MapReduce. MapReduce is a framework for largely parallel computations that use potentially large data sets and a large number of nodes. Per the name MapReduce performs two tasks – “Map”, which takes the input and breaks into smaller sub problems and distributes them to worker nodes which eventually process and send the data back to the worker nodes. “Reduce”, which takes the results from the worker nodes and combines then results to create the output. MapReduce works the best for large data sets because dividing smaller tasks among the worker nodes doesn’t make sense and takes longer computational time than usual. Using Hadoop as a file system provides fault tolerance. If a data node fails in the cluster then usually we need to stop the job, check the file system and database and restart the database. MapReduce actually provides fault tolerance when a job fails by rescheduling the job when the data is still available. This means that with MapReduce the system can recover from the failure of a datanode and complete the job.
  • 18. Key Aspects of Big Data Storage and its Architecture 18 However, if this cannot be used then applications interact using some other API. But using MapReduce should be on priority as it couples multiple data copies with the parallelism and produces a very scale out and distributed solution. Another advantage of using Hadoop is that when a new data node is added then the Hadoop and MapReduce quickly recognize it and take advantage of the system. 5.3 Parallelism in applications – When an application sends a request it is mapped by the HFS to the appropriate storage node. This in turn reaches out to a particular data node on the server. The data node can be RAID device or anything. Due to the involvement of hardware efficiency to achieve a higher degree of parallelism the interface between the server and RAID can become a bottleneck. To avoid this dependency on the attached nodes the user can add more hardware to act as data nodes. Alternate would be to move the parallelism to application itself which is what is exactly done by HPC (high performance computing). The need is fulfilled by parallelizing the requests at file system and at the API accessing the file system. Both serial and parallel applications can be parallelized here. In addition to parallel API we can have parallel file systems such as GPFS, Lustre and Panasas.
  • 19. Key Aspects of Big Data Storage and its Architecture 19 6. Problem with MapReduce – In today’s world people are using big data for serial applications on local storage. Example can be a music site which always provides serial data to the user. Such applications use MapReduce. But at some point this parallelism hits a limit and needs alternate ways to perform parallel tasks. This can be achieved by writing the application code to interact with Hadoop using some of the parallel programming platforms provided by vendors in the market today. a. IBM X10 This is an example of an application programming language provided by IBM to achieve parallelism. It was developed by IBM as part of the DARPA project on Productive, Easy-to-Use, Reliable Compute System (PERC). X10 achieves underlying parallelism through Partitioned Global Address Space (PGAS). It uses a global name space, but the space can be logically partitioned locally with a portion of the space local to each processor. This space has its own thread which can be run locally. Big Data continues to grow and so does the importance of using languages such as X10. It can interoperate with java which makes it easy to write applications that interact with NoSQL and Hadoop. b. SASusingMPI (Message passinginterface)
  • 20. Key Aspects of Big Data Storage and its Architecture 20 7. Designing a Hadoop architecture – Earlier this paper covered the various kinds of database, the applications used to store and manage this database and the file systems used for such systems. This gives us some information to pick the right fit for an application and join the pieces to architect a data warehousing solution. If someone had to consider architecting a solution using Hadoop then that person should keep the following in mind:  Data is in huge amount and enough disks are needed to store this at a data node so as to satisfy the I/O demands. Question to be addressed is what’s the size and number of IOPS and how is the data to be chosen. Is the application database read intensive or write intensive.  Depending on the kind of application the degree of parallelism to be achieved can be percept. Likewise, data nodes can be added to the array and the application can be split into smaller parts across nodes to achieve the desired throughput. So for example if the number of users reading/writing the data is high then multiple copies of data need to be added across data nodes.  If the data nodes are not enough then the load balancing tasks are performed by Hadoop or by namenode to be more precise. But this takes extra network bandwidth as the datanodes communicate with each other using RPC to transfer the data to another datanode. In the process the name node also communicates to the data node to delete any not needed data or update copies. Remember, this is all because the number of data nodes is less that the load is getting transferred onto the system. This can be mitigated by either adding nodes or by completing more number of tasks parallel at the application level.
  • 21. Key Aspects of Big Data Storage and its Architecture 21  Hadoop does not use striping. Therefore, each data node should be of the size of the largest file expected. Each file would reside on a single node. However, multiple files can reside on a single data node server.  Rest is understood regarding hardware like using the best devices, PCIe bus for transfers from storage to processor, Ethernet from the processor to user, Fallback mechanisms using RAID, using SSD, best management software to manage the nodes. Virtualization may be needed if hadoop is to be used to its best efficacy. This saves the extra space that idle threads would consume.
  • 22. Key Aspects of Big Data Storage and its Architecture 22 Bibliography:  Layton, Jeffrey (2014) Top of the Stack (http://www.enterprisestorageforum.com/storage-management.html).  The HDF Group. Hierarchical Data Format, version 5, 1997-2014. http://www.hdfgroup.org/HDF5/.  http://www.wikipedia.org/.