SlideShare uma empresa Scribd logo
1 de 18
Baixar para ler offline
BIG DATA
What, Why, Where, When and How
senthil sundaresan
BI/SQL/Data Visualization Evangelist
Abstract
In this paper, we talk about what is big data, the growing prevalence of big data, the opportunities, the
challenges and architectural framework that will facilitate the delivery of opportunities while addressing
challenges.
The architecture for the ‘Big Data Management’ will be demonstrated through Hadoop technology with Map-
Reduce framework and its Open Source ecosystem.
BIG DATA
SENTHIL SUNDARESAN 1
Author’s Page
I am senthil, a BI/SQL/Data visualization evangelist.
I have donned many roles during my short career of 13+ years such as Analyst, Developer, Lead, Project Manager, Principal
Data and Visualization Architect, Consultant, DB Administrator, Unix Administrator to name a few.
My BI and Visualization skills are SAP BO/BODS, TABLEAU, QLIKVIEW, MSBI, ESSBASE, R, OMNISCOPE, SQLSERVER, SYBASE
IQ, SYBASE, TERADATA (again) to name a few.
Been in this industry and especially in BI for so many years it’s imperative for me to understand the nuances and intricacies
of the Big Data Tech Stack. That’s the trigger for me to write this paper and while doing so I’ve started exploring big data
more.
This paper would be a stepping stone for those who thinks whether it’s possible or plausible.
Thanks for reading!
My blogs:
sensungit.wordpress.com
lifeofbi.blog.com
sites.googles.com/site/youtechies
BIG DATA
SENTHIL SUNDARESAN 2
1. INTRODUCTION
“Big data” is a big vibrating phrase in the IT and business world right now – and there are a dizzying array of opinions on just
what these two simple words really mean. Technology vendors in the legacy database or data warehouse spaces say “big
data” simply refers to a traditional data warehousing scenario involving data volumes in either the single or multi-terabyte
range. Others disagree with this by saying that “big data” isn’t limited to traditional data warehouse situations, but includes
real-time or operational data stores used as the primary data foundation for online applications that power key external or
internal business systems.
In 2011, people have created 1.8 Zetabytes of data and this is increasing exponentially every year. This ever increasing data
contains information that could give rise to many business opportunities. Few of the Business Drivers of Big data are:
Finance: Better and deeper understanding of risk to avoid credit crisis – Basel III
Telecommunications: More reliable network where we can predict and prevent failure
Media: More content that is lined up with your personal preferences
Life science: Better targeted medicines with fewer complications and side effects
Retail: A personal experience with products and offers that are just what you need
Government: Government services that are based on hard data, not just gut.
Big Data is here. Analysts and research organizations have made it clear that mining machine generated data is essential to
future success. Embracing new technologies and techniques are always challenging, but as architects, you are expected to
provide a fast, reliable path to business adoption.
Big Data Characteristics, Architecture Capabilities, Technologies, Market vendors, and Sample implementation are explained
in the subsequent sections.
2. WHAT IS BIG DATA?
Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast,
or doesn’t fit the strictures of your database architectures. To gain value from this data, an alternative way has to be chosen
to process it.
2.1 Characteristics of Big Data
Big data has the following characteristics
Very large distributed aggregations of loosely structured data are often incomplete and inaccessible:
 Petabytes/exabytes of data
 Billions/Trillions of records
 Loosely-structured and often distributed data
 Flat schemes with few complex interrelationships
 Often involving time-stamped events
 Often made up of incomplete data
 Often including connections between data elements that must be probabilistically inferred
Applications that involved Big-data can be:
 Transactional (e.g.: Facebook, Photobox etc)
 Analytic (e.g., ClickFox, Merced Applications)
BIG DATA
SENTHIL SUNDARESAN 3
Fig 1: Big Data Evolution
According to a new global report from IBM and the Said Business School at the University of Oxford, less than half of the
organizations engaged in active Big Data initiatives are currently analyzing external sources of data, like social media.
2.2 Key Metrics: The Three V’s
As a catch-all term, “big data” can be pretty nebulous, in the same way that the term “cloud” covers diverse technologies.
Input data to big data systems could be chatter from social networks, web server logs, traffic flow sensors, satellite imagery,
broadcast audio streams, banking transactions, MP3s of rock music, the content of web pages, scans of government
documents, GPS trails, telemetry from automobiles, financial market data, the list goes on. Are these all really the same thing?
To classify matters the three Vs of volume, velocity, and variety are commonly used to categorize different aspects of big
data. They are a helpful lens through which to view and understand the nature of the data and the software platforms that
is available to exploit them.
2.2.1 Volume
Terabyte records, transactions, tables, files.
A Boeing Jet engine spews out 10TB of operational data for every 30 minutes they run. Hence a 4-engine Jumbo jet can
create 640TB on one Atlantic crossing. Multiply that to 25,000 flights flown each day and you get the picture.
2.2.2 Velocity
Batch, near-time, real-time, streams.
Today’s on-line ad serving requires 40ms to respond with a decision. Financial services need near 1MS to calculate customer
scoring probabilities. Stream data, such as movies, need to travel at high speed for proper rendering.
2.2.3 Variety
Variety: Structured, Semi-structured, Unstructured; Text, image, audio, video, record and all the above in a mix.
WalMart processes 1M customer transactions per hour and feeds information to a database estimated at 2.5PB (Petabytes).
There are old and new data sources like RFID, sensors, mobile payments, in-vehicle tracking, etc.
Data Variety and Complexity
S
t
o
r
a
g
e
ERP
CRM
Web
Big Data
Mega
bytes
Giga
bytes
Tera
bytes
Peta
Bytes
BIG DATA
SENTHIL SUNDARESAN 4
Fig 2: Volume, Velocity and Variety
3. BIG DATA PROCESSING
Before big data, traditional analysis involved crunching data in a traditional database. This was based on the relational
database model, where data and the relationship between the data were stored in tables. The data was processed and stored
in rows.
Databases have progressed over the years, however, and are now using massively parallel processing (MPP) to break data up
into smaller lots and process it on multiple machines simultaneously, enabling faster processing. Instead of storing the data
in rows, the databases can also employ columnar architectures, which enable the processing of only the columns that have
the data needed to answer the query and enable the storage of unstructured data.
Fig 3: Big Data Architecture
MapReduce
MapReduce is the combination of two functions to better process data. First, the map function separates data over multiple
nodes, which are then processed in parallel. The reduce function then combines the results of the calculations into a set of
responses.
Google used MapReduce to index the web, and has been granted a patent for its MapReduce framework. However, the
MapReduce method has now become commonly used, with the most famous implementation being in an open-source
project called Hadoop.
Bridging the Gap – The Key – Value pair
BIG DATA
SENTHIL SUNDARESAN 5
Key-value pair is the data model underlying Map-Reduce (and thus Hadoop) that is actually the fundamental driver of
performance. A file of key value pairs has exactly two columns. One is structured – the KEY. The other, the value, is
unstructured – at least as far as the system is concerned. The Mapper then allows you to move (or split) the data between
the structured and unstructured sections at will. The reducer then allows data to be collated and aggregated provided it has
an identical key.
Massively parallel processing (MPP)
Like MapReduce, MPP processes data by distributing it across a number of nodes, which each process an allocation of data
in parallel. The output is then assembled to create a result.
However, MPP products are queried with SQL, while MapReduce is natively controlled via Java code. MPP is also generally
used on expensive specialized hardware (sometimes referred to as big-data appliances), while MapReduce is deployed on
commodity hardware.
4. BIG DATA ARCHITECTURE
In this section, we will take a closer look at the overall architecture for big data.
Traditional Information Architecture Capabilities
To understand the high-level architecture aspects of Big Data,
let’s first review a well formed logical information architecture for structured data. In the illustration, you see two data
sources that use integration (ELT/ETL/Change Data Capture) techniques to transfer data into a DBMS data warehouse or
operational data store, and then offer a wide variety of analytical capabilities to reveal the data. Some of these analytic
capabilities include: dashboards, reporting, EPM/BI applications, summary and statistical query, semantic interpretations for
textual data, and visualization tools for high-density data. In addition, some organizations have applied oversight and
standardization across projects, and perhaps have matured the information architecture capability through managing it at
the enterprise level.
Fig 4: Traditional Capabilities – Courtesy Oracle
The key information architecture principles include treating data as an asset through a value, cost, and risk lens, and ensuring
timeliness, quality, and accuracy of data. And, the EA oversight responsibility is to establish and maintain a balanced
governance approach including using center of excellence for standards management and training.
Adding Big Data Capabilities
The defining processing capabilities for big data architecture are to meet the volume, velocity, variety, and value
requirements. Unique distributed (multi-node) parallel processing architectures have been created to parse these large data
sets. There are differing technology strategies for real-time and batch processing requirements. For real-time, key-value data
stores, such as NoSQL, allow for high performance, index-based retrieval. For batch processing, a technique known as “Map
Reduce,” filters data according to a specific data discovery strategy. After the filtered data is discovered, it can be analyzed
directly, loaded into other unstructured databases, sent to mobile devices, or merged into traditional data warehousing
environment and correlated to structured data
BIG DATA
SENTHIL SUNDARESAN 6
Fig 5: Big Data Capabilities – Courtesy Oracle
In addition to new unstructured data realms, there are two key differences for big data. First, due to the size of the data sets,
we don’t move the raw data directly to a data warehouse. However, after MapReduce processing we may integrate the
“reduction result” into the data warehouse environment so that we can
leverage conventional BI reporting, statistical, semantic, and correlation capabilities. It is ideal to have analytic capabilities
that combine a conventional BI platform along with big data visualization and query capabilities. And second, to facilitate
analysis in the Hadoop environment, sandbox environments can be created.
For many use cases, big data needs to capture data that is continuously changing and unpredictable. And to analyze that data,
a new architecture is needed. In retail, a good example is capturing real time foot traffic with the intent of delivering in-store
promotion. To track the effectiveness of floor displays and promotions, customer movement and behavior must be
interactively explored with visualization or query tools.
In other use cases, the analysis cannot be complete until you correlate it with other enterprise data - structured data. In the
example of consumer sentiment analysis, capturing a positive or negative social media comment has some value, but
associating it with your most or least profitable customer makes it far more valuable. So, the needed capability with Big Data
BI is context and understanding. Using powerful statistical and semantic tools allow you to find the needle in the haystack,
and will help you predict the future.
In summary, the Big Data architecture challenge is to meet the rapid use and rapid data interpretation requirements while at
the same time correlating it with other data.
5. STEPS TO BIG DATA
Before you go down the path of big data, it's important to be prepared and approach an implementation in an organized
manner, following these steps.
 What do you wish you knew?
 This is where it will be decided as what is expected out of big data that you can't get from your current systems.
 If the answer is nothing, then perhaps big data isn't the right thing to use.
 What are the current data assets?
 Can the data be cross referenced to produce insights?
 Is it possible to build new data products on top of the current assets?
 If not, what needs to be implemented to do so?
Once the above are understood, it's time to prioritize. Select the potentially most valuable opportunity for using big-data
techniques and technology, and prepare a business case for a proof of concept, keeping in mind the skill sets you'll need to
do it. You will need to talk to the owners of the data assets to get the full picture
Another example of applying architecture principles differently is data governance. The quality and accuracy requirements of
big data can vary tremendously. Using strict data precision rules on user sentiment data might filter out too much useful
information, whereas data standards and common definitions are still going to be critical for fraud detections scenarios.
Start the proof of concept, and make sure that there's a clear end point, so that you can evaluate what the proof of concept
has achieved. This might be the time to give the owner of the data assets to take responsibility for the project
BIG DATA
SENTHIL SUNDARESAN 7
Once your proof of concept has been completed, evaluate whether it worked. Are you getting real insights delivered? Is the
work that went in to the concept bearing fruit? Could it be extended to other parts of the organization? Is there other data
that could be included? This will help you to discover whether to expand your implementation or revamp it.
Once the evaluation is done and the need for big data is inevitable, then it’s imperative to choose the vendors and
technologies.
5.1 Architecture Decisions
Information Architecture is perhaps the most complex area of IT. It is the ultimate investment payoff. Today’s economic
environment demands that business be driven by useful, accurate, and timely information. And, the world of Big Data adds
another dimension to the problem. However, there are always business and IT tradeoffs to get to data and information in a
most cost-effective way.
 Key Drivers to Consider
Here is a summary of various business and IT drivers you need to consider when making these architecture choices.
Fig 6: Key Drivers
Planning Big Data architecture is not about understanding just what is different. It’s also about how to integrate what’s new
to what you already have – from database-and-BI infrastructure to IT tools, and end user applications.
5.2 Technologies
To derive real business value from big data, you need the right tools to capture and organize a wide variety of data types from
different sources, and to be able to easily analyze it within the context of all your enterprise data.
Here is a brief outline of Big Data capabilities and their primary technologies:
5.2.1 Hadoop
Derived from MapReduce technology, Hadoop is an open-source framework to process large amounts of data over multiple
nodes in parallel, running on inexpensive hardware.
Data is split into sections and loaded into a file store — for example, the Hadoop Distributed File System (HDFS), which is
made up of multiple redundant nodes on cheap storage. A name node keeps track of which data is on which nodes. The data
is replicated over more than one node, so that even if a node fails, there's still a copy of the data.
The data can then be analyzed using MapReduce, which discovers from the name node where the data needed for calculations
resides. Processing is then done at the node in parallel. The results are aggregated to determine the answer to the query and
then loaded onto a node, which can be further analyzed using other tools. Alternatively, the data can be loaded into traditional
data warehouses for use with transactional processing.
Apache is considered to be the most noteworthy Hadoop distribution.
BIG DATA
SENTHIL SUNDARESAN 8
Fig 7: Hadoop in the Enterprise
5.2.1.1 RDBMS and Hadoop
Here is a comparison of the overall differences between the RDBMS and MapReduce-based systems such as Hadoop:
Fig 8: RDBMS vs. Hadoop
5.2.2 Hive
Databases like Hadoop's file store make ad hoc query and analysis difficult, as the programming map/reduce functions that
are required can be difficult. Realizing this when working with Hadoop, Facebook created Hive, which converts SQL queries
to map/reduce jobs to be executed using Hadoop.
5.2.3 Pig
Procedural data processing language designed for Hadoop where you specify a series of steps to perform on the data.
It’s often described as “the duct tape of Big Data” for its usefulness there and it is often combined with custom streaming
code written in a scripting language for more general operations.
5.2.4 Social Network and Hadoop
Twitter uses Hadoop to store and process tweets, log files, and many other types of data generated across Twitter. Also it
uses Cloudera's CDH2 distribution of Hadoop, and stores all data as compressed LZO files.
 It uses both Scala and Java to access Hadoop's MapReduce APIs
 It uses Pig heavily for both scheduled and ad-hoc jobs, due to its ability to accomplish a lot with few statements.
 It employs committers on Pig, Avro, Hive, and Cassandra, and contribute much of our internal Hadoop work to open
source
Facebook uses Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics
and machine learning.
 Currently Facebook has 2 major clusters as:
 An 1100-machine cluster with 8800 cores and about 12 PB of raw storage.
 A 300-machine cluster with 2400 cores and about 3 PB of raw storage.
 Each (commodity) node has 8 cores and 12 TB of storage.
BIG DATA
SENTHIL SUNDARESAN 9
 Facebook is heavy users of both streaming as well as the Java APIs. It has built a higher level data warehousing
framework using these features called Hive. It has also developed a FUSE implementation over HDFS.
5.2.5 NoSQL
NoSQL database-management systems are unlike relational database-management systems, in that they do not use SQL as
their query language. The idea behind these systems is that that they are better for handling data that doesn't fit easily into
tables. They dispense with the overhead of indexing, schema and ACID transactional properties to create large, replicated
data stores for running analytics on inexpensive hardware, which is useful for dealing with unstructured data.
5.2.5.1 Types of NoSQL Databases
The following are the types of NoSQL Databases
 Key-value Store
 Document Databases
 Column oriented database
Fig 9: NoSQL Types
5.2.5.2 Cassandra
Cassandra is a NoSQL database alternative to Hadoop's HDFS.
5.3 Sample Implementation
Big-data projects have a number of different layers of abstraction from abstraction of the data through to running analytics
against the abstracted data. Figure 1 shows the common components of analytical Big-data and their relationship to each
other. The higher level components help make big data projects easier and more productive. Hadoop (is an apache project,
written in java and being built and used by a global community of contributors) is often at the center of Big-data projects, but
it is not a prerequisite.
 Packaging and support of Hadoop by organizations such as Cloudera; to include MapReduce - essentially he compute
layer of big data.
 File-Systems such as the Hadoop Distributed File System (HDFS), which manages the retrieval and storing of data
and metadata required for computation. Other file systems or databases such as Hbase (a NoSQL tabular store) or
Cassandra (a NoSQL Eventually‐consistent key‐value store) can also be used.
 Instead of writing in JAVA, higher level languages as Pig (part of Hadoop) can be used such, simplifying the writing of
computations.
BIG DATA
SENTHIL SUNDARESAN 10
 Hive is a Data Warehouse layer built on top of Hadoop, developed by Facebook programmers.
 Cascading is a thin Java library that sits on top of Hadoop that allows suites of MapReduce jobs to be run and
managed as a unit. It is widely used to develop special tools.
 Semi-automated modeling tools such as CR-X allow models to develop interactively at great speed, and can help set
up the database that will run the analytics.
 Specialized scale-out analytic databases such as Greenplum or Netezza with very fast loading load & reload the data
for the analytic models
 ISV big data analytical packages such as ClickFox and Merced run against the database to help address the business
issues (e.g., the customer satisfaction issues mentioned in the introduction).
 Transactional Big-data projects cannot use Hadoop, as it is not real-time. For transactional systems that do not need
a database with ACID2 guarantees, NoSQL databases can be used, though there are constraints such as weak
consistency guarantees (e.g., eventual consistency) or restricting transactions to a single data item. For big-data
transactional SQL databases that need the ACID guarantees the choices are limited. Traditional scale-up databases
are usually too costly for very large-scale deployment, and don't scale out very well. Most social medial databases
have had to hand-craft solutions. Recently a new breed of scale-out SQL database have emerged with architectures
that move the processing next to the data (in the same way as Hadoop), such as Clustrix. These allow greater scale
out ability.
Fig 10: Sample Implementation
This area is extremely fast growing, with many new entrants into the market expected over the next few years.
5.4 Vendors
There is scarcely a vendor that doesn't have a big-data plan in train, with many companies combining their proprietary
database products with the open-source Hadoop technology as their strategy to tackle velocity, variety and volume.
Many of the early big-data technologies came out of open source, posing a threat to traditional IT vendors that have packaged
their software and kept their intellectual property close to their chests. However, the open-source nature of the trend has
also provided an opportunity for traditional IT vendors, because enterprise and government often find open-source tools off-
putting.
Therefore, traditional vendors have welcomed Hadoop with open arms, packaging it in to their own proprietary systems so
they can sell the result to enterprise as more comfortable and familiar packaged solutions.
Below are the plans of some of the larger vendors.
5.4.1 Cloudera
Cloudera was founded in 2008 by employees who worked on Hadoop at Yahoo and Facebook. It contributes to the Hadoop
open-source project, offering its own distribution of the software for free. It also sells a subscription-based, Hadoop-based
distribution for the enterprise, which includes production support and tools to make it easier to run Hadoop.
BIG DATA
SENTHIL SUNDARESAN 11
5.4.2 Hortonworks
Cloudera rival Hortonworks was birthed by key architects from the Yahoo Hadoop software engineering team. In June 2012,
the company launched a high-availability version of Apache Hadoop, the Hortonworks Data Platform on which it collaborated
with VMware, as the goal was to target companies deploying Hadoop on VMware's vSphere.
Teradata has also partnered with Hortonworks to create products that "help customers solve business problems in new and
better ways".
5.4.3 Teradata
Teradata made its move out of the "old-world" data-warehouse space by buying Aster Data Systems and Aprimo in 2011.
Teradata wanted Aster's ability to manage "a variety of diverse data that is not structured", such as web applications, sensor
networks, social networks, genomics, video and photographs.
Teradata has now gone to market with the Aster Data nCluster, a database using MPP and MapReduce. Visualization and
analysis is enabled through the Aster Data visual-development environment and suite of analytic modules. The Hadoop
connecter, enabled by its agreement with Cloudera, allows for a transfer of information between nCluster and Hadoop.
5.4.1 Oracle
Oracle made its big-data appliance available earlier this year— a full rack of 18 Oracle Sun servers with 864GB of main
memory; 216 CPU cores; 648TB of raw disk storage; 40Gbps InfiniBand connectivity between nodes and engineered systems;
and 10Gbps Ethernet connectivity.
The system includes Cloudera's Apache Hadoop distribution and manager software, as well as an Oracle NoSQL database and
a distribution of R (an open-source statistical computing and graphics environment).
It integrates with Oracle's 11g database, with the idea being that customers can use Hadoop MapReduce to create optimized
datasets to load and analyze in the database.
5.4.2 IBM
IBM combined Hadoop and its own patents to create IBM InfoSphere BigInsights and IBM InfoSphere Streams as the core
technologies for its big-data push.
The BigInsights product, which enables the analysis of large-scale structured and unstructured data, "enhances" Hadoop to
"withstand the demands of your enterprise", according to IBM. It adds administrative, workflow, provisioning and security
features into the open-source distribution. Meanwhile, streams analysis has a more complex event-processing focus, allowing
the continuous analysis of streaming data so that companies can respond to events.
IBM has partnered with Cloudera to integrate its Hadoop distribution and Cloudera manger with IBM BigInsights. Like Oracle's
big-data product, IBM's BigInsights links to: IBM DB2, its Netezza data-warehouse; its InfoSphere Warehouse; and its Smart
Analytics System.
5.4.3 SAP
At the core of SAP's big-data strategy sits a high-performance analytic appliance (HANA) data-warehouse appliance,
unleashed in 2011. It exploits in-memory computing, processing large amounts of data in the main memory of a server to
provide real-time results for analysis and transactions. Business applications, like SAP's Business Objects, can sit on the HANA
platform to receive a real-time boost.
SAP has integrated HANA with Hadoop, enabling customers to move data between Hive and Hadoop's Distributed File System
and SAP HANA or SAP Sybase IQ server. It has also set up a "big-data" partner council, which will work to provide products
that make use of HANA and Hadoop. One of the key partners is Cloudera. SAP wants it to be easy to connect to data, whether
it's in SAP software or software from another vendor.
5.4.4 Microsoft
Microsoft is integrating Hadoop into its current products. It has been working with Hortonworks to make Hadoop available
on its cloud platform Azure, and on Windows Servers. The former is available in developer preview. It already has connectors
BIG DATA
SENTHIL SUNDARESAN 12
between Hadoop, SQL Server and SQL Server Parallel Data Warehouse, as well as the ability for customers to move data from
Hive into Excel and Microsoft BI tools, such as PowerPivot.
5.4.5 EMC
EMC has centered its big-data technology on technology that it acquired when it bought Greenplum in 2010. It offers a unified
analytics platform that deals with web, social, document, mobile machine and multimedia data using Hadoop's MapReduce
and HDFS, while ERP, CRM and POS data is put into SQL stores. The data mining, neural nets and statistics analysis is carried
out using data from both sets, which is fed in to dashboards.
6. VALUE TO AN ORGANIZATION
Value of Big Data falls into two categories:
1. Analytical use
2. Enabling new markets/products
Big data analytics can reveal insights hidden previously by data too costly to process. , such as peer influence among
customers, revealed by analyzing shoppers’ transactions, social and geographical data
The past decade’s successful web startups are prime examples of big data used as an enabler of new products and services.
For example, by combining a large number of signals from a user’s actions and those of their friends, Facebook has been able
to craft a highly personalized user experience and create a new kind of advertising business. It’s no coincidence that the lion’s
share of ideas and tools underpinning big data has emerged from Google, Yahoo, Amazon and Facebook.
7. FIRMS AND BIG DATA
Now that there are products that make use of big data, what are companies' plans in the space? We've outlined some of
them below.
7.1 Ford
Ford is experimenting with Hadoop to understand better how the car operates and how consumers use the vehicles, and feed
that information back into our design process and help optimize the user's experience in the future, as well so as to gain value
out of the data it generates from its business operations, vehicle research and even its customers' cars.
7.2 HCF
HCF has adopted IBM's big-data analytics solution, including the Netezza big-data appliance, to better analyze claims as they
are made in real time. This helps to more easily detect fraud and provide ailing members with information they might need
to stay fit and healthy.
7.3 Klout
Klout's job is to create insights from the vast amounts of data coming in from the 100 million social-network users indexed
by the company, and to provide those insights to customers. For example, Klout might provide information on how certain
peoples' influence on social networks (or Klout score) might affect word-of-mouth advertising, or provide information on
changes in demand. To deliver the analysis on a shoestring, Klout built custom infrastructure on Apache Hadoop, with a
separate data silo for each social network.
7.4 Mitsui Knowledge Industry
Mitsui analyses genomes for cancer research. Using HANA, R and Hadoop to pre-process DNA sequences, the company was
able to shorten genome-analysis time from several days to 20 minutes.
7.5 Nokia
Nokia is using Apache Hadoop and Cloudera's CDH to pull the unstructured data (generated by its phones around the world)
into a structured environment to create 3D maps that show traffic, inclusive of speed categories, elevation, current events
and video.
BIG DATA
SENTHIL SUNDARESAN 13
7.6 WalMart
WalMart uses a product it bought, called Muppet, as well as Hadoop to analyze social-media data from Twitter, Facebook,
Foursquare and other sources. Among other things, this allows WalMart to analyze in real time which stores will have the
biggest crowds, based on Foursquare check-ins.
8. BIG DATA – CHANGING WORLD
Computers are leaner, meaner and cheaper than ever before. With computing power no longer at a premium, we're
swimming in numbers that describe everything from how a small town in Minnesota behaves during rush hour to the
probability of a successful drone strike in Yemen.
The advent of so-called "big data" means that companies, governments and organizations can collect, interpret and wield
huge stores of data to an amazing breadth of ends. From shoe shopping to privacy concerns, here's a look at five ways "big
data" is changing the world:
8.1 Data as a deadly weapon
The traditional battlefield has dissolved into thin air. In the big data era, information is the deadliest weapon and leveraging
massive amounts of it is this era's arms race. But current military tech is buckling under the sheer weight of data collected
from satellites, unmanned aircraft, and more traditional means.
As part of the Obama administration's "Big Data Initiative," the Department of Defense launched XDATA, a program that
intends to invest $25 million toward systems that analyze massive data sets in record time. With more efficient number
crunching, the U.S. military can funnel petabytes of data toward cutting edge advances, like making unmanned drones
smarter and more deadly than ever.
8.2 Saving the Earth
Beyond powering predator drones and increasing retail revenue, big data can do a literal world of good. Take Google Earth
Engine, an open source big data platform that allowed researchers to map the first high-resolution map of Mexico's forests.
The map would have taken a traditional computer over three years to construct, but using Google Earth Engine's massive
data cloud it was completed in the course of a day.
Massive sets of data like this can help us understand environmental threats on a systemic level. The more data we have about
the changing face of the earth's ecosystems and weather patterns, the better we can model future environmental shifts --
and how to stop them while we still can.
8.3 Watching you shop
Big data can mean big profits. By understanding what you want to buy today, companies large and small can figure out what
you'll want to buy tomorrow -- maybe even before you do? Online retailers like Amazon scoop up information about our
shopping and e-window shopping habits on a huge scale, but even brick and mortar retailers are starting to catch on. A clever
company called RetailNext helps companies like Brookstone and American Apparel record video of shoppers as they browse
and buy. By transforming a single shopper's path into as many as 10,000 data points, companies can see how they move
through a store, where they pause and how that tracks with sales.
8.4 Scientific research in overdrive
Data has long been the cornerstone of scientific discovery, and with big data -- and the big computing power necessary to
process it -- research can move at an exponentially fast clip.
Take the Human Genome Project, widely considered to be one of the landmark scientific accomplishments in human history.
Over the course of the $3 billion project, researchers analyzed and sequenced the roughly 25,000 genes that make up the
human genome in 13 years. With today's modern methods of data collection and analysis, the same process can be completed
in hours -- all by a device the size of a USB memory stick and for less than $1,000.
BIG DATA
SENTHIL SUNDARESAN 14
8.5 Big data, bigger privacy concerns
You might just be a number in the grand scheme of things, but that adage isn't as reassuring as it used to be. It's true that big
data is about breadth, but it's about depth, too.
Web mega-companies like Facebook and Google not only scoop up data on a huge number of users -- 955 million, in
Facebook's case -- but they collect an incredible depth of data as well. From what you search and where you click to who you
know (and who they know, and who they know), the web's biggest players own data stockpiles so robust that they border on
omniscient.
Where technological power, cultural advancement and profit intersect, one thing's clear: with big data comes even bigger
responsibility.
9. DEPLOYMENT CONSIDERATIONS
We have explored the nature of big data, and surveyed the landscape of big data from a high level. As usual, when it comes
to deployment there are dimensions to consider over and above tool selection.
9.1 Cloud or In-house
The majority of big data solutions are now provided in three forms: software‐only, as an appliance or cloud‐based.
Decisions between which routes to take will depend, among other things, on issues of data locality, privacy and regulation,
human resources and project requirements. Many organizations opt for a hybrid solution: using on-demand cloud resources
to supplement in-house deployments.
9.1.1.1 Cloud Computing and Big Data
Experts in the IT industry, including Cloud Computing and Big Data, agree that a flexible and fast IT infrastructure is needed
to support Big Data. The cloud removes the infrastructure challenges, provides the necessary speed and adds scalability.
However, four areas must still be investigated more deeply: store and process, stewardship, sense making and security.
9.1.1.2 Significant Change in Cloud Computing
Traditionally, cloud computing operates in three primary layers: Software as a Service, Platform as a Service and Infrastructure
as a Service. However, the architecture of Big Data adds another layer into the stack, which is concerned with analyzing and
managing Big Data. It includes different binding concepts like lineage, pedigree and provenance. Big Data is complex and
comes with daunting challenges. Phenomenal corporate balance is required for success. For organizations to harness Big Data
effectively, they must change their business processes, implement multiple technologies and give their workforce relevant
training.
9.2 Skills shortages
Even if a company decides to go down the big‐data path, it may be difficult to hire the right people. The data scientist requires
a unique blend of skills, including a strong statistical and mathematical background, a good command of statistical tools such
as SAS, SPSS or the open‐source R and an ability to detect patterns in data (like a data‐mining specialist), all backed by the
domain knowledge and communications skills to understand what to look for and how to deliver it.
9.3 Privacy
Tracking individuals' data in order to be able to sell to them better will be attractive to a company, but not necessarily to the
consumer who is being sold the products. Not everyone wants to have an analysis carried out on their lives, and depending
on how privacy regulations develop, which is likely to vary from country to country, companies will need to be careful with
how invasive they are with big-data efforts, including how they collect data. Regulations could lead to fines for invasive
policies, but perhaps the greater risk is loss of trust.
BIG DATA
SENTHIL SUNDARESAN 15
9.4 Security
Individuals trust companies to keep their data safe. However, because big data is such a new area, products haven't been
built with security in mind, despite the fact that the large volumes of data stored mean that there is more at stake than ever
before if data goes missing.
9.5 Big Data is messy
It’s not all about infrastructure. Big data practitioners consistently report that 80% of the effort involved in dealing with data
is cleaning it up in the first place.
9.6 Big Data is big
It is a fundamental fact that data that is too big to process conventionally is also too big to transport anywhere. Even if the
data isn’t too big to move, locality can still be an issue, especially with rapidly updating data.
9.7 Culture
The phenomenon of big data is closely tied to the emergence of data science, a discipline that combines math, programming
and scientific instinct. Benefiting from big data means investing in teams with this skill set, and surrounding them with an
organizational willingness to understand and use data for advantage.
9.8 Pitfalls
9.8.1 Do you know where your data is?
It's no use setting up a big-data product for analysis only to realize that critical data is spread across the organization in
inaccessible and possibly unknown locations.
9.8.2 A lack of direction
"Collecting and analyzing the data is not enough; it must be presented in a timely fashion, so that decisions are made as a
direct consequence that has a material impact on the productivity, profitability or efficiency of the organization. Most
organizations are ill prepared to address both the technical and management challenges posed by big data; as a direct result,
few will be able to effectively exploit this trend for competitive advantage."
Unless firms know what questions they want to answer and what business objectives they hope to achieve, big-data projects
just won't bear fruit.
10. CONCLUSION
Finally, remember that big data is no panacea. You can find patterns and clues in your data, but then first, decide what
problem you want to solve.
If you pick a real business problem, such as how you can change your advertising strategy to increase spend per customer, it
will guide your implementation. While big data work benefits from an enterprising spirit, it also benefits strongly from a
concrete goal head
As you explore the ‘what’s new’ across the spectrum of Big Data capabilities, we suggest that you think about their integration
into your existing infrastructure and BI investments. As examples, align new operational and management capabilities with
standard IT, build for enterprise scale and resilience, unify your database and development paradigms as you embrace Open
Source, and share metadata wherever possible for both integration and analytics.
Last but not least, expand the IT governance to include a Big Data center of excellence to ensure business alignment, grow
your skills, manage Open Source tools and technologies, share knowledge, establish standards, and to manage best practices.
BIG DATA
SENTHIL SUNDARESAN 16
Fig 11: McKinsey Survey
Corporates vs. Big Data
“Experience Certainty” - big data is imperative for Corporates to face the future.
Scale- Out Storage Systems - Hadoop Technology Stack and Services
Corporates need to have strong partnerships with storage vendors and is involved in architecture of large Data Centers with
Big Data storage requirements. Most Scale-Out storage solutions today includes Hadoop as part of the stack.
BI, Advanced and Predictive Analytics
Corporates need to have strong capability on Business Intelligence, Data Warehousing and Advanced Analytics. This
experience is around Industry Leading products and advanced and Predictive Analytics Solutions as in the case of “Listening
Platform for Social Media” and “Supply Chain Predictive Analytics“.
Vertical Domain Experience
Corporates need to have deep knowledge of Business Imperatives of Semiconductor, Computer Platforms, Consumer
Electronics and Software Product Companies. This knowledge in turn helps setting the right patterns for Advanced Analytics
and also for defining the correct rules for Big Data analytics.
What can be done?
The scarcity in the Big Data and Hadoop knowledge creates the gap between the requirements and resource availability. It
can be avoided by choosing the interested associates and train them properly in order to create a larger pool of associates
having big data expertise available for the future.
BIG DATA
SENTHIL SUNDARESAN 17
11. REFERENCES
[1] Edd Dumbill, http://strata.oreilly.com
[2] David Floyer, http://wikibon.org/wiki/v/Enterprise_Big-data
[3] http://en.wikipedia.org/wiki/Big_data
[4] Taylor Hatmaker, http://www.entrepreneur.com/article/224582.
[5] Scott Jarr, http://voltdb.com/company/blog/big-data-value-continuum.
[6] Oracle white paper in Enterprise Architecture
[7] McKinsey Global Institute Analysis
Victor Daily, http://www.techzost.com/2012/11/where-does-cloudcomputing-and-big-data.html
[8] http://wiki.apache.org/hadoop/PoweredBy
[9] TCS Hadoop and Data Xplode
[10] http://www.tcs.com/resources/white_papers/Pages/Big-Data-Storage-Solutions.aspx
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice.
To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee.

Mais conteúdo relacionado

Mais procurados

Big Data’s Big Impact on Businesses
Big Data’s Big Impact on BusinessesBig Data’s Big Impact on Businesses
Big Data’s Big Impact on BusinessesCRISIL Limited
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data AnalyticsTUSHAR GARG
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsKaniska Mandal
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big DataMatthew Dennis
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyRohit Dubey
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notesMohit Saini
 
big data overview ppt
big data overview pptbig data overview ppt
big data overview pptVIKAS KATARE
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolutionitnewsafrica
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...BigMine
 
Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research reportJULIO GONZALEZ SANZ
 
Big Data Overview 2013-2014
Big Data Overview 2013-2014Big Data Overview 2013-2014
Big Data Overview 2013-2014KMS Technology
 
Lesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxLesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxPankajkumar496281
 
Big Data and Classification
Big Data and ClassificationBig Data and Classification
Big Data and Classification303Computing
 

Mais procurados (20)

Big Data’s Big Impact on Businesses
Big Data’s Big Impact on BusinessesBig Data’s Big Impact on Businesses
Big Data’s Big Impact on Businesses
 
Big data-ppt
Big data-pptBig data-ppt
Big data-ppt
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data Analytics
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
 
Chapter 1 big data
Chapter 1 big dataChapter 1 big data
Chapter 1 big data
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
Big_data_ppt
Big_data_ppt Big_data_ppt
Big_data_ppt
 
big data overview ppt
big data overview pptbig data overview ppt
big data overview ppt
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
 
Research paper on big data and hadoop
Research paper on big data and hadoopResearch paper on big data and hadoop
Research paper on big data and hadoop
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research report
 
Big Data Overview 2013-2014
Big Data Overview 2013-2014Big Data Overview 2013-2014
Big Data Overview 2013-2014
 
Lesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxLesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptx
 
Motivation for big data
Motivation for big dataMotivation for big data
Motivation for big data
 
Big Data and Classification
Big Data and ClassificationBig Data and Classification
Big Data and Classification
 
Big data storage
Big data storageBig data storage
Big data storage
 

Destaque

眼光
眼光眼光
眼光He Yan
 
Marcel Láža: BigData vs. analýza "normálních" dat
Marcel Láža: BigData vs. analýza "normálních" datMarcel Láža: BigData vs. analýza "normálních" dat
Marcel Láža: BigData vs. analýza "normálních" datKISK FF MU
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data HubsWhat Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data HubsCloudera, Inc.
 
Big Data vs Data Warehousing
Big Data vs Data WarehousingBig Data vs Data Warehousing
Big Data vs Data WarehousingThomas Kejser
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?Venu Anuganti
 

Destaque (8)

眼光
眼光眼光
眼光
 
Marcel Láža: BigData vs. analýza "normálních" dat
Marcel Láža: BigData vs. analýza "normálních" datMarcel Láža: BigData vs. analýza "normálních" dat
Marcel Láža: BigData vs. analýza "normálních" dat
 
BI + Big Data
BI + Big DataBI + Big Data
BI + Big Data
 
Next Generation of BI
Next Generation of BINext Generation of BI
Next Generation of BI
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data HubsWhat Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
 
Big Data vs Data Warehousing
Big Data vs Data WarehousingBig Data vs Data Warehousing
Big Data vs Data Warehousing
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?
 

Semelhante a Big data - what, why, where, when and how

Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL TechnologiesAmit Singh
 
A Review on Classification of Data Imbalance using BigData
A Review on Classification of Data Imbalance using BigDataA Review on Classification of Data Imbalance using BigData
A Review on Classification of Data Imbalance using BigDataIJMIT JOURNAL
 
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAA REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAIJMIT JOURNAL
 
Big Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and ChallengesBig Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and ChallengesUyoyo Edosio
 
An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.ijceronline
 
research publish journal
research publish journalresearch publish journal
research publish journalrikaseorika
 
research publish journal
research publish journalresearch publish journal
research publish journalrikaseorika
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfrajsharma159890
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSaciijournal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training reportSarvesh Meena
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewIRJET Journal
 

Semelhante a Big data - what, why, where, when and how (20)

Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL Technologies
 
A Review on Classification of Data Imbalance using BigData
A Review on Classification of Data Imbalance using BigDataA Review on Classification of Data Imbalance using BigData
A Review on Classification of Data Imbalance using BigData
 
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAA REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
 
Big Data przt.pptx
Big Data przt.pptxBig Data przt.pptx
Big Data przt.pptx
 
Big Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and ChallengesBig Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and Challenges
 
An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.
 
research publish journal
research publish journalresearch publish journal
research publish journal
 
research publish journal
research publish journalresearch publish journal
research publish journal
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdf
 
1
11
1
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
The ABCs of Big Data
The ABCs of Big DataThe ABCs of Big Data
The ABCs of Big Data
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
Big Data
Big DataBig Data
Big Data
 
E018142329
E018142329E018142329
E018142329
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A Review
 

Último

Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 

Big data - what, why, where, when and how

  • 1. BIG DATA What, Why, Where, When and How senthil sundaresan BI/SQL/Data Visualization Evangelist Abstract In this paper, we talk about what is big data, the growing prevalence of big data, the opportunities, the challenges and architectural framework that will facilitate the delivery of opportunities while addressing challenges. The architecture for the ‘Big Data Management’ will be demonstrated through Hadoop technology with Map- Reduce framework and its Open Source ecosystem.
  • 2. BIG DATA SENTHIL SUNDARESAN 1 Author’s Page I am senthil, a BI/SQL/Data visualization evangelist. I have donned many roles during my short career of 13+ years such as Analyst, Developer, Lead, Project Manager, Principal Data and Visualization Architect, Consultant, DB Administrator, Unix Administrator to name a few. My BI and Visualization skills are SAP BO/BODS, TABLEAU, QLIKVIEW, MSBI, ESSBASE, R, OMNISCOPE, SQLSERVER, SYBASE IQ, SYBASE, TERADATA (again) to name a few. Been in this industry and especially in BI for so many years it’s imperative for me to understand the nuances and intricacies of the Big Data Tech Stack. That’s the trigger for me to write this paper and while doing so I’ve started exploring big data more. This paper would be a stepping stone for those who thinks whether it’s possible or plausible. Thanks for reading! My blogs: sensungit.wordpress.com lifeofbi.blog.com sites.googles.com/site/youtechies
  • 3. BIG DATA SENTHIL SUNDARESAN 2 1. INTRODUCTION “Big data” is a big vibrating phrase in the IT and business world right now – and there are a dizzying array of opinions on just what these two simple words really mean. Technology vendors in the legacy database or data warehouse spaces say “big data” simply refers to a traditional data warehousing scenario involving data volumes in either the single or multi-terabyte range. Others disagree with this by saying that “big data” isn’t limited to traditional data warehouse situations, but includes real-time or operational data stores used as the primary data foundation for online applications that power key external or internal business systems. In 2011, people have created 1.8 Zetabytes of data and this is increasing exponentially every year. This ever increasing data contains information that could give rise to many business opportunities. Few of the Business Drivers of Big data are: Finance: Better and deeper understanding of risk to avoid credit crisis – Basel III Telecommunications: More reliable network where we can predict and prevent failure Media: More content that is lined up with your personal preferences Life science: Better targeted medicines with fewer complications and side effects Retail: A personal experience with products and offers that are just what you need Government: Government services that are based on hard data, not just gut. Big Data is here. Analysts and research organizations have made it clear that mining machine generated data is essential to future success. Embracing new technologies and techniques are always challenging, but as architects, you are expected to provide a fast, reliable path to business adoption. Big Data Characteristics, Architecture Capabilities, Technologies, Market vendors, and Sample implementation are explained in the subsequent sections. 2. WHAT IS BIG DATA? Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, an alternative way has to be chosen to process it. 2.1 Characteristics of Big Data Big data has the following characteristics Very large distributed aggregations of loosely structured data are often incomplete and inaccessible:  Petabytes/exabytes of data  Billions/Trillions of records  Loosely-structured and often distributed data  Flat schemes with few complex interrelationships  Often involving time-stamped events  Often made up of incomplete data  Often including connections between data elements that must be probabilistically inferred Applications that involved Big-data can be:  Transactional (e.g.: Facebook, Photobox etc)  Analytic (e.g., ClickFox, Merced Applications)
  • 4. BIG DATA SENTHIL SUNDARESAN 3 Fig 1: Big Data Evolution According to a new global report from IBM and the Said Business School at the University of Oxford, less than half of the organizations engaged in active Big Data initiatives are currently analyzing external sources of data, like social media. 2.2 Key Metrics: The Three V’s As a catch-all term, “big data” can be pretty nebulous, in the same way that the term “cloud” covers diverse technologies. Input data to big data systems could be chatter from social networks, web server logs, traffic flow sensors, satellite imagery, broadcast audio streams, banking transactions, MP3s of rock music, the content of web pages, scans of government documents, GPS trails, telemetry from automobiles, financial market data, the list goes on. Are these all really the same thing? To classify matters the three Vs of volume, velocity, and variety are commonly used to categorize different aspects of big data. They are a helpful lens through which to view and understand the nature of the data and the software platforms that is available to exploit them. 2.2.1 Volume Terabyte records, transactions, tables, files. A Boeing Jet engine spews out 10TB of operational data for every 30 minutes they run. Hence a 4-engine Jumbo jet can create 640TB on one Atlantic crossing. Multiply that to 25,000 flights flown each day and you get the picture. 2.2.2 Velocity Batch, near-time, real-time, streams. Today’s on-line ad serving requires 40ms to respond with a decision. Financial services need near 1MS to calculate customer scoring probabilities. Stream data, such as movies, need to travel at high speed for proper rendering. 2.2.3 Variety Variety: Structured, Semi-structured, Unstructured; Text, image, audio, video, record and all the above in a mix. WalMart processes 1M customer transactions per hour and feeds information to a database estimated at 2.5PB (Petabytes). There are old and new data sources like RFID, sensors, mobile payments, in-vehicle tracking, etc. Data Variety and Complexity S t o r a g e ERP CRM Web Big Data Mega bytes Giga bytes Tera bytes Peta Bytes
  • 5. BIG DATA SENTHIL SUNDARESAN 4 Fig 2: Volume, Velocity and Variety 3. BIG DATA PROCESSING Before big data, traditional analysis involved crunching data in a traditional database. This was based on the relational database model, where data and the relationship between the data were stored in tables. The data was processed and stored in rows. Databases have progressed over the years, however, and are now using massively parallel processing (MPP) to break data up into smaller lots and process it on multiple machines simultaneously, enabling faster processing. Instead of storing the data in rows, the databases can also employ columnar architectures, which enable the processing of only the columns that have the data needed to answer the query and enable the storage of unstructured data. Fig 3: Big Data Architecture MapReduce MapReduce is the combination of two functions to better process data. First, the map function separates data over multiple nodes, which are then processed in parallel. The reduce function then combines the results of the calculations into a set of responses. Google used MapReduce to index the web, and has been granted a patent for its MapReduce framework. However, the MapReduce method has now become commonly used, with the most famous implementation being in an open-source project called Hadoop. Bridging the Gap – The Key – Value pair
  • 6. BIG DATA SENTHIL SUNDARESAN 5 Key-value pair is the data model underlying Map-Reduce (and thus Hadoop) that is actually the fundamental driver of performance. A file of key value pairs has exactly two columns. One is structured – the KEY. The other, the value, is unstructured – at least as far as the system is concerned. The Mapper then allows you to move (or split) the data between the structured and unstructured sections at will. The reducer then allows data to be collated and aggregated provided it has an identical key. Massively parallel processing (MPP) Like MapReduce, MPP processes data by distributing it across a number of nodes, which each process an allocation of data in parallel. The output is then assembled to create a result. However, MPP products are queried with SQL, while MapReduce is natively controlled via Java code. MPP is also generally used on expensive specialized hardware (sometimes referred to as big-data appliances), while MapReduce is deployed on commodity hardware. 4. BIG DATA ARCHITECTURE In this section, we will take a closer look at the overall architecture for big data. Traditional Information Architecture Capabilities To understand the high-level architecture aspects of Big Data, let’s first review a well formed logical information architecture for structured data. In the illustration, you see two data sources that use integration (ELT/ETL/Change Data Capture) techniques to transfer data into a DBMS data warehouse or operational data store, and then offer a wide variety of analytical capabilities to reveal the data. Some of these analytic capabilities include: dashboards, reporting, EPM/BI applications, summary and statistical query, semantic interpretations for textual data, and visualization tools for high-density data. In addition, some organizations have applied oversight and standardization across projects, and perhaps have matured the information architecture capability through managing it at the enterprise level. Fig 4: Traditional Capabilities – Courtesy Oracle The key information architecture principles include treating data as an asset through a value, cost, and risk lens, and ensuring timeliness, quality, and accuracy of data. And, the EA oversight responsibility is to establish and maintain a balanced governance approach including using center of excellence for standards management and training. Adding Big Data Capabilities The defining processing capabilities for big data architecture are to meet the volume, velocity, variety, and value requirements. Unique distributed (multi-node) parallel processing architectures have been created to parse these large data sets. There are differing technology strategies for real-time and batch processing requirements. For real-time, key-value data stores, such as NoSQL, allow for high performance, index-based retrieval. For batch processing, a technique known as “Map Reduce,” filters data according to a specific data discovery strategy. After the filtered data is discovered, it can be analyzed directly, loaded into other unstructured databases, sent to mobile devices, or merged into traditional data warehousing environment and correlated to structured data
  • 7. BIG DATA SENTHIL SUNDARESAN 6 Fig 5: Big Data Capabilities – Courtesy Oracle In addition to new unstructured data realms, there are two key differences for big data. First, due to the size of the data sets, we don’t move the raw data directly to a data warehouse. However, after MapReduce processing we may integrate the “reduction result” into the data warehouse environment so that we can leverage conventional BI reporting, statistical, semantic, and correlation capabilities. It is ideal to have analytic capabilities that combine a conventional BI platform along with big data visualization and query capabilities. And second, to facilitate analysis in the Hadoop environment, sandbox environments can be created. For many use cases, big data needs to capture data that is continuously changing and unpredictable. And to analyze that data, a new architecture is needed. In retail, a good example is capturing real time foot traffic with the intent of delivering in-store promotion. To track the effectiveness of floor displays and promotions, customer movement and behavior must be interactively explored with visualization or query tools. In other use cases, the analysis cannot be complete until you correlate it with other enterprise data - structured data. In the example of consumer sentiment analysis, capturing a positive or negative social media comment has some value, but associating it with your most or least profitable customer makes it far more valuable. So, the needed capability with Big Data BI is context and understanding. Using powerful statistical and semantic tools allow you to find the needle in the haystack, and will help you predict the future. In summary, the Big Data architecture challenge is to meet the rapid use and rapid data interpretation requirements while at the same time correlating it with other data. 5. STEPS TO BIG DATA Before you go down the path of big data, it's important to be prepared and approach an implementation in an organized manner, following these steps.  What do you wish you knew?  This is where it will be decided as what is expected out of big data that you can't get from your current systems.  If the answer is nothing, then perhaps big data isn't the right thing to use.  What are the current data assets?  Can the data be cross referenced to produce insights?  Is it possible to build new data products on top of the current assets?  If not, what needs to be implemented to do so? Once the above are understood, it's time to prioritize. Select the potentially most valuable opportunity for using big-data techniques and technology, and prepare a business case for a proof of concept, keeping in mind the skill sets you'll need to do it. You will need to talk to the owners of the data assets to get the full picture Another example of applying architecture principles differently is data governance. The quality and accuracy requirements of big data can vary tremendously. Using strict data precision rules on user sentiment data might filter out too much useful information, whereas data standards and common definitions are still going to be critical for fraud detections scenarios. Start the proof of concept, and make sure that there's a clear end point, so that you can evaluate what the proof of concept has achieved. This might be the time to give the owner of the data assets to take responsibility for the project
  • 8. BIG DATA SENTHIL SUNDARESAN 7 Once your proof of concept has been completed, evaluate whether it worked. Are you getting real insights delivered? Is the work that went in to the concept bearing fruit? Could it be extended to other parts of the organization? Is there other data that could be included? This will help you to discover whether to expand your implementation or revamp it. Once the evaluation is done and the need for big data is inevitable, then it’s imperative to choose the vendors and technologies. 5.1 Architecture Decisions Information Architecture is perhaps the most complex area of IT. It is the ultimate investment payoff. Today’s economic environment demands that business be driven by useful, accurate, and timely information. And, the world of Big Data adds another dimension to the problem. However, there are always business and IT tradeoffs to get to data and information in a most cost-effective way.  Key Drivers to Consider Here is a summary of various business and IT drivers you need to consider when making these architecture choices. Fig 6: Key Drivers Planning Big Data architecture is not about understanding just what is different. It’s also about how to integrate what’s new to what you already have – from database-and-BI infrastructure to IT tools, and end user applications. 5.2 Technologies To derive real business value from big data, you need the right tools to capture and organize a wide variety of data types from different sources, and to be able to easily analyze it within the context of all your enterprise data. Here is a brief outline of Big Data capabilities and their primary technologies: 5.2.1 Hadoop Derived from MapReduce technology, Hadoop is an open-source framework to process large amounts of data over multiple nodes in parallel, running on inexpensive hardware. Data is split into sections and loaded into a file store — for example, the Hadoop Distributed File System (HDFS), which is made up of multiple redundant nodes on cheap storage. A name node keeps track of which data is on which nodes. The data is replicated over more than one node, so that even if a node fails, there's still a copy of the data. The data can then be analyzed using MapReduce, which discovers from the name node where the data needed for calculations resides. Processing is then done at the node in parallel. The results are aggregated to determine the answer to the query and then loaded onto a node, which can be further analyzed using other tools. Alternatively, the data can be loaded into traditional data warehouses for use with transactional processing. Apache is considered to be the most noteworthy Hadoop distribution.
  • 9. BIG DATA SENTHIL SUNDARESAN 8 Fig 7: Hadoop in the Enterprise 5.2.1.1 RDBMS and Hadoop Here is a comparison of the overall differences between the RDBMS and MapReduce-based systems such as Hadoop: Fig 8: RDBMS vs. Hadoop 5.2.2 Hive Databases like Hadoop's file store make ad hoc query and analysis difficult, as the programming map/reduce functions that are required can be difficult. Realizing this when working with Hadoop, Facebook created Hive, which converts SQL queries to map/reduce jobs to be executed using Hadoop. 5.2.3 Pig Procedural data processing language designed for Hadoop where you specify a series of steps to perform on the data. It’s often described as “the duct tape of Big Data” for its usefulness there and it is often combined with custom streaming code written in a scripting language for more general operations. 5.2.4 Social Network and Hadoop Twitter uses Hadoop to store and process tweets, log files, and many other types of data generated across Twitter. Also it uses Cloudera's CDH2 distribution of Hadoop, and stores all data as compressed LZO files.  It uses both Scala and Java to access Hadoop's MapReduce APIs  It uses Pig heavily for both scheduled and ad-hoc jobs, due to its ability to accomplish a lot with few statements.  It employs committers on Pig, Avro, Hive, and Cassandra, and contribute much of our internal Hadoop work to open source Facebook uses Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.  Currently Facebook has 2 major clusters as:  An 1100-machine cluster with 8800 cores and about 12 PB of raw storage.  A 300-machine cluster with 2400 cores and about 3 PB of raw storage.  Each (commodity) node has 8 cores and 12 TB of storage.
  • 10. BIG DATA SENTHIL SUNDARESAN 9  Facebook is heavy users of both streaming as well as the Java APIs. It has built a higher level data warehousing framework using these features called Hive. It has also developed a FUSE implementation over HDFS. 5.2.5 NoSQL NoSQL database-management systems are unlike relational database-management systems, in that they do not use SQL as their query language. The idea behind these systems is that that they are better for handling data that doesn't fit easily into tables. They dispense with the overhead of indexing, schema and ACID transactional properties to create large, replicated data stores for running analytics on inexpensive hardware, which is useful for dealing with unstructured data. 5.2.5.1 Types of NoSQL Databases The following are the types of NoSQL Databases  Key-value Store  Document Databases  Column oriented database Fig 9: NoSQL Types 5.2.5.2 Cassandra Cassandra is a NoSQL database alternative to Hadoop's HDFS. 5.3 Sample Implementation Big-data projects have a number of different layers of abstraction from abstraction of the data through to running analytics against the abstracted data. Figure 1 shows the common components of analytical Big-data and their relationship to each other. The higher level components help make big data projects easier and more productive. Hadoop (is an apache project, written in java and being built and used by a global community of contributors) is often at the center of Big-data projects, but it is not a prerequisite.  Packaging and support of Hadoop by organizations such as Cloudera; to include MapReduce - essentially he compute layer of big data.  File-Systems such as the Hadoop Distributed File System (HDFS), which manages the retrieval and storing of data and metadata required for computation. Other file systems or databases such as Hbase (a NoSQL tabular store) or Cassandra (a NoSQL Eventually‐consistent key‐value store) can also be used.  Instead of writing in JAVA, higher level languages as Pig (part of Hadoop) can be used such, simplifying the writing of computations.
  • 11. BIG DATA SENTHIL SUNDARESAN 10  Hive is a Data Warehouse layer built on top of Hadoop, developed by Facebook programmers.  Cascading is a thin Java library that sits on top of Hadoop that allows suites of MapReduce jobs to be run and managed as a unit. It is widely used to develop special tools.  Semi-automated modeling tools such as CR-X allow models to develop interactively at great speed, and can help set up the database that will run the analytics.  Specialized scale-out analytic databases such as Greenplum or Netezza with very fast loading load & reload the data for the analytic models  ISV big data analytical packages such as ClickFox and Merced run against the database to help address the business issues (e.g., the customer satisfaction issues mentioned in the introduction).  Transactional Big-data projects cannot use Hadoop, as it is not real-time. For transactional systems that do not need a database with ACID2 guarantees, NoSQL databases can be used, though there are constraints such as weak consistency guarantees (e.g., eventual consistency) or restricting transactions to a single data item. For big-data transactional SQL databases that need the ACID guarantees the choices are limited. Traditional scale-up databases are usually too costly for very large-scale deployment, and don't scale out very well. Most social medial databases have had to hand-craft solutions. Recently a new breed of scale-out SQL database have emerged with architectures that move the processing next to the data (in the same way as Hadoop), such as Clustrix. These allow greater scale out ability. Fig 10: Sample Implementation This area is extremely fast growing, with many new entrants into the market expected over the next few years. 5.4 Vendors There is scarcely a vendor that doesn't have a big-data plan in train, with many companies combining their proprietary database products with the open-source Hadoop technology as their strategy to tackle velocity, variety and volume. Many of the early big-data technologies came out of open source, posing a threat to traditional IT vendors that have packaged their software and kept their intellectual property close to their chests. However, the open-source nature of the trend has also provided an opportunity for traditional IT vendors, because enterprise and government often find open-source tools off- putting. Therefore, traditional vendors have welcomed Hadoop with open arms, packaging it in to their own proprietary systems so they can sell the result to enterprise as more comfortable and familiar packaged solutions. Below are the plans of some of the larger vendors. 5.4.1 Cloudera Cloudera was founded in 2008 by employees who worked on Hadoop at Yahoo and Facebook. It contributes to the Hadoop open-source project, offering its own distribution of the software for free. It also sells a subscription-based, Hadoop-based distribution for the enterprise, which includes production support and tools to make it easier to run Hadoop.
  • 12. BIG DATA SENTHIL SUNDARESAN 11 5.4.2 Hortonworks Cloudera rival Hortonworks was birthed by key architects from the Yahoo Hadoop software engineering team. In June 2012, the company launched a high-availability version of Apache Hadoop, the Hortonworks Data Platform on which it collaborated with VMware, as the goal was to target companies deploying Hadoop on VMware's vSphere. Teradata has also partnered with Hortonworks to create products that "help customers solve business problems in new and better ways". 5.4.3 Teradata Teradata made its move out of the "old-world" data-warehouse space by buying Aster Data Systems and Aprimo in 2011. Teradata wanted Aster's ability to manage "a variety of diverse data that is not structured", such as web applications, sensor networks, social networks, genomics, video and photographs. Teradata has now gone to market with the Aster Data nCluster, a database using MPP and MapReduce. Visualization and analysis is enabled through the Aster Data visual-development environment and suite of analytic modules. The Hadoop connecter, enabled by its agreement with Cloudera, allows for a transfer of information between nCluster and Hadoop. 5.4.1 Oracle Oracle made its big-data appliance available earlier this year— a full rack of 18 Oracle Sun servers with 864GB of main memory; 216 CPU cores; 648TB of raw disk storage; 40Gbps InfiniBand connectivity between nodes and engineered systems; and 10Gbps Ethernet connectivity. The system includes Cloudera's Apache Hadoop distribution and manager software, as well as an Oracle NoSQL database and a distribution of R (an open-source statistical computing and graphics environment). It integrates with Oracle's 11g database, with the idea being that customers can use Hadoop MapReduce to create optimized datasets to load and analyze in the database. 5.4.2 IBM IBM combined Hadoop and its own patents to create IBM InfoSphere BigInsights and IBM InfoSphere Streams as the core technologies for its big-data push. The BigInsights product, which enables the analysis of large-scale structured and unstructured data, "enhances" Hadoop to "withstand the demands of your enterprise", according to IBM. It adds administrative, workflow, provisioning and security features into the open-source distribution. Meanwhile, streams analysis has a more complex event-processing focus, allowing the continuous analysis of streaming data so that companies can respond to events. IBM has partnered with Cloudera to integrate its Hadoop distribution and Cloudera manger with IBM BigInsights. Like Oracle's big-data product, IBM's BigInsights links to: IBM DB2, its Netezza data-warehouse; its InfoSphere Warehouse; and its Smart Analytics System. 5.4.3 SAP At the core of SAP's big-data strategy sits a high-performance analytic appliance (HANA) data-warehouse appliance, unleashed in 2011. It exploits in-memory computing, processing large amounts of data in the main memory of a server to provide real-time results for analysis and transactions. Business applications, like SAP's Business Objects, can sit on the HANA platform to receive a real-time boost. SAP has integrated HANA with Hadoop, enabling customers to move data between Hive and Hadoop's Distributed File System and SAP HANA or SAP Sybase IQ server. It has also set up a "big-data" partner council, which will work to provide products that make use of HANA and Hadoop. One of the key partners is Cloudera. SAP wants it to be easy to connect to data, whether it's in SAP software or software from another vendor. 5.4.4 Microsoft Microsoft is integrating Hadoop into its current products. It has been working with Hortonworks to make Hadoop available on its cloud platform Azure, and on Windows Servers. The former is available in developer preview. It already has connectors
  • 13. BIG DATA SENTHIL SUNDARESAN 12 between Hadoop, SQL Server and SQL Server Parallel Data Warehouse, as well as the ability for customers to move data from Hive into Excel and Microsoft BI tools, such as PowerPivot. 5.4.5 EMC EMC has centered its big-data technology on technology that it acquired when it bought Greenplum in 2010. It offers a unified analytics platform that deals with web, social, document, mobile machine and multimedia data using Hadoop's MapReduce and HDFS, while ERP, CRM and POS data is put into SQL stores. The data mining, neural nets and statistics analysis is carried out using data from both sets, which is fed in to dashboards. 6. VALUE TO AN ORGANIZATION Value of Big Data falls into two categories: 1. Analytical use 2. Enabling new markets/products Big data analytics can reveal insights hidden previously by data too costly to process. , such as peer influence among customers, revealed by analyzing shoppers’ transactions, social and geographical data The past decade’s successful web startups are prime examples of big data used as an enabler of new products and services. For example, by combining a large number of signals from a user’s actions and those of their friends, Facebook has been able to craft a highly personalized user experience and create a new kind of advertising business. It’s no coincidence that the lion’s share of ideas and tools underpinning big data has emerged from Google, Yahoo, Amazon and Facebook. 7. FIRMS AND BIG DATA Now that there are products that make use of big data, what are companies' plans in the space? We've outlined some of them below. 7.1 Ford Ford is experimenting with Hadoop to understand better how the car operates and how consumers use the vehicles, and feed that information back into our design process and help optimize the user's experience in the future, as well so as to gain value out of the data it generates from its business operations, vehicle research and even its customers' cars. 7.2 HCF HCF has adopted IBM's big-data analytics solution, including the Netezza big-data appliance, to better analyze claims as they are made in real time. This helps to more easily detect fraud and provide ailing members with information they might need to stay fit and healthy. 7.3 Klout Klout's job is to create insights from the vast amounts of data coming in from the 100 million social-network users indexed by the company, and to provide those insights to customers. For example, Klout might provide information on how certain peoples' influence on social networks (or Klout score) might affect word-of-mouth advertising, or provide information on changes in demand. To deliver the analysis on a shoestring, Klout built custom infrastructure on Apache Hadoop, with a separate data silo for each social network. 7.4 Mitsui Knowledge Industry Mitsui analyses genomes for cancer research. Using HANA, R and Hadoop to pre-process DNA sequences, the company was able to shorten genome-analysis time from several days to 20 minutes. 7.5 Nokia Nokia is using Apache Hadoop and Cloudera's CDH to pull the unstructured data (generated by its phones around the world) into a structured environment to create 3D maps that show traffic, inclusive of speed categories, elevation, current events and video.
  • 14. BIG DATA SENTHIL SUNDARESAN 13 7.6 WalMart WalMart uses a product it bought, called Muppet, as well as Hadoop to analyze social-media data from Twitter, Facebook, Foursquare and other sources. Among other things, this allows WalMart to analyze in real time which stores will have the biggest crowds, based on Foursquare check-ins. 8. BIG DATA – CHANGING WORLD Computers are leaner, meaner and cheaper than ever before. With computing power no longer at a premium, we're swimming in numbers that describe everything from how a small town in Minnesota behaves during rush hour to the probability of a successful drone strike in Yemen. The advent of so-called "big data" means that companies, governments and organizations can collect, interpret and wield huge stores of data to an amazing breadth of ends. From shoe shopping to privacy concerns, here's a look at five ways "big data" is changing the world: 8.1 Data as a deadly weapon The traditional battlefield has dissolved into thin air. In the big data era, information is the deadliest weapon and leveraging massive amounts of it is this era's arms race. But current military tech is buckling under the sheer weight of data collected from satellites, unmanned aircraft, and more traditional means. As part of the Obama administration's "Big Data Initiative," the Department of Defense launched XDATA, a program that intends to invest $25 million toward systems that analyze massive data sets in record time. With more efficient number crunching, the U.S. military can funnel petabytes of data toward cutting edge advances, like making unmanned drones smarter and more deadly than ever. 8.2 Saving the Earth Beyond powering predator drones and increasing retail revenue, big data can do a literal world of good. Take Google Earth Engine, an open source big data platform that allowed researchers to map the first high-resolution map of Mexico's forests. The map would have taken a traditional computer over three years to construct, but using Google Earth Engine's massive data cloud it was completed in the course of a day. Massive sets of data like this can help us understand environmental threats on a systemic level. The more data we have about the changing face of the earth's ecosystems and weather patterns, the better we can model future environmental shifts -- and how to stop them while we still can. 8.3 Watching you shop Big data can mean big profits. By understanding what you want to buy today, companies large and small can figure out what you'll want to buy tomorrow -- maybe even before you do? Online retailers like Amazon scoop up information about our shopping and e-window shopping habits on a huge scale, but even brick and mortar retailers are starting to catch on. A clever company called RetailNext helps companies like Brookstone and American Apparel record video of shoppers as they browse and buy. By transforming a single shopper's path into as many as 10,000 data points, companies can see how they move through a store, where they pause and how that tracks with sales. 8.4 Scientific research in overdrive Data has long been the cornerstone of scientific discovery, and with big data -- and the big computing power necessary to process it -- research can move at an exponentially fast clip. Take the Human Genome Project, widely considered to be one of the landmark scientific accomplishments in human history. Over the course of the $3 billion project, researchers analyzed and sequenced the roughly 25,000 genes that make up the human genome in 13 years. With today's modern methods of data collection and analysis, the same process can be completed in hours -- all by a device the size of a USB memory stick and for less than $1,000.
  • 15. BIG DATA SENTHIL SUNDARESAN 14 8.5 Big data, bigger privacy concerns You might just be a number in the grand scheme of things, but that adage isn't as reassuring as it used to be. It's true that big data is about breadth, but it's about depth, too. Web mega-companies like Facebook and Google not only scoop up data on a huge number of users -- 955 million, in Facebook's case -- but they collect an incredible depth of data as well. From what you search and where you click to who you know (and who they know, and who they know), the web's biggest players own data stockpiles so robust that they border on omniscient. Where technological power, cultural advancement and profit intersect, one thing's clear: with big data comes even bigger responsibility. 9. DEPLOYMENT CONSIDERATIONS We have explored the nature of big data, and surveyed the landscape of big data from a high level. As usual, when it comes to deployment there are dimensions to consider over and above tool selection. 9.1 Cloud or In-house The majority of big data solutions are now provided in three forms: software‐only, as an appliance or cloud‐based. Decisions between which routes to take will depend, among other things, on issues of data locality, privacy and regulation, human resources and project requirements. Many organizations opt for a hybrid solution: using on-demand cloud resources to supplement in-house deployments. 9.1.1.1 Cloud Computing and Big Data Experts in the IT industry, including Cloud Computing and Big Data, agree that a flexible and fast IT infrastructure is needed to support Big Data. The cloud removes the infrastructure challenges, provides the necessary speed and adds scalability. However, four areas must still be investigated more deeply: store and process, stewardship, sense making and security. 9.1.1.2 Significant Change in Cloud Computing Traditionally, cloud computing operates in three primary layers: Software as a Service, Platform as a Service and Infrastructure as a Service. However, the architecture of Big Data adds another layer into the stack, which is concerned with analyzing and managing Big Data. It includes different binding concepts like lineage, pedigree and provenance. Big Data is complex and comes with daunting challenges. Phenomenal corporate balance is required for success. For organizations to harness Big Data effectively, they must change their business processes, implement multiple technologies and give their workforce relevant training. 9.2 Skills shortages Even if a company decides to go down the big‐data path, it may be difficult to hire the right people. The data scientist requires a unique blend of skills, including a strong statistical and mathematical background, a good command of statistical tools such as SAS, SPSS or the open‐source R and an ability to detect patterns in data (like a data‐mining specialist), all backed by the domain knowledge and communications skills to understand what to look for and how to deliver it. 9.3 Privacy Tracking individuals' data in order to be able to sell to them better will be attractive to a company, but not necessarily to the consumer who is being sold the products. Not everyone wants to have an analysis carried out on their lives, and depending on how privacy regulations develop, which is likely to vary from country to country, companies will need to be careful with how invasive they are with big-data efforts, including how they collect data. Regulations could lead to fines for invasive policies, but perhaps the greater risk is loss of trust.
  • 16. BIG DATA SENTHIL SUNDARESAN 15 9.4 Security Individuals trust companies to keep their data safe. However, because big data is such a new area, products haven't been built with security in mind, despite the fact that the large volumes of data stored mean that there is more at stake than ever before if data goes missing. 9.5 Big Data is messy It’s not all about infrastructure. Big data practitioners consistently report that 80% of the effort involved in dealing with data is cleaning it up in the first place. 9.6 Big Data is big It is a fundamental fact that data that is too big to process conventionally is also too big to transport anywhere. Even if the data isn’t too big to move, locality can still be an issue, especially with rapidly updating data. 9.7 Culture The phenomenon of big data is closely tied to the emergence of data science, a discipline that combines math, programming and scientific instinct. Benefiting from big data means investing in teams with this skill set, and surrounding them with an organizational willingness to understand and use data for advantage. 9.8 Pitfalls 9.8.1 Do you know where your data is? It's no use setting up a big-data product for analysis only to realize that critical data is spread across the organization in inaccessible and possibly unknown locations. 9.8.2 A lack of direction "Collecting and analyzing the data is not enough; it must be presented in a timely fashion, so that decisions are made as a direct consequence that has a material impact on the productivity, profitability or efficiency of the organization. Most organizations are ill prepared to address both the technical and management challenges posed by big data; as a direct result, few will be able to effectively exploit this trend for competitive advantage." Unless firms know what questions they want to answer and what business objectives they hope to achieve, big-data projects just won't bear fruit. 10. CONCLUSION Finally, remember that big data is no panacea. You can find patterns and clues in your data, but then first, decide what problem you want to solve. If you pick a real business problem, such as how you can change your advertising strategy to increase spend per customer, it will guide your implementation. While big data work benefits from an enterprising spirit, it also benefits strongly from a concrete goal head As you explore the ‘what’s new’ across the spectrum of Big Data capabilities, we suggest that you think about their integration into your existing infrastructure and BI investments. As examples, align new operational and management capabilities with standard IT, build for enterprise scale and resilience, unify your database and development paradigms as you embrace Open Source, and share metadata wherever possible for both integration and analytics. Last but not least, expand the IT governance to include a Big Data center of excellence to ensure business alignment, grow your skills, manage Open Source tools and technologies, share knowledge, establish standards, and to manage best practices.
  • 17. BIG DATA SENTHIL SUNDARESAN 16 Fig 11: McKinsey Survey Corporates vs. Big Data “Experience Certainty” - big data is imperative for Corporates to face the future. Scale- Out Storage Systems - Hadoop Technology Stack and Services Corporates need to have strong partnerships with storage vendors and is involved in architecture of large Data Centers with Big Data storage requirements. Most Scale-Out storage solutions today includes Hadoop as part of the stack. BI, Advanced and Predictive Analytics Corporates need to have strong capability on Business Intelligence, Data Warehousing and Advanced Analytics. This experience is around Industry Leading products and advanced and Predictive Analytics Solutions as in the case of “Listening Platform for Social Media” and “Supply Chain Predictive Analytics“. Vertical Domain Experience Corporates need to have deep knowledge of Business Imperatives of Semiconductor, Computer Platforms, Consumer Electronics and Software Product Companies. This knowledge in turn helps setting the right patterns for Advanced Analytics and also for defining the correct rules for Big Data analytics. What can be done? The scarcity in the Big Data and Hadoop knowledge creates the gap between the requirements and resource availability. It can be avoided by choosing the interested associates and train them properly in order to create a larger pool of associates having big data expertise available for the future.
  • 18. BIG DATA SENTHIL SUNDARESAN 17 11. REFERENCES [1] Edd Dumbill, http://strata.oreilly.com [2] David Floyer, http://wikibon.org/wiki/v/Enterprise_Big-data [3] http://en.wikipedia.org/wiki/Big_data [4] Taylor Hatmaker, http://www.entrepreneur.com/article/224582. [5] Scott Jarr, http://voltdb.com/company/blog/big-data-value-continuum. [6] Oracle white paper in Enterprise Architecture [7] McKinsey Global Institute Analysis Victor Daily, http://www.techzost.com/2012/11/where-does-cloudcomputing-and-big-data.html [8] http://wiki.apache.org/hadoop/PoweredBy [9] TCS Hadoop and Data Xplode [10] http://www.tcs.com/resources/white_papers/Pages/Big-Data-Storage-Solutions.aspx Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.