SlideShare uma empresa Scribd logo
1 de 19
Baixar para ler offline
A Non-Geek’s
Big Data
Playbook
Hadoop and the
Enterprise Data Warehouse
by
Tamara dull
best
practices
T H O U G H T P R O V O K I N G B U S I N E S S
big DATA
a SAS Best Practices white paper
A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehouse big data
3
Introduction....................................................................................... 4
terms & diagrams used in this playbook.................................. 5
The Enterprise Data Warehouse............................................................... 5
Big Data and Hadoop.............................................................................. 6
play 1: stage structured data...................................................... 8
play 2: process structured data.............................................. 10
play 3: process non-integrated & unstructured data..... 11
play 4: archive all data.................................................................. 13
play 5: access all data via the EDW........................................... 14
play 6: access all data via Hadoop............................................ 16
conclusion......................................................................................... 18
table of contents
A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehousebig data
4
This Big Data
Playbook
demonstrates
in six common
“plays” how
Apache Hadoop
supports and
extends
the EDW
ecosystem.
You: “No, we don’t need Hadoop. We don’t have any big data.”
Big Data Geek: “Wah wah wawah.”
You: “Seriously, even if we did have big data, that would be the least of our worries.
We’re having enough trouble getting our ‘small’ data stored, processed and analyzed in
a timely manner. First things first.”
Big Data Geek: “Wah wah wah wawah wah wah wawah.”
You: “What’s wrong with you? Do you even speak English?”
Big Data Geek: “Wawah.”
You: “Right. Anyway, the other challenge we’ve been having is that our data is growing
faster than our data warehouse and our budget. Can Hadoop help with that?”
Big Data Geek: “Wah wah wawah wah wah wah wah wawah wah wawah!”
You: “Whatever, dude.”
We’ve all been there. We’re having lunch with a few colleagues, and the conversation
shifts to big data. And then Charlie Brown’s teacher shows up. We try not to look at our
watches.
If you’re interested in getting past the “wah wah wawah” of big data, then this white pa-
per is for you. It’s designed to be a visual playbook for the non-geek yet technically savvy
business professional who is still trying to understand how big data, specifically Hadoop,
impacts the enterprise data game we’ve all been playing for years.
Specifically, this Big Data Playbook demonstrates in six common “plays” how Apache
Hadoop, the open source poster child for big data technologies, supports and extends
the enterprise data warehouse (EDW) ecosystem. It begins with a simple, popular play
and progresses to more complex, integrated plays.
Introduction
A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehouse big data
5
This reference section defines the key terms used in the play diagrams.
The Enterprise Data Warehouse
Figure 1 portrays a simplified, traditional EDW setup:
terms & diagrams
used in this playbook
Each object in the diagram represents a key component in the EDW ecosystem:
•	 Structured data sources: This is the data creation component. Typically,
these are applications that capture transactional data that gets stored in a
relational database. Example sources include: ERP, CRM, financial data, POS
data, trouble tickets, e-commerce and legacy apps.
•	 Enterprise data warehouse (EDW): This is the data storage component.
The EDW is a repository of integrated data from multiple structured data
sources used for reporting and data analysis. Data integration tools, such as
ETL, are typically used to extract, transform and load structured data into a
relational or column-oriented DBMS. Example storage components include:
operational warehouse, analytical warehouse (or sandbox), data mart, opera-
tional data store (ODS) and data warehouse appliance.
•	 BI/analytics: This is the data action component. These are the applications,
tools and utilities designed for users to access, interact, analyze and make
decisions using data in relational databases and warehouses. It’s important
to note that many traditional vendors have also extended their BI/analyt-
ics products to support Hadoop. Example applications include: operational
reporting, ad hoc queries, OLAP, descriptive analytics, predictive analytics,
prescriptive analytics and data visualization.
A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehousebig data
6
Big Data and Hadoop
Figure 2 represents a simple standalone Hadoop setup:
Each object in this diagram represents key Hadoop-related components:
•	 Unstructured data sources: This is the data creation component. Typically,
this is data that’s not or cannot be stored in a structured, relational database.
Includes both semi-structured and unstructured data sources. Example
sources include: email, social data, XML data, videos, audio files, photos,
GPS, satellite images, sensor data, spreadsheets, web log data, mobile data,
RFID tags and PDF docs.
•	 Hadoop (HDFS): The Hadoop Distributed File System (HDFS) is the data
storage component of the open source Apache Hadoop project. It can
store any type of data – structured, semi-structured and unstructured. It is
designed to run on low-cost commodity hardware and is able to scale out
quickly and cheaply across thousands of machines.
•	 Big data apps: This is the data action component. These are the applica-
tions, tools and utilities that have been natively built for users to access,
interact, analyze and make decisions using data in Hadoop and other nonre-
lational storage systems. It does not include traditional BI/analytics applica-
tions or tools that have been extended to support Hadoop.
A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehouse big data
7
With the
release of
Hadoop 2.0,
MapReduce
doesn’t get
bottlenecked and
can stay focused
on what it does
best: processing
data.
Not represented directly in Figure 2 is MapReduce, the resource management and pro-
cessing component of Hadoop. MapReduce allows Hadoop developers to write opti-
mized programs that can process large volumes of data, structured and unstructured, in
parallel across clusters of machines in a reliable and fault-tolerant manner. For instance,
a programmer can use MapReduce to find friends or calculate the average number of
contacts in a social network application, or process web access log stats to analyze web
traffic volume and patterns.
Another benefit of MapReduce is that it processes the data where it resides (in HDFS)
instead of moving it around, as is sometimes the case in a traditional EDW system. It
also comes with a built-in recovery system – so if one machine goes down, MapReduce
knows where to go to get another copy of the data.
Although MapReduce processing is lightning fast when compared to more traditional
methods, its jobs must be run in batch mode. This has proven to be a limitation for orga-
nizations that need to process data more frequently and/or closer to real time. The good
news is that with the release of Hadoop 2.0, the resource management functionality has
been packaged separately (it’s called YARN) so that MapReduce doesn’t get bottle-
necked and can stay focused on what it does best: processing data.
A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehousebig data
8
With growing
volumes of data
and increasing
requirements
to process and
analyze data
even faster,
organizations are
faced with three
options these
days.
play 1
stage structured data
Use Hadoop as a data staging platform for your EDW.
With growing volumes of data and increasing requirements to process and analyze data
even faster, organizations are faced with three options these days:
1.	 To add more hardware and/or horsepower to their existing EDW and opera-
tional systems.
2.	 Consider alternative ways to manage their data.
3.	 Do nothing.
While option 1 is viable yet expensive, and option 3 could be the kiss of death for some,
option 2 is where Hadoop steps in.
A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehouse big data
9
If you’re
experiencing rapid
application data
growth, consider
handing off some
of this work to
Hadoop.
Consider this: What if you used Hadoop as a data staging platform to load your EDW?
You could write MapReduce jobs to bring the application data into the HDFS, transform
it and then send the transformed data on its way to your EDW. Two key benefits of this
approach would be:
•	 Storage costs. Because of the low cost of Hadoop storage, you could store
both versions of the data in the HDFS: the before application data and the
after transformed data. Your data would then all be in one place, making it
easier to manage, reprocess (if needed) and analyze at a later date.
•	 Processing power. Processing data in Hadoop frees up EDW resources and
gets data processed, transformed and into your EDW quicker so that the
analysis work can begin.
Back in the early days of Hadoop, some went so far as to call Hadoop the “ETL killer,”
putting ETL vendors at risk and on the defensive. Fortunately, these vendors quickly re-
sponded with new HDFS connectors, making it easier for organizations to optimize their
ETL investments in this new Hadoop world.
If you’re experiencing rapid application data growth and/or you’re having trouble getting
all your ETL jobs to finish in a timely manner, consider handing off some of this work to
Hadoop – using your ETL vendor’s Hadoop/HDFS connector or MapReduce – and get
ahead of your data, not behind it.
A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehousebig data
10
play 2
process structured data
Use Hadoop to update data in your EDW and/or operational systems.
Contrary to popular belief, you don’t need “big” data to take advantage of the power
of Hadoop. Not only can you use Hadoop to ease the ETL burden of processing your
“small” data for your EDW (see Play 1), you can also use it to offload some of the pro-
cessing work you’re currently asking your EDW to do.
Take, for example, social networks like Facebook and LinkedIn that maintain a list of mu-
tual friends and contacts you have with each of your connections. This mutual list data is
stored in a data warehouse and needs to be updated periodically to reflect your current
state of connections. As you can imagine, this is a data-heavy, resource-intensive job for
your EDW – a job that could easily be handled by Hadoop in a fraction of the time and
cost.
Figure 4 shows this play in action: Send the data to be updated to Hadoop, let MapRe-
duce do its thing, and then send the updated data back to your EDW. This would not only
apply to your EDW data, but also any data that is being maintained in your operational
and analytical systems.
Take advantage of Hadoop’s low-cost, high-speed processing power so that your EDW
and operational systems are freed up to do what they do best.
Contrary to
popular belief,
you don’t need
“big” data to
take advantage
of the power of
Hadoop.
A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehouse big data
11
Is all data
valuable and
should our goal
be to capture all of
it so that we can
process, analyze
and discover
greater insights in
it?
Use Hadoop to take advantage of data that’s currently unavailable in your EDW.
This play focuses on two categories of data: (1) structured data sources that have not
been integrated into your EDW and (2) unstructured data sources. More generally, it’s
any data that’s currently not part of your current EDW ecosystem that could be providing
additional insight into your customers, products and services.
Notwithstanding, this raises a higher-level, more philosophical question: Is all data valu-
able and should our goal be to capture all of it so that we can process, analyze and
discover greater insights in it? At many companies, such discussions are informed by a
structured process for data governance, which is increasingly incorporating policy-mak-
ing processes specific to Hadoop-resident data. It’s important to note that even if the
answer is “yes, collect it all,” Hadoop is well-suited to help get us to this desired state.
play 3
process non-integrated  unstructured data
A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehousebig data
12
For the data your
EDW cannot or
doesn’t handle
well, you can use
Hadoop to pick
up the slack.
In Figure 5, data is coming into Hadoop (HDFS) from both structured and unstructured
data sources. With the data in the HDFS, you have a couple of options:
•	 Process and keep the data in Hadoop (HDFS). You can then analyze the data
using big data apps or BI/analytics tools. (See Play 6 for more information.)
•	 Process and store the data in Hadoop (HDFS) and, optionally, push relevant
data into your EDW so that it can be analyzed with existing data. Be aware
that not all unstructured data can or should be structured for the EDW.
Because Hadoop can store any data, it complements the EDW well. For the data your
EDW cannot or doesn’t handle well, you can use Hadoop to pick up the slack.
A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehouse big data
13
No longer does
the business
analyst or data
scientist need
to limit his data
analysis to the
last three, five or
seven years.
Use Hadoop to archive all your data on-premises or in the cloud.
This play is straightforward and one of the more popular plays. Since Hadoop runs on
commodity hardware that scales out easily and quickly, organizations are now able to
store and archive a lot more data at a much lower cost. To reduce costs even more,
organizations can also archive their data in the cloud, thus freeing up more resources in
the data center.
This is good news for IT, but it should also be music to the business professional’s ears.
No longer does data need to be destroyed after its regulatory life to save on storage
costs. No longer does the business analyst or data scientist need to limit his data analy-
sis to the last three, five or seven years. Because Hadoop is open source software run-
ning on commodity hardware and because performance can outstrip that of traditional
databases, decades of data can now be stored more easily and cost-effectively.
play 4
archive all data
As demonstrated in Figure 6, archived data in Hadoop can be accessed and analyzed
using big data tools or, alternatively, using traditional BI/analytics tools that have been
extended to work with Hadoop. Ideally, the best tools to use will be those that are most
familiar to the data professionals and have been designed to handle the volume and
variety of archived data.
A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehousebig data
14
Play 5 is
geared towards
organizations that
want to keep the
EDW as the de
facto system of
record.
Use Hadoop to extend your EDW as the center of your organization’s data universe.
This play is geared towards organizations that want to keep the EDW as the de facto
system of record – at least for now. Hadoop is used to process and integrate structured
and unstructured data to load into the EDW. The organization can continue using its BI/
analytics tools to access data in the EDW, and alternatively, in Hadoop.
play 5
access all data via the EDW
Figure 7 demonstrates how this works:
1.	 This step focuses on structured data that meets two criteria: (1) Data that is
currently being integrated (or can be integrated) into the EDW via ETL tools
and (2) data that does not need to be integrated with other non-integrated or
unstructured data sources (see Play 3).
2.	 This is structured data you want to integrate with other data, structured or
unstructured. The best place to bring it together is Hadoop.
3.	 This is any unstructured data that you want to process and possibly integrate
with other structured or unstructured data sources. This data can be stored
in Hadoop in its raw state.
A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehouse big data
15
Be aware
that the
Hadoop view, by
design, will not be
a complete view
of the data, given
that the EDW
is the de facto
system of record
in this play.
4.	 As outlined in Play 3, this is data that Hadoop has processed and integrated,
and can be loaded into the EDW. Given that this play is about making the
EDW the central player, figuring out which data goes to the EDW will be key.
5.	 All EDW data – from structured data sources and Hadoop – can now be ac-
cessed and analyzed via BI/analytics tools.
6.	 Alternatively, BI/analytics tools that have been extended to work with Ha-
doop can access the data in Hadoop directly. Be aware that the Hadoop
view, by design, will not be a complete view of the data, given that the EDW
is the de facto system of record in this play.
A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehousebig data
16
Organizations
have been
spending years
building up and
out their EDW
ecosystem, and
along comes
Hadoop to
upset the
apple cart.
Use Hadoop as the landing platform for all data and exploit the strengths of both the
EDW and Hadoop.
This play is a paradigm shift in data management. Organizations have been spending
years building up and out their EDW ecosystem, and along comes Hadoop to upset the
apple cart. This play is focused on exploiting the strengths of both technologies, EDW
and Hadoop, so that organizations can extract even more value and insight from one of
their greatest strategic assets – data.
play 6
access all data via Hadoop
Figure 8 shows how it could work:
1.	 Structured data is initially processed and stored in Hadoop. It is no longer
loaded directly into the EDW via ETL tools.
2.	 Unstructured data is collected, processed and integrated in Hadoop with
other structured and unstructured data sources as needed.
3.	 Hadoop then becomes the primary source to load clean, integrated data into
the EDW. The EDW also uses Hadoop to keep its data updated (see Play 2).
4.	 BI/analytics tools are then used to access and analyze data in both the EDW
and Hadoop, depending on the business requirement.
5.	 Big data apps can and should be used to access and analyze data in Ha-
doop, now the repository for all data.
A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehouse big data
17
Big data
technologies,
including Hadoop,
are relatively
new and are
still maturing,
especially for the
enterprise market.
One advantage of capturing data in Hadoop is that it can be stored in its raw, native
state. It does not need to be formatted upfront as with traditional, structured data stores;
it can be formatted at the time of the data request. This process of formatting the data
at the time of the query is called “late binding” and is a growing practice for companies
that want to avoid protracted data transformation when loading the data. Late binding
ensures that there is context to the data formats depending on the data request itself.
Thus, Hadoop programmers can save months of programming by loading data in its na-
tive state.
Even though Figure 8 makes this data paradigm shift look simple, the road to get there
will not be easy. Big data technologies, including Hadoop, are relatively new and are still
maturing, especially for the enterprise market. Additionally, you can expect to see signifi-
cant growth and maturation in the development of big data applications. Given that data
creation is growing exponentially, it follows that we’ll need better applications and tools
to consume data and make it more meaningful, insightful and useful.
A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehousebig data
18
Don’t fall into the
trap of believing
that Hadoop is
a big-data-only
solution.
You: “I always thought you needed big data if you were going to use Hadoop. That’s
clearly not the case.”
Big Data Geek: “Wah wah wawah.”
You: “That’s exactly right, and that’s what I told my boss.”
Big Data Geek: “Wah wah wah wawah wah wah wawah.”
You: “Really? I’m going to check with IT and see how many years of data we’re currently
archiving. Maybe we can bump it up if we implement Hadoop.”
Big Data Geek: “Wawah.”
You: “Good point. I think we’re currently archiving our data on-premises. Maybe we
should look into moving some of this data off-site into the cloud.”
Big Data Geek: “Wah wah wawah wah wah wah wah wawah wah wawah!
You: “Whatever, dude.”
Don’t fall into the trap of believing that Hadoop is a big-data-only solution. What’s im-
portant for, indeed incumbent on, the tech-savvy business professional is to identify the
improvement opportunities that Hadoop can help address. Often, technical practitioners
and programmers are so focused on getting the data in that they lose sight of the prom-
ised efficiencies, not to mention business-specific constraints that might be lifted as a
result of using Hadoop.
Whether or not you speak open source and whether or not you’re database-fluent, you
now have the business perspective to identify the optimal play(s) from the Big Data Play-
book. In so doing, you can and will be the bridge between the problem and the solution.
conclusion
A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehouse big data
19
[ you have been reading a SAS Best Practices white paper. thank you. ]
SAS Institute Inc.
100 SAS Campus Drive
Cary, NC 27513-2414
USA
Phone: 919-677-8000
Fax: 919-677-4444
about the author
tamara dull has over 25 years of technology services experi-
ence, with a strong foundation in data analysis, design and develop-
ment. This has served as a natural bridge to Tamara’s current role as
SAS’ thought leadership guru of big data, guiding the conversation
on this emerging trend as she examines everything from basic prin-
ciples to architectures and delivery best practices.
A pioneer in the development of social media and online strategy,
Tamara has established dynamic marketing efforts and fostered ro-
bust online collaboration and community interaction. At Lyzasoft,
she helmed a major release of the startup company’s enterprise col-
laboration software suite. Tamara also established an online commu-
nity in her role as the co-founder of Semper Vita, a non-profit charity
website. Her role as the VP of Marketing at Baseline Consulting saw
her provide strategic leadership and online media expertise for key
marketing efforts and branding.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA
and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
106947_S120094.0114

Mais conteúdo relacionado

Mais procurados

Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digitalsambiswal
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeSysfore Technologies
 
Hadoop essentials by shiva achari - sample chapter
Hadoop essentials by shiva achari - sample chapterHadoop essentials by shiva achari - sample chapter
Hadoop essentials by shiva achari - sample chapterShiva Achari
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
 
Big data and apache hadoop adoption
Big data and apache hadoop adoptionBig data and apache hadoop adoption
Big data and apache hadoop adoptionfaizrashid1995
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?Hortonworks
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduceRyan Tabora
 
Data warehouse-optimization-with-hadoop-informatica-cloudera
Data warehouse-optimization-with-hadoop-informatica-clouderaData warehouse-optimization-with-hadoop-informatica-cloudera
Data warehouse-optimization-with-hadoop-informatica-clouderaJyrki Määttä
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKRajesh Jayarman
 
Rajesh Angadi Brochure
Rajesh Angadi Brochure Rajesh Angadi Brochure
Rajesh Angadi Brochure Rajesh Angadi
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 

Mais procurados (19)

Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digital
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
Hadoop essentials by shiva achari - sample chapter
Hadoop essentials by shiva achari - sample chapterHadoop essentials by shiva achari - sample chapter
Hadoop essentials by shiva achari - sample chapter
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Big data and apache hadoop adoption
Big data and apache hadoop adoptionBig data and apache hadoop adoption
Big data and apache hadoop adoption
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Data warehouse-optimization-with-hadoop-informatica-cloudera
Data warehouse-optimization-with-hadoop-informatica-clouderaData warehouse-optimization-with-hadoop-informatica-cloudera
Data warehouse-optimization-with-hadoop-informatica-cloudera
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Rajesh Angadi Brochure
Rajesh Angadi Brochure Rajesh Angadi Brochure
Rajesh Angadi Brochure
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 

Destaque

Transform Banking with Big Data and Automated Machine Learning 9.12.17
Transform Banking with Big Data and Automated Machine Learning 9.12.17Transform Banking with Big Data and Automated Machine Learning 9.12.17
Transform Banking with Big Data and Automated Machine Learning 9.12.17Cloudera, Inc.
 
Put Alternative Data to Use in Capital Markets

Put Alternative Data to Use in Capital Markets
Put Alternative Data to Use in Capital Markets

Put Alternative Data to Use in Capital Markets
Cloudera, Inc.
 
The Big Picture: Real-time Data is Defining Intelligent Offers
The Big Picture: Real-time Data is Defining Intelligent OffersThe Big Picture: Real-time Data is Defining Intelligent Offers
The Big Picture: Real-time Data is Defining Intelligent OffersCloudera, Inc.
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
Cloudera Customer Success Story
Cloudera Customer Success StoryCloudera Customer Success Story
Cloudera Customer Success StoryXpand IT
 
IoT - Data Management Trends, Best Practices, & Use Cases
IoT - Data Management Trends, Best Practices, & Use CasesIoT - Data Management Trends, Best Practices, & Use Cases
IoT - Data Management Trends, Best Practices, & Use CasesCloudera, Inc.
 
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data PlatformHow to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data PlatformCloudera, Inc.
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
 
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...Cloudera, Inc.
 

Destaque (9)

Transform Banking with Big Data and Automated Machine Learning 9.12.17
Transform Banking with Big Data and Automated Machine Learning 9.12.17Transform Banking with Big Data and Automated Machine Learning 9.12.17
Transform Banking with Big Data and Automated Machine Learning 9.12.17
 
Put Alternative Data to Use in Capital Markets

Put Alternative Data to Use in Capital Markets
Put Alternative Data to Use in Capital Markets

Put Alternative Data to Use in Capital Markets

 
The Big Picture: Real-time Data is Defining Intelligent Offers
The Big Picture: Real-time Data is Defining Intelligent OffersThe Big Picture: Real-time Data is Defining Intelligent Offers
The Big Picture: Real-time Data is Defining Intelligent Offers
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
Cloudera Customer Success Story
Cloudera Customer Success StoryCloudera Customer Success Story
Cloudera Customer Success Story
 
IoT - Data Management Trends, Best Practices, & Use Cases
IoT - Data Management Trends, Best Practices, & Use CasesIoT - Data Management Trends, Best Practices, & Use Cases
IoT - Data Management Trends, Best Practices, & Use Cases
 
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data PlatformHow to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
 

Semelhante a Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices

Learn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant ResourceLearn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant ResourceAssignment Help
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introductionsaisreealekhya
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesDavid Tjahjono,MD,MBA(UK)
 
Hadoop Based Data Discovery
Hadoop Based Data DiscoveryHadoop Based Data Discovery
Hadoop Based Data DiscoveryBenjamin Ashkar
 
Hadoop for Finance - sample chapter
Hadoop for Finance - sample chapterHadoop for Finance - sample chapter
Hadoop for Finance - sample chapterRajiv Tiwari
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouseStephen Alex
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouseStephen Alex
 
Introduction-to-Big-Data-and-Hadoop.pptx
Introduction-to-Big-Data-and-Hadoop.pptxIntroduction-to-Big-Data-and-Hadoop.pptx
Introduction-to-Big-Data-and-Hadoop.pptxPratimakumari213460
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035Neelam Rawat
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Machine Learning Hadoop
Machine Learning HadoopMachine Learning Hadoop
Machine Learning HadoopAletheLabs
 

Semelhante a Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices (20)

Hadoop Business Cases
Hadoop Business CasesHadoop Business Cases
Hadoop Business Cases
 
Learn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant ResourceLearn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant Resource
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Hadoop Based Data Discovery
Hadoop Based Data DiscoveryHadoop Based Data Discovery
Hadoop Based Data Discovery
 
paper
paperpaper
paper
 
View on big data technologies
View on big data technologiesView on big data technologies
View on big data technologies
 
Hadoop for Finance - sample chapter
Hadoop for Finance - sample chapterHadoop for Finance - sample chapter
Hadoop for Finance - sample chapter
 
Hadoop & Data Warehouse
Hadoop & Data Warehouse Hadoop & Data Warehouse
Hadoop & Data Warehouse
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
finap ppt conference.pptx
finap ppt conference.pptxfinap ppt conference.pptx
finap ppt conference.pptx
 
Introduction-to-Big-Data-and-Hadoop.pptx
Introduction-to-Big-Data-and-Hadoop.pptxIntroduction-to-Big-Data-and-Hadoop.pptx
Introduction-to-Big-Data-and-Hadoop.pptx
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Machine Learning Hadoop
Machine Learning HadoopMachine Learning Hadoop
Machine Learning Hadoop
 

Último

Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 

Último (20)

Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 

Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices

  • 1. A Non-Geek’s Big Data Playbook Hadoop and the Enterprise Data Warehouse by Tamara dull best practices T H O U G H T P R O V O K I N G B U S I N E S S big DATA a SAS Best Practices white paper
  • 2. A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehouse big data 3 Introduction....................................................................................... 4 terms & diagrams used in this playbook.................................. 5 The Enterprise Data Warehouse............................................................... 5 Big Data and Hadoop.............................................................................. 6 play 1: stage structured data...................................................... 8 play 2: process structured data.............................................. 10 play 3: process non-integrated & unstructured data..... 11 play 4: archive all data.................................................................. 13 play 5: access all data via the EDW........................................... 14 play 6: access all data via Hadoop............................................ 16 conclusion......................................................................................... 18 table of contents
  • 3. A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehousebig data 4 This Big Data Playbook demonstrates in six common “plays” how Apache Hadoop supports and extends the EDW ecosystem. You: “No, we don’t need Hadoop. We don’t have any big data.” Big Data Geek: “Wah wah wawah.” You: “Seriously, even if we did have big data, that would be the least of our worries. We’re having enough trouble getting our ‘small’ data stored, processed and analyzed in a timely manner. First things first.” Big Data Geek: “Wah wah wah wawah wah wah wawah.” You: “What’s wrong with you? Do you even speak English?” Big Data Geek: “Wawah.” You: “Right. Anyway, the other challenge we’ve been having is that our data is growing faster than our data warehouse and our budget. Can Hadoop help with that?” Big Data Geek: “Wah wah wawah wah wah wah wah wawah wah wawah!” You: “Whatever, dude.” We’ve all been there. We’re having lunch with a few colleagues, and the conversation shifts to big data. And then Charlie Brown’s teacher shows up. We try not to look at our watches. If you’re interested in getting past the “wah wah wawah” of big data, then this white pa- per is for you. It’s designed to be a visual playbook for the non-geek yet technically savvy business professional who is still trying to understand how big data, specifically Hadoop, impacts the enterprise data game we’ve all been playing for years. Specifically, this Big Data Playbook demonstrates in six common “plays” how Apache Hadoop, the open source poster child for big data technologies, supports and extends the enterprise data warehouse (EDW) ecosystem. It begins with a simple, popular play and progresses to more complex, integrated plays. Introduction
  • 4. A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehouse big data 5 This reference section defines the key terms used in the play diagrams. The Enterprise Data Warehouse Figure 1 portrays a simplified, traditional EDW setup: terms & diagrams used in this playbook Each object in the diagram represents a key component in the EDW ecosystem: • Structured data sources: This is the data creation component. Typically, these are applications that capture transactional data that gets stored in a relational database. Example sources include: ERP, CRM, financial data, POS data, trouble tickets, e-commerce and legacy apps. • Enterprise data warehouse (EDW): This is the data storage component. The EDW is a repository of integrated data from multiple structured data sources used for reporting and data analysis. Data integration tools, such as ETL, are typically used to extract, transform and load structured data into a relational or column-oriented DBMS. Example storage components include: operational warehouse, analytical warehouse (or sandbox), data mart, opera- tional data store (ODS) and data warehouse appliance. • BI/analytics: This is the data action component. These are the applications, tools and utilities designed for users to access, interact, analyze and make decisions using data in relational databases and warehouses. It’s important to note that many traditional vendors have also extended their BI/analyt- ics products to support Hadoop. Example applications include: operational reporting, ad hoc queries, OLAP, descriptive analytics, predictive analytics, prescriptive analytics and data visualization.
  • 5. A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehousebig data 6 Big Data and Hadoop Figure 2 represents a simple standalone Hadoop setup: Each object in this diagram represents key Hadoop-related components: • Unstructured data sources: This is the data creation component. Typically, this is data that’s not or cannot be stored in a structured, relational database. Includes both semi-structured and unstructured data sources. Example sources include: email, social data, XML data, videos, audio files, photos, GPS, satellite images, sensor data, spreadsheets, web log data, mobile data, RFID tags and PDF docs. • Hadoop (HDFS): The Hadoop Distributed File System (HDFS) is the data storage component of the open source Apache Hadoop project. It can store any type of data – structured, semi-structured and unstructured. It is designed to run on low-cost commodity hardware and is able to scale out quickly and cheaply across thousands of machines. • Big data apps: This is the data action component. These are the applica- tions, tools and utilities that have been natively built for users to access, interact, analyze and make decisions using data in Hadoop and other nonre- lational storage systems. It does not include traditional BI/analytics applica- tions or tools that have been extended to support Hadoop.
  • 6. A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehouse big data 7 With the release of Hadoop 2.0, MapReduce doesn’t get bottlenecked and can stay focused on what it does best: processing data. Not represented directly in Figure 2 is MapReduce, the resource management and pro- cessing component of Hadoop. MapReduce allows Hadoop developers to write opti- mized programs that can process large volumes of data, structured and unstructured, in parallel across clusters of machines in a reliable and fault-tolerant manner. For instance, a programmer can use MapReduce to find friends or calculate the average number of contacts in a social network application, or process web access log stats to analyze web traffic volume and patterns. Another benefit of MapReduce is that it processes the data where it resides (in HDFS) instead of moving it around, as is sometimes the case in a traditional EDW system. It also comes with a built-in recovery system – so if one machine goes down, MapReduce knows where to go to get another copy of the data. Although MapReduce processing is lightning fast when compared to more traditional methods, its jobs must be run in batch mode. This has proven to be a limitation for orga- nizations that need to process data more frequently and/or closer to real time. The good news is that with the release of Hadoop 2.0, the resource management functionality has been packaged separately (it’s called YARN) so that MapReduce doesn’t get bottle- necked and can stay focused on what it does best: processing data.
  • 7. A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehousebig data 8 With growing volumes of data and increasing requirements to process and analyze data even faster, organizations are faced with three options these days. play 1 stage structured data Use Hadoop as a data staging platform for your EDW. With growing volumes of data and increasing requirements to process and analyze data even faster, organizations are faced with three options these days: 1. To add more hardware and/or horsepower to their existing EDW and opera- tional systems. 2. Consider alternative ways to manage their data. 3. Do nothing. While option 1 is viable yet expensive, and option 3 could be the kiss of death for some, option 2 is where Hadoop steps in.
  • 8. A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehouse big data 9 If you’re experiencing rapid application data growth, consider handing off some of this work to Hadoop. Consider this: What if you used Hadoop as a data staging platform to load your EDW? You could write MapReduce jobs to bring the application data into the HDFS, transform it and then send the transformed data on its way to your EDW. Two key benefits of this approach would be: • Storage costs. Because of the low cost of Hadoop storage, you could store both versions of the data in the HDFS: the before application data and the after transformed data. Your data would then all be in one place, making it easier to manage, reprocess (if needed) and analyze at a later date. • Processing power. Processing data in Hadoop frees up EDW resources and gets data processed, transformed and into your EDW quicker so that the analysis work can begin. Back in the early days of Hadoop, some went so far as to call Hadoop the “ETL killer,” putting ETL vendors at risk and on the defensive. Fortunately, these vendors quickly re- sponded with new HDFS connectors, making it easier for organizations to optimize their ETL investments in this new Hadoop world. If you’re experiencing rapid application data growth and/or you’re having trouble getting all your ETL jobs to finish in a timely manner, consider handing off some of this work to Hadoop – using your ETL vendor’s Hadoop/HDFS connector or MapReduce – and get ahead of your data, not behind it.
  • 9. A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehousebig data 10 play 2 process structured data Use Hadoop to update data in your EDW and/or operational systems. Contrary to popular belief, you don’t need “big” data to take advantage of the power of Hadoop. Not only can you use Hadoop to ease the ETL burden of processing your “small” data for your EDW (see Play 1), you can also use it to offload some of the pro- cessing work you’re currently asking your EDW to do. Take, for example, social networks like Facebook and LinkedIn that maintain a list of mu- tual friends and contacts you have with each of your connections. This mutual list data is stored in a data warehouse and needs to be updated periodically to reflect your current state of connections. As you can imagine, this is a data-heavy, resource-intensive job for your EDW – a job that could easily be handled by Hadoop in a fraction of the time and cost. Figure 4 shows this play in action: Send the data to be updated to Hadoop, let MapRe- duce do its thing, and then send the updated data back to your EDW. This would not only apply to your EDW data, but also any data that is being maintained in your operational and analytical systems. Take advantage of Hadoop’s low-cost, high-speed processing power so that your EDW and operational systems are freed up to do what they do best. Contrary to popular belief, you don’t need “big” data to take advantage of the power of Hadoop.
  • 10. A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehouse big data 11 Is all data valuable and should our goal be to capture all of it so that we can process, analyze and discover greater insights in it? Use Hadoop to take advantage of data that’s currently unavailable in your EDW. This play focuses on two categories of data: (1) structured data sources that have not been integrated into your EDW and (2) unstructured data sources. More generally, it’s any data that’s currently not part of your current EDW ecosystem that could be providing additional insight into your customers, products and services. Notwithstanding, this raises a higher-level, more philosophical question: Is all data valu- able and should our goal be to capture all of it so that we can process, analyze and discover greater insights in it? At many companies, such discussions are informed by a structured process for data governance, which is increasingly incorporating policy-mak- ing processes specific to Hadoop-resident data. It’s important to note that even if the answer is “yes, collect it all,” Hadoop is well-suited to help get us to this desired state. play 3 process non-integrated unstructured data
  • 11. A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehousebig data 12 For the data your EDW cannot or doesn’t handle well, you can use Hadoop to pick up the slack. In Figure 5, data is coming into Hadoop (HDFS) from both structured and unstructured data sources. With the data in the HDFS, you have a couple of options: • Process and keep the data in Hadoop (HDFS). You can then analyze the data using big data apps or BI/analytics tools. (See Play 6 for more information.) • Process and store the data in Hadoop (HDFS) and, optionally, push relevant data into your EDW so that it can be analyzed with existing data. Be aware that not all unstructured data can or should be structured for the EDW. Because Hadoop can store any data, it complements the EDW well. For the data your EDW cannot or doesn’t handle well, you can use Hadoop to pick up the slack.
  • 12. A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehouse big data 13 No longer does the business analyst or data scientist need to limit his data analysis to the last three, five or seven years. Use Hadoop to archive all your data on-premises or in the cloud. This play is straightforward and one of the more popular plays. Since Hadoop runs on commodity hardware that scales out easily and quickly, organizations are now able to store and archive a lot more data at a much lower cost. To reduce costs even more, organizations can also archive their data in the cloud, thus freeing up more resources in the data center. This is good news for IT, but it should also be music to the business professional’s ears. No longer does data need to be destroyed after its regulatory life to save on storage costs. No longer does the business analyst or data scientist need to limit his data analy- sis to the last three, five or seven years. Because Hadoop is open source software run- ning on commodity hardware and because performance can outstrip that of traditional databases, decades of data can now be stored more easily and cost-effectively. play 4 archive all data As demonstrated in Figure 6, archived data in Hadoop can be accessed and analyzed using big data tools or, alternatively, using traditional BI/analytics tools that have been extended to work with Hadoop. Ideally, the best tools to use will be those that are most familiar to the data professionals and have been designed to handle the volume and variety of archived data.
  • 13. A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehousebig data 14 Play 5 is geared towards organizations that want to keep the EDW as the de facto system of record. Use Hadoop to extend your EDW as the center of your organization’s data universe. This play is geared towards organizations that want to keep the EDW as the de facto system of record – at least for now. Hadoop is used to process and integrate structured and unstructured data to load into the EDW. The organization can continue using its BI/ analytics tools to access data in the EDW, and alternatively, in Hadoop. play 5 access all data via the EDW Figure 7 demonstrates how this works: 1. This step focuses on structured data that meets two criteria: (1) Data that is currently being integrated (or can be integrated) into the EDW via ETL tools and (2) data that does not need to be integrated with other non-integrated or unstructured data sources (see Play 3). 2. This is structured data you want to integrate with other data, structured or unstructured. The best place to bring it together is Hadoop. 3. This is any unstructured data that you want to process and possibly integrate with other structured or unstructured data sources. This data can be stored in Hadoop in its raw state.
  • 14. A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehouse big data 15 Be aware that the Hadoop view, by design, will not be a complete view of the data, given that the EDW is the de facto system of record in this play. 4. As outlined in Play 3, this is data that Hadoop has processed and integrated, and can be loaded into the EDW. Given that this play is about making the EDW the central player, figuring out which data goes to the EDW will be key. 5. All EDW data – from structured data sources and Hadoop – can now be ac- cessed and analyzed via BI/analytics tools. 6. Alternatively, BI/analytics tools that have been extended to work with Ha- doop can access the data in Hadoop directly. Be aware that the Hadoop view, by design, will not be a complete view of the data, given that the EDW is the de facto system of record in this play.
  • 15. A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehousebig data 16 Organizations have been spending years building up and out their EDW ecosystem, and along comes Hadoop to upset the apple cart. Use Hadoop as the landing platform for all data and exploit the strengths of both the EDW and Hadoop. This play is a paradigm shift in data management. Organizations have been spending years building up and out their EDW ecosystem, and along comes Hadoop to upset the apple cart. This play is focused on exploiting the strengths of both technologies, EDW and Hadoop, so that organizations can extract even more value and insight from one of their greatest strategic assets – data. play 6 access all data via Hadoop Figure 8 shows how it could work: 1. Structured data is initially processed and stored in Hadoop. It is no longer loaded directly into the EDW via ETL tools. 2. Unstructured data is collected, processed and integrated in Hadoop with other structured and unstructured data sources as needed. 3. Hadoop then becomes the primary source to load clean, integrated data into the EDW. The EDW also uses Hadoop to keep its data updated (see Play 2). 4. BI/analytics tools are then used to access and analyze data in both the EDW and Hadoop, depending on the business requirement. 5. Big data apps can and should be used to access and analyze data in Ha- doop, now the repository for all data.
  • 16. A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehouse big data 17 Big data technologies, including Hadoop, are relatively new and are still maturing, especially for the enterprise market. One advantage of capturing data in Hadoop is that it can be stored in its raw, native state. It does not need to be formatted upfront as with traditional, structured data stores; it can be formatted at the time of the data request. This process of formatting the data at the time of the query is called “late binding” and is a growing practice for companies that want to avoid protracted data transformation when loading the data. Late binding ensures that there is context to the data formats depending on the data request itself. Thus, Hadoop programmers can save months of programming by loading data in its na- tive state. Even though Figure 8 makes this data paradigm shift look simple, the road to get there will not be easy. Big data technologies, including Hadoop, are relatively new and are still maturing, especially for the enterprise market. Additionally, you can expect to see signifi- cant growth and maturation in the development of big data applications. Given that data creation is growing exponentially, it follows that we’ll need better applications and tools to consume data and make it more meaningful, insightful and useful.
  • 17. A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehousebig data 18 Don’t fall into the trap of believing that Hadoop is a big-data-only solution. You: “I always thought you needed big data if you were going to use Hadoop. That’s clearly not the case.” Big Data Geek: “Wah wah wawah.” You: “That’s exactly right, and that’s what I told my boss.” Big Data Geek: “Wah wah wah wawah wah wah wawah.” You: “Really? I’m going to check with IT and see how many years of data we’re currently archiving. Maybe we can bump it up if we implement Hadoop.” Big Data Geek: “Wawah.” You: “Good point. I think we’re currently archiving our data on-premises. Maybe we should look into moving some of this data off-site into the cloud.” Big Data Geek: “Wah wah wawah wah wah wah wah wawah wah wawah! You: “Whatever, dude.” Don’t fall into the trap of believing that Hadoop is a big-data-only solution. What’s im- portant for, indeed incumbent on, the tech-savvy business professional is to identify the improvement opportunities that Hadoop can help address. Often, technical practitioners and programmers are so focused on getting the data in that they lose sight of the prom- ised efficiencies, not to mention business-specific constraints that might be lifted as a result of using Hadoop. Whether or not you speak open source and whether or not you’re database-fluent, you now have the business perspective to identify the optimal play(s) from the Big Data Play- book. In so doing, you can and will be the bridge between the problem and the solution. conclusion
  • 18. A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehouse big data 19 [ you have been reading a SAS Best Practices white paper. thank you. ]
  • 19. SAS Institute Inc. 100 SAS Campus Drive Cary, NC 27513-2414 USA Phone: 919-677-8000 Fax: 919-677-4444 about the author tamara dull has over 25 years of technology services experi- ence, with a strong foundation in data analysis, design and develop- ment. This has served as a natural bridge to Tamara’s current role as SAS’ thought leadership guru of big data, guiding the conversation on this emerging trend as she examines everything from basic prin- ciples to architectures and delivery best practices. A pioneer in the development of social media and online strategy, Tamara has established dynamic marketing efforts and fostered ro- bust online collaboration and community interaction. At Lyzasoft, she helmed a major release of the startup company’s enterprise col- laboration software suite. Tamara also established an online commu- nity in her role as the co-founder of Semper Vita, a non-profit charity website. Her role as the VP of Marketing at Baseline Consulting saw her provide strategic leadership and online media expertise for key marketing efforts and branding. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. 106947_S120094.0114