Cloud Computing Evolution
Why Cloud Computing needed?
Cloud Computing Models
Cloud Solutions
Cloud Jobs opportunities
Criteria for Big Data
Big Data challenges
Technologies to process Big Data- Hadoop
Hadoop History and Architecture
Hadoop Eco-System
Hadoop Real-time Use cases
Hadoop Job opportunities
Hadoop and SAP HANA integration
Summary
Introduction to Cloud computing and Big Data-Hadoop
1. Introduction: Cloud Computing and
Big Data - Hadoop
Presented By:
Nagarjuna D.N
SAP CTL
AT&T, Bengaluru
Date: 14-07-2015
2. Overview
• Cloud Computing Evolution
• Why Cloud Computing needed?
• Cloud Computing Models
• Cloud Solutions
• Cloud Jobs opportunities
• Criteria for Big Data
• Big Data challenges
• Technologies to process Big Data- Hadoop
• Hadoop History and Architecture
• Hadoop Eco-System
• Hadoop Real-time Use cases
• Hadoop Job opportunities
• Hadoop and SAP HANA integration
• Summary
2
3. Internet of Things (IoT)
Big Data “One of the Reason is Cloud Computing….!”
3
4. Cloud Computing
(Evolution of an internet and its hidden from the end user)
• Infrastructure is maintained somewhere with shared computing
resources -servers and storage, network, all delivered over the Internet.
• The Cloud delivers a hosting environment that is-
-immediate,
-flexible,
-scalable,
-secure,
-available,
-saves corporations money, time and resources.
Flexible
Scalable
Secure
5. Cloud Computing (Cont….)
• In addition, the platform provides on demand services, i.e
always on, anywhere, anytime and any place.
• “Pay-for-what-you-use”- metered basis.
• Its based on utility computing and Virtualization.
5
19. Enterprise Cloud Solutions
1. Test / Development / QA Platform
o Use cloud infrastructure servers as test and development
platform
2. Disaster Recovery
o Keep images of servers on cloud infrastructure ready to
go in case of a disaster
3. Cloud File Storage
o Backup or Archive company data to cloud file storage
4. Load Balancing
o Use cloud infrastructure for overflow management during
peak usage times
19
20. Enterprise Cloud Solutions (cont)
5. Overhead Control
o Lower overhead costs and make bids more competitive
6. Distributed Network Control and Cost Reporting
o Create an individual private networks (VPC) for each of
subsidiaries or contracts
7. Rapid Deployment
o Turn up servers immediately to fulfill project timelines
8. Functional IT Labor Shift
o Refocus IT labor expense on revenue producing activities
20
21. Preparing for the Future Cloud IT
Jobs
Sampling of IT skills likely to be in demand in the future
o Functional application development and support
I.e. Oracle, SAP, SQL, linking hardware to software
o Leveraging data to make strategic business decisions
I.e. Business Intelligence : Applying sales forecasts to inventory and
manufacturing decisions
o Mobile apps
Android, iPhone, Windows Mobile
o Wi-Fi engineers
USF to include broadband communications (LTE replaces GSM/CDMA)
o Optical engineers
Optical offers the highest bandwidth today (PON, CWDM, DWDM)
o Virtualization Specialists
Economies of scale require virtualization (server, storage, client…)
o IP Engineers
o Network Security Specialists
o Web developers
o Social Media developers
o Business Intelligence application development and support
21
24. “Big Data- Big Thing”
• Big Data is exactly like Rubik’s cube.
• Just like a Rubik’s cube Big Data has many different solutions.
• If you take five Rubik’s cube and mix up the same way and give it to five
different expert’s.
• They will solve the Rubik’s cube in fractions of the seconds.
• But if you pay attention to the same closely, you will notice that even though
the final outcome is the same, the route taken to solve the Rubik’s cube is
not the same.
• Every expert will start at a different place(colors) and will try to resolve it
with different methods.
• It is nearly impossible to have a exact same route taken by two experts.
Begining Big Data
24
26. Big Data Definition in general
• Big Data is a collection of data sets that are large and complex in
nature.
• They constitute both structured and unstructured data that grow
large so fast that they are not manageable by traditional relational
database systems(Eg., RDBMS).
26
27. Big Data Technically
i. Volume
petta bytes or Zetta bytes.
ii. Velocity
Batch or real(stream) time processing.
iii. Variety
Structured, semi-structured &
Unstructured.
It is estimated that 80% of world’s data
are unstructured and rest of them
semi-structured and structured.
iv. Veracity
The quality of the data being captured
can vary greatly.
Fig.Big Data Based on Doug Cutting 3Vs model
27
28. Variety of Data
1. Structured Data:- Data i.e. identifiable because its organized in a
structure(Standard defined format)
E.g.: Database, Data Warehouses & Electronic spreadsheets.
2. Semi-Structured Data:- Data i.e. neither raw data, nor typed data in
a conventional database system
E.g.: Wiki pages, Tweets, Facebook data & Instant Messages.
3. Unstructured Data:- its doesn’t have standard defined structure
E.g.: Data files, Audio files, Video, Graphics & Multimedia.
28
29. Traditional Data v/s Big Data
Attributes Traditional Data Big Data
Volume Gigabytes to terabytes Petabytes to zettabytes
Organizaton Centralized Distributed
Structure Structured Semi-structured & unstructured
Data model Strict schema based Flat schema
Data relationship Complex interrelationships Almost flat with few relationships
29
30. Criteria of Big Data
1. 272 hours of video are uploaded to YouTube every minute and
over 3 billion hours of video are watched every month.
2. Radio Frequency ID (RFID) systems generated up to 1,000 times
more data compared to the conventional bar code systems.
3. 340 million tweets are sent every day and that amounts of 7TB of
data.
4. Social networking site, Facebook, processes over 10TB of data
every day.
5. Over 5 billion people use cell phones to call, send SMS, email,
browse Internet, and interact via social networking sites.
6. The Square Kilometre Array project of NASA receives 700 TB of
data per second.
30
31. Challenges with Big Data
1. Scaling is costly.
2. Strategy must be in place before you hit the limit of a single
computer.
3. Most entreprises responded to scalability needs when they started
facing problems of poor response and low throughput.
4. Adding hardware to existing system is manpower extensive and
hence error prone.
5. Mixed data type - structured and unstructured - makes scaling even
harder.
31
41. Technology to process Big Data- Hadoop
(Open-source software framework written in Java)
• Open-source software: It's free to download, though more and
more commercial versions of Hadoop are becoming available.
• Framework: It means that everything you need to develop and run
software applications is provided –programs, connections, etc.
• Distributed storage: The Hadoop framework breaks big data into
blocks, which are stored on clusters of commodity hardware.
• Processing power: Hadoop concurrently processes large amounts
of data using multiple low-cost computers for fast results.
• Hadoop an DFS and not Database. Its designed for information from
many forms.
• Open source project started by Doug Cutting-
employee of Yahoo. Hadoop is the name of his sons toy elephant.
• Apache software foundation- Apache Hadoop.
41
43. Hadoop Architecture
Hadoop core has two major components (daemons):
1. HDFS
a. NameNode
b. Secondary NameNode
c. DataNode
2. MapReduce Engine (distributed data processing framework)
a. JobTracker
b. TaskTracker
46
44. What components make up Hadoop?
• Hadoop Common – the libraries and utilities used by other Hadoop
modules.
• Hadoop Distributed File System (HDFS) – the Java-based
scalable system that stores data across multiple machines without
prior organization.
• MapReduce – a software programming model for processing large
sets of data in parallel.
• YARN – resource management framework for scheduling and
handling resource requests from distributed applications. (YARN is
an acronym for Yet Another Resource Negotiator.)
45
51. Benefits of Hadoop
• Scalable– New nodes can be added without needing to change
data formats.
• Cost effective– Hadoop brings massively parallel computing to
commodity hardwares.
• Flexible– Hadoop is schema-less, and can absorb any type of data,
structured or not, from any number of sources.
• Fault tolerant– When you lose a node, the system redirects work to
another location of the data and continues processing without
missing a heartbeat.
• Programming languages- Java(default)/python.
• Last but not least – it’s free! ( Open source).
43
52. Hadoop is not Suitable for All Kinds of
Applications
Hadoop is not suitable to:
• perform real-time, stream-based processing where data is
processed immediately upon its arrival.
• perform online access where low latency is required.
44
54. Real-Time Hadoop
Use Cases
1. Risk Modeling (How can banks
understand customers & markets ?)
2. Customer churn analysis (why do
companies really lose customers?)
3. Ad Targeting (How can companies
increase campaign efficiency?)
4. Point of sale transaction analysis (How do retailers
target promotion guaranteed to make you buy?)
5. Search quality
(What’s in your search?) Hyperlink54
Era of IOT, start up companies, resources- server, storage, networking
Infrastructure is maintained somewhere with shared computing resources can be accessed over the internet
Public cloud
The cloud infrastructure is available to the public on a commercial basis by a cloud service provider. This enables a consumer to develop and deploy a service in the cloud with very little financial outlay compared to the capital expenditure requirements normally associated with other deployment options
Private cloud
The cloud infrastructure has been deployed, and is maintained and operated for a specific organization. The operation may be in-house or with a third party on the premises. • Community Cloud — The cloud infrastructure is shared among a num
Hybrid cloud
A hybrid cloud environment consists of some portion of computing resources on-site (on premise) and off-site (public cloud). By integrating public cloud services, users can leverage cloud solutions for specific functions that are too costly to maintain on-premise such as virtual server disaster recovery, backups and test/development environments.
Community cloud
A community cloud is formed when several organizations with similar requirements share common infrastructure. Costs are spread over fewer users than a public cloud but more than a single tenant.
On-demand, Reserved and Bid
It infrastructure
myriad-numerous to think of it
Harnessing big data for business insights
Harnessing-exploit, control, keep in check
Issues with existing RDBMS
Big Data solutions
http://saphanatutorial.com/what-is-hadoop/
JobT- 50030
TaskT-50060
Sec N Node-50090
Name Node-50070
datanode-50075