More Related Content Similar to Big Data, Hadoop, NoSQL and more ... (20) More from Varad Meru (16) Big Data, Hadoop, NoSQL and more ...2. Big Data, Hadoop, NoSQL and
more …
Varad Meru
Software Development Engineer, Orzota, Inc.
varad@orzota.com
in.linkedin.com/in/vmeru
@vrdmr
© Orzota, Inc. 2013 2
3. Mission: Make big data easy for consumption
Offers Big Data/Hadoop Solutions and Software
Services to companies
Develops Software to help companies consume Big
Data
Founded in March 2012
Headquartered in Silicon Valley, California
Offshore offices in Chennai, India
About Orzota
© Orzota, Inc. 2013 3
4. We work on
o Big Data
o Hadoop
o Cloud Technologies
o Data Science
o Products and Services
o Everything that it takes to be a valued Player.
About Orzota (contd.)
© Orzota, Inc. 2013 4
5. Community Development
Occasional seminars by Architects, Engineers,
Managers.
We invite professionals and aspiring professionals to
join Big Data / Hadoop communities in their
geographies.
Pune Hadoop User Group – Participant + Organizer.
Chennai Hadoop User Group – Participant + Sponsor.
About Orzota (contd.)
© Orzota, Inc. 2013 5
6. About Me
• Orzota, Inc.
• Currently working with
Hadoop, Mahout, Cloud, etc.
• Past Work Experience
• Persistent Systems – Search,
Recommendation Engines and
User Behavior Analytics.
• Area of Interest
• Data Science, Information
Retrieval
• Distributed Systems
6
© Orzota, Inc. 2013
8. Agenda
• Introduction to BigData
• Technologies and Domain
• Hadoop EcoSystem
• Introduction to MapReduce
• Architecture – HDFS + MapReduce.
• NoSQL Databases
• CAP Theorem
• Different NoSQL Databases
• Other Trends
© Orzota, Inc. 2013 8
10. • What is Big Data?
• What does it mean to me?
• Why so much fuss in the industry?
• Who uses these technologies?
• How are they used in the Industry and Academia?
• When to start using them?
• How to learn them?
10
Big Data
© Orzota, Inc. 2013
11. • Volume - Amassing terabytes—even petabytes—of information.
• 12 terabytes of Tweets created each day.
• 350 billion annual meter readings.
• Velocity - Sometimes 2 minutes is too late.
• Scrutinize 5 million trade events.
• 500 million daily call detail records
• Variety - Big data is any type of data.
• 80% data growth in images, video and documents.
“Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of
processing to enable enhanced decision making, insight discovery and process optimization.”
– Laney Douglas. "The Importance of 'Big Data': A Definition"
11
Big Data – 3 Vs
© Orzota, Inc. 2013
12. Problem
12
• Store and Process Data for -
• Search Engines,
• Recommendations Engines,
• Fraud Detection,
• Aadhar (Govt. of India),
• Spam Detection, etc.
• Also, in some cases Real-time (e.g. Facebook)
© Orzota, Inc. 2013
13. Solutions ?
13
• Classical Solutions
• Database + Programming Language (Java-Oracle, C#-
SQL Server)
• Data Warehouses – Teradata, Netezza, Microsoft PDW
• Legacy Network Systems
• Novel
• CORBA
• Java RMI – RPC
© Orzota, Inc. 2013
14. Problems of the Solutions
14
• Problems with Classical Solutions
• CAP Theorem, by Prof. Eric Brewer (Berkeley) –
• Choose any 2 between
Consistency, Availability and Partitioning
• ACID Properties
• For Small number of Transactions, cumulative overhead still
manageable.
• For Very large number of Transactions – Facebook Posts?
• Very High Licensing Fees.
• Closed Source – Stick with the Company’s Eco-System.
© Orzota, Inc. 2013
15. Solution to the Problems of the
Solutions
15
• Focus on Problem Domain
• What’s more important for your Solution?
• Consistency, Availability, and Partitioning
• Which Industry/Company already face similar
Problems?
• How/Where to Collect Data?
• Technology Fields – Internet Companies
• Hadoop, NoSQL Datastores
• Open Source, Free and with Friendly Licenses.
© Orzota, Inc. 2013
17. Introduction
17
• Started by Doug Cutting and Mike Caferella for Nutch –
Open Search Engine.
• Further Developed at Yahoo!, Facebook and contributed
by people from many companies.
• Named after a Little Toy Elephant owned by Doug’s Son.
• Inspired by 2 research papers from Google
• The Google File System – 2003
• MapReduce – 2004
© Orzota, Inc. 2013
18. Introduction (contd.)
18
• Contains 3 modules
• Distributed File System
• MapReduce
• Commons (A Java library containing common functions
used by both DFS and MapReduce)
• Apache Top Level Project
• Hadoop’s Website – hadoop.apache.org
• Two Parallel Release Cycles – 1.x and 2.x
© Orzota, Inc. 2013
19. 19
• A Rich Eco-System built around Hadoop
• Hive – Large Scale Data Warehouse
• Hbase – NoSQL Database
• Pig – A Data-flow language on top of Hadoop
• Flume – Log Management for Hadoop
• Oozie –Workflow framework
• Mahout – Machine Learning Library on top of Hadoop
• Vaidya – Performance benchmarking framework.
• MRUnit – Unit testing framework for MapReduce Programs.
• And many more …
© Orzota, Inc. 2013
Introduction (contd.)
20. MapReduce in 2 minutes –
Problem Statement – Sum of Double of set of
Numbers.
The intermediate array after
Processing
20
MapReduce
1 3 4 5 6 8 9 11 17 21 1
3
4
5
6
8
9
11
17
21
2
6
8
10
12
16
18
22
34
42
© Orzota, Inc. 2013
21. 21
Introduction – contd.
Mapping Phase
• Splitting the input
• Sending
slaves(datanodes) the
mapping code - f(x).
• Apply the f(x) method
on the data split 1
1
9
8
6
11
4
3
17
21
The Master
Node
This node
contains the
code of the
function to be
applied on
individual entries
of Array
Written in the
map() method in
Hadoop.
Mapping Phase
Code f(x) being sent to the
slave node for applying the
logic on the data piece. In our
case the data piece is an entry
from the Array.
Slave Nodes
© Orzota, Inc. 2013
22. 22
Introduction – contd.
Spill Phase
• Masternode directs the
Mappers to send the
processed f(x) output
data to intermediate
location.
• Shuffle and Sorting
2
2
18
16
12
22
8
6
34
42
The Master
Node.
The Results of the
Processed Data
(from the slave
nodes is given to s
specific node
where reducer
function runs)
Spill Phase :- Shuffle and Sort
Slave Nodes
© Orzota, Inc. 2013
23. 23
Introduction – contd.
Reduce Phase
• MasterNode
(JobTracker) to invokes
the Reduce task once
the spilling is over.
• Get location of the Spill
output from
MasterNode
(Namenode).
g(x)=162
The Master
Node.
The Results of
the Processed
Data (from the
slave nodes is
given to s
specific node
where reducer
function runs)
Reducer Phase
Slave Nodes
© Orzota, Inc. 2013
24. Steps involved in writing a MapReduce program
• Write the Mapper
• Write the Reducer
• Write the Driver
Life’s Simple until you start customizing and work on
Data Cleansing
24
MapReduce Programming
© Orzota, Inc. 2013
25. 25
Hadoop – Bird’s Eye View
© Orzota, Inc. 2013
DN TT
DN TT
DN TT
DN TT
DN TT
DN TT
DN TT
DN TT …
… …
Name
Node
Job
Tracker
DFS Message Path
MapReduce Processing Msg
27. Non-Relational Databases
• Data Model not bound by a Schema.
• No Predetermined Schema, Run-Time Columns
• Sample Data
• Twitter Streams
• Web Forms
• Sensor Networks
27
Introduction
© Orzota, Inc. 2013
29. Business Requirements
• High Writes, Low Reads – Sensor Networks, Large Hadron
Collider, Click Logging.
• High Reads, Low Writes – Archival Storage.
• Don’t have any fixed Schema.
Open Question - Where Else?
29
© Orzota, Inc. 2013
30. NoSQL Types
• Key-Value Pair
• Riak, Voldemort, etc.
• Document Oriented
• CouchDB, MongoDB, etc.
• BigTable Implementations
• Cassandra, HyperTable, Hbase, etc.
• Graph oriented
• Neo4j, etc.
30
© Orzota, Inc. 2013
31. 31
Introduction
© Orzota, Inc. 2013
Source: http://techcrunch.com/2012/10/27/big-data-right-now-five-trendy-open-source-technologies/
© Orzota, Inc. 2013
32. Wake up - Conclusion Time
• BigData on the Rise
• Technology and the Domain
• Smart Engineers needed, with BigData skills
• Chance to develop niche areas of Expertise even before
stepping into the Industry
• 3rd Year Students – Select your final year projects very
carefully, with the tools mentioned in this Seminar
• 4th Year Students – Equip your self with the necessary
skills for better industry opportunities.
© Orzota, Inc. 2013
33. Recommendations
33
• I recommend aspiring professionals and young
professionals read:
• How to Solve it by Computer – RG Dromey
• Code Complete 2 – Steve McConnell
• Advanced Programming in the Unix Environment – Richard
Stevens
• Many Books on Hadoop, NoSQL Datastores, and Big Data
in general.
© Orzota, Inc. 2013
… and many more
35. 35
Contact Us at –
Thank You
Linkedin.com/company/orzota-inc-
Twitter.com/orzota
© Orzota, Inc. 2013
Editor's Notes Complete till this in 8 mins. You have 25 minutes left.