SlideShare uma empresa Scribd logo
1 de 29
Baixar para ler offline
Scaling up with
Hadoop
&
Parallel Processing Framework (Banyan)
LatentView Analytics
2015
Agenda
1 Introducing LatentView Analytics
2 Data Processing Frameworks and a brief history of Hadoop
3 Solving the Big Data Problem with Hadoop, Spark & Storm
4 The Unstructured Maze
5 Banyan – A Parallel Processing Framework
6 Demo
Agenda
Introducing LatentView Analytics
2 Data Processing Frameworks and a brief history of Hadoop
3 Solving the Big Data Problem with Hadoop, Spark & Storm
4 The Unstructured Maze
5 Banyan – A Parallel Processing Framework
6 Demo
LatentView Analytics
25
Developed
Solutions for
Fortune
500 Firms 500
Over
People Strong
1000Experience in
Analytics
With more than
years of
combined
Followers on Social Media
30K
Engaging
LatentView in the News
LatentView won the Deloitte Technology Fast 50
India awards for 6 consecutive years (2009 – 13)
‘Top Innovator’ awarded to LatentView by
Developer Week (Conference & Festival 2013)
LatentView was a Top Finalist in the ‘We Love
Our Workplace 2013’ category. Reflecting global
recognition of our workplace culture.
LatentView is Advanced Consulting Partner with
Amazon Web Services
 Build Reporting and Analytics Centers of
Excellence (COEs)
 Analyze Business problems both Qualitatively &
Quantitatively and provide actionable insights
 Onsite-Offshore Global Delivery model that helps
in-house teams do more with less
 Provide Thought Leadership in Data Science
Services provided by LatentView
LatentView is an Alliance Partner with Tableau
Industry Specific Analysis: Market basket Analysis, Campaign Analytics, Fraud Detection, Survey
analytics, Customer Life time Value, Demand Forecasting, Price Optimization, Social Media Analysis
Mobile
PC
Tablet
Signal
&
Wireless
Data
Servers
&
Cloud
Social
User Profile
Surveys &
Reviews
Travel &
Location
Performance
System Logs
&
Database data
Unstructured Data
Work @ LatentView
Different Data Sources & Formats Technology & Predictive Analysis Tool Kits
Data Engineering & Advanced Analytics
Infrastructure
Databases
Predictive Modelling
CXO Dashboards & Visualization
Agenda
1 Introducing LatentView Analytics
Data Processing Frameworks and a brief history of Hadoop
3 Solving the Big Data Problem with Hadoop, Spark & Storm
4 The Unstructured Maze
5 Banyan – A Parallel Processing Framework
6 Demo
Data Processing Frameworks
Data Processing Frameworks
Distributed Processing Parallel Processing
Distributed Processing Characteristics
• Master/Slave or Peer-Peer architecture
• Data Replication and redundancy
• Fault tolerant, Shared Memory
• Centralized Job distribution
• Efficient Job scheduling
• Coordinated Resource Management
• Process Structured & Semi-Structured data
• Examples:
• Hadoop, Spark, Storm
Parallel Processing Characteristics
• Shared Nothing Massively Parallel architecture
• Common or Independent Storage
• Independent Memory & Processor space
• Random Job distribution
• Self managed resources & worker
• Dynamic load balanced cluster
• Process Unstructured data
• Examples:
• Banyan
The Search Context
• Gerard Salton, Father of Modern Search Technology
• Salton’s Magic Automatic Retriever of Text
• Inverse Document Frequency (IDF), Term Frequency (TF), term discrimination values
SMART (informational retrieval system)
Project Xanadu
ARPANet
Archie Query Form
FTP & WWW
Ask, AltaVista, Yahoo, Google, Bing
• Ted Nelson – Coined the Term Hyper Text
• Create a Computer Network with a simple UI to solve social problems like attribution
• Inspired creation of WWW
• Advanced Research Projects Agency Network
• Led to Internet
• First Implementation of TCP/IP stack
• Document Search & Find Tool
• Script-based data gatherer with a regular expression matcher for retrieving file
• A database of web filenames which it would match with the users queries
• Enter Tim Berners Lee
• httpd, TCP, DNS – Connected it all
• A database of web filenames which it would match with the users queries
What is the biggest problem that the Search Engines of the
last two decades solve?
 Project Lucene was written by Doug Cutting in 1999. It was written purely in
JAVA
 It was written with an intention of helping in creating an open source web
engine
 Lucene is just an indexing and search library and does not contain crawling and
HTML parsing functionality.
Building Lucene
 Ported Nutch Algorithms to Hadoop
 Yahoo Hires Doug Cutting!
 Apache Hadoop comes into picture to support Map Reduce & HDFS
 Yahoo’s Grid team adopts Hadoop
 Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours
Algorithms in Hadoop
 Yahoo! set up a Hadoop research cluster—300 nodes. Also, Sort benchmark
run on 500 nodes in 42 hours (better hardware than April benchmark).
 Research cluster upgraded to 600 nodes
 In 2008 - Won the 1 terabyte sort benchmark in 209 seconds on 900 nodes.
 Yahoo! Announced that its production search index was being generated by
10000-core Hadoop
Hadoop Benchmarks!
 As of 2008 - Loading 10 terabytes of data per day on to research clusters
 17 clusters with total of 24,000 nodes
 In 2009 – Won the minute sort by sorting 500GB in 59 seconds(on 1,400
nodes) and 100TB sort in 173 minutes(on 3,400 nodes)
 Last.fm, Facebook, New York Times
Hadoop In Action!
 Lucene was not able to crawl or parse HTML by itself. So, a sub project was
developed under it which was called Nutch
 Doug Cutting & Mike Cafarella
 Highly modular architecture, allows developers to create plug-ins for media-
type parsing, data retrieval, querying and clustering.
Building Nutch
 Google File System paper was presented
 NDFS was developed based on the paper
 Google released another paper on MapReduce that Revolutionized the
Hadoop development
 MapReduce tries to collocate the data with the compute node, so data access
is fast since it is local. This is known as Data Locality
The Heart of Hadoop – Distributed File System & Map Reduce
A Brief History of Hadoop
Year: 1999
Key Challenges Addressed
 Efficient Indexing of the results for easy retrieval
Year: 2002
 Efficient Crawling of the World wide web at Scale
Year: 2003
File System
Year: 2005
Year: 2006Algorithms in Hadoop
 Cost & Time efficient hardware andsoftware
Year: 2008 Hadoop in Action!
 Distributed Processing, Data Warehousing & Analysis
 UI & Tools
 Hadoop Distributions
 Apache Hadoop, Apache Bigtop
 Hadoop as a Platform
 Cloudera, HortonWorks, MapR
 Hadoop as a Service
 GoGrid, Qubole, Altiscale, AWS EMR, Azure HDInsight, IBM BigInsights
Hadoop Services Ecosystem
 Master Slave Architecture
 Batch Vs Real Time Stream processing
 Relational database Vs NoSQL database
 Data fragmentation and management
 Parallel processing job requirements
 Efficient energy management
Elephant be it, has it‘s limitations!
Applications of Big Data Processing
Predict Galaxy types and shapes
Analyzing Life Forms
Weather Forecast
Traffic Management
Disaster Recovery
Personal Health Care
Science & Engineering Environmental Management Intelligent Devices & IoT
Agenda
1 Introducing LatentView Analytics
2 Data Processing Frameworks and a brief history of Hadoop
Solving the Big Data Problem with Hadoop, Spark & Storm
4 The Unstructured Maze
5 Banyan – A Parallel Processing Framework
6 Demo
Apache Hadoop & Family
Identify the Apache Hadoop Components!
The Apache Hadoop Stack
Hadoop Distributed File System
YARN/Map Reduce V2
Pig Hive Mahout Oozie
Hbase
Flume
Sqoop
Hadoop User Experience (HUE)
ML Workflow
Columnar
data
store
Scripting SQL
Coordination
ZooKeeperData
Exchange
Log
Control
Walking the Talk with Hadoop – Let’s Architect…
People you may know on LinkedIn.
You might know me, if people that you know, know me!
foreach u in UserList:
foreach x in Connections(u):
foreach y in Connections(x):
if(y not in Connections(u)):
Count(u, y)++;
Sort (u, y) in descending order of Count(u, y);
Choose Top 3 y;
Store (u, {y0, y1, y2..}) for serving;
Simplest ever Map-Reduce example
Mapper is a function that transforms
the input data in required format,
without aggregating.
Mapped_List = Mapper(Input_List)
Ex:
Input_List = (1, 2, 3, 4, 5, 6, 7, 8, 9)
Mapper = Square()
Mapped_List = Square(Input_List)
Mapped_List
= Square(1, 2, 3, 4, 5, 6, 7, 8, 9)
Mapped_List
= (1, 4, 9, 16, 25, 36, 49, 64, 81)
What is a Map ? What is a Reduce?
Reducer is a function that aggregates
the input data in required format.
Output_List = Reducer(Mapped_List)
Ex:
Mapped_List
= (1, 4, 9, 16, 25, 36, 49, 64, 81)
Reducer = Sum()
Output_List = Sum(Mapped_List)
Output_List
= Sum(1, 4, 9, 16, 25, 36, 49, 64, 81)
Output_List = 285
Characteristics of Map Reduce
Map is inherently parallel process,
where each list element is processed
independently
Reduce is inherently sequential, unless
multiple lists are processed at a time –
in parallel
Grouping is done to produce multiple
lists to avail parallelism
Input  Partition  Map  Sort 
Shuffle  Reduce  Output
Native MapReduce , Hadoop Streaming
Simulating Map Reduce
mc:~$ cat /var/log/auth.log* | grep "session opened" | cut -f11 -d' ' | sort | uniq –c
What do each of above commands do? What is the output?
Feb 1 18:17:01 ip-10-218-136-14 CRON[21353]: pam_unix(cron:session): session opened for user root by (uid=0)
Feb 1 18:30:01 ip-10-218-136-14 CRON[21373]: pam_unix(cron:session): session opened for user ubuntu by (uid=0)
Feb 1 18:39:01 ip-10-218-136-14 CRON[21387]: pam_unix(cron:session): session opened for user root by (uid=0)
Feb 1 19:09:01 ip-10-218-136-14 CRON[21427]: pam_unix(cron:session): session opened for user root by (uid=0)
mc:~$ cat /var/log/auth.log* | grep "session opened" | less
mc:~$ cat /var/log/auth.log* | grep "session opened" | cut -f11 -d' ' | sort | uniq
mc:~$ cat /var/log/auth.log* | grep "session opened" | cut -f11 -d' ' | sort | uniq -c
28321 root
86 ubuntu
47635 user
The MapReduce Process
Map
In  (Key1, Value1)
Out  List(Key2, Value2)
Input --- (Filtering, Transformation) --- Output
Reduce
In  List(Key2, List(Value2))
Out  List(Key3, Value3)
Aggregation
Shuffle
In  (Key2, Value2)
Out  Sort(Partition(Key2, List(Value2)))
Movement / copy of data
The MapReduce Process with a Deck of Cards!
Map in Parallel Shuffle/Group Reduce
Sum()
Sum()
Sum()
Sum()
Sum()
Hadoop Security
Centralized framework for collecting access audit history and easy
reporting on the data.
Provides Kerberos based authentication. Kerberos can be
connected to corporate LDAP environments to centrally provision
user information.
Supports encrypting data when it is is transferred and at rest and
masking capalbilities for desenstizing PII information
Ensures users have access to only to data as per corporate policies.
Provides fine-grained authorization via file permissions in HDFS,
recsource level access control for YARN & MapReduce
Security requirements consistently applied across the platform and
can be mangaged centrally with a single interface
Audit
Data Protection
Authorization
Authentication
Centralized Seurity Administration
Difference between Authentication & Authorization ?
A Brief note on Spark & Storm
What do you think is the most time consuming aspect of Hadoop Processes?
How to improve the I/O Limitation?
Result: Faster Analytics
How to achieve event driven real
time analytics?
Result: Highly customized
service response
Hadoop Distributions & Service Providers
Agenda
1 Introducing LatentView Analytics
2 Data Processing Frameworks and a brief history of Hadoop
3 Solving the Big Data Problem with Hadoop, Spark & Storm
The Unstructured Maze
5 Banyan – A Parallel Processing Framework
6 Demo
Data Deluge – A big problem
PC
Tablet Mobile
SocialSearch
&
Mail
E-Commerce
Tree based Unstructured Feature Extraction
Panel & Web Logs
Social
Rules
Engine
Data
Parser
• Tweets
• Comments
• Likes
• Shares
• Blogs
• Reviews
• Clickstream
• HTML
• Images
• Audio*
• Video*
Feature Type Detail
Feature 1 Image 600*400
Feature 2 Link #
Feature 3 Price 200$
Feature 4 Star 3.5
Tweet Time View
Tweet1 12:00 Positive
Tweet 2 12:05 Neutral
Tree Based Parser
Agenda
1 Introducing LatentView Analytics
2 Data Processing Frameworks and a brief history of Hadoop
3 Solving the Big Data Problem with Hadoop, Spark & Storm
4 The Unstructured Maze
Banyan – A Parallel Processing Framework
6 Demo
Banyan – Parallel processing framework at scale
• Is your data unstructured ?
Ex: HTML, Images, URLs, Audio, Video, Documents, Text
• Is processing each input independent of processing other input?
Ex: Compressing one image is independent of next image
• Do you need to solve the two problems above at web scale?
Ex: say 1 Million documents to processed in less than 1 hour
We handle what Hadoop can’t
handle!
Rather, We handle what Hadoop
isn’t supposed to handle – Parallel
Processing & Unstructured data!
Banyan is a parallel processing
framework well integrated with
cloud platform of your choice!
Follow us here:
Banyan – Embarrassingly Parallel
Processing Framework Linked In
Group
http://www.growbanyan.com
Email us :
runparallel@latentview.com
Banyan Vs Hadoop (Yes or No type of comparison)
Characteristics Banyan Hadoop
Job Type Embarrassingly Parallel Processing Distributed Processing
Master Slave Architecture
Shared Nothing Architecture
Data Replication
Fault tolerance
Coordinated Job Distribution
Dynamical Load Balancing
Rescheduling Job Failures
Process Structured data
Process Unstructured data
Note:
The core advantage of Banyan is best utilized
when Data Processing & Analysis (Aggregation)
are executed in a decoupled fashion for jobs
that can be processed in parallel
Follow us here!
http://www.growbanyan.com
Banyan – Embarrassingly Parallel Processing Framework Linked In Group
Email us : runparallel@latentview.com

Mais conteúdo relacionado

Mais procurados

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsFadi Yousuf
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introductionXuan-Chao Huang
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop IntroductionDzung Nguyen
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made EasyDataWorks Summit
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop TechnologyOpenDev
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hiveDavid Kaiser
 

Mais procurados (20)

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The Essentials
 
SQOOP - RDBMS to Hadoop
SQOOP - RDBMS to HadoopSQOOP - RDBMS to Hadoop
SQOOP - RDBMS to Hadoop
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made Easy
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hive
 

Destaque

Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Apache Hadoop ecosystem - March 2013
Apache Hadoop ecosystem - March 2013Apache Hadoop ecosystem - March 2013
Apache Hadoop ecosystem - March 2013hadoopsphere
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on HadoopDataWorks Summit
 
Hadoop data access layer v4.0
Hadoop data access layer v4.0Hadoop data access layer v4.0
Hadoop data access layer v4.0SpringPeople
 
Wex grade5 level-c:book1
Wex grade5 level-c:book1Wex grade5 level-c:book1
Wex grade5 level-c:book1Paul Solarz
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Rohit Agrawal
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionDong Ngoc
 
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UKSUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UKhuguk
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 

Destaque (20)

Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Apache Hadoop ecosystem - March 2013
Apache Hadoop ecosystem - March 2013Apache Hadoop ecosystem - March 2013
Apache Hadoop ecosystem - March 2013
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on Hadoop
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Apache Hadoop: DFS and Map Reduce
Apache Hadoop: DFS and Map ReduceApache Hadoop: DFS and Map Reduce
Apache Hadoop: DFS and Map Reduce
 
Hadoop data access layer v4.0
Hadoop data access layer v4.0Hadoop data access layer v4.0
Hadoop data access layer v4.0
 
Wex grade5 level-c:book1
Wex grade5 level-c:book1Wex grade5 level-c:book1
Wex grade5 level-c:book1
 
Big Data
Big DataBig Data
Big Data
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Amazon Elastic Computing 2
Amazon Elastic Computing 2Amazon Elastic Computing 2
Amazon Elastic Computing 2
 
Taller hadoop
Taller hadoopTaller hadoop
Taller hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop administration
Hadoop administrationHadoop administration
Hadoop administration
 
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UKSUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
 

Semelhante a Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy

Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchHortonworks
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Jeff Magnusson
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopSteve Watt
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 

Semelhante a Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy (20)

Hadoop
HadoopHadoop
Hadoop
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
hadoop
hadoophadoop
hadoop
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 

Último

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx9to5mart
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...amitlee9823
 

Último (20)

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy

  • 1. Scaling up with Hadoop & Parallel Processing Framework (Banyan) LatentView Analytics 2015
  • 2. Agenda 1 Introducing LatentView Analytics 2 Data Processing Frameworks and a brief history of Hadoop 3 Solving the Big Data Problem with Hadoop, Spark & Storm 4 The Unstructured Maze 5 Banyan – A Parallel Processing Framework 6 Demo
  • 3. Agenda Introducing LatentView Analytics 2 Data Processing Frameworks and a brief history of Hadoop 3 Solving the Big Data Problem with Hadoop, Spark & Storm 4 The Unstructured Maze 5 Banyan – A Parallel Processing Framework 6 Demo
  • 4. LatentView Analytics 25 Developed Solutions for Fortune 500 Firms 500 Over People Strong 1000Experience in Analytics With more than years of combined Followers on Social Media 30K Engaging LatentView in the News LatentView won the Deloitte Technology Fast 50 India awards for 6 consecutive years (2009 – 13) ‘Top Innovator’ awarded to LatentView by Developer Week (Conference & Festival 2013) LatentView was a Top Finalist in the ‘We Love Our Workplace 2013’ category. Reflecting global recognition of our workplace culture. LatentView is Advanced Consulting Partner with Amazon Web Services  Build Reporting and Analytics Centers of Excellence (COEs)  Analyze Business problems both Qualitatively & Quantitatively and provide actionable insights  Onsite-Offshore Global Delivery model that helps in-house teams do more with less  Provide Thought Leadership in Data Science Services provided by LatentView LatentView is an Alliance Partner with Tableau
  • 5. Industry Specific Analysis: Market basket Analysis, Campaign Analytics, Fraud Detection, Survey analytics, Customer Life time Value, Demand Forecasting, Price Optimization, Social Media Analysis Mobile PC Tablet Signal & Wireless Data Servers & Cloud Social User Profile Surveys & Reviews Travel & Location Performance System Logs & Database data Unstructured Data Work @ LatentView Different Data Sources & Formats Technology & Predictive Analysis Tool Kits Data Engineering & Advanced Analytics Infrastructure Databases Predictive Modelling CXO Dashboards & Visualization
  • 6. Agenda 1 Introducing LatentView Analytics Data Processing Frameworks and a brief history of Hadoop 3 Solving the Big Data Problem with Hadoop, Spark & Storm 4 The Unstructured Maze 5 Banyan – A Parallel Processing Framework 6 Demo
  • 8. Data Processing Frameworks Distributed Processing Parallel Processing Distributed Processing Characteristics • Master/Slave or Peer-Peer architecture • Data Replication and redundancy • Fault tolerant, Shared Memory • Centralized Job distribution • Efficient Job scheduling • Coordinated Resource Management • Process Structured & Semi-Structured data • Examples: • Hadoop, Spark, Storm Parallel Processing Characteristics • Shared Nothing Massively Parallel architecture • Common or Independent Storage • Independent Memory & Processor space • Random Job distribution • Self managed resources & worker • Dynamic load balanced cluster • Process Unstructured data • Examples: • Banyan
  • 9. The Search Context • Gerard Salton, Father of Modern Search Technology • Salton’s Magic Automatic Retriever of Text • Inverse Document Frequency (IDF), Term Frequency (TF), term discrimination values SMART (informational retrieval system) Project Xanadu ARPANet Archie Query Form FTP & WWW Ask, AltaVista, Yahoo, Google, Bing • Ted Nelson – Coined the Term Hyper Text • Create a Computer Network with a simple UI to solve social problems like attribution • Inspired creation of WWW • Advanced Research Projects Agency Network • Led to Internet • First Implementation of TCP/IP stack • Document Search & Find Tool • Script-based data gatherer with a regular expression matcher for retrieving file • A database of web filenames which it would match with the users queries • Enter Tim Berners Lee • httpd, TCP, DNS – Connected it all • A database of web filenames which it would match with the users queries What is the biggest problem that the Search Engines of the last two decades solve?
  • 10.  Project Lucene was written by Doug Cutting in 1999. It was written purely in JAVA  It was written with an intention of helping in creating an open source web engine  Lucene is just an indexing and search library and does not contain crawling and HTML parsing functionality. Building Lucene  Ported Nutch Algorithms to Hadoop  Yahoo Hires Doug Cutting!  Apache Hadoop comes into picture to support Map Reduce & HDFS  Yahoo’s Grid team adopts Hadoop  Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours Algorithms in Hadoop  Yahoo! set up a Hadoop research cluster—300 nodes. Also, Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark).  Research cluster upgraded to 600 nodes  In 2008 - Won the 1 terabyte sort benchmark in 209 seconds on 900 nodes.  Yahoo! Announced that its production search index was being generated by 10000-core Hadoop Hadoop Benchmarks!  As of 2008 - Loading 10 terabytes of data per day on to research clusters  17 clusters with total of 24,000 nodes  In 2009 – Won the minute sort by sorting 500GB in 59 seconds(on 1,400 nodes) and 100TB sort in 173 minutes(on 3,400 nodes)  Last.fm, Facebook, New York Times Hadoop In Action!  Lucene was not able to crawl or parse HTML by itself. So, a sub project was developed under it which was called Nutch  Doug Cutting & Mike Cafarella  Highly modular architecture, allows developers to create plug-ins for media- type parsing, data retrieval, querying and clustering. Building Nutch  Google File System paper was presented  NDFS was developed based on the paper  Google released another paper on MapReduce that Revolutionized the Hadoop development  MapReduce tries to collocate the data with the compute node, so data access is fast since it is local. This is known as Data Locality The Heart of Hadoop – Distributed File System & Map Reduce A Brief History of Hadoop Year: 1999 Key Challenges Addressed  Efficient Indexing of the results for easy retrieval Year: 2002  Efficient Crawling of the World wide web at Scale Year: 2003 File System Year: 2005 Year: 2006Algorithms in Hadoop  Cost & Time efficient hardware andsoftware Year: 2008 Hadoop in Action!  Distributed Processing, Data Warehousing & Analysis  UI & Tools  Hadoop Distributions  Apache Hadoop, Apache Bigtop  Hadoop as a Platform  Cloudera, HortonWorks, MapR  Hadoop as a Service  GoGrid, Qubole, Altiscale, AWS EMR, Azure HDInsight, IBM BigInsights Hadoop Services Ecosystem  Master Slave Architecture  Batch Vs Real Time Stream processing  Relational database Vs NoSQL database  Data fragmentation and management  Parallel processing job requirements  Efficient energy management Elephant be it, has it‘s limitations!
  • 11. Applications of Big Data Processing Predict Galaxy types and shapes Analyzing Life Forms Weather Forecast Traffic Management Disaster Recovery Personal Health Care Science & Engineering Environmental Management Intelligent Devices & IoT
  • 12. Agenda 1 Introducing LatentView Analytics 2 Data Processing Frameworks and a brief history of Hadoop Solving the Big Data Problem with Hadoop, Spark & Storm 4 The Unstructured Maze 5 Banyan – A Parallel Processing Framework 6 Demo
  • 13. Apache Hadoop & Family Identify the Apache Hadoop Components!
  • 14. The Apache Hadoop Stack Hadoop Distributed File System YARN/Map Reduce V2 Pig Hive Mahout Oozie Hbase Flume Sqoop Hadoop User Experience (HUE) ML Workflow Columnar data store Scripting SQL Coordination ZooKeeperData Exchange Log Control
  • 15. Walking the Talk with Hadoop – Let’s Architect… People you may know on LinkedIn. You might know me, if people that you know, know me! foreach u in UserList: foreach x in Connections(u): foreach y in Connections(x): if(y not in Connections(u)): Count(u, y)++; Sort (u, y) in descending order of Count(u, y); Choose Top 3 y; Store (u, {y0, y1, y2..}) for serving;
  • 16. Simplest ever Map-Reduce example Mapper is a function that transforms the input data in required format, without aggregating. Mapped_List = Mapper(Input_List) Ex: Input_List = (1, 2, 3, 4, 5, 6, 7, 8, 9) Mapper = Square() Mapped_List = Square(Input_List) Mapped_List = Square(1, 2, 3, 4, 5, 6, 7, 8, 9) Mapped_List = (1, 4, 9, 16, 25, 36, 49, 64, 81) What is a Map ? What is a Reduce? Reducer is a function that aggregates the input data in required format. Output_List = Reducer(Mapped_List) Ex: Mapped_List = (1, 4, 9, 16, 25, 36, 49, 64, 81) Reducer = Sum() Output_List = Sum(Mapped_List) Output_List = Sum(1, 4, 9, 16, 25, 36, 49, 64, 81) Output_List = 285 Characteristics of Map Reduce Map is inherently parallel process, where each list element is processed independently Reduce is inherently sequential, unless multiple lists are processed at a time – in parallel Grouping is done to produce multiple lists to avail parallelism Input  Partition  Map  Sort  Shuffle  Reduce  Output Native MapReduce , Hadoop Streaming
  • 17. Simulating Map Reduce mc:~$ cat /var/log/auth.log* | grep "session opened" | cut -f11 -d' ' | sort | uniq –c What do each of above commands do? What is the output? Feb 1 18:17:01 ip-10-218-136-14 CRON[21353]: pam_unix(cron:session): session opened for user root by (uid=0) Feb 1 18:30:01 ip-10-218-136-14 CRON[21373]: pam_unix(cron:session): session opened for user ubuntu by (uid=0) Feb 1 18:39:01 ip-10-218-136-14 CRON[21387]: pam_unix(cron:session): session opened for user root by (uid=0) Feb 1 19:09:01 ip-10-218-136-14 CRON[21427]: pam_unix(cron:session): session opened for user root by (uid=0) mc:~$ cat /var/log/auth.log* | grep "session opened" | less mc:~$ cat /var/log/auth.log* | grep "session opened" | cut -f11 -d' ' | sort | uniq mc:~$ cat /var/log/auth.log* | grep "session opened" | cut -f11 -d' ' | sort | uniq -c 28321 root 86 ubuntu 47635 user
  • 18. The MapReduce Process Map In  (Key1, Value1) Out  List(Key2, Value2) Input --- (Filtering, Transformation) --- Output Reduce In  List(Key2, List(Value2)) Out  List(Key3, Value3) Aggregation Shuffle In  (Key2, Value2) Out  Sort(Partition(Key2, List(Value2))) Movement / copy of data
  • 19. The MapReduce Process with a Deck of Cards! Map in Parallel Shuffle/Group Reduce Sum() Sum() Sum() Sum() Sum()
  • 20. Hadoop Security Centralized framework for collecting access audit history and easy reporting on the data. Provides Kerberos based authentication. Kerberos can be connected to corporate LDAP environments to centrally provision user information. Supports encrypting data when it is is transferred and at rest and masking capalbilities for desenstizing PII information Ensures users have access to only to data as per corporate policies. Provides fine-grained authorization via file permissions in HDFS, recsource level access control for YARN & MapReduce Security requirements consistently applied across the platform and can be mangaged centrally with a single interface Audit Data Protection Authorization Authentication Centralized Seurity Administration Difference between Authentication & Authorization ?
  • 21. A Brief note on Spark & Storm What do you think is the most time consuming aspect of Hadoop Processes? How to improve the I/O Limitation? Result: Faster Analytics How to achieve event driven real time analytics? Result: Highly customized service response
  • 22. Hadoop Distributions & Service Providers
  • 23. Agenda 1 Introducing LatentView Analytics 2 Data Processing Frameworks and a brief history of Hadoop 3 Solving the Big Data Problem with Hadoop, Spark & Storm The Unstructured Maze 5 Banyan – A Parallel Processing Framework 6 Demo
  • 24. Data Deluge – A big problem PC Tablet Mobile SocialSearch & Mail E-Commerce
  • 25. Tree based Unstructured Feature Extraction Panel & Web Logs Social Rules Engine Data Parser • Tweets • Comments • Likes • Shares • Blogs • Reviews • Clickstream • HTML • Images • Audio* • Video* Feature Type Detail Feature 1 Image 600*400 Feature 2 Link # Feature 3 Price 200$ Feature 4 Star 3.5 Tweet Time View Tweet1 12:00 Positive Tweet 2 12:05 Neutral Tree Based Parser
  • 26. Agenda 1 Introducing LatentView Analytics 2 Data Processing Frameworks and a brief history of Hadoop 3 Solving the Big Data Problem with Hadoop, Spark & Storm 4 The Unstructured Maze Banyan – A Parallel Processing Framework 6 Demo
  • 27. Banyan – Parallel processing framework at scale • Is your data unstructured ? Ex: HTML, Images, URLs, Audio, Video, Documents, Text • Is processing each input independent of processing other input? Ex: Compressing one image is independent of next image • Do you need to solve the two problems above at web scale? Ex: say 1 Million documents to processed in less than 1 hour We handle what Hadoop can’t handle! Rather, We handle what Hadoop isn’t supposed to handle – Parallel Processing & Unstructured data! Banyan is a parallel processing framework well integrated with cloud platform of your choice! Follow us here: Banyan – Embarrassingly Parallel Processing Framework Linked In Group http://www.growbanyan.com Email us : runparallel@latentview.com
  • 28. Banyan Vs Hadoop (Yes or No type of comparison) Characteristics Banyan Hadoop Job Type Embarrassingly Parallel Processing Distributed Processing Master Slave Architecture Shared Nothing Architecture Data Replication Fault tolerance Coordinated Job Distribution Dynamical Load Balancing Rescheduling Job Failures Process Structured data Process Unstructured data Note: The core advantage of Banyan is best utilized when Data Processing & Analysis (Aggregation) are executed in a decoupled fashion for jobs that can be processed in parallel
  • 29. Follow us here! http://www.growbanyan.com Banyan – Embarrassingly Parallel Processing Framework Linked In Group Email us : runparallel@latentview.com