SlideShare uma empresa Scribd logo
1 de 46
Workshop on Parallel, Cluster and
Cloud Computing on Multi-core & GPU
(PCCCMG - 2015)
Workshop Conducted by
Computer Society of India In Association with
Dept. of CSE, VNIT and
Persistence System Ltd, Nagpur
4th – 6th Sep’15
Big-Data Cluster Computing
Advance tools & technologies
Jagadeesan A S
Software Engineer
Persistent Systems Limited
www.github.com/jagadeesanas2
www.linkedin.com/in/jagadeesanas2
ContentContent
Overview of Big Data
• Data clustering concepts
• Clustering vs Classification
• Data Journey
Advance tools and technologies
• Apache Hadoop
• Apache Spark
Future of analytics
• Demo - Spark RDD in Intellij IDEA
Big-Data is similar to Small-Data , but bigger in size and complexity.
What is Big-Data ?
Definition from Wikipedia:
Big data is the term for a collection of data sets so large and complex that
it becomes difficult to process using on-hand database management tools
or traditional data processing applications.
Characterization of Big Data: 4V’s
Veracity
Characterization of Big Data: 4V’s
Now big question ????
Why we need
Big Data ?
What to do with
those Data ?
And the answer is very clear…!!
What is a Cluster ?
A group of the same or similar elements gathered or occurring closely together.
Clustering is the key to Big Data problem
• Not feasible to “label” large collection of objects
• No prior knowledge of the number and nature of groups (clusters) in data
• Clusters may evolve over time
• Clustering provides efficient browsing, search, recommendation and organization of data
Difference between Clustering & classification
Clustering data on
Clustering videos on
Clustering Algorithms
Hundreds of Clustering algorithms are available.
• K-Means
• Kernel K-means
• Nearest neighbour
• Gaussian mixture
• Fuzzy Clustering
• OPTICS algorithm
Data Journey
Advance tools
&
Technologies
Large-Scale Data Analytics
MapReduce computing paradigm vs. Traditional database systems
Database
Many enterprises are turned to Hadoop
Especially applications generating big data, Web applications, social networks, scientific applications
APACHE HADOOP (Disk Based Computing)
open-source software framework written in Java for distributed storage and distributed processing
Design Principles of Hadoop
• Need to process big data
• Need to parallelize computation across thousands of nodes
• Commodity hardware
• Large number of low-end cheap machines working in parallel to solve
a computing problem
• Small number of high-end expensive machines
Hadoop cluster architecture
A Hadoop cluster can be divided into two abstract entities:
MapReduce engine + distributed file system =
What is SPARK
Why SPARK
How to configure SPARK
APACHE SPARK
Open-source cluster computing framework
APACHE SPARK (Memory Based Computing)
open-source software framework written in Java for distributed storage and distributed processing
• Fast cluster computing system for large-scale data processing
compatible with Apache Hadoop
• Improves efficiency through:
• In-memory computing primitives
• General computation graphs
• Improves usability through:
• Rich APIs in Java, Scala, Python
• Interactive shell
Up to 100× faster
Often 2-10× less code
Spark OverviewSpark Overview
Spark Shell Spark applications
• Interactive shell for learning
or data exploration
• Python or Scala
• It provides a preconfigured
Spark context called sc.
• For large scale data processing
• Python, Java, Scala and R
• Every spark application requires
a spark Context. It is the main
entry point to the Spark API.
Scala Interactive shell Python Interactive shell
Spark Overview
Resilient distributed datasets (RDDs)
 Immutable collections of objects spread across a cluster
 Built through parallel transformations (map, filter, etc)
 Automatically rebuilt on failure
 Controllable persistence (e.g. caching in RAM) for reuse
 Shared variables that can be used in parallel operations
Work with distributed collections as we would with local ones
Resilient Distributed Datasets (RDDs)
Two types of RDD operation
• Transformation – define new RDDs based on the current one
Example: Filter, map, reduce
• Action – return values.
Example : count, take(n)
Resilient Distributed Datasets (RDDs)
I have never seen the horror movies.
I never hope to see one;
But I can tell you, anyhow,
I had rather see than be one.
File: movie.txt RDD: mydata
I have never seen the horror movies.
I never hope to see one;
But I can tell you, anyhow,
I had rather see than be one.
Resilient Distributed Datasets (RDDs)
map and filter Transformation
I have never seen the horror movies.
I never hope to see one;
But I can tell you, anyhow,
I had rather see than be one.
I HAVE NEVER SEEN THE HORROR MOVIES.
I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I HAD RATHER SEE THAN BE ONE.
I HAVE NEVER SEEN THE HORROR MOVIES.
I NEVER HOPE TO SEE ONE;
I HAD RATHER SEE THAN BE ONE.
Map(lambda line : line.upper())
Filter(lambda line: line.startswith(‘I’))
Map(line => line.toUpperCase())
Filter(line => line.startsWith(‘I’))
Spark Stack
• Spark SQL :
--- For SQL and unstructured data processing
• Spark Streaming :
--- Stream processing of live data streams
• MLib:
--- For machine learning algorithm
• GraphX:
--- Graph processing
Why Spark ?
 Core engine with SQL, Streaming, machine learning and graph processing
modules.
 Can run today’s most advanced algorithms.
 Alternative to Map Reduce for certain applications.
 APIs in Java, Scala and Python
 Interactive shells in Scala and Python
 Runs on Yarn, Mesos and Standalone.
Spark’s major use cases over Hadoop
• Iterative Algorithms in Machine Learning
• Interactive Data Mining and Data Processing
• Spark is a fully Apache Hive-compatible data warehousing system that
can run 100x faster than Hive.
• Stream processing: Log processing and Fraud detection in live streams
for alerts, aggregates and analysis
• Sensor data processing: Where data is fetched and joined from
multiple sources, in-memory dataset really helpful as they are easy and
fast to process.
MapReduce Example: Word Count
MapReduce Example: Word Count
MapReduce Example: Word Count
MapReduce Example: Word Count
Example : Page Rank
A way of analyzing websites based on their link relationships
• Good example of a more complex algorithm
• Multiple stages of map & reduce
• Benefits from Spark’s in-memory caching
• Multiple iterations over the same data
Basic Idea
Give pages ranks (scores) based on links to them
• Links from many pages  high rank
• Link from a high-rank page  high rank
PageRank Performance
171
80
23
14
0
20
40
60
80
100
120
140
160
180
200
30 60
ITERATIONTIME(S)
NUMBER OF MACHINES
Hadoop Spark
NOTE : Less Iteration Time denotes high Performance
Other Iterative Algorithms
0.96
110
0 25 50 75 100 125
Logistic Regression
4.1
155
0 30 60 90 120 150 180
K-Means Clustering
Hadoop Spark
TIME PER ITERATION(S)
NOTE : Less Iteration Time denotes high Performance
Spark Installation
(For end-user side)
Download Spark distribution from https://spark.apache.org/downloads.html
which pre-build of hadoop 2.4 or later.
Spark Installation
Clone from apache https://github.com/apache/spark GitHub repository
(For developer side)
Spark Installation (continue)
Build the source code using maven and hadoop
<SPARK_HOME>#build/mvn –Pyarn –Phadoop –Phaddop-2.4 -Dhadoop.version=2.6.0
How to run Spark ?
(Standalone mode )
Once the build is completed. Go to your bin directory which is inside Spark home
directory in a terminal and invoke Spark Shell
<SPARK_HOME>/bin#./spark-shell
To start all Spark’s Master and slave nodes:
To execute following terminal inside sbin directory side spark home directory.
<SPARK_HOME>/sbin#./start-all.sh
Spark Master at Spark (Browser view):
localhost:8080
To stop all Spark’s Master and slave nodes:
To execute following terminal inside sbin directory side spark home directory.
<SPARK_HOME>/sbin#./stop-all.sh
Future of analytics
Analytics in the Cloud
https://www.youtube.com/watch?v=JfqJTQnVZvA
• IBM is making Spark available as a cloud service on its
Bluemix cloud platform.
• 3,500 IBM researchers and developers to work on Spark-
related projects at more than a dozen labs worldwide.
Demo - Spark RDD in
Intellij IDEA

Mais conteúdo relacionado

Mais procurados

Cloud architecture
Cloud architectureCloud architecture
Cloud architectureAdeel Javaid
 
CS6010 Social Network Analysis Unit I
CS6010 Social Network Analysis Unit ICS6010 Social Network Analysis Unit I
CS6010 Social Network Analysis Unit Ipkaviya
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streamshktripathy
 
IoT Physical Devices and End Points.pdf
IoT Physical Devices and End Points.pdfIoT Physical Devices and End Points.pdf
IoT Physical Devices and End Points.pdfGVNSK Sravya
 
M2M - Machine to Machine Technology
M2M - Machine to Machine TechnologyM2M - Machine to Machine Technology
M2M - Machine to Machine TechnologySamip jain
 
Lossless predictive coding in Digital Image Processing
Lossless predictive coding in Digital Image ProcessingLossless predictive coding in Digital Image Processing
Lossless predictive coding in Digital Image Processingpriyadharshini murugan
 
Image Restoration (Digital Image Processing)
Image Restoration (Digital Image Processing)Image Restoration (Digital Image Processing)
Image Restoration (Digital Image Processing)Kalyan Acharjya
 
CS6010 Social Network Analysis Unit II
CS6010 Social Network Analysis   Unit IICS6010 Social Network Analysis   Unit II
CS6010 Social Network Analysis Unit IIpkaviya
 
Virtual Machine provisioning and migration services
Virtual Machine provisioning and migration servicesVirtual Machine provisioning and migration services
Virtual Machine provisioning and migration servicesANUSUYA T K
 
Challenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxChallenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxGovardhanV7
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
final presentation fake news detection.pptx
final presentation fake news detection.pptxfinal presentation fake news detection.pptx
final presentation fake news detection.pptxRudraSaraswat6
 
Cloud Computing Design Considerations
Cloud Computing Design ConsiderationsCloud Computing Design Considerations
Cloud Computing Design ConsiderationsMike Kavis
 

Mais procurados (20)

Cloud architecture
Cloud architectureCloud architecture
Cloud architecture
 
Google App Engine
Google App EngineGoogle App Engine
Google App Engine
 
CS6010 Social Network Analysis Unit I
CS6010 Social Network Analysis Unit ICS6010 Social Network Analysis Unit I
CS6010 Social Network Analysis Unit I
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
IoT Physical Devices and End Points.pdf
IoT Physical Devices and End Points.pdfIoT Physical Devices and End Points.pdf
IoT Physical Devices and End Points.pdf
 
M2M - Machine to Machine Technology
M2M - Machine to Machine TechnologyM2M - Machine to Machine Technology
M2M - Machine to Machine Technology
 
IoT and m2m
IoT and m2mIoT and m2m
IoT and m2m
 
Lossless predictive coding in Digital Image Processing
Lossless predictive coding in Digital Image ProcessingLossless predictive coding in Digital Image Processing
Lossless predictive coding in Digital Image Processing
 
Image Restoration (Digital Image Processing)
Image Restoration (Digital Image Processing)Image Restoration (Digital Image Processing)
Image Restoration (Digital Image Processing)
 
Image compression models
Image compression modelsImage compression models
Image compression models
 
CS6010 Social Network Analysis Unit II
CS6010 Social Network Analysis   Unit IICS6010 Social Network Analysis   Unit II
CS6010 Social Network Analysis Unit II
 
Virtual Machine provisioning and migration services
Virtual Machine provisioning and migration servicesVirtual Machine provisioning and migration services
Virtual Machine provisioning and migration services
 
Challenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxChallenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptx
 
Cloud Computing Architecture
Cloud Computing ArchitectureCloud Computing Architecture
Cloud Computing Architecture
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
On demand provisioning
On demand provisioningOn demand provisioning
On demand provisioning
 
final presentation fake news detection.pptx
final presentation fake news detection.pptxfinal presentation fake news detection.pptx
final presentation fake news detection.pptx
 
Big Data & The Cloud
Big Data & The CloudBig Data & The Cloud
Big Data & The Cloud
 
PAC Learning
PAC LearningPAC Learning
PAC Learning
 
Cloud Computing Design Considerations
Cloud Computing Design ConsiderationsCloud Computing Design Considerations
Cloud Computing Design Considerations
 

Semelhante a Big data clustering

Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsDr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysDemi Ben-Ari
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 

Semelhante a Big data clustering (20)

Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - Panorays
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Stratio big data spain
Stratio   big data spainStratio   big data spain
Stratio big data spain
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 

Último

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 

Último (20)

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 

Big data clustering

  • 1. Workshop on Parallel, Cluster and Cloud Computing on Multi-core & GPU (PCCCMG - 2015) Workshop Conducted by Computer Society of India In Association with Dept. of CSE, VNIT and Persistence System Ltd, Nagpur 4th – 6th Sep’15
  • 2. Big-Data Cluster Computing Advance tools & technologies Jagadeesan A S Software Engineer Persistent Systems Limited www.github.com/jagadeesanas2 www.linkedin.com/in/jagadeesanas2
  • 3. ContentContent Overview of Big Data • Data clustering concepts • Clustering vs Classification • Data Journey Advance tools and technologies • Apache Hadoop • Apache Spark Future of analytics • Demo - Spark RDD in Intellij IDEA
  • 4. Big-Data is similar to Small-Data , but bigger in size and complexity. What is Big-Data ? Definition from Wikipedia: Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
  • 5. Characterization of Big Data: 4V’s Veracity
  • 6. Characterization of Big Data: 4V’s
  • 7. Now big question ???? Why we need Big Data ? What to do with those Data ?
  • 8. And the answer is very clear…!!
  • 9. What is a Cluster ? A group of the same or similar elements gathered or occurring closely together. Clustering is the key to Big Data problem • Not feasible to “label” large collection of objects • No prior knowledge of the number and nature of groups (clusters) in data • Clusters may evolve over time • Clustering provides efficient browsing, search, recommendation and organization of data
  • 10. Difference between Clustering & classification
  • 13. Clustering Algorithms Hundreds of Clustering algorithms are available. • K-Means • Kernel K-means • Nearest neighbour • Gaussian mixture • Fuzzy Clustering • OPTICS algorithm
  • 16. Large-Scale Data Analytics MapReduce computing paradigm vs. Traditional database systems Database Many enterprises are turned to Hadoop Especially applications generating big data, Web applications, social networks, scientific applications
  • 17. APACHE HADOOP (Disk Based Computing) open-source software framework written in Java for distributed storage and distributed processing Design Principles of Hadoop • Need to process big data • Need to parallelize computation across thousands of nodes • Commodity hardware • Large number of low-end cheap machines working in parallel to solve a computing problem • Small number of high-end expensive machines
  • 18. Hadoop cluster architecture A Hadoop cluster can be divided into two abstract entities: MapReduce engine + distributed file system =
  • 19. What is SPARK Why SPARK How to configure SPARK APACHE SPARK Open-source cluster computing framework
  • 20. APACHE SPARK (Memory Based Computing) open-source software framework written in Java for distributed storage and distributed processing • Fast cluster computing system for large-scale data processing compatible with Apache Hadoop • Improves efficiency through: • In-memory computing primitives • General computation graphs • Improves usability through: • Rich APIs in Java, Scala, Python • Interactive shell Up to 100× faster Often 2-10× less code
  • 21. Spark OverviewSpark Overview Spark Shell Spark applications • Interactive shell for learning or data exploration • Python or Scala • It provides a preconfigured Spark context called sc. • For large scale data processing • Python, Java, Scala and R • Every spark application requires a spark Context. It is the main entry point to the Spark API. Scala Interactive shell Python Interactive shell
  • 22. Spark Overview Resilient distributed datasets (RDDs)  Immutable collections of objects spread across a cluster  Built through parallel transformations (map, filter, etc)  Automatically rebuilt on failure  Controllable persistence (e.g. caching in RAM) for reuse  Shared variables that can be used in parallel operations Work with distributed collections as we would with local ones
  • 23. Resilient Distributed Datasets (RDDs) Two types of RDD operation • Transformation – define new RDDs based on the current one Example: Filter, map, reduce • Action – return values. Example : count, take(n)
  • 24. Resilient Distributed Datasets (RDDs) I have never seen the horror movies. I never hope to see one; But I can tell you, anyhow, I had rather see than be one. File: movie.txt RDD: mydata I have never seen the horror movies. I never hope to see one; But I can tell you, anyhow, I had rather see than be one.
  • 25. Resilient Distributed Datasets (RDDs) map and filter Transformation I have never seen the horror movies. I never hope to see one; But I can tell you, anyhow, I had rather see than be one. I HAVE NEVER SEEN THE HORROR MOVIES. I NEVER HOPE TO SEE ONE; BUT I CAN TELL YOU, ANYHOW, I HAD RATHER SEE THAN BE ONE. I HAVE NEVER SEEN THE HORROR MOVIES. I NEVER HOPE TO SEE ONE; I HAD RATHER SEE THAN BE ONE. Map(lambda line : line.upper()) Filter(lambda line: line.startswith(‘I’)) Map(line => line.toUpperCase()) Filter(line => line.startsWith(‘I’))
  • 26. Spark Stack • Spark SQL : --- For SQL and unstructured data processing • Spark Streaming : --- Stream processing of live data streams • MLib: --- For machine learning algorithm • GraphX: --- Graph processing
  • 27. Why Spark ?  Core engine with SQL, Streaming, machine learning and graph processing modules.  Can run today’s most advanced algorithms.  Alternative to Map Reduce for certain applications.  APIs in Java, Scala and Python  Interactive shells in Scala and Python  Runs on Yarn, Mesos and Standalone.
  • 28. Spark’s major use cases over Hadoop • Iterative Algorithms in Machine Learning • Interactive Data Mining and Data Processing • Spark is a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive. • Stream processing: Log processing and Fraud detection in live streams for alerts, aggregates and analysis • Sensor data processing: Where data is fetched and joined from multiple sources, in-memory dataset really helpful as they are easy and fast to process.
  • 33. Example : Page Rank A way of analyzing websites based on their link relationships • Good example of a more complex algorithm • Multiple stages of map & reduce • Benefits from Spark’s in-memory caching • Multiple iterations over the same data Basic Idea Give pages ranks (scores) based on links to them • Links from many pages  high rank • Link from a high-rank page  high rank
  • 34. PageRank Performance 171 80 23 14 0 20 40 60 80 100 120 140 160 180 200 30 60 ITERATIONTIME(S) NUMBER OF MACHINES Hadoop Spark NOTE : Less Iteration Time denotes high Performance
  • 35. Other Iterative Algorithms 0.96 110 0 25 50 75 100 125 Logistic Regression 4.1 155 0 30 60 90 120 150 180 K-Means Clustering Hadoop Spark TIME PER ITERATION(S) NOTE : Less Iteration Time denotes high Performance
  • 36. Spark Installation (For end-user side) Download Spark distribution from https://spark.apache.org/downloads.html which pre-build of hadoop 2.4 or later.
  • 37. Spark Installation Clone from apache https://github.com/apache/spark GitHub repository (For developer side)
  • 38. Spark Installation (continue) Build the source code using maven and hadoop <SPARK_HOME>#build/mvn –Pyarn –Phadoop –Phaddop-2.4 -Dhadoop.version=2.6.0
  • 39. How to run Spark ? (Standalone mode ) Once the build is completed. Go to your bin directory which is inside Spark home directory in a terminal and invoke Spark Shell <SPARK_HOME>/bin#./spark-shell
  • 40. To start all Spark’s Master and slave nodes: To execute following terminal inside sbin directory side spark home directory. <SPARK_HOME>/sbin#./start-all.sh
  • 41. Spark Master at Spark (Browser view): localhost:8080
  • 42. To stop all Spark’s Master and slave nodes: To execute following terminal inside sbin directory side spark home directory. <SPARK_HOME>/sbin#./stop-all.sh
  • 45. https://www.youtube.com/watch?v=JfqJTQnVZvA • IBM is making Spark available as a cloud service on its Bluemix cloud platform. • 3,500 IBM researchers and developers to work on Spark- related projects at more than a dozen labs worldwide.
  • 46. Demo - Spark RDD in Intellij IDEA