SlideShare uma empresa Scribd logo
1 de 46
Baixar para ler offline
Workshop on Parallel, Cluster and
Cloud Computing on Multi-core & GPU
(PCCCMG - 2015)
Workshop Conducted by
Computer Society of India In Association with
Dept. of CSE, VNIT and
Persistence System Ltd, Nagpur
4th – 6th Sep’15
Big-Data Cluster Computing
Advance tools & technologies
Jagadeesan A S
Software Engineer
Persistent Systems Limited
www.github.com/jagadeesanas2
www.linkedin.com/in/jagadeesanas2
ContentContent
Overview of Big Data
• Data clustering concepts
• Clustering vs Classification
• Data Journey
Advance tools and technologies
• Apache Hadoop
• Apache Spark
Future of analytics
• Demo - Spark RDD in Intellij IDEA
Big-Data is similar to Small-Data , but bigger in size and complexity.
What is Big-Data ?
Definition from Wikipedia:
Big data is the term for a collection of data sets so large and complex that
it becomes difficult to process using on-hand database management tools
or traditional data processing applications.
Characterization of Big Data: 4V’s
Veracity
Characterization of Big Data: 4V’s
Now big question ????
Why we need
Big Data ?
What to do with
those Data ?
And the answer is very clear…!!
What is a Cluster ?
A group of the same or similar elements gathered or occurring closely together.
Clustering is the key to Big Data problem
• Not feasible to “label” large collection of objects
• No prior knowledge of the number and nature of groups (clusters) in data
• Clusters may evolve over time
• Clustering provides efficient browsing, search, recommendation and organization of data
Difference between Clustering & classification
Clustering data on
Clustering videos on
Clustering Algorithms
Hundreds of Clustering algorithms are available.
• K-Means
• Kernel K-means
• Nearest neighbour
• Gaussian mixture
• Fuzzy Clustering
• OPTICS algorithm
Data Journey
Advance tools
&
Technologies
Large-Scale Data Analytics
MapReduce computing paradigm vs. Traditional database systems
Database
Many enterprises are turned to Hadoop
Especially applications generating big data, Web applications, social networks, scientific applications
APACHE HADOOP (Disk Based Computing)
open-source software framework written in Java for distributed storage and distributed processing
Design Principles of Hadoop
• Need to process big data
• Need to parallelize computation across thousands of nodes
• Commodity hardware
• Large number of low-end cheap machines working in parallel to solve
a computing problem
• Small number of high-end expensive machines
Hadoop cluster architecture
A Hadoop cluster can be divided into two abstract entities:
MapReduce engine + distributed file system =
What is SPARK
Why SPARK
How to configure SPARK
APACHE SPARK
Open-source cluster computing framework
APACHE SPARK (Memory Based Computing)
open-source software framework written in Java for distributed storage and distributed processing
• Fast cluster computing system for large-scale data processing
compatible with Apache Hadoop
• Improves efficiency through:
• In-memory computing primitives
• General computation graphs
• Improves usability through:
• Rich APIs in Java, Scala, Python
• Interactive shell
Up to 100× faster
Often 2-10× less code
Spark OverviewSpark Overview
Spark Shell Spark applications
• Interactive shell for learning
or data exploration
• Python or Scala
• It provides a preconfigured
Spark context called sc.
• For large scale data processing
• Python, Java, Scala and R
• Every spark application requires
a spark Context. It is the main
entry point to the Spark API.
Scala Interactive shell Python Interactive shell
Spark Overview
Resilient distributed datasets (RDDs)
 Immutable collections of objects spread across a cluster
 Built through parallel transformations (map, filter, etc)
 Automatically rebuilt on failure
 Controllable persistence (e.g. caching in RAM) for reuse
 Shared variables that can be used in parallel operations
Work with distributed collections as we would with local ones
Resilient Distributed Datasets (RDDs)
Two types of RDD operation
• Transformation – define new RDDs based on the current one
Example: Filter, map, reduce
• Action – return values.
Example : count, take(n)
Resilient Distributed Datasets (RDDs)
I have never seen the horror movies.
I never hope to see one;
But I can tell you, anyhow,
I had rather see than be one.
File: movie.txt RDD: mydata
I have never seen the horror movies.
I never hope to see one;
But I can tell you, anyhow,
I had rather see than be one.
Resilient Distributed Datasets (RDDs)
map and filter Transformation
I have never seen the horror movies.
I never hope to see one;
But I can tell you, anyhow,
I had rather see than be one.
I HAVE NEVER SEEN THE HORROR MOVIES.
I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I HAD RATHER SEE THAN BE ONE.
I HAVE NEVER SEEN THE HORROR MOVIES.
I NEVER HOPE TO SEE ONE;
I HAD RATHER SEE THAN BE ONE.
Map(lambda line : line.upper())
Filter(lambda line: line.startswith(‘I’))
Map(line => line.toUpperCase())
Filter(line => line.startsWith(‘I’))
Spark Stack
• Spark SQL :
--- For SQL and unstructured data processing
• Spark Streaming :
--- Stream processing of live data streams
• MLib:
--- For machine learning algorithm
• GraphX:
--- Graph processing
Why Spark ?
 Core engine with SQL, Streaming, machine learning and graph processing
modules.
 Can run today’s most advanced algorithms.
 Alternative to Map Reduce for certain applications.
 APIs in Java, Scala and Python
 Interactive shells in Scala and Python
 Runs on Yarn, Mesos and Standalone.
Spark’s major use cases over Hadoop
• Iterative Algorithms in Machine Learning
• Interactive Data Mining and Data Processing
• Spark is a fully Apache Hive-compatible data warehousing system that
can run 100x faster than Hive.
• Stream processing: Log processing and Fraud detection in live streams
for alerts, aggregates and analysis
• Sensor data processing: Where data is fetched and joined from
multiple sources, in-memory dataset really helpful as they are easy and
fast to process.
MapReduce Example: Word Count
MapReduce Example: Word Count
MapReduce Example: Word Count
MapReduce Example: Word Count
Example : Page Rank
A way of analyzing websites based on their link relationships
• Good example of a more complex algorithm
• Multiple stages of map & reduce
• Benefits from Spark’s in-memory caching
• Multiple iterations over the same data
Basic Idea
Give pages ranks (scores) based on links to them
• Links from many pages  high rank
• Link from a high-rank page  high rank
PageRank Performance
171
80
23
14
0
20
40
60
80
100
120
140
160
180
200
30 60
ITERATIONTIME(S)
NUMBER OF MACHINES
Hadoop Spark
NOTE : Less Iteration Time denotes high Performance
Other Iterative Algorithms
0.96
110
0 25 50 75 100 125
Logistic Regression
4.1
155
0 30 60 90 120 150 180
K-Means Clustering
Hadoop Spark
TIME PER ITERATION(S)
NOTE : Less Iteration Time denotes high Performance
Spark Installation
(For end-user side)
Download Spark distribution from https://spark.apache.org/downloads.html
which pre-build of hadoop 2.4 or later.
Spark Installation
Clone from apache https://github.com/apache/spark GitHub repository
(For developer side)
Spark Installation (continue)
Build the source code using maven and hadoop
<SPARK_HOME>#build/mvn –Pyarn –Phadoop –Phaddop-2.4 -Dhadoop.version=2.6.0
How to run Spark ?
(Standalone mode )
Once the build is completed. Go to your bin directory which is inside Spark home
directory in a terminal and invoke Spark Shell
<SPARK_HOME>/bin#./spark-shell
To start all Spark’s Master and slave nodes:
To execute following terminal inside sbin directory side spark home directory.
<SPARK_HOME>/sbin#./start-all.sh
Spark Master at Spark (Browser view):
localhost:8080
To stop all Spark’s Master and slave nodes:
To execute following terminal inside sbin directory side spark home directory.
<SPARK_HOME>/sbin#./stop-all.sh
Future of analytics
Analytics in the Cloud
https://www.youtube.com/watch?v=JfqJTQnVZvA
• IBM is making Spark available as a cloud service on its
Bluemix cloud platform.
• 3,500 IBM researchers and developers to work on Spark-
related projects at more than a dozen labs worldwide.
Demo - Spark RDD in
Intellij IDEA

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
 
9. Object Relational Databases in DBMS
9. Object Relational Databases in DBMS9. Object Relational Databases in DBMS
9. Object Relational Databases in DBMS
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Data cube computation
Data cube computationData cube computation
Data cube computation
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Ppt for Application of big data
Ppt for Application of big dataPpt for Application of big data
Ppt for Application of big data
 
Big Data & Data Mining
Big Data & Data MiningBig Data & Data Mining
Big Data & Data Mining
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
 
Lecture #01
Lecture #01Lecture #01
Lecture #01
 
Cloud Resource Management
Cloud Resource ManagementCloud Resource Management
Cloud Resource Management
 
7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them
 
Google BigTable
Google BigTableGoogle BigTable
Google BigTable
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Interface specification
Interface specificationInterface specification
Interface specification
 

Semelhante a Big data clustering

Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsDr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysDemi Ben-Ari
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 

Semelhante a Big data clustering (20)

Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - Panorays
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Stratio big data spain
Stratio   big data spainStratio   big data spain
Stratio big data spain
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 

Último

MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
Cyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis ProjectCyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis Projectdanielbell861
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
Create Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI DesktopCreate Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI DesktopThinkInnovation
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 

Último (13)

MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
Cyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis ProjectCyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis Project
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
Create Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI DesktopCreate Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI Desktop
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 

Big data clustering

  • 1. Workshop on Parallel, Cluster and Cloud Computing on Multi-core & GPU (PCCCMG - 2015) Workshop Conducted by Computer Society of India In Association with Dept. of CSE, VNIT and Persistence System Ltd, Nagpur 4th – 6th Sep’15
  • 2. Big-Data Cluster Computing Advance tools & technologies Jagadeesan A S Software Engineer Persistent Systems Limited www.github.com/jagadeesanas2 www.linkedin.com/in/jagadeesanas2
  • 3. ContentContent Overview of Big Data • Data clustering concepts • Clustering vs Classification • Data Journey Advance tools and technologies • Apache Hadoop • Apache Spark Future of analytics • Demo - Spark RDD in Intellij IDEA
  • 4. Big-Data is similar to Small-Data , but bigger in size and complexity. What is Big-Data ? Definition from Wikipedia: Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
  • 5. Characterization of Big Data: 4V’s Veracity
  • 6. Characterization of Big Data: 4V’s
  • 7. Now big question ???? Why we need Big Data ? What to do with those Data ?
  • 8. And the answer is very clear…!!
  • 9. What is a Cluster ? A group of the same or similar elements gathered or occurring closely together. Clustering is the key to Big Data problem • Not feasible to “label” large collection of objects • No prior knowledge of the number and nature of groups (clusters) in data • Clusters may evolve over time • Clustering provides efficient browsing, search, recommendation and organization of data
  • 10. Difference between Clustering & classification
  • 13. Clustering Algorithms Hundreds of Clustering algorithms are available. • K-Means • Kernel K-means • Nearest neighbour • Gaussian mixture • Fuzzy Clustering • OPTICS algorithm
  • 16. Large-Scale Data Analytics MapReduce computing paradigm vs. Traditional database systems Database Many enterprises are turned to Hadoop Especially applications generating big data, Web applications, social networks, scientific applications
  • 17. APACHE HADOOP (Disk Based Computing) open-source software framework written in Java for distributed storage and distributed processing Design Principles of Hadoop • Need to process big data • Need to parallelize computation across thousands of nodes • Commodity hardware • Large number of low-end cheap machines working in parallel to solve a computing problem • Small number of high-end expensive machines
  • 18. Hadoop cluster architecture A Hadoop cluster can be divided into two abstract entities: MapReduce engine + distributed file system =
  • 19. What is SPARK Why SPARK How to configure SPARK APACHE SPARK Open-source cluster computing framework
  • 20. APACHE SPARK (Memory Based Computing) open-source software framework written in Java for distributed storage and distributed processing • Fast cluster computing system for large-scale data processing compatible with Apache Hadoop • Improves efficiency through: • In-memory computing primitives • General computation graphs • Improves usability through: • Rich APIs in Java, Scala, Python • Interactive shell Up to 100× faster Often 2-10× less code
  • 21. Spark OverviewSpark Overview Spark Shell Spark applications • Interactive shell for learning or data exploration • Python or Scala • It provides a preconfigured Spark context called sc. • For large scale data processing • Python, Java, Scala and R • Every spark application requires a spark Context. It is the main entry point to the Spark API. Scala Interactive shell Python Interactive shell
  • 22. Spark Overview Resilient distributed datasets (RDDs)  Immutable collections of objects spread across a cluster  Built through parallel transformations (map, filter, etc)  Automatically rebuilt on failure  Controllable persistence (e.g. caching in RAM) for reuse  Shared variables that can be used in parallel operations Work with distributed collections as we would with local ones
  • 23. Resilient Distributed Datasets (RDDs) Two types of RDD operation • Transformation – define new RDDs based on the current one Example: Filter, map, reduce • Action – return values. Example : count, take(n)
  • 24. Resilient Distributed Datasets (RDDs) I have never seen the horror movies. I never hope to see one; But I can tell you, anyhow, I had rather see than be one. File: movie.txt RDD: mydata I have never seen the horror movies. I never hope to see one; But I can tell you, anyhow, I had rather see than be one.
  • 25. Resilient Distributed Datasets (RDDs) map and filter Transformation I have never seen the horror movies. I never hope to see one; But I can tell you, anyhow, I had rather see than be one. I HAVE NEVER SEEN THE HORROR MOVIES. I NEVER HOPE TO SEE ONE; BUT I CAN TELL YOU, ANYHOW, I HAD RATHER SEE THAN BE ONE. I HAVE NEVER SEEN THE HORROR MOVIES. I NEVER HOPE TO SEE ONE; I HAD RATHER SEE THAN BE ONE. Map(lambda line : line.upper()) Filter(lambda line: line.startswith(‘I’)) Map(line => line.toUpperCase()) Filter(line => line.startsWith(‘I’))
  • 26. Spark Stack • Spark SQL : --- For SQL and unstructured data processing • Spark Streaming : --- Stream processing of live data streams • MLib: --- For machine learning algorithm • GraphX: --- Graph processing
  • 27. Why Spark ?  Core engine with SQL, Streaming, machine learning and graph processing modules.  Can run today’s most advanced algorithms.  Alternative to Map Reduce for certain applications.  APIs in Java, Scala and Python  Interactive shells in Scala and Python  Runs on Yarn, Mesos and Standalone.
  • 28. Spark’s major use cases over Hadoop • Iterative Algorithms in Machine Learning • Interactive Data Mining and Data Processing • Spark is a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive. • Stream processing: Log processing and Fraud detection in live streams for alerts, aggregates and analysis • Sensor data processing: Where data is fetched and joined from multiple sources, in-memory dataset really helpful as they are easy and fast to process.
  • 33. Example : Page Rank A way of analyzing websites based on their link relationships • Good example of a more complex algorithm • Multiple stages of map & reduce • Benefits from Spark’s in-memory caching • Multiple iterations over the same data Basic Idea Give pages ranks (scores) based on links to them • Links from many pages  high rank • Link from a high-rank page  high rank
  • 34. PageRank Performance 171 80 23 14 0 20 40 60 80 100 120 140 160 180 200 30 60 ITERATIONTIME(S) NUMBER OF MACHINES Hadoop Spark NOTE : Less Iteration Time denotes high Performance
  • 35. Other Iterative Algorithms 0.96 110 0 25 50 75 100 125 Logistic Regression 4.1 155 0 30 60 90 120 150 180 K-Means Clustering Hadoop Spark TIME PER ITERATION(S) NOTE : Less Iteration Time denotes high Performance
  • 36. Spark Installation (For end-user side) Download Spark distribution from https://spark.apache.org/downloads.html which pre-build of hadoop 2.4 or later.
  • 37. Spark Installation Clone from apache https://github.com/apache/spark GitHub repository (For developer side)
  • 38. Spark Installation (continue) Build the source code using maven and hadoop <SPARK_HOME>#build/mvn –Pyarn –Phadoop –Phaddop-2.4 -Dhadoop.version=2.6.0
  • 39. How to run Spark ? (Standalone mode ) Once the build is completed. Go to your bin directory which is inside Spark home directory in a terminal and invoke Spark Shell <SPARK_HOME>/bin#./spark-shell
  • 40. To start all Spark’s Master and slave nodes: To execute following terminal inside sbin directory side spark home directory. <SPARK_HOME>/sbin#./start-all.sh
  • 41. Spark Master at Spark (Browser view): localhost:8080
  • 42. To stop all Spark’s Master and slave nodes: To execute following terminal inside sbin directory side spark home directory. <SPARK_HOME>/sbin#./stop-all.sh
  • 45. https://www.youtube.com/watch?v=JfqJTQnVZvA • IBM is making Spark available as a cloud service on its Bluemix cloud platform. • 3,500 IBM researchers and developers to work on Spark- related projects at more than a dozen labs worldwide.
  • 46. Demo - Spark RDD in Intellij IDEA