SlideShare uma empresa Scribd logo
1 de 13
The 10 Apache Spark Features 
You (Unlikely) Didn't Hear About 
Roger Brinkley 
Technical Evangelist
The 10 Apache Stack Features You 
(Unlikely) Didn't Hear About 
• 10 minutes – 10 slides 
• Ignite Format 
• No stopping! 
• No going back! 
• Questions? Sure, but only if and until time 
remains on slide (otherwise, save for later) 
• Hire me, I’ll find 45 more
It’s Fast Really Fast 
• 10 - 100x faster than MapReduce 
• 10 – 100x faster than Hive 
• Historical perspective 
MapReduce is Listed as the Last Most 
Important Software Innovation 
– JRuby 2-3x Faster with InvokedDynamic JVM 
– Hardware rarely gets greater than 10x/year
It’s Pure Open Source 
• Commons-based Peer Production 
– Apache Software Foundation Top Level Project 
– 200 people from 50 OrganizationsContributing 
– 12 Organizations Committing 
– Peer Governance 
– Participative Decision Making 
The very essence of a free software 
consists in considering contributing 
roles as public trusts, bestowed for the 
good of the community, and not for 
the benefit of an individual or a party 
The very essence of a free government 
consists in considering offices as public 
trusts, bestowed for the good of the 
country, and not for the benefit of an 
individual or a party 
Modern John C. FOSS Calhoun John 2/C. 13/Calhoun 
1835
Strong Enterprise Relationships 
• Spark is in every major Hadoop distributor 
• Vertical enterprise use 
– Internet companies, government, financials 
– Churn analysis, fraud detection, risk analytics 
• Used in other data stores 
– Datastax (Cassandra) 
– MongoDB 
• Databricks has a cloud based implementation
Enhances Other Big Data 
Implementations 
• Hadoop – Replacement of Map Reduce 
• Cassandara – Analytics 
• Hive – Faster SQL processing 
• SAP Hana – Faster interactive analysis
API Stability 
• Guaranteed stability of its core API for 1.X 
• Spark has always been conservative with API 
changes 
• Clearly defined annotations for future APIs 
– Experimental 
– Alpha 
– Developer
Don’t Need to Learn a New Language 
• Scala 
• Java – 25% 
• Python – 30% 
• And soon R
Java 8 Lambda Support 
JavaRDD<String> String> lines = sc.textFile("lines hdfs://= sc.log.textFile("txt"); 
hdfs://log.txt"); 
JavaRDD<// Map each line String> to multiple words 
words = 
JavaRDD<lines.String> flatMap(words = lines.line flatMap( 
new FlatMapFunction<String, String>() -> Arrays.{ 
asList(line.split(" "))); 
JavaPairRDD<public Iterable<String> String, call(String Integer> line) { 
counts = 
words.mapToPair(w -> new Tuple2<String, Integer>(w, 1)) 
return Arrays.asList(line.split(" ")); 
} 
}); 
// Turn the words into (word, 1) pairs 
JavaPairRDD<String, Integer> ones = words.mapToPair( 
new PairFunction<String, String, Integer>() { 
public Tuple2<String, Integer> call(String w) { 
.reduceByKey((x, y) -> x + y); 
counts.saveAsTextFile("hdfs://counts.txt"); 
return new Tuple2<String, Integer>(w, 1); 
} 
}); 
// Group up and add the pairs by key to produce counts 
JavaPairRDD<String, Integer> counts = ones.reduceByKey( 
new Function2<Integer, Integer, Integer>() { 
public Integer call(Integer i1, Integer i2) { 
return i1 + i2; 
} 
}); 
counts.saveAsTextFile("hdfs://counts.txt");
Real Time Stream Process 
val ssc = new StreamingContext(args(0), 
"NetworkHashCount", Seconds(10), 
file = sc.textFile("hdfs://.../pagecounts-*.gz") 
val counts = file.flatMap(line => line.split(" ")) 
System.getenv("SPARK_HOME"), 
.map(word => (word, 1)) 
.reduceByKey(_ + _) 
Seq(System.getenv("SPARK_EXAMPLES_JAR"))) 
val lines = ssc.socketTextStream("localhost", 9999) 
val words = lines.flatMap(_.split(" 
")).filter(_.startsWith("#")) 
val wordCounts = words.map(x => (x, 
1)).reduceByKey(_ + _) 
wordCounts.print() 
ssc.start() 
counts.saveAsTextFile("hdfs://.../word-count")
Caching Interactive Algorithms 
val points = 
sc.textFile("...").map(parsePoint).cache() 
var w = Vector.random(D) //current separating 
plane 
for (i <- 1 to ITERATIONS) { 
val gradient = points.map(p => 
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x 
).reduce(_ + _) 
w -= gradient 
} 
println("Final separating plane: " + w)
New Security Integration 
• Complete Integration with Haddop/YARN Security 
Model 
– Authenticate Job Submissions 
– Securely transfer HDFS credentials 
– Authenticate communication between component 
• Other deployments supported 
val conf = new SparkConf 
conf.set("spark.authenticate", "true") 
conf.set("spark.authenticate.secret", "good")
And Lots More 
• Apache Spark Website 
• Databricks – making big data easy 
– Introduction to Apache Spark 
• Jul 28 – Austin, TX - More Info & Registration 
• Aug 25 – Chicago, IL - More Info & Registration

Mais conteúdo relacionado

Mais procurados

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Zhenxiao Luo
 
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Spark Summit
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeDatabricks
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Thomas W. Dinsmore
 
Hadoop at ayasdi
Hadoop at ayasdiHadoop at ayasdi
Hadoop at ayasdiMohit Jaggi
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowDremio Corporation
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Spark Summit
 
Redshift Introduction
Redshift IntroductionRedshift Introduction
Redshift IntroductionDataKitchen
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkMatt Ingenthron
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkItai Yaffe
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in AzureMostafa
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit
 
Introduction and HDInsight best practices
Introduction and HDInsight best practicesIntroduction and HDInsight best practices
Introduction and HDInsight best practicesAshish Thapliyal
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 

Mais procurados (20)

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
 
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
 
Hadoop at ayasdi
Hadoop at ayasdiHadoop at ayasdi
Hadoop at ayasdi
 
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
 
Redshift Introduction
Redshift IntroductionRedshift Introduction
Redshift Introduction
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in Azure
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
 
Introduction and HDInsight best practices
Introduction and HDInsight best practicesIntroduction and HDInsight best practices
Introduction and HDInsight best practices
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 

Destaque

Test your english 1
Test your english 1Test your english 1
Test your english 1gomesr
 
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...In-Memory Computing Summit
 
Jeemain 2015 question-paper_solution
Jeemain 2015 question-paper_solutionJeemain 2015 question-paper_solution
Jeemain 2015 question-paper_solutionSahil Gaurav
 
Access grammar 3+
Access grammar 3+Access grammar 3+
Access grammar 3+OlgaYurchuk
 
Click on test_booklet_3_with_key
Click on test_booklet_3_with_keyClick on test_booklet_3_with_key
Click on test_booklet_3_with_keyTuvshin Tuvshuu
 
Click on 2 test booklet with keys
Click on 2   test booklet with keysClick on 2   test booklet with keys
Click on 2 test booklet with keysYulia Karpenko
 
Summative test in English 6
Summative test in English 6Summative test in English 6
Summative test in English 6Jerlyn Zara
 
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and IgniteJCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and IgniteJoseph Kuo
 
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...DataStax
 
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing HubIMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing HubIn-Memory Computing Summit
 
K to 12 ENGLISH Grade 2 (4th Quarter 1st Summative Test)
K to 12 ENGLISH Grade 2 (4th Quarter 1st Summative Test)K to 12 ENGLISH Grade 2 (4th Quarter 1st Summative Test)
K to 12 ENGLISH Grade 2 (4th Quarter 1st Summative Test)LiGhT ArOhL
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...Yahoo Developer Network
 

Destaque (14)

Test your english 1
Test your english 1Test your english 1
Test your english 1
 
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
 
Jeemain 2015 question-paper_solution
Jeemain 2015 question-paper_solutionJeemain 2015 question-paper_solution
Jeemain 2015 question-paper_solution
 
Access grammar 3+
Access grammar 3+Access grammar 3+
Access grammar 3+
 
Access 1-test-booklet
Access 1-test-bookletAccess 1-test-booklet
Access 1-test-booklet
 
Click on test_booklet_3_with_key
Click on test_booklet_3_with_keyClick on test_booklet_3_with_key
Click on test_booklet_3_with_key
 
Click on 2 test booklet with keys
Click on 2   test booklet with keysClick on 2   test booklet with keys
Click on 2 test booklet with keys
 
Summative test in English 6
Summative test in English 6Summative test in English 6
Summative test in English 6
 
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and IgniteJCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
 
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
 
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing HubIMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
 
K to 12 ENGLISH Grade 2 (4th Quarter 1st Summative Test)
K to 12 ENGLISH Grade 2 (4th Quarter 1st Summative Test)K to 12 ENGLISH Grade 2 (4th Quarter 1st Summative Test)
K to 12 ENGLISH Grade 2 (4th Quarter 1st Summative Test)
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
Fce practice test (book 3)
Fce practice test (book 3)Fce practice test (book 3)
Fce practice test (book 3)
 

Semelhante a 10 Things About Spark

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?rhatr
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"Giivee The
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingSantosh Sahoo
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerCodemotion
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
 

Semelhante a 10 Things About Spark (20)

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
963
963963
963
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Real-Time Analytics with Apache Cassandra and Apache Spark,
Real-Time Analytics with Apache Cassandra and Apache Spark,Real-Time Analytics with Apache Cassandra and Apache Spark,
Real-Time Analytics with Apache Cassandra and Apache Spark,
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark Streaming
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe Seiler
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 

Último

How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 

Último (20)

How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 

10 Things About Spark

  • 1. The 10 Apache Spark Features You (Unlikely) Didn't Hear About Roger Brinkley Technical Evangelist
  • 2. The 10 Apache Stack Features You (Unlikely) Didn't Hear About • 10 minutes – 10 slides • Ignite Format • No stopping! • No going back! • Questions? Sure, but only if and until time remains on slide (otherwise, save for later) • Hire me, I’ll find 45 more
  • 3. It’s Fast Really Fast • 10 - 100x faster than MapReduce • 10 – 100x faster than Hive • Historical perspective MapReduce is Listed as the Last Most Important Software Innovation – JRuby 2-3x Faster with InvokedDynamic JVM – Hardware rarely gets greater than 10x/year
  • 4. It’s Pure Open Source • Commons-based Peer Production – Apache Software Foundation Top Level Project – 200 people from 50 OrganizationsContributing – 12 Organizations Committing – Peer Governance – Participative Decision Making The very essence of a free software consists in considering contributing roles as public trusts, bestowed for the good of the community, and not for the benefit of an individual or a party The very essence of a free government consists in considering offices as public trusts, bestowed for the good of the country, and not for the benefit of an individual or a party Modern John C. FOSS Calhoun John 2/C. 13/Calhoun 1835
  • 5. Strong Enterprise Relationships • Spark is in every major Hadoop distributor • Vertical enterprise use – Internet companies, government, financials – Churn analysis, fraud detection, risk analytics • Used in other data stores – Datastax (Cassandra) – MongoDB • Databricks has a cloud based implementation
  • 6. Enhances Other Big Data Implementations • Hadoop – Replacement of Map Reduce • Cassandara – Analytics • Hive – Faster SQL processing • SAP Hana – Faster interactive analysis
  • 7. API Stability • Guaranteed stability of its core API for 1.X • Spark has always been conservative with API changes • Clearly defined annotations for future APIs – Experimental – Alpha – Developer
  • 8. Don’t Need to Learn a New Language • Scala • Java – 25% • Python – 30% • And soon R
  • 9. Java 8 Lambda Support JavaRDD<String> String> lines = sc.textFile("lines hdfs://= sc.log.textFile("txt"); hdfs://log.txt"); JavaRDD<// Map each line String> to multiple words words = JavaRDD<lines.String> flatMap(words = lines.line flatMap( new FlatMapFunction<String, String>() -> Arrays.{ asList(line.split(" "))); JavaPairRDD<public Iterable<String> String, call(String Integer> line) { counts = words.mapToPair(w -> new Tuple2<String, Integer>(w, 1)) return Arrays.asList(line.split(" ")); } }); // Turn the words into (word, 1) pairs JavaPairRDD<String, Integer> ones = words.mapToPair( new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String w) { .reduceByKey((x, y) -> x + y); counts.saveAsTextFile("hdfs://counts.txt"); return new Tuple2<String, Integer>(w, 1); } }); // Group up and add the pairs by key to produce counts JavaPairRDD<String, Integer> counts = ones.reduceByKey( new Function2<Integer, Integer, Integer>() { public Integer call(Integer i1, Integer i2) { return i1 + i2; } }); counts.saveAsTextFile("hdfs://counts.txt");
  • 10. Real Time Stream Process val ssc = new StreamingContext(args(0), "NetworkHashCount", Seconds(10), file = sc.textFile("hdfs://.../pagecounts-*.gz") val counts = file.flatMap(line => line.split(" ")) System.getenv("SPARK_HOME"), .map(word => (word, 1)) .reduceByKey(_ + _) Seq(System.getenv("SPARK_EXAMPLES_JAR"))) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")).filter(_.startsWith("#")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() counts.saveAsTextFile("hdfs://.../word-count")
  • 11. Caching Interactive Algorithms val points = sc.textFile("...").map(parsePoint).cache() var w = Vector.random(D) //current separating plane for (i <- 1 to ITERATIONS) { val gradient = points.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final separating plane: " + w)
  • 12. New Security Integration • Complete Integration with Haddop/YARN Security Model – Authenticate Job Submissions – Securely transfer HDFS credentials – Authenticate communication between component • Other deployments supported val conf = new SparkConf conf.set("spark.authenticate", "true") conf.set("spark.authenticate.secret", "good")
  • 13. And Lots More • Apache Spark Website • Databricks – making big data easy – Introduction to Apache Spark • Jul 28 – Austin, TX - More Info & Registration • Aug 25 – Chicago, IL - More Info & Registration

Notas do Editor

  1. There are lot of features that you probably don’t know about but you can find them at the Apache Spark Website or at Databricks, the company where a number of the leading conributors to Apache Spark work. Also be aware that Databricks offers an Introduction to Apache Spark with events coming up on July 28 in Austin and August 25 in Chicago.