SlideShare a Scribd company logo
1 of 40
Download to read offline
Jump Start into 

Apache Spark

Seattle Spark Meetup – 1/12/2016
Denny Lee,
Technology Evangelist
Seattle Meetup
Join us at
Spark Summit East
February 16-18, 2016 | New York City
Code: SeattleMeetupEast for 20% Discount
Apply to the Academic
Partners Program
databricks.com/academic
Thanks!
March 28th – 31st
Code: UGSEASPRK to save 20%
Upcoming Sessions
Event Date Location
Seattle Scalability Meetup - Eastside event 1/27/2016 Eastside (Microsoft
Building 41)
Exploratory Analysis of Large Data with R and
Spark
2/10/2016 Seattle (Fred Hutchinson
Cancer Research Center)
SparkCLR and Kafka+Spark 2/25/2016 Eastside (Microsoft City
Center)
A Primer into Jupyter, Spark on HDInsight, and
Office 365 Analytics with Spark
3/9/2016 Eastside (Microsoft City
Center)
Jump into Spark Streaming 4/13/2016 TBD
About Me: Denny Lee
Technology Evangelist, Databricks (Working with Spark since v0.5)
Former:
• Senior Director of Data Sciences Engineering at Concur (now
part of SAP)
• Principal Program Manager at Microsoft
Hands-on Data Engineer, Architect more than 15y developer
internet-scale infrastructure for both on-premises and cloud
including Bing’s Audience Insights, Yahoo’s 24TB SSAS cube, and
Isotope Incubation Team (HDInsight)
We are Databricks, the company behind Spark
Founded by the creators of
Apache Spark in 2013
Share of Spark code
contributed by Databricks
in 2014
75%
8
Data Value
Created Databricks on top of Spark to make big data simple.
Spark Survey 2015
Highlights
Spark adoption is
growing rapidly
Spark use is growing
beyond Hadoop
Spark is increasing
access to big data
Spark Survey Report 2015 Highlights
TOP 3 APACHE SPARK TAKEAWAYS
Quick Start
Quick Start Using Python | Quick Start Using Scala
Quick Start with Python
textFile	=	sc.textFile("/mnt/tardis6/docs/README.md")	
textFile.count()
Quick Start with Scala
textFile	=	sc.textFile("/mnt/tardis6/docs/README.md")	
textFile.count()
RDDs
• RDDs have actions, which return values, and transformations, which return
pointers to new RDDs.
• Transformations are lazy and executed when an action is run
• Transformations: map(),	flatMap(),	filter(),	mapPartitions(),	mapPartitionsWithIndex(),	
sample(),	union(),	distinct(),	groupByKey(),	reduceByKey(),	sortByKey(),	join(),	cogroup(),	
pipe(),	coalesce(),	repartition(),	partitionBy(),	...
• Actions: reduce(),	collect(),	count(),	first(),	take(),	takeSample(),	takeOrdered(),	
saveAsTextFile(),	saveAsSequenceFile(),	saveAsObjectFile(),	countByKey(),	foreach(),	...
• Persist (cache) distributed data in memory or disk
Ad Tech Example
AdTech Sample Notebook (Part 1)
…
Apache Spark Engine
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Unified engine across diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, and R APIs
Standard libraries
Create External Table with RegEx
CREATE	EXTERNAL	TABLE	accesslog	(			
		ipaddress	STRING,	
		...	
)	
ROW	FORMAT	
		SERDE	'org.apache.hadoop.hive.serde2.RegexSerDe'	
WITH	SERDEPROPERTIES	(			
		"input.regex"	=	'^(S+)	(S+)	(S+)	[([w:/]+s[+-]d{4})]		"(
S+)	(S+)	(S+)"	(d{3})	(d+)	"(.*)"	"(.*)"	(S+)	"(S+),	
(S+),	(S+),	(S+)"’	
)	
LOCATION	
		"/mnt/mdl/accesslogs/"
External Web Service Call via Mapper
#	Obtain	the	unique	agents	from	the	accesslog	table	
ipaddresses	=	sqlContext.sql("select	distinct	ip1	from		
	accesslog	where	ip1	is	not	null").rdd	
#	getCCA2:	Obtains	two	letter	country	code	based	on	IP	address	
def	getCCA2(ip):			
		url	=	'http://freegeoip.net/csv/'	+	ip	
		str	=	urllib2.urlopen(url).read()			
		return	str.split(",")[1]	
#	Loop	through	distinct	IP	addresses	and	obtain	two-letter	country	codes	
mappedIPs	=	ipaddresses.map(lambda	x:	(x[0],	getCCA2(x[0])))
Join DataFrames and Register Temp Table
#	Join	countrycodes	with	mappedIPsDF	so	we	can	have	IP	address	and		
#	three-letter	ISO	country	codes	
mappedIP3	=	mappedIP2		
		.join(countryCodesDF,	mappedIP2.cca2	==	countryCodesDF.cca2,	"left_outer")		
		.select(mappedIP2.ip,	mappedIP2.cca2,	countryCodesDF.cca3,	countryCodesDF.cn)	
#	Register	the	mapping	table	
mappedIP3.registerTempTable("mappedIP3")
Add Columns to DataFrames with UDFs
from	user_agents	import	parse	
from	pyspark.sql.types	import	StringType	
from	pyspark.sql.functions	import	udf	
#	Create	UDFs	to	extract	out	Browser	Family	information	
def	browserFamily(ua_string)	:	return	xstr(parse(xstr(ua_string)).browser.family)	
udfBrowserFamily	=	udf(browserFamily,	StringType())	
#	Obtain	the	unique	agents	from	the	accesslog	table	
userAgentTbl	=	sqlContext.sql("select	distinct	agent	from	accesslog")	
#	Add	new	columns	to	the	UserAgentInfo	DataFrame	containing	browser	information	
userAgentInfo	=	userAgentTbl.withColumn('browserFamily',			
			udfBrowserFamily(userAgentTbl.agent))
Use Python UDFs with Spark SQL
#	Define	function	(converts	Apache	web	log	time)	
def	weblog2Time(weblog_timestr):	...	
#	Define	and	Register	UDF	
udfWeblog2Time	=	udf(weblog2Time,	DateType())	
sqlContext.registerFunction("udfWeblog2Time",	lambda	x:	weblog2Time(x))	
#	Create	DataFrame		
accessLogsPrime	=	sqlContext.sql("select	hash(a.ip1,	a.agent)	as	UserId,	
m.cca3,	udfWeblog2Time(a.datetime),...")udfWeblog2Time(a.datetime)
Spark API Performance
History of Spark APIs
RDD
(2011)
DataFrame
(2013)
Distribute collection
of JVM objects
Functional Operators (map,
filter, etc.)
Distribute collection
of Row objects
Expression-based operations
and UDFs
Logical plans and optimizer
Fast/efficient internal
representations
DataSet
(2015)
Internally rows, externally JVM
objects
“Best of both worlds”
type safe + fast
Benefit of Logical Plan:

Performance Parity Across Languages
SQL
R
Python
Java/Scala
Python
Java/Scala
Runtime for an example aggregation workload (secs)
0 2.5 5 7.5 10
DataFrame
RDD
NYC Taxi Dataset
Spark	1.3.1,	1.4,	and	1.5	for	9	queries
0
75
150
225
300
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
1.5Run	A1 1.5Run	A2 1.5Run	B1 1.5Run	B2 1.4Run	A1 1.4Run	A2 1.4Run	B1 1.4Run	B2 1.3.1Run	A1
1.3.1Run	A2 1.3.1Run	B1 1.3.1Run	B2
Dataset API in Spark 1.6
Typed interface over DataFrames / Tungsten
case	class	Person(name:	String,	age:	Long)

val	dataframe	=	read.json(“people.json”)	
val	ds:	Dataset[Person]	=	dataframe.as[Person]

ds.filter(p	=>	p.name.startsWith(“M”))	

		.toDF()

		.groupBy($“name”)

		.avg(“age”)
Dataset
“Encoder” converts from JVM Object into
a Dataset Row
Checkout [SPARK-9999]
JVM Object
Dataset
Row
encoder
Tungsten Execution
PythonSQL R Streaming
DataFrame (& Dataset)
Advanced
Analytics
References
• Spark DataFrames: Simple and Fast Analysis on Structured Data
[Michael Armbrust]
• Apache Spark 1.6 presented by Databricks co-founder Patrick
Wendell
• Announcing Spark 1.6
• Introducing Spark Datasets
• Spark SQL Data Sources API: Unified Data Access for the Spark
Platform
Join us at
Spark Summit East
February 16-18, 2016 | New York City
Apply to the Academic
Partners Program
databricks.com/academic
Thanks!

More Related Content

What's hot

Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Databricks
 
How Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeHow Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeDenny Lee
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Databricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesDatabricks
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...Databricks
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonDatabricks
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPDatabricks
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 

What's hot (20)

Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
 
How Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeHow Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On Time
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLP
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 

Viewers also liked

Introduction to scala for a c programmer
Introduction to scala for a c programmerIntroduction to scala for a c programmer
Introduction to scala for a c programmerGirish Kumar A L
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDBDenny Lee
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks
 
Scala - A Scalable Language
Scala - A Scalable LanguageScala - A Scalable Language
Scala - A Scalable LanguageMario Gleichmann
 
Scala: Pattern matching, Concepts and Implementations
Scala: Pattern matching, Concepts and ImplementationsScala: Pattern matching, Concepts and Implementations
Scala: Pattern matching, Concepts and ImplementationsMICHRAFY MUSTAFA
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaAlexander Dean
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFramesJen Aman
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
A Basic Hive Inspection
A Basic Hive InspectionA Basic Hive Inspection
A Basic Hive InspectionLinda Tillman
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 
Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Martin Odersky
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for TrainingBryan Yang
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 

Viewers also liked (20)

Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
Introduction to scala for a c programmer
Introduction to scala for a c programmerIntroduction to scala for a c programmer
Introduction to scala for a c programmer
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Apache hive
Apache hiveApache hive
Apache hive
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
Python to scala
Python to scalaPython to scala
Python to scala
 
Scala - A Scalable Language
Scala - A Scalable LanguageScala - A Scalable Language
Scala - A Scalable Language
 
Indexed Hive
Indexed HiveIndexed Hive
Indexed Hive
 
Fun[ctional] spark with scala
Fun[ctional] spark with scalaFun[ctional] spark with scala
Fun[ctional] spark with scala
 
Scala: Pattern matching, Concepts and Implementations
Scala: Pattern matching, Concepts and ImplementationsScala: Pattern matching, Concepts and Implementations
Scala: Pattern matching, Concepts and Implementations
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in Scala
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
A Basic Hive Inspection
A Basic Hive InspectionA Basic Hive Inspection
A Basic Hive Inspection
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 

Similar to Jump Start into Apache Spark (Seattle Spark Meetup)

Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015dhiguero
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache SparkIndicThreads
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFMLconf
 

Similar to Jump Start into Apache Spark (Seattle Spark Meetup) (20)

Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
 

More from Denny Lee

Azure Cosmos DB: Globally Distributed Multi-Model Database Service
Azure Cosmos DB: Globally Distributed Multi-Model Database ServiceAzure Cosmos DB: Globally Distributed Multi-Model Database Service
Azure Cosmos DB: Globally Distributed Multi-Model Database ServiceDenny Lee
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connectorDenny Lee
 
SQL Server Integration Services Best Practices
SQL Server Integration Services Best PracticesSQL Server Integration Services Best Practices
SQL Server Integration Services Best PracticesDenny Lee
 
SQL Server Reporting Services: IT Best Practices
SQL Server Reporting Services: IT Best PracticesSQL Server Reporting Services: IT Best Practices
SQL Server Reporting Services: IT Best PracticesDenny Lee
 
Introduction to Microsoft's Big Data Platform and Hadoop Primer
Introduction to Microsoft's Big Data Platform and Hadoop PrimerIntroduction to Microsoft's Big Data Platform and Hadoop Primer
Introduction to Microsoft's Big Data Platform and Hadoop PrimerDenny Lee
 
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Denny Lee
 
Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together
Yahoo!, Big Data, and Microsoft BI: Bigger and Better TogetherYahoo!, Big Data, and Microsoft BI: Bigger and Better Together
Yahoo!, Big Data, and Microsoft BI: Bigger and Better TogetherDenny Lee
 
SQL Server Reporting Services Disaster Recovery webinar
SQL Server Reporting Services Disaster Recovery webinarSQL Server Reporting Services Disaster Recovery webinar
SQL Server Reporting Services Disaster Recovery webinarDenny Lee
 
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...Denny Lee
 
Designing, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDesigning, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDenny Lee
 
SQLCAT - Data and Admin Security
SQLCAT - Data and Admin SecuritySQLCAT - Data and Admin Security
SQLCAT - Data and Admin SecurityDenny Lee
 
SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008
SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008
SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008Denny Lee
 
SQLCAT: A Preview to PowerPivot Server Best Practices
SQLCAT: A Preview to PowerPivot Server Best PracticesSQLCAT: A Preview to PowerPivot Server Best Practices
SQLCAT: A Preview to PowerPivot Server Best PracticesDenny Lee
 
Deploying and Managing PowerPivot for SharePoint
Deploying and Managing PowerPivot for SharePointDeploying and Managing PowerPivot for SharePoint
Deploying and Managing PowerPivot for SharePointDenny Lee
 
SQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataSQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataDenny Lee
 
Big Data, Bigger Brains
Big Data, Bigger BrainsBig Data, Bigger Brains
Big Data, Bigger BrainsDenny Lee
 
SQL Server Reporting Services Disaster Recovery Webinar
SQL Server Reporting Services Disaster Recovery WebinarSQL Server Reporting Services Disaster Recovery Webinar
SQL Server Reporting Services Disaster Recovery WebinarDenny Lee
 
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)Denny Lee
 
Yahoo! TAO Case Study Excerpt
Yahoo! TAO Case Study ExcerptYahoo! TAO Case Study Excerpt
Yahoo! TAO Case Study ExcerptDenny Lee
 
SQL Server Reporting Services: IT Best Practices
SQL Server Reporting Services: IT Best PracticesSQL Server Reporting Services: IT Best Practices
SQL Server Reporting Services: IT Best PracticesDenny Lee
 

More from Denny Lee (20)

Azure Cosmos DB: Globally Distributed Multi-Model Database Service
Azure Cosmos DB: Globally Distributed Multi-Model Database ServiceAzure Cosmos DB: Globally Distributed Multi-Model Database Service
Azure Cosmos DB: Globally Distributed Multi-Model Database Service
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connector
 
SQL Server Integration Services Best Practices
SQL Server Integration Services Best PracticesSQL Server Integration Services Best Practices
SQL Server Integration Services Best Practices
 
SQL Server Reporting Services: IT Best Practices
SQL Server Reporting Services: IT Best PracticesSQL Server Reporting Services: IT Best Practices
SQL Server Reporting Services: IT Best Practices
 
Introduction to Microsoft's Big Data Platform and Hadoop Primer
Introduction to Microsoft's Big Data Platform and Hadoop PrimerIntroduction to Microsoft's Big Data Platform and Hadoop Primer
Introduction to Microsoft's Big Data Platform and Hadoop Primer
 
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
 
Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together
Yahoo!, Big Data, and Microsoft BI: Bigger and Better TogetherYahoo!, Big Data, and Microsoft BI: Bigger and Better Together
Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together
 
SQL Server Reporting Services Disaster Recovery webinar
SQL Server Reporting Services Disaster Recovery webinarSQL Server Reporting Services Disaster Recovery webinar
SQL Server Reporting Services Disaster Recovery webinar
 
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
 
Designing, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDesigning, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons Learned
 
SQLCAT - Data and Admin Security
SQLCAT - Data and Admin SecuritySQLCAT - Data and Admin Security
SQLCAT - Data and Admin Security
 
SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008
SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008
SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008
 
SQLCAT: A Preview to PowerPivot Server Best Practices
SQLCAT: A Preview to PowerPivot Server Best PracticesSQLCAT: A Preview to PowerPivot Server Best Practices
SQLCAT: A Preview to PowerPivot Server Best Practices
 
Deploying and Managing PowerPivot for SharePoint
Deploying and Managing PowerPivot for SharePointDeploying and Managing PowerPivot for SharePoint
Deploying and Managing PowerPivot for SharePoint
 
SQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataSQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big Data
 
Big Data, Bigger Brains
Big Data, Bigger BrainsBig Data, Bigger Brains
Big Data, Bigger Brains
 
SQL Server Reporting Services Disaster Recovery Webinar
SQL Server Reporting Services Disaster Recovery WebinarSQL Server Reporting Services Disaster Recovery Webinar
SQL Server Reporting Services Disaster Recovery Webinar
 
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
 
Yahoo! TAO Case Study Excerpt
Yahoo! TAO Case Study ExcerptYahoo! TAO Case Study Excerpt
Yahoo! TAO Case Study Excerpt
 
SQL Server Reporting Services: IT Best Practices
SQL Server Reporting Services: IT Best PracticesSQL Server Reporting Services: IT Best Practices
SQL Server Reporting Services: IT Best Practices
 

Recently uploaded

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 

Recently uploaded (20)

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 

Jump Start into Apache Spark (Seattle Spark Meetup)

  • 1. Jump Start into 
 Apache Spark
 Seattle Spark Meetup – 1/12/2016 Denny Lee, Technology Evangelist
  • 3. Join us at Spark Summit East February 16-18, 2016 | New York City Code: SeattleMeetupEast for 20% Discount
  • 4. Apply to the Academic Partners Program databricks.com/academic
  • 5. Thanks! March 28th – 31st Code: UGSEASPRK to save 20%
  • 6. Upcoming Sessions Event Date Location Seattle Scalability Meetup - Eastside event 1/27/2016 Eastside (Microsoft Building 41) Exploratory Analysis of Large Data with R and Spark 2/10/2016 Seattle (Fred Hutchinson Cancer Research Center) SparkCLR and Kafka+Spark 2/25/2016 Eastside (Microsoft City Center) A Primer into Jupyter, Spark on HDInsight, and Office 365 Analytics with Spark 3/9/2016 Eastside (Microsoft City Center) Jump into Spark Streaming 4/13/2016 TBD
  • 7. About Me: Denny Lee Technology Evangelist, Databricks (Working with Spark since v0.5) Former: • Senior Director of Data Sciences Engineering at Concur (now part of SAP) • Principal Program Manager at Microsoft Hands-on Data Engineer, Architect more than 15y developer internet-scale infrastructure for both on-premises and cloud including Bing’s Audience Insights, Yahoo’s 24TB SSAS cube, and Isotope Incubation Team (HDInsight)
  • 8. We are Databricks, the company behind Spark Founded by the creators of Apache Spark in 2013 Share of Spark code contributed by Databricks in 2014 75% 8 Data Value Created Databricks on top of Spark to make big data simple.
  • 10. Spark adoption is growing rapidly Spark use is growing beyond Hadoop Spark is increasing access to big data Spark Survey Report 2015 Highlights TOP 3 APACHE SPARK TAKEAWAYS
  • 11.
  • 12.
  • 13.
  • 14.
  • 15. Quick Start Quick Start Using Python | Quick Start Using Scala
  • 16. Quick Start with Python textFile = sc.textFile("/mnt/tardis6/docs/README.md") textFile.count()
  • 17. Quick Start with Scala textFile = sc.textFile("/mnt/tardis6/docs/README.md") textFile.count()
  • 18. RDDs • RDDs have actions, which return values, and transformations, which return pointers to new RDDs. • Transformations are lazy and executed when an action is run • Transformations: map(), flatMap(), filter(), mapPartitions(), mapPartitionsWithIndex(), sample(), union(), distinct(), groupByKey(), reduceByKey(), sortByKey(), join(), cogroup(), pipe(), coalesce(), repartition(), partitionBy(), ... • Actions: reduce(), collect(), count(), first(), take(), takeSample(), takeOrdered(), saveAsTextFile(), saveAsSequenceFile(), saveAsObjectFile(), countByKey(), foreach(), ... • Persist (cache) distributed data in memory or disk
  • 19. Ad Tech Example AdTech Sample Notebook (Part 1)
  • 20. … Apache Spark Engine Spark Core Spark Streaming Spark SQL MLlib GraphX Unified engine across diverse workloads & environments Scale out, fault tolerant Python, Java, Scala, and R APIs Standard libraries
  • 21. Create External Table with RegEx CREATE EXTERNAL TABLE accesslog ( ipaddress STRING, ... ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = '^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "( S+) (S+) (S+)" (d{3}) (d+) "(.*)" "(.*)" (S+) "(S+), (S+), (S+), (S+)"’ ) LOCATION "/mnt/mdl/accesslogs/"
  • 22. External Web Service Call via Mapper # Obtain the unique agents from the accesslog table ipaddresses = sqlContext.sql("select distinct ip1 from accesslog where ip1 is not null").rdd # getCCA2: Obtains two letter country code based on IP address def getCCA2(ip): url = 'http://freegeoip.net/csv/' + ip str = urllib2.urlopen(url).read() return str.split(",")[1] # Loop through distinct IP addresses and obtain two-letter country codes mappedIPs = ipaddresses.map(lambda x: (x[0], getCCA2(x[0])))
  • 23. Join DataFrames and Register Temp Table # Join countrycodes with mappedIPsDF so we can have IP address and # three-letter ISO country codes mappedIP3 = mappedIP2 .join(countryCodesDF, mappedIP2.cca2 == countryCodesDF.cca2, "left_outer") .select(mappedIP2.ip, mappedIP2.cca2, countryCodesDF.cca3, countryCodesDF.cn) # Register the mapping table mappedIP3.registerTempTable("mappedIP3")
  • 24. Add Columns to DataFrames with UDFs from user_agents import parse from pyspark.sql.types import StringType from pyspark.sql.functions import udf # Create UDFs to extract out Browser Family information def browserFamily(ua_string) : return xstr(parse(xstr(ua_string)).browser.family) udfBrowserFamily = udf(browserFamily, StringType()) # Obtain the unique agents from the accesslog table userAgentTbl = sqlContext.sql("select distinct agent from accesslog") # Add new columns to the UserAgentInfo DataFrame containing browser information userAgentInfo = userAgentTbl.withColumn('browserFamily', udfBrowserFamily(userAgentTbl.agent))
  • 25. Use Python UDFs with Spark SQL # Define function (converts Apache web log time) def weblog2Time(weblog_timestr): ... # Define and Register UDF udfWeblog2Time = udf(weblog2Time, DateType()) sqlContext.registerFunction("udfWeblog2Time", lambda x: weblog2Time(x)) # Create DataFrame accessLogsPrime = sqlContext.sql("select hash(a.ip1, a.agent) as UserId, m.cca3, udfWeblog2Time(a.datetime),...")udfWeblog2Time(a.datetime)
  • 26.
  • 27.
  • 28.
  • 29.
  • 31. History of Spark APIs RDD (2011) DataFrame (2013) Distribute collection of JVM objects Functional Operators (map, filter, etc.) Distribute collection of Row objects Expression-based operations and UDFs Logical plans and optimizer Fast/efficient internal representations DataSet (2015) Internally rows, externally JVM objects “Best of both worlds” type safe + fast
  • 32. Benefit of Logical Plan:
 Performance Parity Across Languages SQL R Python Java/Scala Python Java/Scala Runtime for an example aggregation workload (secs) 0 2.5 5 7.5 10 DataFrame RDD
  • 33. NYC Taxi Dataset Spark 1.3.1, 1.4, and 1.5 for 9 queries 0 75 150 225 300 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 1.5Run A1 1.5Run A2 1.5Run B1 1.5Run B2 1.4Run A1 1.4Run A2 1.4Run B1 1.4Run B2 1.3.1Run A1 1.3.1Run A2 1.3.1Run B1 1.3.1Run B2
  • 34. Dataset API in Spark 1.6 Typed interface over DataFrames / Tungsten case class Person(name: String, age: Long)
 val dataframe = read.json(“people.json”) val ds: Dataset[Person] = dataframe.as[Person]
 ds.filter(p => p.name.startsWith(“M”)) 
 .toDF()
 .groupBy($“name”)
 .avg(“age”)
  • 35. Dataset “Encoder” converts from JVM Object into a Dataset Row Checkout [SPARK-9999] JVM Object Dataset Row encoder
  • 36. Tungsten Execution PythonSQL R Streaming DataFrame (& Dataset) Advanced Analytics
  • 37. References • Spark DataFrames: Simple and Fast Analysis on Structured Data [Michael Armbrust] • Apache Spark 1.6 presented by Databricks co-founder Patrick Wendell • Announcing Spark 1.6 • Introducing Spark Datasets • Spark SQL Data Sources API: Unified Data Access for the Spark Platform
  • 38. Join us at Spark Summit East February 16-18, 2016 | New York City
  • 39. Apply to the Academic Partners Program databricks.com/academic