SlideShare uma empresa Scribd logo
1 de 55
Baixar para ler offline
IBM | spark.tc
Advanced Apache Spark Meetup
Spark SQL + DataFrames + Catalyst + Data Sources API
Chris Fregly, Principal Data Solutions Engineer
IBM Spark Technology Center
Oct 6, 2015
Power of data. Simplicity of design. Speed of innovation.
Meetup Housekeeping
IBM | spark.tc
Announcements
Steve Beier, Boss Man!
IBM Spark Technology Center!
IBM | spark.tc
CAP Theorem Adapted to Hiring
Parochial!
Collaborative!
Awesome!
Spelling Bee!
Champion!
!
!
!
First Chair !
Chess Club!
!
!
!
Math-lete !
1st Place!
!
!
<---->-
IBM | spark.tc
Who am I?!
!
!
!
Streaming Data Engineer!
Netflix Open Source Committer!!
!
!
Data Solutions Engineer!
Apache Contributor!
!
!
Principal Data Solutions Engineer!
IBM Technology Center!
IBM | spark.tc
Last Meetup (Spark Wins 100 TB Daytona GraySort)
On-disk only, in-memory caching disabled!!sortbenchmark.org/ApacheSpark2014.pdf!
IBM | spark.tc
Upcoming Advanced Apache Spark Meetups!
Project Tungsten Data Structs/Algos for CPU/Memory Optimization!
Nov 12th, 2015!
Text-based Advanced Analytics and Machine Learning!
Jan 14th, 2016!
ElasticSearch-Spark Connector w/ Costin Leau (Elastic.co) & Me!
Feb 16th, 2016!
Spark Internals Deep Dive!
Mar 24th, 2016!
Spark SQL Catalyst Optimizer Deep Dive !
Apr 21st, 2016!
IBM | spark.tc
Meetup Metrics
Total Spark Experts: 1100+!
!
!
Donations: $0!
“Your money is no good here.”!
!
Lloyd from !
The Shining!
<--- eek!!
IBM | spark.tc
Meetup Updates
Talking with other Spark Meetup Groups!
Potential mergers and/or hostile takeovers!!
New Sponsors!!!
!
Connected with Organizer of Bangalore Spark Meetup!
Madhukara Phatak <-- Technical Deep Dives --^!
We’re trying out new PowerPoint Animations!
Please be patient!!
We got our first Spam comment! ->!
IBM | spark.tc
Constructive Criticism from Previous Attendees
“Chris, you’re like a fat version of an !
already-fat Erlich from Silicon Valley -!
except not funny.”!
“Chris, your voice is so annoying that it !
keeps waking me up from sleep induced !
by your boring content.”!
IBM | spark.tc
Recent Events
Cassandra Summit 2015!
Real-time Advanced Analytics w/ Spark & Cassandra!
!
!
!
!
!
Strata NYC 2015!
Practical Data Science w/ Spark: Recommender Systems!
Available on Slideshare!
http://slideshare.net/cfregly!
IBM | spark.tc
Freg-a-palooza Upcoming World Tour
  London Spark Meetup (Oct 12th)!
  Scotland Data Science Meetup (Oct 13th)!
  Dublin Spark Meetup (Oct 15th)!
  Barcelona Spark Meetup (Oct 20th)!
  Madrid Spark Meetup (Oct 22nd)!
  Paris Spark Summit (Oct 26th)!
  Amsterdam Spark Summit (Oct 27th – Oct 29th)!
  Delft Dutch Data Science Meetup (Oct 29th) !
  Brussels Spark Meetup (Oct 30th)!
  Zurich Big Data Developers Meetup (Nov 2nd)!
High probability!
I’ll end up in jail!
or married?!!
Spark SQL + DataFrames

Catalyst + Data Sources API
IBM | spark.tc
Topics of this Talk!
 DataFrames!
 Catalyst Optimizer and Query Plans!
 Data Sources API!
 Creating and Contributing Custom Data Source!
!
 Partitions, Pruning, Pushdowns!
!
 Native + Third-Party Data Source Impls!
!
 Spark SQL Performance Tuning!
IBM | spark.tc
DataFrames!
Inspired by R and Pandas DataFrames!
Cross language support!
SQL, Python, Scala, Java, R!
Levels performance of Python, Scala, Java, and R!
Generates JVM bytecode vs serialize/pickle objects to Python!
DataFrame is Container for Logical Plan!
Transformations are lazy and represented as a tree!
Catalyst Optimizer creates physical plan!
DataFrame.rdd returns the underlying RDD if needed!
Custom UDF using registerFunction()
New, experimental UDAF support!
Use DataFrames !
instead of RDDs!!!
IBM | spark.tc
Catalyst Optimizer!
Converts logical plan to physical plan!
Manipulate & optimize DataFrame transformation tree!
Subquery elimination – use aliases to collapse subqueries!
Constant folding – replace expression with constant!
Simplify filters – remove unnecessary filters!
Predicate/filter pushdowns – avoid unnecessary data load!
Projection collapsing – avoid unnecessary projections!
Hooks for custom rules!
Rules = Scala Case Classes!
val newPlan = MyFilterRule(analyzedPlan)
Implements!
oas.sql.catalyst.rules.Rule!
Apply to any
plan stage!
IBM | spark.tc
Plan Debugging!
gendersCsvDF.select($"id", $"gender").filter("gender != 'F'").filter("gender != 'M'").explain(true)!
Requires explain(true)!
DataFrame.queryExecution.logical!
DataFrame.queryExecution.analyzed!
DataFrame.queryExecution.optimizedPlan!
DataFrame.queryExecution.executedPlan!
IBM | spark.tc
Plan Visualization & Join/Aggregation Metrics!
Effectiveness !
of Filter!
Cost-based !
Optimization!
is Applied!
Peak Memory for!
Joins and Aggs!
Optimized !
CPU-cache-aware!
Binary Format!
Minimizes GC &!
Improves Join Perf!
(Project Tungsten)!
New in Spark 1.5!!
IBM | spark.tc
Data Sources API!
Relations (o.a.s.sql.sources.interfaces.scala)!
BaseRelation (abstract class): Provides schema of data!
TableScan (impl): Read all data from source, construct rows !
PrunedFilteredScan (impl): Read with column pruning & predicate pushdowns
InsertableRelation (impl): Insert or overwrite data based on SaveMode enum!
RelationProvider (trait/interface): Handles user options, creates BaseRelation!
Execution (o.a.s.sql.execution.commands.scala)!
RunnableCommand (trait/interface)!
ExplainCommand(impl: case class)!
CacheTableCommand(impl: case class)!
Filters (o.a.s.sql.sources.filters.scala)!
Filter (abstract class for all filter pushdowns for this data source)!
EqualTo (impl)!
GreaterThan (impl)!
StringStartsWith (impl)!
IBM | spark.tc
Creating a Custom Data Source!
Study Existing Native and Third-Party Data Source Impls!
!
Native: JDBC (o.a.s.sql.execution.datasources.jdbc)!
class JDBCRelation extends BaseRelation
with PrunedFilteredScan
with InsertableRelation
!
Third-Party: Cassandra (o.a.s.sql.cassandra)!
class CassandraSourceRelation extends BaseRelation
with PrunedFilteredScan
with InsertableRelation!
!
IBM | spark.tc
Contributing a Custom Data Source!
spark-packages.org!
Managed by!
Contains links to externally-managed github projects!
Ratings and comments!
Spark version requirements of each package!
Examples!
https://github.com/databricks/spark-csv!
https://github.com/databricks/spark-avro!
https://github.com/databricks/spark-redshift!
Partitions, Pruning, Pushdowns
IBM | spark.tc
Demo Dataset (from previous Spark After Dark talks)!
RATINGS !
========!
UserID,ProfileID,Rating !
(1-10)!
GENDERS!
========!
UserID,Gender !
(M,F,U)!
<-- Totally -->!
Anonymous !
IBM | spark.tc
Partitions!
Partition based on data usage patterns!
/genders.parquet/gender=M/…
/gender=F/… <-- Use case: access users by gender
/gender=U/…
Partition Discovery!
On read, infer partitions from organization of data (ie. gender=F)!
Dynamic Partitions!
Upon insert, dynamically create partitions!
Specify field to use for each partition (ie. gender)!
SQL: INSERT TABLE genders PARTITION (gender) SELECT …
DF: gendersDF.write.format(”parquet").partitionBy(”gender”).save(…)
IBM | spark.tc
Pruning!
Partition Pruning!
Filter out entire partitions of rows on partitioned data
SELECT id, gender FROM genders where gender = ‘U’
Column Pruning!
Filter out entire columns for all rows if not required!
Extremely useful for columnar storage formats!
Parquet, ORC!
SELECT id, gender FROM genders
!
IBM | spark.tc
Pushdowns!
Predicate (aka Filter) Pushdowns!
Predicate returns {true, false} for a given function/condition!
Filters rows as deep into the data source as possible!
Data Source must implement PrunedFilteredScan!
Native Spark SQL Data Sources
IBM | spark.tc
Spark SQL Native Data Sources - Source Code!
IBM | spark.tc
JSON Data Source!
DataFrame!
val ratingsDF = sqlContext.read.format("json")
.load("file:/root/pipeline/datasets/dating/ratings.json.bz2")
-- or --!
val ratingsDF = sqlContext.read.json
("file:/root/pipeline/datasets/dating/ratings.json.bz2")
SQL Code!
CREATE TABLE genders USING json
OPTIONS
(path "file:/root/pipeline/datasets/dating/genders.json.bz2")
Convenience Method
IBM | spark.tc
JDBC Data Source!
Add Driver to Spark JVM System Classpath!
$ export SPARK_CLASSPATH=<jdbc-driver.jar>
DataFrame!
val jdbcConfig = Map("driver" -> "org.postgresql.Driver",
"url" -> "jdbc:postgresql:hostname:port/database",
"dbtable" -> ”schema.tablename")
df.read.format("jdbc").options(jdbcConfig).load()
SQL!
CREATE TABLE genders USING jdbc
OPTIONS (url, dbtable, driver, …)
IBM | spark.tc
Parquet Data Source!
Configuration!
spark.sql.parquet.filterPushdown=true!
spark.sql.parquet.mergeSchema=true
spark.sql.parquet.cacheMetadata=true!
spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]
DataFrames!
val gendersDF = sqlContext.read.format("parquet")
.load("file:/root/pipeline/datasets/dating/genders.parquet")!
gendersDF.write.format("parquet").partitionBy("gender")
.save("file:/root/pipeline/datasets/dating/genders.parquet")
SQL!
CREATE TABLE genders USING parquet
OPTIONS
(path "file:/root/pipeline/datasets/dating/genders.parquet")
IBM | spark.tc
ORC Data Source!
Configuration!
spark.sql.orc.filterPushdown=true
DataFrames!
val gendersDF = sqlContext.read.format("orc")
.load("file:/root/pipeline/datasets/dating/genders")!
gendersDF.write.format("orc").partitionBy("gender")
.save("file:/root/pipeline/datasets/dating/genders")
SQL!
CREATE TABLE genders USING orc
OPTIONS
(path "file:/root/pipeline/datasets/dating/genders")
Third-Party Data Sources

spark-packages.org
IBM | spark.tc
CSV Data Source (Databricks)!
Github!
https://github.com/databricks/spark-csv!
!
Maven!
com.databricks:spark-csv_2.10:1.2.0!
!
Code!
val gendersCsvDF = sqlContext.read
.format("com.databricks.spark.csv")
.load("file:/root/pipeline/datasets/dating/gender.csv.bz2")
.toDF("id", "gender") toDF() defines column names!
IBM | spark.tc
Avro Data Source (Databricks)!
Github!
https://github.com/databricks/spark-avro!
!
Maven!
com.databricks:spark-avro_2.10:2.0.1!
!
Code!
val df = sqlContext.read
.format("com.databricks.spark.avro")
.load("file:/root/pipeline/datasets/dating/gender.avro")
!
IBM | spark.tc
ElasticSearch Data Source (Elastic.co)!
Github!
https://github.com/elastic/elasticsearch-hadoop!
Maven!
org.elasticsearch:elasticsearch-spark_2.10:2.1.0!
Code!
val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>",
"es.port" -> "<port>")
df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite)
.options(esConfig).save("<index>/<document>")
IBM | spark.tc
Cassandra Data Source (DataStax)!
Github!
https://github.com/datastax/spark-cassandra-connector!
Maven!
com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1
Code!
ratingsDF.write
.format("org.apache.spark.sql.cassandra")
.mode(SaveMode.Append)
.options(Map("keyspace"->"<keyspace>",
"table"->"<table>")).save(…)
IBM | spark.tc
Cassandra Pushdown Rules!
Determines which filter predicates can be pushed down to Cassandra.!
* 1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate!
* 2. Only push down primary key column predicates with = or IN predicate.!
* 3. If there are regular columns in the pushdown predicates, they should have!
* at least one EQ expression on an indexed column and no IN predicates.!
* 4. All partition column predicates must be included in the predicates to be pushed down,!
* only the last part of the partition key can be an IN predicate. For each partition column,!
* only one predicate is allowed.!
* 5. For cluster column predicates, only last predicate can be non-EQ predicate!
* including IN predicate, and preceding column predicates must be EQ predicates.!
* If there is only one cluster column predicate, the predicates could be any non-IN
predicate.!
* 6. There is no pushdown predicates if there is any OR condition or NOT IN condition.!
* 7. We're not allowed to push down multiple predicates for the same column if any of them!
* is equality or IN predicate.!
spark-cassandra-connector/…/o.a.s.sql.cassandra.PredicatePushDown.scala!
IBM | spark.tc
Special Thanks to DataStax!!!!
Russel Spitzer!
@RussSpitzer!
(He created the following few slides)!
(These guys built a lot of the connector.)!
IBM | spark.tc
Spark-Cassandra Architecture!
IBM | spark.tc
Spark-Cassandra Data Locality!
IBM | spark.tc
Spark-Cassandra Configuration:input.page.row.size
IBM | spark.tc
Spark-Cassandra Configuration: grouping.key!
IBM | spark.tc
Spark-Cassandra Configuration: size.rows/bytes!
IBM | spark.tc
Spark-Cassandra Configuration: batch.buffer.size!
IBM | spark.tc
Spark-Cassandra Configuration: concurrent.writes!
IBM | spark.tc
Spark-Cassandra Configuration: throughput_mb/s!
IBM | spark.tc
Redshift Data Source (Databricks)!
Github!
https://github.com/databricks/spark-redshift!
Maven!
com.databricks:spark-redshift:0.5.0!
Code!
val df: DataFrame = sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://<hostname>:<port>/<database>…")
.option("query", "select x, count(*) my_table group by x")
.option("tempdir", "s3n://tmpdir")
.load(...)
Copies to S3 for !
fast, parallel reads vs !
single Redshift Master bottleneck!
IBM | spark.tc
Cloudant Data Source (IBM)!
Github!
http://spark-packages.org/package/cloudant/spark-cloudant!
Maven!
com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1
Code!
ratingsDF.write.format("com.cloudant.spark")
.mode(SaveMode.Append)
.options(Map("cloudant.host"->"<account>.cloudant.com",
"cloudant.username"->"<username>",
"cloudant.password"->"<password>"))
.save("<filename>")
IBM | spark.tc
DB2 and BigSQL Data Sources (IBM)!
Coming Soon!!
!
!
!
https://github.com/SparkTC/spark-db2!
https://github.com/SparkTC/spark-bigsql!
!
IBM | spark.tc
REST Data Source (Databricks)!
Coming Soon!!
https://github.com/databricks/spark-rest?!
Michael Armbrust!
Spark SQL Lead @ Databricks!
IBM | spark.tc
SparkSQL Performance Tuning (oas.sql.SQLConf)!
spark.sql.inMemoryColumnarStorage.compressed=true!
Automatically selects column codec based on data!
spark.sql.inMemoryColumnarStorage.batchSize!
Increase as much as possible without OOM – improves compression and GC!
spark.sql.inMemoryPartitionPruning=true!
Enable partition pruning for in-memory partitions!
spark.sql.tungsten.enabled=true!
Code Gen for CPU and Memory Optimizations (Tungsten aka Unsafe Mode)!
spark.sql.shuffle.partitions!
Increase from default 200 for large joins and aggregations!
spark.sql.autoBroadcastJoinThreshold!
Increase to tune this cost-based, physical plan optimization!
spark.sql.hive.metastorePartitionPruning!
Predicate pushdown into the metastore to prune partitions early!
spark.sql.planner.sortMergeJoin!
Prefer sort-merge (vs. hash join) for large joins !
spark.sql.sources.partitionDiscovery.enabled !
& spark.sql.sources.parallelPartitionDiscovery.threshold!
IBM | spark.tc
Related Links!
https://github.com/datastax/spark-cassandra-connector!
http://blog.madhukaraphatak.com/anatomy-of-spark-dataframe-api/!
https://github.com/phatak-dev/anatomy_of_spark_dataframe_api!
https://databricks.com/blog/!
https://www.youtube.com/watch?v=uxuLRiNoDio!
http://www.slideshare.net/RussellSpitzer!
@cfregly
IBM Spark Tech Center is Hiring! "
JOnly Fun, Collaborative People - No Erlichs!J
IBM | spark.tc
Sign up for our newsletter at
Thank You!
Power of data. Simplicity of design. Speed of innovation.
Power of data. Simplicity of design. Speed of innovation.
IBM Spark

Mais conteúdo relacionado

Mais procurados

Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Chris Fregly
 
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Chris Fregly
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly
 
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Chris Fregly
 
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...Chris Fregly
 
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016Chris Fregly
 
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Chris Fregly
 
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Chris Fregly
 
Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015Chris Fregly
 
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Chris Fregly
 
Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015Chris Fregly
 
Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015Chris Fregly
 
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Chris Fregly
 
Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015Chris Fregly
 
DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix RecommendationsDC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix RecommendationsChris Fregly
 
Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016 Chris Fregly
 
Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015Chris Fregly
 
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Chris Fregly
 
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Chris Fregly
 
Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015Chris Fregly
 

Mais procurados (20)

Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
 
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
Spark, Similarity, Approximations, NLP, Recommendations - Boulder Denver Spar...
 
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016USF Seminar Series:  Apache Spark, Machine Learning, Recommendations Feb 05 2016
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
 
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
 
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
 
Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015
 
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
 
Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015
 
Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015
 
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016
 
Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015
 
DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix RecommendationsDC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations
 
Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016
 
Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015
 
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
 
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
 
Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015
 

Destaque

Introduction to vSphere APIs Using pyVmomi
Introduction to vSphere APIs Using pyVmomiIntroduction to vSphere APIs Using pyVmomi
Introduction to vSphere APIs Using pyVmomiMichael Rice
 
Explorez vos données avec apache zeppelin
Explorez vos données avec apache zeppelinExplorez vos données avec apache zeppelin
Explorez vos données avec apache zeppelinBruno Bonnin
 
Apache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystemApache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystemDuyhai Doan
 
CAS, OpenID, Shibboleth, SAML : concepts, différences et exemples
CAS, OpenID, Shibboleth, SAML : concepts, différences et exemplesCAS, OpenID, Shibboleth, SAML : concepts, différences et exemples
CAS, OpenID, Shibboleth, SAML : concepts, différences et exemplesClément OUDOT
 
Apache Zeppelin + Livy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + Livy: Bringing Multi Tenancy to Interactive Data AnalysisApache Zeppelin + Livy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + Livy: Bringing Multi Tenancy to Interactive Data AnalysisDataWorks Summit/Hadoop Summit
 
Seven Habits of Highly Effective Jenkins Users (2014 edition!)
Seven Habits of Highly Effective Jenkins Users (2014 edition!)Seven Habits of Highly Effective Jenkins Users (2014 edition!)
Seven Habits of Highly Effective Jenkins Users (2014 edition!)Andrew Bayer
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 

Destaque (8)

Introduction to vSphere APIs Using pyVmomi
Introduction to vSphere APIs Using pyVmomiIntroduction to vSphere APIs Using pyVmomi
Introduction to vSphere APIs Using pyVmomi
 
Explorez vos données avec apache zeppelin
Explorez vos données avec apache zeppelinExplorez vos données avec apache zeppelin
Explorez vos données avec apache zeppelin
 
Apache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystemApache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystem
 
Apache Zeppelin Helium and Beyond
Apache Zeppelin Helium and BeyondApache Zeppelin Helium and Beyond
Apache Zeppelin Helium and Beyond
 
CAS, OpenID, Shibboleth, SAML : concepts, différences et exemples
CAS, OpenID, Shibboleth, SAML : concepts, différences et exemplesCAS, OpenID, Shibboleth, SAML : concepts, différences et exemples
CAS, OpenID, Shibboleth, SAML : concepts, différences et exemples
 
Apache Zeppelin + Livy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + Livy: Bringing Multi Tenancy to Interactive Data AnalysisApache Zeppelin + Livy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + Livy: Bringing Multi Tenancy to Interactive Data Analysis
 
Seven Habits of Highly Effective Jenkins Users (2014 edition!)
Seven Habits of Highly Effective Jenkins Users (2014 edition!)Seven Habits of Highly Effective Jenkins Users (2014 edition!)
Seven Habits of Highly Effective Jenkins Users (2014 edition!)
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 

Semelhante a Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark 1.5.1 Zeppelin 0.6.0

Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationCraig Chao
 
Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015Chris Fregly
 
Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016Chris Fregly
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...Athens Big Data
 
Polyglot Graph Databases using OCL as pivot
Polyglot Graph Databases using OCL as pivotPolyglot Graph Databases using OCL as pivot
Polyglot Graph Databases using OCL as pivotGraph-TA
 
Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesDuyhai Doan
 
Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015Chris Fregly
 
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChris Fregly
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Andy Petrella
 
Intro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the CloudIntro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the CloudDaniel Zivkovic
 

Semelhante a Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark 1.5.1 Zeppelin 0.6.0 (19)

Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
ETL 2.0 Data Engineering for developers
ETL 2.0 Data Engineering for developersETL 2.0 Data Engineering for developers
ETL 2.0 Data Engineering for developers
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimization
 
Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015
 
Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016Dallas DFW Data Science Meetup Jan 21 2016
Dallas DFW Data Science Meetup Jan 21 2016
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
 
Polyglot Graph Databases using OCL as pivot
Polyglot Graph Databases using OCL as pivotPolyglot Graph Databases using OCL as pivot
Polyglot Graph Databases using OCL as pivot
 
Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-Cases
 
Big data apache spark + scala
Big data   apache spark + scalaBig data   apache spark + scala
Big data apache spark + scala
 
Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015
 
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChicago Spark Meetup 03 01 2016 - Spark and Recommendations
Chicago Spark Meetup 03 01 2016 - Spark and Recommendations
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
 
Intro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the CloudIntro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the Cloud
 

Mais de Chris Fregly

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataChris Fregly
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfChris Fregly
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupChris Fregly
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedChris Fregly
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine LearningChris Fregly
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...Chris Fregly
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon BraketChris Fregly
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-PersonChris Fregly
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapChris Fregly
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...Chris Fregly
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Chris Fregly
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Chris Fregly
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...Chris Fregly
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...Chris Fregly
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Chris Fregly
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...Chris Fregly
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Chris Fregly
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...Chris Fregly
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...Chris Fregly
 

Mais de Chris Fregly (20)

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and Data
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdf
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:Cap
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
 

Último

Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 

Último (20)

Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 

Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark 1.5.1 Zeppelin 0.6.0

  • 1. IBM | spark.tc Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst + Data Sources API Chris Fregly, Principal Data Solutions Engineer IBM Spark Technology Center Oct 6, 2015 Power of data. Simplicity of design. Speed of innovation.
  • 3. IBM | spark.tc Announcements Steve Beier, Boss Man! IBM Spark Technology Center!
  • 4. IBM | spark.tc CAP Theorem Adapted to Hiring Parochial! Collaborative! Awesome! Spelling Bee! Champion! ! ! ! First Chair ! Chess Club! ! ! ! Math-lete ! 1st Place! ! ! <---->-
  • 5. IBM | spark.tc Who am I?! ! ! ! Streaming Data Engineer! Netflix Open Source Committer!! ! ! Data Solutions Engineer! Apache Contributor! ! ! Principal Data Solutions Engineer! IBM Technology Center!
  • 6. IBM | spark.tc Last Meetup (Spark Wins 100 TB Daytona GraySort) On-disk only, in-memory caching disabled!!sortbenchmark.org/ApacheSpark2014.pdf!
  • 7. IBM | spark.tc Upcoming Advanced Apache Spark Meetups! Project Tungsten Data Structs/Algos for CPU/Memory Optimization! Nov 12th, 2015! Text-based Advanced Analytics and Machine Learning! Jan 14th, 2016! ElasticSearch-Spark Connector w/ Costin Leau (Elastic.co) & Me! Feb 16th, 2016! Spark Internals Deep Dive! Mar 24th, 2016! Spark SQL Catalyst Optimizer Deep Dive ! Apr 21st, 2016!
  • 8. IBM | spark.tc Meetup Metrics Total Spark Experts: 1100+! ! ! Donations: $0! “Your money is no good here.”! ! Lloyd from ! The Shining! <--- eek!!
  • 9. IBM | spark.tc Meetup Updates Talking with other Spark Meetup Groups! Potential mergers and/or hostile takeovers!! New Sponsors!!! ! Connected with Organizer of Bangalore Spark Meetup! Madhukara Phatak <-- Technical Deep Dives --^! We’re trying out new PowerPoint Animations! Please be patient!! We got our first Spam comment! ->!
  • 10. IBM | spark.tc Constructive Criticism from Previous Attendees “Chris, you’re like a fat version of an ! already-fat Erlich from Silicon Valley -! except not funny.”! “Chris, your voice is so annoying that it ! keeps waking me up from sleep induced ! by your boring content.”!
  • 11. IBM | spark.tc Recent Events Cassandra Summit 2015! Real-time Advanced Analytics w/ Spark & Cassandra! ! ! ! ! ! Strata NYC 2015! Practical Data Science w/ Spark: Recommender Systems! Available on Slideshare! http://slideshare.net/cfregly!
  • 12. IBM | spark.tc Freg-a-palooza Upcoming World Tour   London Spark Meetup (Oct 12th)!   Scotland Data Science Meetup (Oct 13th)!   Dublin Spark Meetup (Oct 15th)!   Barcelona Spark Meetup (Oct 20th)!   Madrid Spark Meetup (Oct 22nd)!   Paris Spark Summit (Oct 26th)!   Amsterdam Spark Summit (Oct 27th – Oct 29th)!   Delft Dutch Data Science Meetup (Oct 29th) !   Brussels Spark Meetup (Oct 30th)!   Zurich Big Data Developers Meetup (Nov 2nd)! High probability! I’ll end up in jail! or married?!!
  • 13. Spark SQL + DataFrames Catalyst + Data Sources API
  • 14. IBM | spark.tc Topics of this Talk!  DataFrames!  Catalyst Optimizer and Query Plans!  Data Sources API!  Creating and Contributing Custom Data Source! !  Partitions, Pruning, Pushdowns! !  Native + Third-Party Data Source Impls! !  Spark SQL Performance Tuning!
  • 15. IBM | spark.tc DataFrames! Inspired by R and Pandas DataFrames! Cross language support! SQL, Python, Scala, Java, R! Levels performance of Python, Scala, Java, and R! Generates JVM bytecode vs serialize/pickle objects to Python! DataFrame is Container for Logical Plan! Transformations are lazy and represented as a tree! Catalyst Optimizer creates physical plan! DataFrame.rdd returns the underlying RDD if needed! Custom UDF using registerFunction() New, experimental UDAF support! Use DataFrames ! instead of RDDs!!!
  • 16. IBM | spark.tc Catalyst Optimizer! Converts logical plan to physical plan! Manipulate & optimize DataFrame transformation tree! Subquery elimination – use aliases to collapse subqueries! Constant folding – replace expression with constant! Simplify filters – remove unnecessary filters! Predicate/filter pushdowns – avoid unnecessary data load! Projection collapsing – avoid unnecessary projections! Hooks for custom rules! Rules = Scala Case Classes! val newPlan = MyFilterRule(analyzedPlan) Implements! oas.sql.catalyst.rules.Rule! Apply to any plan stage!
  • 17. IBM | spark.tc Plan Debugging! gendersCsvDF.select($"id", $"gender").filter("gender != 'F'").filter("gender != 'M'").explain(true)! Requires explain(true)! DataFrame.queryExecution.logical! DataFrame.queryExecution.analyzed! DataFrame.queryExecution.optimizedPlan! DataFrame.queryExecution.executedPlan!
  • 18. IBM | spark.tc Plan Visualization & Join/Aggregation Metrics! Effectiveness ! of Filter! Cost-based ! Optimization! is Applied! Peak Memory for! Joins and Aggs! Optimized ! CPU-cache-aware! Binary Format! Minimizes GC &! Improves Join Perf! (Project Tungsten)! New in Spark 1.5!!
  • 19. IBM | spark.tc Data Sources API! Relations (o.a.s.sql.sources.interfaces.scala)! BaseRelation (abstract class): Provides schema of data! TableScan (impl): Read all data from source, construct rows ! PrunedFilteredScan (impl): Read with column pruning & predicate pushdowns InsertableRelation (impl): Insert or overwrite data based on SaveMode enum! RelationProvider (trait/interface): Handles user options, creates BaseRelation! Execution (o.a.s.sql.execution.commands.scala)! RunnableCommand (trait/interface)! ExplainCommand(impl: case class)! CacheTableCommand(impl: case class)! Filters (o.a.s.sql.sources.filters.scala)! Filter (abstract class for all filter pushdowns for this data source)! EqualTo (impl)! GreaterThan (impl)! StringStartsWith (impl)!
  • 20. IBM | spark.tc Creating a Custom Data Source! Study Existing Native and Third-Party Data Source Impls! ! Native: JDBC (o.a.s.sql.execution.datasources.jdbc)! class JDBCRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation ! Third-Party: Cassandra (o.a.s.sql.cassandra)! class CassandraSourceRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation! !
  • 21. IBM | spark.tc Contributing a Custom Data Source! spark-packages.org! Managed by! Contains links to externally-managed github projects! Ratings and comments! Spark version requirements of each package! Examples! https://github.com/databricks/spark-csv! https://github.com/databricks/spark-avro! https://github.com/databricks/spark-redshift!
  • 23. IBM | spark.tc Demo Dataset (from previous Spark After Dark talks)! RATINGS ! ========! UserID,ProfileID,Rating ! (1-10)! GENDERS! ========! UserID,Gender ! (M,F,U)! <-- Totally -->! Anonymous !
  • 24. IBM | spark.tc Partitions! Partition based on data usage patterns! /genders.parquet/gender=M/… /gender=F/… <-- Use case: access users by gender /gender=U/… Partition Discovery! On read, infer partitions from organization of data (ie. gender=F)! Dynamic Partitions! Upon insert, dynamically create partitions! Specify field to use for each partition (ie. gender)! SQL: INSERT TABLE genders PARTITION (gender) SELECT … DF: gendersDF.write.format(”parquet").partitionBy(”gender”).save(…)
  • 25. IBM | spark.tc Pruning! Partition Pruning! Filter out entire partitions of rows on partitioned data SELECT id, gender FROM genders where gender = ‘U’ Column Pruning! Filter out entire columns for all rows if not required! Extremely useful for columnar storage formats! Parquet, ORC! SELECT id, gender FROM genders !
  • 26. IBM | spark.tc Pushdowns! Predicate (aka Filter) Pushdowns! Predicate returns {true, false} for a given function/condition! Filters rows as deep into the data source as possible! Data Source must implement PrunedFilteredScan!
  • 27. Native Spark SQL Data Sources
  • 28. IBM | spark.tc Spark SQL Native Data Sources - Source Code!
  • 29. IBM | spark.tc JSON Data Source! DataFrame! val ratingsDF = sqlContext.read.format("json") .load("file:/root/pipeline/datasets/dating/ratings.json.bz2") -- or --! val ratingsDF = sqlContext.read.json ("file:/root/pipeline/datasets/dating/ratings.json.bz2") SQL Code! CREATE TABLE genders USING json OPTIONS (path "file:/root/pipeline/datasets/dating/genders.json.bz2") Convenience Method
  • 30. IBM | spark.tc JDBC Data Source! Add Driver to Spark JVM System Classpath! $ export SPARK_CLASSPATH=<jdbc-driver.jar> DataFrame! val jdbcConfig = Map("driver" -> "org.postgresql.Driver", "url" -> "jdbc:postgresql:hostname:port/database", "dbtable" -> ”schema.tablename") df.read.format("jdbc").options(jdbcConfig).load() SQL! CREATE TABLE genders USING jdbc OPTIONS (url, dbtable, driver, …)
  • 31. IBM | spark.tc Parquet Data Source! Configuration! spark.sql.parquet.filterPushdown=true! spark.sql.parquet.mergeSchema=true spark.sql.parquet.cacheMetadata=true! spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo] DataFrames! val gendersDF = sqlContext.read.format("parquet") .load("file:/root/pipeline/datasets/dating/genders.parquet")! gendersDF.write.format("parquet").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders.parquet") SQL! CREATE TABLE genders USING parquet OPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet")
  • 32. IBM | spark.tc ORC Data Source! Configuration! spark.sql.orc.filterPushdown=true DataFrames! val gendersDF = sqlContext.read.format("orc") .load("file:/root/pipeline/datasets/dating/genders")! gendersDF.write.format("orc").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders") SQL! CREATE TABLE genders USING orc OPTIONS (path "file:/root/pipeline/datasets/dating/genders")
  • 34. IBM | spark.tc CSV Data Source (Databricks)! Github! https://github.com/databricks/spark-csv! ! Maven! com.databricks:spark-csv_2.10:1.2.0! ! Code! val gendersCsvDF = sqlContext.read .format("com.databricks.spark.csv") .load("file:/root/pipeline/datasets/dating/gender.csv.bz2") .toDF("id", "gender") toDF() defines column names!
  • 35. IBM | spark.tc Avro Data Source (Databricks)! Github! https://github.com/databricks/spark-avro! ! Maven! com.databricks:spark-avro_2.10:2.0.1! ! Code! val df = sqlContext.read .format("com.databricks.spark.avro") .load("file:/root/pipeline/datasets/dating/gender.avro") !
  • 36. IBM | spark.tc ElasticSearch Data Source (Elastic.co)! Github! https://github.com/elastic/elasticsearch-hadoop! Maven! org.elasticsearch:elasticsearch-spark_2.10:2.1.0! Code! val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>", "es.port" -> "<port>") df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite) .options(esConfig).save("<index>/<document>")
  • 37. IBM | spark.tc Cassandra Data Source (DataStax)! Github! https://github.com/datastax/spark-cassandra-connector! Maven! com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1 Code! ratingsDF.write .format("org.apache.spark.sql.cassandra") .mode(SaveMode.Append) .options(Map("keyspace"->"<keyspace>", "table"->"<table>")).save(…)
  • 38. IBM | spark.tc Cassandra Pushdown Rules! Determines which filter predicates can be pushed down to Cassandra.! * 1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate! * 2. Only push down primary key column predicates with = or IN predicate.! * 3. If there are regular columns in the pushdown predicates, they should have! * at least one EQ expression on an indexed column and no IN predicates.! * 4. All partition column predicates must be included in the predicates to be pushed down,! * only the last part of the partition key can be an IN predicate. For each partition column,! * only one predicate is allowed.! * 5. For cluster column predicates, only last predicate can be non-EQ predicate! * including IN predicate, and preceding column predicates must be EQ predicates.! * If there is only one cluster column predicate, the predicates could be any non-IN predicate.! * 6. There is no pushdown predicates if there is any OR condition or NOT IN condition.! * 7. We're not allowed to push down multiple predicates for the same column if any of them! * is equality or IN predicate.! spark-cassandra-connector/…/o.a.s.sql.cassandra.PredicatePushDown.scala!
  • 39. IBM | spark.tc Special Thanks to DataStax!!!! Russel Spitzer! @RussSpitzer! (He created the following few slides)! (These guys built a lot of the connector.)!
  • 42. IBM | spark.tc Spark-Cassandra Configuration:input.page.row.size
  • 43. IBM | spark.tc Spark-Cassandra Configuration: grouping.key!
  • 44. IBM | spark.tc Spark-Cassandra Configuration: size.rows/bytes!
  • 45. IBM | spark.tc Spark-Cassandra Configuration: batch.buffer.size!
  • 46. IBM | spark.tc Spark-Cassandra Configuration: concurrent.writes!
  • 47. IBM | spark.tc Spark-Cassandra Configuration: throughput_mb/s!
  • 48. IBM | spark.tc Redshift Data Source (Databricks)! Github! https://github.com/databricks/spark-redshift! Maven! com.databricks:spark-redshift:0.5.0! Code! val df: DataFrame = sqlContext.read .format("com.databricks.spark.redshift") .option("url", "jdbc:redshift://<hostname>:<port>/<database>…") .option("query", "select x, count(*) my_table group by x") .option("tempdir", "s3n://tmpdir") .load(...) Copies to S3 for ! fast, parallel reads vs ! single Redshift Master bottleneck!
  • 49. IBM | spark.tc Cloudant Data Source (IBM)! Github! http://spark-packages.org/package/cloudant/spark-cloudant! Maven! com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1 Code! ratingsDF.write.format("com.cloudant.spark") .mode(SaveMode.Append) .options(Map("cloudant.host"->"<account>.cloudant.com", "cloudant.username"->"<username>", "cloudant.password"->"<password>")) .save("<filename>")
  • 50. IBM | spark.tc DB2 and BigSQL Data Sources (IBM)! Coming Soon!! ! ! ! https://github.com/SparkTC/spark-db2! https://github.com/SparkTC/spark-bigsql! !
  • 51. IBM | spark.tc REST Data Source (Databricks)! Coming Soon!! https://github.com/databricks/spark-rest?! Michael Armbrust! Spark SQL Lead @ Databricks!
  • 52. IBM | spark.tc SparkSQL Performance Tuning (oas.sql.SQLConf)! spark.sql.inMemoryColumnarStorage.compressed=true! Automatically selects column codec based on data! spark.sql.inMemoryColumnarStorage.batchSize! Increase as much as possible without OOM – improves compression and GC! spark.sql.inMemoryPartitionPruning=true! Enable partition pruning for in-memory partitions! spark.sql.tungsten.enabled=true! Code Gen for CPU and Memory Optimizations (Tungsten aka Unsafe Mode)! spark.sql.shuffle.partitions! Increase from default 200 for large joins and aggregations! spark.sql.autoBroadcastJoinThreshold! Increase to tune this cost-based, physical plan optimization! spark.sql.hive.metastorePartitionPruning! Predicate pushdown into the metastore to prune partitions early! spark.sql.planner.sortMergeJoin! Prefer sort-merge (vs. hash join) for large joins ! spark.sql.sources.partitionDiscovery.enabled ! & spark.sql.sources.parallelPartitionDiscovery.threshold!
  • 53. IBM | spark.tc Related Links! https://github.com/datastax/spark-cassandra-connector! http://blog.madhukaraphatak.com/anatomy-of-spark-dataframe-api/! https://github.com/phatak-dev/anatomy_of_spark_dataframe_api! https://databricks.com/blog/! https://www.youtube.com/watch?v=uxuLRiNoDio! http://www.slideshare.net/RussellSpitzer!
  • 54. @cfregly IBM Spark Tech Center is Hiring! " JOnly Fun, Collaborative People - No Erlichs!J IBM | spark.tc Sign up for our newsletter at Thank You! Power of data. Simplicity of design. Speed of innovation.
  • 55. Power of data. Simplicity of design. Speed of innovation. IBM Spark