SlideShare uma empresa Scribd logo
1 de 33
Baixar para ler offline
Jean Georges Perrin, @jgperrin, Oplo
Extending Spark’s
Ingestion: Build Your Own
Java Data Source
#EUdev6
2#EUdev6
JGP
@jgperrin
Jean Georges Perrin
Chapel Hill, NC, USA
I 🏗SW
#Knowledge =
𝑓 ( ∑ (#SmallData, #BigData), #DataScience)
& #Software
#IBMChampion x9 #KeepLearning
@ http://jgp.net
2#EUdev6
I💚#
A Little Help
Hello Ruby,
Nathaniel,
Jack,
and PN!
3@jgperrin - #EUdev6
And What About You?
• Who is a Java programmer?
• Who swears by Scala?
• Who has tried to build a custom data source?
• In Java?
• Who has succeeded?
4@jgperrin - #EUdev6
My Problem…
5@jgperrin - #EUdev6
CSV
JSON
RDBMS REST
Other
Solution #1
6@jgperrin - #EUdev6
• Look in the Spark Packages library
• https://spark-packages.org/
• 48 data sources listed (some dups)
• Some with Source Code
package
net.jgp.labs.spark.datasources.l100_
photo_datasource;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import
org.apache.spark.sql.SparkSession;
public class
PhotoMetadataIngestionApp {
public static void main(String[]
args) {
PhotoMetadataIngestionApp
app = new
PhotoMetadataIngestionApp();
app.start();
}
private boolean start() {
SparkSession spark =
SparkSession.builder()
.appName("EXIF to
Dataset")
.master("local[*]").
getOrCreate();
String importDirectory =
"/Users/jgp/Pictures/All
Photos/2010-2019/2016";
Solution #2
7@jgperrin - #EUdev6
• Write it yourself
• In Java
What is ingestion anyway?
• Loading
– CSV, JSON, RDBMS…
– Basically any data
• Getting a Dataframe (Dataset<Row>) with your
data
8@jgperrin - #EUdev6
9@jgperrin - #EUdev6
Our lab
Import some of
the metadata
of our photos
and start
analysis…
Code
Dataset<Row> df = spark.read()
.format("exif")
.option("recursive", "true")
.option("limit", "80000")
.option("extensions", "jpg,jpeg")
.load(importDirectory);
#EUdev6
Spark Session
Short name for
your data source
Options
Where to start
/jgperrin/net.jgp.labs.spark.datasources
Short name
• Optional: can specify a full class name
• An alias to a class name
• Needs to be registered
• Not recommended during development phase
11@jgperrin - #EUdev6
Project
12#EUdev6
Information for
registration of
the data source
Data source’s
short name
Data source
The app
Relation
Utilities
An existing
library you can
embed/link to
/jgperrin
/net.jgp.labs.spark.datasources
Library code
• A library you already use, you developed or
acquired
• No need for anything special
13@jgperrin - #EUdev6
/jgperrin
/net.jgp.labs.spark.datasources
Scala 2 Java
conversion
(sorry!) – All the
options
Data Source Code
14@jgperrin - #EUdev6
Provides a
relation
Needed by
Spark
Needs to be
passed to the
relation
public class ExifDirectoryDataSource implements RelationProvider {
@Override
public BaseRelation createRelation(
SQLContext sqlContext,
Map<String, String> params) {
java.util.Map<String, String> javaMap =
mapAsJavaMapConverter(params).asJava();
ExifDirectoryRelation br = new ExifDirectoryRelation();
br.setSqlContext(sqlContext);
...
br.setPhotoLister(photoLister);
return br;
}
}
The relation to
be exploited
Relation
• Plumbing between
– Spark
– Your existing library
• Mission
– Returns the schema as a StructType
– Returns the data as a RDD<Row>
15@jgperrin - #EUdev6
/jgperrin
/net.jgp.labs.spark.datasources
TableScan is the
key, other more
specialized Scan
available
Relation
16@jgperrin - #EUdev6
The schema:
may/will be
called first
The data…
SQL Context
public class ExifDirectoryRelation extends BaseRelation
implements Serializable, TableScan {
private static final long serialVersionUID = 4598175080399877334L;
@Override
public RDD<Row> buildScan() {
...
return rowRDD.rdd();
}
@Override
public StructType schema() {
...
return schema.getSparkSchema();
}
@Override
public SQLContext sqlContext() {
return this.sqlContext;
}
...
}
/jgperrin
/net.jgp.labs.spark.datasources
A utility function
that introspect a
Java bean and
turn it into a
”Super” schema,
which contains
the required
StructType for
Spark
Relation – Schema
17@jgperrin - #EUdev6
@Override
public StructType schema() {
if (schema == null) {
schema =
SparkBeanUtils.getSchemaFromBean(PhotoMetadata.class);
}
return schema.getSparkSchema();
}
/jgperrin
/net.jgp.labs.spark.datasources
Collect the data
Relation – Data
18@jgperrin - #EUdev6
@Override
public RDD<Row> buildScan() {
schema();
List<PhotoMetadata> table = collectData();
JavaSparkContext sparkContext = new JavaSparkContext(sqlContext.sparkContext());
JavaRDD<Row> rowRDD = sparkContext.parallelize(table)
.map(photo -> SparkBeanUtils.getRowFromBean(schema, photo));
return rowRDD.rdd();
}
private List<PhotoMetadata> collectData() {
List<File> photosToProcess = this.photoLister.getFiles();
List<PhotoMetadata> list = new ArrayList<PhotoMetadata>();
PhotoMetadata photo;
for (File photoToProcess : photosToProcess) {
photo = ExifUtils.processFromFilename(photoToProcess.getAbsolutePath());
list.add(photo);
}
return list;
}
Scans the files,
extract EXIF
information: the
interface to your
library…
Creates the RDD
by parallelizing
the list of photos
/jgperrin
/net.jgp.labs.spark.datasources
Application
19@jgperrin - #EUdev6
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class PhotoMetadataIngestionApp {
public static void main(String[] args) {
PhotoMetadataIngestionApp app = new PhotoMetadataIngestionApp();
app.start();
}
private boolean start() {
SparkSession spark = SparkSession.builder()
.appName("EXIF to Dataset").master("local[*]").getOrCreate();
Dataset<Row> df = spark.read()
.format("exif")
.option("recursive", "true")
.option("limit", "80000")
.option("extensions", "jpg,jpeg")
.load("/Users/jgp/Pictures");
df = df
.filter(df.col("GeoX").isNotNull())
.filter(df.col("GeoZ").notEqual("NaN"))
.orderBy(df.col("GeoZ").desc());
System.out.println("I have imported " + df.count() + " photos.");
df.printSchema();
df.show(5);
return true;
}
}
Normal imports,
no reference to
our data source
Local mode
Classic read
Standard
dataframe API:
getting my
”highest” photos!
Standard output
mechanism
Output
+--------------------+-------+---------+----------+--------------------+-------------------+-------------------+--------------------+--------------------+-----------+-------------------+---------+----------+------+-----+
| Name| Size|Extension| MimeType| Directory| FileCreationDate| FileLastAccessDate|FileLastModifiedDate| Filename| GeoY| Date| GeoX| GeoZ|Height|Width|
+--------------------+-------+---------+----------+--------------------+-------------------+-------------------+--------------------+--------------------+-----------+-------------------+---------+----------+------+-----+
| IMG_0177.JPG|1129296| JPG|image/jpeg|/Users/jgp/Pictur...|2015-04-14 08:14:42|2017-10-24 11:20:31| 2015-04-14 08:14:42|/Users/jgp/Pictur...| -111.17341|2015-04-14 15:14:42| 36.38115| 9314.743| 1936| 2592|
| IMG_0176.JPG|1206265| JPG|image/jpeg|/Users/jgp/Pictur...|2015-04-14 09:14:36|2017-10-24 11:20:31| 2015-04-14 09:14:36|/Users/jgp/Pictur...| -111.17341|2015-04-14 16:14:36| 36.38115| 9314.743| 1936| 2592|
| IMG_1202.JPG|1799552| JPG|image/jpeg|/Users/jgp/Pictur...|2015-05-05 17:52:15|2017-10-24 11:20:28| 2015-05-05 17:52:15|/Users/jgp/Pictur...| -98.471405|2015-05-05 15:52:15|29.528196| 9128.197| 2448| 3264|
| IMG_1203.JPG|2412908| JPG|image/jpeg|/Users/jgp/Pictur...|2015-05-05 18:05:25|2017-10-24 11:20:28| 2015-05-05 18:05:25|/Users/jgp/Pictur...| -98.4719|2015-05-05 16:05:25|29.526403| 9128.197| 2448| 3264|
| IMG_1204.JPG|2261133| JPG|image/jpeg|/Users/jgp/Pictur...|2015-05-05 15:13:25|2017-10-24 11:20:29| 2015-05-05 15:13:25|/Users/jgp/Pictur...| -98.47248|2015-05-05 16:13:25|29.526756| 9128.197| 2448| 3264|
+--------------------+-------+---------+----------+--------------------+-------------------+-------------------+--------------------+--------------------+-----------+-------------------+---------+----------+------+-----+
only showing top 5 rows
20@jgperrin - #EUdev6
+--------------------+-------+-----------+-------------------+---------+----------+
| Name| Size| GeoY| Date| GeoX| GeoZ|
+--------------------+-------+-----------+-------------------+---------+----------+
| IMG_0177.JPG|1129296| -111.17341|2015-04-14 15:14:42| 36.38115| 9314.743|
| IMG_0176.JPG|1206265| -111.17341|2015-04-14 16:14:36| 36.38115| 9314.743|
| IMG_1202.JPG|1799552| -98.471405|2015-05-05 15:52:15|29.528196| 9128.197|
| IMG_1203.JPG|2412908| -98.4719|2015-05-05 16:05:25|29.526403| 9128.197|
| IMG_1204.JPG|2261133| -98.47248|2015-05-05 16:13:25|29.526756| 9128.197|
+--------------------+-------+-----------+-------------------+---------+----------+
only showing top 5 rows
dsfsd
21#EUdev6
IMG_0176.JPG
Rijsttafel in Rotterdam, NL
2.6m (9ft) above sea level
dsfsd
22#EUdev6
IMG_0176.JPG
Hoover Dam & Lake Mead, AZ
9314.7m (30560ft) above sea level
@jgperrin - #EUdev6
A Little Extra
23
Data
Metadata
(Schema)
Schema, Beans, and Annotations
24@jgperrin - #EUdev6
Schema
Bean
StructType
SparkBeanUtils.
getSchemaFromBean()
SparkBeanUtils.
getRowFromBean()
Schema.
getSparkSchema()
RowColumn
Column
Column
Column
Can be augmented via
the @SparkColumn
annotation
Mass production
• Easily import from any Java Bean
• Conversion done by utility functions
• Schema is a superset of StructType
25#EUdev6@jgperrin - #EUdev6
Conclusion
• Reuse the Bean to schema and Bean to data in
your project
• Building custom data source to REST server or non
standard format is easy
• No need for a pricey conversion to CSV or JSON
• There is always a solution in Java
• On you: check for parallelism, optimization, extend
schema (order of columns)
26@jgperrin - #EUdev6
Going Further
• Check out the code (fork/like)
https://github.com/jgperrin/net.jgp.labs.spark.dataso
urces
• Follow me @jgperrin
• Watch for my Java + Spark book, coming out soon!
• (If you come in the RTP area in NC, USA, come for
a Spark meet-up and let’s have a drink!)
27#EUdev6
Go raibh maith agaibh.
Jean Georges “JGP” Perrin
@jgperrin
Don’t forget
to rate
Backup Slides & Other Resources
30Discover CLEGO @ http://bit.ly/spark-clego
More Reading & Resources
• My blog: http://jgp.net
• This code on GitHub:
https://github.com/jgperrin/net.jgp.labs.spark.dat
asources
• Java Code on GitHub:
https://github.com/jgperrin/net.jgp.labs.spark
/jgperrin/net.jgp.labs.spark.datasources
Photo & Other Credits
• JGP @ Oplo, Problem, Solution 1&2, Relation,
Beans: Pexels
• Lab, Hoover Dam, Clego: © Jean Georges
Perrin
• Rijsttafel: © Nathaniel Perrin
Abstract
EXTENDING APACHE SPARK'S INGESTION: BUILDING YOUR OWN JAVA DATA SOURCE
By Jean Georges Perrin (@jgperrin, Oplo)
Apache Spark is a wonderful platform for running your analytics jobs. It has great ingestion
features from CSV, Hive, JDBC, etc. however, you may have your own data sources or formats
you want to use. Your solution could be to convert your data in a CSV or JSON file and then
ask Spark to do ingest it through its built-in tools. However, for enhanced performance, we will
explore the way to build a data source, in Java, to extend Spark’s ingestion capabilities. We will
first understand how Spark works for ingestion, then walk through the development of this data
source plug-in.
Targeted audience: Software and data engineers who need to expand Spark’s ingestion
capability.
Key takeaways:
• Requirements, needs & architecture – 15%.
• Build the required tool set in Java – 85%.
Session hashtag: #EUdev6

Mais conteúdo relacionado

Mais procurados

Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageDatabricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Parallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeParallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...Altinity Ltd
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBRavi Teja
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...HostedbyConfluent
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfAltinity Ltd
 
Spark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross LawleySpark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross LawleySpark Summit
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structuresconfluent
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Capture the Streams of Database Changes
Capture the Streams of Database ChangesCapture the Streams of Database Changes
Capture the Streams of Database Changesconfluent
 
Apache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonApache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonTimothy Spann
 

Mais procurados (20)

Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Parallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeParallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta Lake
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
 
Spark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross LawleySpark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross Lawley
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structures
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Capture the Streams of Database Changes
Capture the Streams of Database ChangesCapture the Streams of Database Changes
Capture the Streams of Database Changes
 
Apache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonApache Pulsar Development 101 with Python
Apache Pulsar Development 101 with Python
 

Destaque

From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
 
Strata Beijing - Deep Learning in Production on Spark
Strata Beijing - Deep Learning in Production on SparkStrata Beijing - Deep Learning in Production on Spark
Strata Beijing - Deep Learning in Production on SparkAdam Gibson
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...Spark Summit
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
 
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...Spark Summit
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Building Custom ML PipelineStages for Feature Selection with Marc Kaminski
Building Custom ML PipelineStages for Feature Selection with Marc KaminskiBuilding Custom ML PipelineStages for Feature Selection with Marc Kaminski
Building Custom ML PipelineStages for Feature Selection with Marc KaminskiSpark Summit
 
A Developer's View Into Spark's Memory Model with Wenchen Fan
A Developer's View Into Spark's Memory Model with Wenchen FanA Developer's View Into Spark's Memory Model with Wenchen Fan
A Developer's View Into Spark's Memory Model with Wenchen FanDatabricks
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks
 

Destaque (11)

From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
Strata Beijing - Deep Learning in Production on Spark
Strata Beijing - Deep Learning in Production on SparkStrata Beijing - Deep Learning in Production on Spark
Strata Beijing - Deep Learning in Production on Spark
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Building Custom ML PipelineStages for Feature Selection with Marc Kaminski
Building Custom ML PipelineStages for Feature Selection with Marc KaminskiBuilding Custom ML PipelineStages for Feature Selection with Marc Kaminski
Building Custom ML PipelineStages for Feature Selection with Marc Kaminski
 
A Developer's View Into Spark's Memory Model with Wenchen Fan
A Developer's View Into Spark's Memory Model with Wenchen FanA Developer's View Into Spark's Memory Model with Wenchen Fan
A Developer's View Into Spark's Memory Model with Wenchen Fan
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 

Semelhante a Extending Spark's Ingestion: Build Your Own Java Data Source with Jean Georges Perrin

A Machine Learning Approach to SPARQL Query Performance Prediction
A Machine Learning Approach to SPARQL Query Performance PredictionA Machine Learning Approach to SPARQL Query Performance Prediction
A Machine Learning Approach to SPARQL Query Performance PredictionRakebul Hasan
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseJimmy Angelakos
 
Adding To the Leaf Pile
Adding To the Leaf PileAdding To the Leaf Pile
Adding To the Leaf Pilestuporglue
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JJosh Patterson
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to ProductionMostafa Majidpour
 
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev PROIDEA
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterDatabricks
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large GraphsNishant Gandhi
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingDatabricks
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Wee Hyong Tok
 
Infrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache SparkInfrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache SparkDatabricks
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
 

Semelhante a Extending Spark's Ingestion: Build Your Own Java Data Source with Jean Georges Perrin (20)

A Machine Learning Approach to SPARQL Query Performance Prediction
A Machine Learning Approach to SPARQL Query Performance PredictionA Machine Learning Approach to SPARQL Query Performance Prediction
A Machine Learning Approach to SPARQL Query Performance Prediction
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
 
Adding To the Leaf Pile
Adding To the Leaf PileAdding To the Leaf Pile
Adding To the Leaf Pile
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4J
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Spark Worshop
Spark WorshopSpark Worshop
Spark Worshop
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev
JDD2015: Thorny path to Data Mining projects - Alexey Zinoviev
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large Graphs
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
Infrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache SparkInfrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache Spark
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 

Mais de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 

Último (20)

Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 

Extending Spark's Ingestion: Build Your Own Java Data Source with Jean Georges Perrin

  • 1. Jean Georges Perrin, @jgperrin, Oplo Extending Spark’s Ingestion: Build Your Own Java Data Source #EUdev6
  • 2. 2#EUdev6 JGP @jgperrin Jean Georges Perrin Chapel Hill, NC, USA I 🏗SW #Knowledge = 𝑓 ( ∑ (#SmallData, #BigData), #DataScience) & #Software #IBMChampion x9 #KeepLearning @ http://jgp.net 2#EUdev6 I💚#
  • 3. A Little Help Hello Ruby, Nathaniel, Jack, and PN! 3@jgperrin - #EUdev6
  • 4. And What About You? • Who is a Java programmer? • Who swears by Scala? • Who has tried to build a custom data source? • In Java? • Who has succeeded? 4@jgperrin - #EUdev6
  • 5. My Problem… 5@jgperrin - #EUdev6 CSV JSON RDBMS REST Other
  • 6. Solution #1 6@jgperrin - #EUdev6 • Look in the Spark Packages library • https://spark-packages.org/ • 48 data sources listed (some dups) • Some with Source Code
  • 7. package net.jgp.labs.spark.datasources.l100_ photo_datasource; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class PhotoMetadataIngestionApp { public static void main(String[] args) { PhotoMetadataIngestionApp app = new PhotoMetadataIngestionApp(); app.start(); } private boolean start() { SparkSession spark = SparkSession.builder() .appName("EXIF to Dataset") .master("local[*]"). getOrCreate(); String importDirectory = "/Users/jgp/Pictures/All Photos/2010-2019/2016"; Solution #2 7@jgperrin - #EUdev6 • Write it yourself • In Java
  • 8. What is ingestion anyway? • Loading – CSV, JSON, RDBMS… – Basically any data • Getting a Dataframe (Dataset<Row>) with your data 8@jgperrin - #EUdev6
  • 9. 9@jgperrin - #EUdev6 Our lab Import some of the metadata of our photos and start analysis…
  • 10. Code Dataset<Row> df = spark.read() .format("exif") .option("recursive", "true") .option("limit", "80000") .option("extensions", "jpg,jpeg") .load(importDirectory); #EUdev6 Spark Session Short name for your data source Options Where to start /jgperrin/net.jgp.labs.spark.datasources
  • 11. Short name • Optional: can specify a full class name • An alias to a class name • Needs to be registered • Not recommended during development phase 11@jgperrin - #EUdev6
  • 12. Project 12#EUdev6 Information for registration of the data source Data source’s short name Data source The app Relation Utilities An existing library you can embed/link to /jgperrin /net.jgp.labs.spark.datasources
  • 13. Library code • A library you already use, you developed or acquired • No need for anything special 13@jgperrin - #EUdev6
  • 14. /jgperrin /net.jgp.labs.spark.datasources Scala 2 Java conversion (sorry!) – All the options Data Source Code 14@jgperrin - #EUdev6 Provides a relation Needed by Spark Needs to be passed to the relation public class ExifDirectoryDataSource implements RelationProvider { @Override public BaseRelation createRelation( SQLContext sqlContext, Map<String, String> params) { java.util.Map<String, String> javaMap = mapAsJavaMapConverter(params).asJava(); ExifDirectoryRelation br = new ExifDirectoryRelation(); br.setSqlContext(sqlContext); ... br.setPhotoLister(photoLister); return br; } } The relation to be exploited
  • 15. Relation • Plumbing between – Spark – Your existing library • Mission – Returns the schema as a StructType – Returns the data as a RDD<Row> 15@jgperrin - #EUdev6
  • 16. /jgperrin /net.jgp.labs.spark.datasources TableScan is the key, other more specialized Scan available Relation 16@jgperrin - #EUdev6 The schema: may/will be called first The data… SQL Context public class ExifDirectoryRelation extends BaseRelation implements Serializable, TableScan { private static final long serialVersionUID = 4598175080399877334L; @Override public RDD<Row> buildScan() { ... return rowRDD.rdd(); } @Override public StructType schema() { ... return schema.getSparkSchema(); } @Override public SQLContext sqlContext() { return this.sqlContext; } ... }
  • 17. /jgperrin /net.jgp.labs.spark.datasources A utility function that introspect a Java bean and turn it into a ”Super” schema, which contains the required StructType for Spark Relation – Schema 17@jgperrin - #EUdev6 @Override public StructType schema() { if (schema == null) { schema = SparkBeanUtils.getSchemaFromBean(PhotoMetadata.class); } return schema.getSparkSchema(); }
  • 18. /jgperrin /net.jgp.labs.spark.datasources Collect the data Relation – Data 18@jgperrin - #EUdev6 @Override public RDD<Row> buildScan() { schema(); List<PhotoMetadata> table = collectData(); JavaSparkContext sparkContext = new JavaSparkContext(sqlContext.sparkContext()); JavaRDD<Row> rowRDD = sparkContext.parallelize(table) .map(photo -> SparkBeanUtils.getRowFromBean(schema, photo)); return rowRDD.rdd(); } private List<PhotoMetadata> collectData() { List<File> photosToProcess = this.photoLister.getFiles(); List<PhotoMetadata> list = new ArrayList<PhotoMetadata>(); PhotoMetadata photo; for (File photoToProcess : photosToProcess) { photo = ExifUtils.processFromFilename(photoToProcess.getAbsolutePath()); list.add(photo); } return list; } Scans the files, extract EXIF information: the interface to your library… Creates the RDD by parallelizing the list of photos
  • 19. /jgperrin /net.jgp.labs.spark.datasources Application 19@jgperrin - #EUdev6 import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class PhotoMetadataIngestionApp { public static void main(String[] args) { PhotoMetadataIngestionApp app = new PhotoMetadataIngestionApp(); app.start(); } private boolean start() { SparkSession spark = SparkSession.builder() .appName("EXIF to Dataset").master("local[*]").getOrCreate(); Dataset<Row> df = spark.read() .format("exif") .option("recursive", "true") .option("limit", "80000") .option("extensions", "jpg,jpeg") .load("/Users/jgp/Pictures"); df = df .filter(df.col("GeoX").isNotNull()) .filter(df.col("GeoZ").notEqual("NaN")) .orderBy(df.col("GeoZ").desc()); System.out.println("I have imported " + df.count() + " photos."); df.printSchema(); df.show(5); return true; } } Normal imports, no reference to our data source Local mode Classic read Standard dataframe API: getting my ”highest” photos! Standard output mechanism
  • 20. Output +--------------------+-------+---------+----------+--------------------+-------------------+-------------------+--------------------+--------------------+-----------+-------------------+---------+----------+------+-----+ | Name| Size|Extension| MimeType| Directory| FileCreationDate| FileLastAccessDate|FileLastModifiedDate| Filename| GeoY| Date| GeoX| GeoZ|Height|Width| +--------------------+-------+---------+----------+--------------------+-------------------+-------------------+--------------------+--------------------+-----------+-------------------+---------+----------+------+-----+ | IMG_0177.JPG|1129296| JPG|image/jpeg|/Users/jgp/Pictur...|2015-04-14 08:14:42|2017-10-24 11:20:31| 2015-04-14 08:14:42|/Users/jgp/Pictur...| -111.17341|2015-04-14 15:14:42| 36.38115| 9314.743| 1936| 2592| | IMG_0176.JPG|1206265| JPG|image/jpeg|/Users/jgp/Pictur...|2015-04-14 09:14:36|2017-10-24 11:20:31| 2015-04-14 09:14:36|/Users/jgp/Pictur...| -111.17341|2015-04-14 16:14:36| 36.38115| 9314.743| 1936| 2592| | IMG_1202.JPG|1799552| JPG|image/jpeg|/Users/jgp/Pictur...|2015-05-05 17:52:15|2017-10-24 11:20:28| 2015-05-05 17:52:15|/Users/jgp/Pictur...| -98.471405|2015-05-05 15:52:15|29.528196| 9128.197| 2448| 3264| | IMG_1203.JPG|2412908| JPG|image/jpeg|/Users/jgp/Pictur...|2015-05-05 18:05:25|2017-10-24 11:20:28| 2015-05-05 18:05:25|/Users/jgp/Pictur...| -98.4719|2015-05-05 16:05:25|29.526403| 9128.197| 2448| 3264| | IMG_1204.JPG|2261133| JPG|image/jpeg|/Users/jgp/Pictur...|2015-05-05 15:13:25|2017-10-24 11:20:29| 2015-05-05 15:13:25|/Users/jgp/Pictur...| -98.47248|2015-05-05 16:13:25|29.526756| 9128.197| 2448| 3264| +--------------------+-------+---------+----------+--------------------+-------------------+-------------------+--------------------+--------------------+-----------+-------------------+---------+----------+------+-----+ only showing top 5 rows 20@jgperrin - #EUdev6 +--------------------+-------+-----------+-------------------+---------+----------+ | Name| Size| GeoY| Date| GeoX| GeoZ| +--------------------+-------+-----------+-------------------+---------+----------+ | IMG_0177.JPG|1129296| -111.17341|2015-04-14 15:14:42| 36.38115| 9314.743| | IMG_0176.JPG|1206265| -111.17341|2015-04-14 16:14:36| 36.38115| 9314.743| | IMG_1202.JPG|1799552| -98.471405|2015-05-05 15:52:15|29.528196| 9128.197| | IMG_1203.JPG|2412908| -98.4719|2015-05-05 16:05:25|29.526403| 9128.197| | IMG_1204.JPG|2261133| -98.47248|2015-05-05 16:13:25|29.526756| 9128.197| +--------------------+-------+-----------+-------------------+---------+----------+ only showing top 5 rows
  • 22. dsfsd 22#EUdev6 IMG_0176.JPG Hoover Dam & Lake Mead, AZ 9314.7m (30560ft) above sea level
  • 23. @jgperrin - #EUdev6 A Little Extra 23 Data Metadata (Schema)
  • 24. Schema, Beans, and Annotations 24@jgperrin - #EUdev6 Schema Bean StructType SparkBeanUtils. getSchemaFromBean() SparkBeanUtils. getRowFromBean() Schema. getSparkSchema() RowColumn Column Column Column Can be augmented via the @SparkColumn annotation
  • 25. Mass production • Easily import from any Java Bean • Conversion done by utility functions • Schema is a superset of StructType 25#EUdev6@jgperrin - #EUdev6
  • 26. Conclusion • Reuse the Bean to schema and Bean to data in your project • Building custom data source to REST server or non standard format is easy • No need for a pricey conversion to CSV or JSON • There is always a solution in Java • On you: check for parallelism, optimization, extend schema (order of columns) 26@jgperrin - #EUdev6
  • 27. Going Further • Check out the code (fork/like) https://github.com/jgperrin/net.jgp.labs.spark.dataso urces • Follow me @jgperrin • Watch for my Java + Spark book, coming out soon! • (If you come in the RTP area in NC, USA, come for a Spark meet-up and let’s have a drink!) 27#EUdev6
  • 28. Go raibh maith agaibh. Jean Georges “JGP” Perrin @jgperrin Don’t forget to rate
  • 29. Backup Slides & Other Resources
  • 30. 30Discover CLEGO @ http://bit.ly/spark-clego
  • 31. More Reading & Resources • My blog: http://jgp.net • This code on GitHub: https://github.com/jgperrin/net.jgp.labs.spark.dat asources • Java Code on GitHub: https://github.com/jgperrin/net.jgp.labs.spark /jgperrin/net.jgp.labs.spark.datasources
  • 32. Photo & Other Credits • JGP @ Oplo, Problem, Solution 1&2, Relation, Beans: Pexels • Lab, Hoover Dam, Clego: © Jean Georges Perrin • Rijsttafel: © Nathaniel Perrin
  • 33. Abstract EXTENDING APACHE SPARK'S INGESTION: BUILDING YOUR OWN JAVA DATA SOURCE By Jean Georges Perrin (@jgperrin, Oplo) Apache Spark is a wonderful platform for running your analytics jobs. It has great ingestion features from CSV, Hive, JDBC, etc. however, you may have your own data sources or formats you want to use. Your solution could be to convert your data in a CSV or JSON file and then ask Spark to do ingest it through its built-in tools. However, for enhanced performance, we will explore the way to build a data source, in Java, to extend Spark’s ingestion capabilities. We will first understand how Spark works for ingestion, then walk through the development of this data source plug-in. Targeted audience: Software and data engineers who need to expand Spark’s ingestion capability. Key takeaways: • Requirements, needs & architecture – 15%. • Build the required tool set in Java – 85%. Session hashtag: #EUdev6