Paris Spark Meetup - Trifacta - 03_04_2017

1
Data Wrangling sur Hadoop avec Spark
Paris Spark Meetup 03/04/17
Victor Coustenoble
Technical regional manager EMEA
victor@trifacta.com

DATA WRANGLING
2
QUESTION ANALYZE INSIGHTDISCOVER STRUCTURE CLEANSE ENRICH VALIDATE PUBLISH
What is Data Wrangling?

Company Overview
Background
➔ Headquartered in San Francisco, with offices in Boston,
London, Berlin, Paris
➔ >100+ Employees
➔ Created in 2012
Focus
➔ 100% focused on Data Wrangling and Data Preparation
➔ Accelerate time to value and business use of Big Data
➔ Visual, interactive and Self-Service Data Preparation
4

5
Business System Data Machine Generated Data Third Party Data
Reporting / BI
Data Visualization
LOB IT
Explore Structure Clean Enrich Validate Publish
Distributed Data Platform
Predictive Analytics /
Data Science
Machine Data /
Enterprise Processes
Applications
/ processes
Reporting /
Data driven decision
Recommendations /
Data Mining
Self-service access for business analysts to raw
data operated under IT control

6
INTERACTIVE &
VISUAL
PREDICTIVE &
SUGGESTIONS
INTEROPERABLE
Trifacta Key Differentiators

Interoperable: Reduces Total Cost of Ownership
7
Interoperability with
metadata repositories
enables discoverability
and lineage for
compliance & audit
Interoperability with
existing security models
prevents administering
another app
*Predictive Interaction for Data Transformation – Heer, Hellerstein & Kandel; Stanford University & University of California, Berkeley
(2015)
Intelligent Execution* ensures
Trifacta is highly performant
both now and in the future

8
Execution Architecture
Optimized processing for data not needing parallel processing
Future Technologies
Intelligent Execution
In-memory

MBs GBs TBs PBs
Data Volume
ExecutionLatency
Immediate
Interactive
Batch
Intelligent Execution Architecture
Automatically selects the right execution engine for the data set being transformed

TRIFACTA
Trifacta Workflow in Hadoop
Sample Scale Up
Refine
Sample
Results
Identify/Register Data
1
. Predictive Interaction
2
.
Consume
Schedulers
Monitor and Adjust
3
.
Schedule
Visualization & Analysis
Secure Access
Kerberos, LDAP…
CLI

How Does Trifacta’s Spark Work?
§ Yarn ressource manager
§ Cluster deployment mode

12
Trifacta executes our own version of Spark in a “Cluster Deployment Mode” using
the Hadoop cluster’s YARN resource manager.
§ Trifacta’s Spark job lives in its own YARN container, separate from other Spark
jobs running on the same cluster.
Trifacta submits the following to YARN for execution across cluster:
§ Spark v2.1.0 libraries
§ Trifacta Transformation & Profiling libraries
§ Transformation logic (DAG)
§ Libraries are distributed & cached by YARN after initial load.
Spark jobs parameters (possible per user) :
§ Executor parameters (memory size, nb vcores).
§ Dynamic allocation (by default) for dynamic nb of executors depending of
YARN available ressources.
§ Possible to assign jobs to specific YARN queue
How Does Trifacta’s Spark Work?

Trifacta Selected as OEM Partner for Google Cloud Dataprep Service
Trifacta Interface & Photon Engine Integrated within Google Cloud Ecosystem
● Access & publish data from/to Google Cloud Storage & BigQuery
● Compile recipes to Google Cloud Dataflow for fully-managed auto-scaling execution
Google Cloud Dataprep
Cloud Storage
BigQuery
Dataflow
Cloud Storage
BigQuery
Cloud Dataprep
INPUT OUTPUT
https://cloud.google.com/dataprep/

Storage
3rd Party
Experian,
Nielson,
FICO…
v
IT
LOB
Discovering Structuring Cleaning Enriching Validating Publishing
Ingestion Processing
DATA LAKE
Demonstration : Predict and Avoid Churn – Customer 360
Customer Data
Account Activity
Social Media
CRM
Contact / Status
Voice
Text Data
Tweets
Handles
ANALYSIS & VISUALIZATION

Trifacta: The Global Leader in Data Wrangling
No. 1 by Analysts
#1 End User Data
Preparation Vendor
2015
Leader in Forrester Wave
for Data Preparation Tools
2017
0
50 000
No. 1 by Users
No. 1 by Customers
No. 1 by Partners
2016
Oct 2015 Oct 2016 Oct 2017
2017

Merci
Questions?
Télécharger Trifacta Wrangler
trifacta.com/start-wrangling
victor@trifacta.com
@vizanalytics

Paris Spark Meetup - Trifacta - 03_04_2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Paris Spark Meetup - Trifacta - 03_04_2017

Similar to Paris Spark Meetup - Trifacta - 03_04_2017 (20)

More from Modern Data Stack France

More from Modern Data Stack France (20)

Recently uploaded

Recently uploaded (20)

Paris Spark Meetup - Trifacta - 03_04_2017