SlideShare a Scribd company logo
1 of 16
Download to read offline
1
Data Wrangling sur Hadoop avec Spark
Paris Spark Meetup 03/04/17
Victor Coustenoble
Technical regional manager EMEA
victor@trifacta.com
DATA WRANGLING
2
QUESTION ANALYZE INSIGHTDISCOVER STRUCTURE CLEANSE ENRICH VALIDATE PUBLISH
What is Data Wrangling?
3
DATA
Company Overview
Background
➔ Headquartered in San Francisco, with offices in Boston,
London, Berlin, Paris
➔ >100+ Employees
➔ Created in 2012
Focus
➔ 100% focused on Data Wrangling and Data Preparation
➔ Accelerate time to value and business use of Big Data
➔ Visual, interactive and Self-Service Data Preparation
4
5
Business System Data Machine Generated Data Third Party Data
Reporting / BI
Data Visualization
LOB IT
Explore Structure Clean Enrich Validate Publish
Distributed Data Platform
Predictive Analytics /
Data Science
Machine Data /
Enterprise Processes
Applications
/ processes
Reporting /
Data driven decision
Recommendations /
Data Mining
Self-service access for business analysts to raw
data operated under IT control
6
INTERACTIVE &
VISUAL
PREDICTIVE &
SUGGESTIONS
INTEROPERABLE
Trifacta Key Differentiators
Interoperable: Reduces Total Cost of Ownership
7
Interoperability with
metadata repositories
enables discoverability
and lineage for
compliance & audit
Interoperability with
existing security models
prevents administering
another app
*Predictive Interaction for Data Transformation – Heer, Hellerstein & Kandel; Stanford University & University of California, Berkeley
(2015)
Intelligent Execution* ensures
Trifacta is highly performant
both now and in the future
8
Execution Architecture
Optimized processing for data not needing parallel processing
Future Technologies
Intelligent Execution
In-memory
MBs GBs TBs PBs
Data Volume
ExecutionLatency
Immediate
Interactive
Batch
Intelligent Execution Architecture
Automatically selects the right execution engine for the data set being transformed
TRIFACTA
Trifacta Workflow in Hadoop
Sample Scale Up
Refine
Sample
Results
Identify/Register Data
1
. Predictive Interaction
2
.
Consume
Schedulers
Monitor and Adjust
3
.
Schedule
Visualization & Analysis
Secure Access
Kerberos, LDAP…
CLI
How Does Trifacta’s Spark Work?
§ Yarn ressource manager
§ Cluster deployment mode
12
Trifacta executes our own version of Spark in a “Cluster Deployment Mode” using
the Hadoop cluster’s YARN resource manager.
§ Trifacta’s Spark job lives in its own YARN container, separate from other Spark
jobs running on the same cluster.
Trifacta submits the following to YARN for execution across cluster:
§ Spark v2.1.0 libraries
§ Trifacta Transformation & Profiling libraries
§ Transformation logic (DAG)
§ Libraries are distributed & cached by YARN after initial load.
Spark jobs parameters (possible per user) :
§ Executor parameters (memory size, nb vcores).
§ Dynamic allocation (by default) for dynamic nb of executors depending of
YARN available ressources.
§ Possible to assign jobs to specific YARN queue
How Does Trifacta’s Spark Work?
Trifacta Selected as OEM Partner for Google Cloud Dataprep Service
Trifacta Interface & Photon Engine Integrated within Google Cloud Ecosystem
● Access & publish data from/to Google Cloud Storage & BigQuery
● Compile recipes to Google Cloud Dataflow for fully-managed auto-scaling execution
Google Cloud Dataprep
Cloud Storage
BigQuery
Dataflow
Cloud Storage
BigQuery
Cloud Dataprep
INPUT OUTPUT
https://cloud.google.com/dataprep/
Storage
3rd Party
Experian,
Nielson,
FICO…
v
IT
LOB
Discovering Structuring Cleaning Enriching Validating Publishing
Ingestion Processing
DATA LAKE
Demonstration : Predict and Avoid Churn – Customer 360
Customer Data
Account Activity
Social Media
CRM
Contact / Status
Voice
Text Data
Tweets
Handles
ANALYSIS & VISUALIZATION
Trifacta: The Global Leader in Data Wrangling
No. 1 by Analysts
#1 End User Data
Preparation Vendor
2015
Leader in Forrester Wave
for Data Preparation Tools
2017
0
50 000
No. 1 by Users
No. 1 by Customers
No. 1 by Partners
2016
Oct 2015 Oct 2016 Oct 2017
2017
Merci
Questions?
Télécharger Trifacta Wrangler
trifacta.com/start-wrangling
victor@trifacta.com
@vizanalytics

More Related Content

What's hot

Give sense to your Big Data w/ Apache TinkerPop™ & property graph databases
Give sense to your Big Data w/ Apache TinkerPop™ & property graph databasesGive sense to your Big Data w/ Apache TinkerPop™ & property graph databases
Give sense to your Big Data w/ Apache TinkerPop™ & property graph databases
DataStax
 

What's hot (20)

How to migrate to GraphDB in 10 easy to follow steps
How to migrate to GraphDB in 10 easy to follow steps How to migrate to GraphDB in 10 easy to follow steps
How to migrate to GraphDB in 10 easy to follow steps
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
 
Give sense to your Big Data w/ Apache TinkerPop™ & property graph databases
Give sense to your Big Data w/ Apache TinkerPop™ & property graph databasesGive sense to your Big Data w/ Apache TinkerPop™ & property graph databases
Give sense to your Big Data w/ Apache TinkerPop™ & property graph databases
 
AnzoGraph DB - SPARQL 101
AnzoGraph DB - SPARQL 101AnzoGraph DB - SPARQL 101
AnzoGraph DB - SPARQL 101
 
Smart data for a predictive bank
Smart data for a predictive bankSmart data for a predictive bank
Smart data for a predictive bank
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
 
Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)
 
The Business Case for Semantic Web Ontology & Knowledge Graph
The Business Case for Semantic Web Ontology & Knowledge GraphThe Business Case for Semantic Web Ontology & Knowledge Graph
The Business Case for Semantic Web Ontology & Knowledge Graph
 
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
 
Big Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace ImagesBig Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace Images
 
Graph-based Product Lifecycle Management
Graph-based Product Lifecycle ManagementGraph-based Product Lifecycle Management
Graph-based Product Lifecycle Management
 
Spark Summit EU 2015: Matei Zaharia keynote
Spark Summit EU 2015: Matei Zaharia keynoteSpark Summit EU 2015: Matei Zaharia keynote
Spark Summit EU 2015: Matei Zaharia keynote
 
Azure cafe marketplace with looker data analytics
Azure cafe marketplace with looker data analyticsAzure cafe marketplace with looker data analytics
Azure cafe marketplace with looker data analytics
 
Manage tracability with Apache Atlas, a flexible metadata repository
Manage tracability with Apache Atlas, a flexible metadata repositoryManage tracability with Apache Atlas, a flexible metadata repository
Manage tracability with Apache Atlas, a flexible metadata repository
 
Cloud and Big Data trends
Cloud and Big Data trendsCloud and Big Data trends
Cloud and Big Data trends
 
Geo-Analytics with Apache Spark and In-Memory Data Grids
Geo-Analytics with Apache Spark and In-Memory Data GridsGeo-Analytics with Apache Spark and In-Memory Data Grids
Geo-Analytics with Apache Spark and In-Memory Data Grids
 
Big data landscape
Big data landscapeBig data landscape
Big data landscape
 
Machine Data Analytics
Machine Data AnalyticsMachine Data Analytics
Machine Data Analytics
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
Big Data Landscape 2016
Big Data Landscape 2016Big Data Landscape 2016
Big Data Landscape 2016
 

Similar to Paris Spark Meetup - Trifacta - 03_04_2017

2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
 
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData BI & Advanced Analytics Platform + 8 Week Pilot ProgramszData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData Inc.
 
Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...
DataWorks Summit
 

Similar to Paris Spark Meetup - Trifacta - 03_04_2017 (20)

Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
 
Azure Notebooks - Jupyter for the Cloud
Azure Notebooks - Jupyter for the CloudAzure Notebooks - Jupyter for the Cloud
Azure Notebooks - Jupyter for the Cloud
 
Session 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramSession 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers Program
 
Oracle Data Integration - Overview
Oracle Data Integration - OverviewOracle Data Integration - Overview
Oracle Data Integration - Overview
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
 
PCM Vision 2019 Breakout: Quest Software
PCM Vision 2019 Breakout: Quest SoftwarePCM Vision 2019 Breakout: Quest Software
PCM Vision 2019 Breakout: Quest Software
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers Program
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
 
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData BI & Advanced Analytics Platform + 8 Week Pilot ProgramszData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
 

More from Modern Data Stack France

More from Modern Data Stack France (20)

Stash - Data FinOPS
Stash - Data FinOPSStash - Data FinOPS
Stash - Data FinOPS
 
Vue d'ensemble Dremio
Vue d'ensemble DremioVue d'ensemble Dremio
Vue d'ensemble Dremio
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Talend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark MeetupTalend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark Meetup
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
Hug janvier 2016 -EDF
Hug   janvier 2016 -EDFHug   janvier 2016 -EDF
Hug janvier 2016 -EDF
 
HUG France - 20160114 industrialisation_process_big_data CanalPlus
HUG France -  20160114 industrialisation_process_big_data CanalPlusHUG France -  20160114 industrialisation_process_big_data CanalPlus
HUG France - 20160114 industrialisation_process_big_data CanalPlus
 
Hugfr SPARK & RIAK -20160114_hug_france
Hugfr  SPARK & RIAK -20160114_hug_franceHugfr  SPARK & RIAK -20160114_hug_france
Hugfr SPARK & RIAK -20160114_hug_france
 
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
 
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
 
Spark dataframe
Spark dataframeSpark dataframe
Spark dataframe
 
June Spark meetup : search as recommandation
June Spark meetup : search as recommandationJune Spark meetup : search as recommandation
June Spark meetup : search as recommandation
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)
 
Spark meetup at viadeo
Spark meetup at viadeoSpark meetup at viadeo
Spark meetup at viadeo
 
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielParis Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
 
Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX
Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REXHadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX
Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Paris Spark Meetup - Trifacta - 03_04_2017

  • 1. 1 Data Wrangling sur Hadoop avec Spark Paris Spark Meetup 03/04/17 Victor Coustenoble Technical regional manager EMEA victor@trifacta.com
  • 2. DATA WRANGLING 2 QUESTION ANALYZE INSIGHTDISCOVER STRUCTURE CLEANSE ENRICH VALIDATE PUBLISH What is Data Wrangling?
  • 4. Company Overview Background ➔ Headquartered in San Francisco, with offices in Boston, London, Berlin, Paris ➔ >100+ Employees ➔ Created in 2012 Focus ➔ 100% focused on Data Wrangling and Data Preparation ➔ Accelerate time to value and business use of Big Data ➔ Visual, interactive and Self-Service Data Preparation 4
  • 5. 5 Business System Data Machine Generated Data Third Party Data Reporting / BI Data Visualization LOB IT Explore Structure Clean Enrich Validate Publish Distributed Data Platform Predictive Analytics / Data Science Machine Data / Enterprise Processes Applications / processes Reporting / Data driven decision Recommendations / Data Mining Self-service access for business analysts to raw data operated under IT control
  • 7. Interoperable: Reduces Total Cost of Ownership 7 Interoperability with metadata repositories enables discoverability and lineage for compliance & audit Interoperability with existing security models prevents administering another app *Predictive Interaction for Data Transformation – Heer, Hellerstein & Kandel; Stanford University & University of California, Berkeley (2015) Intelligent Execution* ensures Trifacta is highly performant both now and in the future
  • 8. 8 Execution Architecture Optimized processing for data not needing parallel processing Future Technologies Intelligent Execution In-memory
  • 9. MBs GBs TBs PBs Data Volume ExecutionLatency Immediate Interactive Batch Intelligent Execution Architecture Automatically selects the right execution engine for the data set being transformed
  • 10. TRIFACTA Trifacta Workflow in Hadoop Sample Scale Up Refine Sample Results Identify/Register Data 1 . Predictive Interaction 2 . Consume Schedulers Monitor and Adjust 3 . Schedule Visualization & Analysis Secure Access Kerberos, LDAP… CLI
  • 11. How Does Trifacta’s Spark Work? § Yarn ressource manager § Cluster deployment mode
  • 12. 12 Trifacta executes our own version of Spark in a “Cluster Deployment Mode” using the Hadoop cluster’s YARN resource manager. § Trifacta’s Spark job lives in its own YARN container, separate from other Spark jobs running on the same cluster. Trifacta submits the following to YARN for execution across cluster: § Spark v2.1.0 libraries § Trifacta Transformation & Profiling libraries § Transformation logic (DAG) § Libraries are distributed & cached by YARN after initial load. Spark jobs parameters (possible per user) : § Executor parameters (memory size, nb vcores). § Dynamic allocation (by default) for dynamic nb of executors depending of YARN available ressources. § Possible to assign jobs to specific YARN queue How Does Trifacta’s Spark Work?
  • 13. Trifacta Selected as OEM Partner for Google Cloud Dataprep Service Trifacta Interface & Photon Engine Integrated within Google Cloud Ecosystem ● Access & publish data from/to Google Cloud Storage & BigQuery ● Compile recipes to Google Cloud Dataflow for fully-managed auto-scaling execution Google Cloud Dataprep Cloud Storage BigQuery Dataflow Cloud Storage BigQuery Cloud Dataprep INPUT OUTPUT https://cloud.google.com/dataprep/
  • 14. Storage 3rd Party Experian, Nielson, FICO… v IT LOB Discovering Structuring Cleaning Enriching Validating Publishing Ingestion Processing DATA LAKE Demonstration : Predict and Avoid Churn – Customer 360 Customer Data Account Activity Social Media CRM Contact / Status Voice Text Data Tweets Handles ANALYSIS & VISUALIZATION
  • 15. Trifacta: The Global Leader in Data Wrangling No. 1 by Analysts #1 End User Data Preparation Vendor 2015 Leader in Forrester Wave for Data Preparation Tools 2017 0 50 000 No. 1 by Users No. 1 by Customers No. 1 by Partners 2016 Oct 2015 Oct 2016 Oct 2017 2017