SlideShare uma empresa Scribd logo
1 de 18
Baixar para ler offline
Pentaho Data Integration/Kettle
● Dan Moore
● 8z Real Estate
● Kettle user for two years
Questions
● Who has used a relational database?
● Who has written scripts or java code to munge
data from one source and load it to another?
– What did you use?
– Scripts
– Custom java code
– ETL tool
What is Kettle?
● Batch data integration and processing tool
written in Java
● Exists to retrieve, process and load data
● ETL
– Extract, transform and load
● PDI synonomous
What is Kettle good for
● Mirroring data from master to slave
● Syncing two data sources
● Processing data retrieved from multiple sources
and pushed to multiple destinations
● Loading data to RDBMS
● Datamart/data warehouse
– Dimension lookup/update step
● Graphical manipulation of data
Alternatives
● Code
– Custom java
– Spring batch
● Scripts
– perl, python, shell, etc
– Possibly + db loader tool and cron
● Commercial ETL tools
– Oracle Warehouse Builder
– Datastage
– Informatica
– SQL Server Integration services
● Open source ETL tools:
– Talend
– KETL
– Clover.ETL
● Special case tools
– SymmetricDS
– Db replication
Why Kettle is better
● Higher level than code
– Graphical interface
– No connection pooling to worry about
– No DDL to write
– Validation/business rules
● Well tested full suite of components
● Data analysis tools
– Preview
– Data profiling with data cleaner (add on)
● Free (as in beer and speech)
– Two editions
– GPLv2
● Performant?
– Developer vs computer performant
– Depends, right?
– Sitemailsame job 10k rows/second for 125M rows
● Leverage java
– jvm tuning skills
– java libraries and logic (in jars)
Data sources
● Files
● Databases
● No SQL
● REST
● XML
● Hadoop/HBase
● JSON
● Excel
● EDI
● RSS
● Google Analytics
Kettle concepts
● Repository
● Rows/Stream
● Steps
● Job
● Transformation
Demo 1: one way sync
● Sync tables
Demo 2: processing
● Process data from one table and replace some
values, filter some values
● Lookup table
Demo 3: log file processing
● Load apache logs for analysis
What it is not good for
● User interfaces/user interaction
● Small data sets
– 500 (from experience)
● Web applications
● One off processes?
– One off becomes regular
Who uses
● Survey results
– ~20 people
● Number of downloads: 110K downloads of Kettle 4.4
– Since Nov 2012
● Our specific use
– MLS data
● Different data source formats and types (jdbc, local csv, ftp)
– Public records data
● Fixed width files
Larger picture
● Kettle 10 years old
– joined Pentaho about 7 years ago
● Open source, at version 4.4
– GPLv2 license
– EE edition available
● BI suite
– Reporting
– Analytics
– Dashboards
– Machine Learning (weka)
Kettle tools
● Spoon
● Kitchen
● Pan
● Carte
– Clustering tool
Advanced topics
● Existing java logic
– Embedded
– Polygon example
– Demo 4
● Deployment
– Variables Config files are your friend
● Mapping/Parameterization
– Subroutines of logic
Advanced Topics Continued
● Testing
– Who tests
● Version control
– Who uses version control
● Error handling
– Email
– Log files
Getting started
● Download
– sourceforge
● Includes over 150 example transformations
– Mysql 3.14 jdbc driver
● Helpful sites
– Forums: http://forums.pentaho.com/forumdisplay.php?135-Pentaho-Data-Integration-
Kettle
– Wiki: http://wiki.pentaho.com/display/EAI/Pentaho+Data+Integration+Steps
– Testing: http://www.mooreds.com/wordpress/pentaho-kettle-testing
● Helpful books
– Pentaho Kettle Solutions: Casters, Bouman, van Dongen
● Barely scratched surface
● Don't like tools that turn me into a mechanic

Mais conteúdo relacionado

Mais procurados

Talend Open Studio Data Integration
Talend Open Studio Data IntegrationTalend Open Studio Data Integration
Talend Open Studio Data IntegrationRoberto Marchetto
 
Intro to Talend Open Studio for Data Integration
Intro to Talend Open Studio for Data IntegrationIntro to Talend Open Studio for Data Integration
Intro to Talend Open Studio for Data IntegrationPhilip Yurchuk
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSDatabricks
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLCloudera, Inc.
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceTao Feng
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexDataWorks Summit
 
Big Data Ingestion @ Flipkart Data Platform
Big Data Ingestion @ Flipkart Data PlatformBig Data Ingestion @ Flipkart Data Platform
Big Data Ingestion @ Flipkart Data PlatformNavneet Gupta
 
Open Source ETL using Talend Open Studio
Open Source ETL using Talend Open StudioOpen Source ETL using Talend Open Studio
Open Source ETL using Talend Open Studiosantosluis87
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSyed Hadoop
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 
KnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge baseKnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge baseLaurent Alquier
 
final_proj_Implementation of the ETL system
final_proj_Implementation of the ETL systemfinal_proj_Implementation of the ETL system
final_proj_Implementation of the ETL systemR-uturaj R-aval
 
Sql Server 2005 Business Inteligence
Sql Server 2005 Business InteligenceSql Server 2005 Business Inteligence
Sql Server 2005 Business Inteligenceabercius24
 
Datastage to ODI
Datastage to ODIDatastage to ODI
Datastage to ODINagendra K
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mais procurados (20)

Talend Open Studio Data Integration
Talend Open Studio Data IntegrationTalend Open Studio Data Integration
Talend Open Studio Data Integration
 
A deep dive into neuton
A deep dive into neutonA deep dive into neuton
A deep dive into neuton
 
Intro to Talend Open Studio for Data Integration
Intro to Talend Open Studio for Data IntegrationIntro to Talend Open Studio for Data Integration
Intro to Talend Open Studio for Data Integration
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RS
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conference
 
Tableau Architecture
Tableau ArchitectureTableau Architecture
Tableau Architecture
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
 
Big Data Ingestion @ Flipkart Data Platform
Big Data Ingestion @ Flipkart Data PlatformBig Data Ingestion @ Flipkart Data Platform
Big Data Ingestion @ Flipkart Data Platform
 
Data provenance in Hopsworks
Data provenance in HopsworksData provenance in Hopsworks
Data provenance in Hopsworks
 
Open Source ETL using Talend Open Studio
Open Source ETL using Talend Open StudioOpen Source ETL using Talend Open Studio
Open Source ETL using Talend Open Studio
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
KnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge baseKnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge base
 
final_proj_Implementation of the ETL system
final_proj_Implementation of the ETL systemfinal_proj_Implementation of the ETL system
final_proj_Implementation of the ETL system
 
Sql Server 2005 Business Inteligence
Sql Server 2005 Business InteligenceSql Server 2005 Business Inteligence
Sql Server 2005 Business Inteligence
 
Datastage to ODI
Datastage to ODIDatastage to ODI
Datastage to ODI
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Datastage ppt
Datastage pptDatastage ppt
Datastage ppt
 

Destaque

Pentaho Data Integration Introduction
Pentaho Data Integration IntroductionPentaho Data Integration Introduction
Pentaho Data Integration Introductionmattcasters
 
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)Roland Bouman
 
Pentaho technical whitepaper-1-6
Pentaho technical whitepaper-1-6Pentaho technical whitepaper-1-6
Pentaho technical whitepaper-1-6skonda
 
Open Source ETL vs Commercial ETL
Open Source ETL vs Commercial ETLOpen Source ETL vs Commercial ETL
Open Source ETL vs Commercial ETLJonathan Levin
 
Pentaho-BI
Pentaho-BIPentaho-BI
Pentaho-BIEdureka!
 
SQL Tutorial - Basic Commands
SQL Tutorial - Basic CommandsSQL Tutorial - Basic Commands
SQL Tutorial - Basic Commands1keydata
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSINGKing Julian
 

Destaque (11)

Pentaho Data Integration Introduction
Pentaho Data Integration IntroductionPentaho Data Integration Introduction
Pentaho Data Integration Introduction
 
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
 
Introduction To Pentaho
Introduction To PentahoIntroduction To Pentaho
Introduction To Pentaho
 
Pentaho technical whitepaper-1-6
Pentaho technical whitepaper-1-6Pentaho technical whitepaper-1-6
Pentaho technical whitepaper-1-6
 
Introduction To Pentaho
Introduction To PentahoIntroduction To Pentaho
Introduction To Pentaho
 
Open Source ETL vs Commercial ETL
Open Source ETL vs Commercial ETLOpen Source ETL vs Commercial ETL
Open Source ETL vs Commercial ETL
 
Pentaho-BI
Pentaho-BIPentaho-BI
Pentaho-BI
 
SQL Basics
SQL BasicsSQL Basics
SQL Basics
 
SQL Tutorial - Basic Commands
SQL Tutorial - Basic CommandsSQL Tutorial - Basic Commands
SQL Tutorial - Basic Commands
 
Sql ppt
Sql pptSql ppt
Sql ppt
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 

Semelhante a Introduction To Pentaho Kettle

WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsMars Lan
 
Data analysis with Pandas and Spark
Data analysis with Pandas and SparkData analysis with Pandas and Spark
Data analysis with Pandas and SparkFelix Crisan
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLdatamantra
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sqlaftab alam
 
A compute infrastructure for data scientists
A compute infrastructure for data scientistsA compute infrastructure for data scientists
A compute infrastructure for data scientistsStitch Fix Algorithms
 
Introduction to SQL Alchemy - SyPy June 2013
Introduction to SQL Alchemy - SyPy June 2013Introduction to SQL Alchemy - SyPy June 2013
Introduction to SQL Alchemy - SyPy June 2013Roger Barnes
 
Proud to be polyglot
Proud to be polyglotProud to be polyglot
Proud to be polyglotTugdual Grall
 
Spark Workflow Management
Spark Workflow ManagementSpark Workflow Management
Spark Workflow ManagementRomi Kuntsman
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & pythonMaloy Manna, PMP®
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformDatabricks
 
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...Lviv Startup Club
 
BlackRay - The open Source Data Engine
BlackRay - The open Source Data EngineBlackRay - The open Source Data Engine
BlackRay - The open Source Data Enginefschupp
 
SQL or NoSQL - how to choose
SQL or NoSQL - how to chooseSQL or NoSQL - how to choose
SQL or NoSQL - how to chooseLars Thorup
 
Build an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsBuild an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsShawn Zhu
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
 
Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009fschupp
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 

Semelhante a Introduction To Pentaho Kettle (20)

NoSQL
NoSQLNoSQL
NoSQL
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
 
Data analysis with Pandas and Spark
Data analysis with Pandas and SparkData analysis with Pandas and Spark
Data analysis with Pandas and Spark
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
 
Are we there yet?
Are we there yet?Are we there yet?
Are we there yet?
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
A compute infrastructure for data scientists
A compute infrastructure for data scientistsA compute infrastructure for data scientists
A compute infrastructure for data scientists
 
Introduction to SQL Alchemy - SyPy June 2013
Introduction to SQL Alchemy - SyPy June 2013Introduction to SQL Alchemy - SyPy June 2013
Introduction to SQL Alchemy - SyPy June 2013
 
Proud to be polyglot
Proud to be polyglotProud to be polyglot
Proud to be polyglot
 
Spark Workflow Management
Spark Workflow ManagementSpark Workflow Management
Spark Workflow Management
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & python
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
 
BlackRay - The open Source Data Engine
BlackRay - The open Source Data EngineBlackRay - The open Source Data Engine
BlackRay - The open Source Data Engine
 
SQL or NoSQL - how to choose
SQL or NoSQL - how to chooseSQL or NoSQL - how to choose
SQL or NoSQL - how to choose
 
Build an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsBuild an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data Scientists
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 
Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 

Mais de Boulder Java User's Group (8)

Spring insight what just happened
Spring insight   what just happenedSpring insight   what just happened
Spring insight what just happened
 
Big Data and OSS at IBM
Big Data and OSS at IBMBig Data and OSS at IBM
Big Data and OSS at IBM
 
Json at work overview and ecosystem-v2.0
Json at work   overview and ecosystem-v2.0Json at work   overview and ecosystem-v2.0
Json at work overview and ecosystem-v2.0
 
Restful design at work v2.0
Restful design at work v2.0Restful design at work v2.0
Restful design at work v2.0
 
Introduction To JavaFX 2.0
Introduction To JavaFX 2.0Introduction To JavaFX 2.0
Introduction To JavaFX 2.0
 
55 New Features in Java 7
55 New Features in Java 755 New Features in Java 7
55 New Features in Java 7
 
Watson and Open Source Tools
Watson and Open Source ToolsWatson and Open Source Tools
Watson and Open Source Tools
 
Intro to Redis
Intro to RedisIntro to Redis
Intro to Redis
 

Último

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 

Último (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

Introduction To Pentaho Kettle

  • 1. Pentaho Data Integration/Kettle ● Dan Moore ● 8z Real Estate ● Kettle user for two years
  • 2. Questions ● Who has used a relational database? ● Who has written scripts or java code to munge data from one source and load it to another? – What did you use? – Scripts – Custom java code – ETL tool
  • 3. What is Kettle? ● Batch data integration and processing tool written in Java ● Exists to retrieve, process and load data ● ETL – Extract, transform and load ● PDI synonomous
  • 4. What is Kettle good for ● Mirroring data from master to slave ● Syncing two data sources ● Processing data retrieved from multiple sources and pushed to multiple destinations ● Loading data to RDBMS ● Datamart/data warehouse – Dimension lookup/update step ● Graphical manipulation of data
  • 5. Alternatives ● Code – Custom java – Spring batch ● Scripts – perl, python, shell, etc – Possibly + db loader tool and cron ● Commercial ETL tools – Oracle Warehouse Builder – Datastage – Informatica – SQL Server Integration services ● Open source ETL tools: – Talend – KETL – Clover.ETL ● Special case tools – SymmetricDS – Db replication
  • 6. Why Kettle is better ● Higher level than code – Graphical interface – No connection pooling to worry about – No DDL to write – Validation/business rules ● Well tested full suite of components ● Data analysis tools – Preview – Data profiling with data cleaner (add on) ● Free (as in beer and speech) – Two editions – GPLv2 ● Performant? – Developer vs computer performant – Depends, right? – Sitemailsame job 10k rows/second for 125M rows ● Leverage java – jvm tuning skills – java libraries and logic (in jars)
  • 7. Data sources ● Files ● Databases ● No SQL ● REST ● XML ● Hadoop/HBase ● JSON ● Excel ● EDI ● RSS ● Google Analytics
  • 8. Kettle concepts ● Repository ● Rows/Stream ● Steps ● Job ● Transformation
  • 9. Demo 1: one way sync ● Sync tables
  • 10. Demo 2: processing ● Process data from one table and replace some values, filter some values ● Lookup table
  • 11. Demo 3: log file processing ● Load apache logs for analysis
  • 12. What it is not good for ● User interfaces/user interaction ● Small data sets – 500 (from experience) ● Web applications ● One off processes? – One off becomes regular
  • 13. Who uses ● Survey results – ~20 people ● Number of downloads: 110K downloads of Kettle 4.4 – Since Nov 2012 ● Our specific use – MLS data ● Different data source formats and types (jdbc, local csv, ftp) – Public records data ● Fixed width files
  • 14. Larger picture ● Kettle 10 years old – joined Pentaho about 7 years ago ● Open source, at version 4.4 – GPLv2 license – EE edition available ● BI suite – Reporting – Analytics – Dashboards – Machine Learning (weka)
  • 15. Kettle tools ● Spoon ● Kitchen ● Pan ● Carte – Clustering tool
  • 16. Advanced topics ● Existing java logic – Embedded – Polygon example – Demo 4 ● Deployment – Variables Config files are your friend ● Mapping/Parameterization – Subroutines of logic
  • 17. Advanced Topics Continued ● Testing – Who tests ● Version control – Who uses version control ● Error handling – Email – Log files
  • 18. Getting started ● Download – sourceforge ● Includes over 150 example transformations – Mysql 3.14 jdbc driver ● Helpful sites – Forums: http://forums.pentaho.com/forumdisplay.php?135-Pentaho-Data-Integration- Kettle – Wiki: http://wiki.pentaho.com/display/EAI/Pentaho+Data+Integration+Steps – Testing: http://www.mooreds.com/wordpress/pentaho-kettle-testing ● Helpful books – Pentaho Kettle Solutions: Casters, Bouman, van Dongen ● Barely scratched surface ● Don't like tools that turn me into a mechanic