The document discusses Kettle, an open source ETL tool from Pentaho. It provides an introduction to the ETL process and describes Kettle's major components: Spoon for designing transformations and jobs, Pan for executing transformations, and Kitchen for executing jobs. Transformations in Kettle perform tasks like data filtering, field manipulation, lookups and more. Jobs are used to call and sequence multiple transformations. The document also covers recent Kettle releases and how it can help address challenges in data integration projects.
2. Agenda
Introduction
− ETL Process
− Pentaho's Kettle
Data Integration Challenges
Prerequisites and Recent Releases
Pentaho DI Components
Spoon
− Transformations
− Jobs
3. Introduction – ETL Process
Major Components
− Extracting
Gathering raw data from source systems and storing it in ETL staging
environment
Data Profiling
Identifying data that changed since last load.
− Transforming- Cleaning and Conforming
Processing data to improve its quality, format it, merge from multiple
sources, enforce conformed dimensions
Data cleansing
Recording error events
Audit dimensions
Creating and maintaining conformed dimensions and facts
4. Introduction – ETL Process
− Loading
Loading data into data warehouse tables
Managing hierarchies in dimensions
Managing special dimensions such as date and time, junk, mini, shrunken,
small static, and user-maintained dimensions
Fact table loading
Building and maintaining bridge dimension tables
Handling late arriving data
Management of conformed dimensions
Administration of fact tables
Building aggregations
Building OLAP cubes
Transferring DW data to other environment for specific purposes
5. Data Transformation and
Integration Examples
Data filtering
− Is not null, greater than, less than, includes
Field manipulation
− Trimming, padding, upper and lowercase conversion
Data calculations
− + - X / , average, absolute value, arctangent, natural logarithm
Date manipulation
− First day of month, Last day of month, add months, week of year, day of year
Data type conversion
− String to number, number to string, date to number
Merging fields & splitting fields
Looking up date
− Look up in a database, in a text file, an excel sheet, …
6. Introduction – Pentaho Kettle
Kettle – Kettle Extraction Transformation Transportation &
Loading tool
Its open source business intelligence suite for powerful
data integration by Pentaho. Founded in 2004.
Products of Pentaho
− Mondrain – OLAP server written in Java
− Kettle – ETL tool
7. Data Integration - Challenges
Data is everywhere
Data is inconsistent
− Records are different in each system
Performance issues
− Running queries to summarize data for stipulated
long period takes operating system for task
Data is never all in Data Warehouse
− Excel sheet, acquisition, new application
8. Prerequisites Recent Releases
Java Runtime Environment
1.5 and above
Compatible with almost any
platform
Compatible with wide range
of Databases technologies.
4/25 Data Integration 3.0.3 GA
4/18 Data Integration 3.1 Milestone
2/8 Data Integration 3.0.2 GA
12/12 Data Integration 3.0.1 GA
11/15 Data Integration 3.0 GA
10/31 Data Integration 3.0 RC2
10/24 Data Integration 2.5.2 GA
10/08 Data Integration 3.0 RC1
08/24 Data Integration 2.5.1 GA
9. Pentaho Components
Spoon
− GUI that allows you to design transformations and jobs that can
be run with the Kettle tools — Pan and Kitchen
− Transformations and Jobs can describe themselves using an XML
file or can be put in a Kettle database repository.
− Spoon is available as executable script and batch file to make use
of tool in heterogeneous environment.
Pan
− A program to execute transformations designed by Spoon in XML or
database repository.
− Transformations are scheduled in batch mode to be run automatically at
regular intervals
Kitchen
− Execute jobs designed by Spoon in XML or database repository
10.
Repository Connection establishment
Auto login
− By setting manually KETTLE_REPOSITORY,
KETTLE_USER and KETTLE_PASSWORD
environmental variables.
Login
− By default PDI provides login username and
password ad admin.
11.
12.
13.
14.
Transformation
− Value: Values are part of a row
and can contain any type of data
− Row: a row exists of 0 or more
values
− Output stream: an output
stream is a stack of rows that
leaves a step.
− Input stream: an input stream is
a stack of rows that enters a
step.
− Hop: A hop is a graphical
representation of one or more
data streams between 2 steps.
− Note: A note is a piece of
information that can be added to
a transformation
Engine capable of performing a
multitude of functions such as reading,
manipulating and writing data to and
from various data sources.
15.
Jobs
− Job Entry: A job entry is
one part of a job and
performs a certain
− Hop: A hop is a graphical
representation of one or
more data streams
between 2 steps
− Note: a note is a piece of
information that can be added to
a job
A way of calling transformations and
controlling the sequence of their
execution. Usually jobs are
scheduled in batch mode to be run
automatically at regular intervals.