ETL tools extract data from various sources, transform it for reporting and analysis, cleanse errors, and load it into a data warehouse. They save time and money compared to manual coding by automating this process. Popular open-source ETL tools include Pentaho Kettle and Talend, while Informatica is a leading commercial tool. A comparison found that Pentaho Kettle uses a graphical interface and standalone engine, has a large user community, and includes data quality features, while Talend generates code to run ETL jobs.
4. Overview
What do ETLs tool do?
An ETL tool is a tool that:
Extracts data from various data sources (usually legacy)
Transforms data
from -> being optimized for transaction
to -> being optimized for reporting and analysis
synchronizes the data coming from different databases
data cleanses to remove errors
Loads data into a data warehouse
5. Overview
Why use an ETL tool?
ETL tools save time and money when developing a data warehouse by
removing the need for “hand-coding”.
“Hand Coding” is still the most common way of integrating data today. It
requires hours and hours of development and expertise to create a
Business-Intelligence-System.
It is very difficult for data base administrators to connect between
different brands of databases without using an external tool.
In the event that databases are altered or new databases need to be
integrated, a lot of “hand-coded” work needs to be completely redone.
7. Tools
ETL Tools
Pentaho Kettle
Pentaho is a commercial open-source BI suite that has a product called
Kettle for data integration.
It uses an innovative meta-driven approach and has a strong and very
easy-to-use GUI
The company started around 2001
It has a strong community of 13,500 registered users
It uses a stand-alone java engine that process the tasks for moving data
between many different databases and files
8. Tools
ETL Tools
Talend
Talend is an open-source data integration tool
It uses a code-generating approach and uses a GUI (implemented in
Eclipse RC)
It started around October 2006
It has a much smaller community then Pentaho, but is supported by 2
finance companies
It generates Java code or Perl code which can later be run on a server
9. Tools
ETL Tools
Informatica PowerCenter
Informatica has a very good commercial data integration suite
It was founded in 1993
It is the market share leader in data integration (Gartner Dataquest)
It has 2600 customers. Of those, there are fortune 100 companies,
companies listed on the Dow Jones and government organization
The company's sole focus is data integration
It has quite a big package for enterprises to integrate their systems,
cleanse their data and can connect to a vast number of current and legacy
systems
10. Open Source Tools
ETL Tools
Inaplex Inaport
Inaplex is a small UK company
InaPlex is a producer of Customer Data Integration products for mid-
market CRM solutions
Inaplex mainly focuses on providing simple solutions for it’s customers to
integrate their data into CRM and accounting software like Sage and
Goldmine
12. Type
IBM (Information Server Infosphere platform)
Advantages:
strongest vision on the market, flexibility
progress towards common metadata platform
high level of satisfaction from clients and a variety of initiatives
Disadvantages:
difficult learning curve
long implementation cycles
became very heavy (lots of GBs) with version 8.x and requires a lot of
processing power
13. Type
Informatica PowerCenter
Advantages:
most substantial size and resources on the market of data integration
tools vendors
consistent track record, solid technology, straightforward learning
curve, ability to address real-time data integration schemes
Informatica is highly specialized in ETL and Data Integration and
focuses on those topics, not on BI as a whole
focus on B2B data exchange
Disadvantages:
several partnerships diminishing the value of technologies
limited experience in the field.
14. Type
Microsoft (SQL Server Integration Services)
Advantages:
broad documentation and support, best practices to data warehouses
ease and speed of implementation
standardized data integration
real-time, message-based capabilities
relatively low cost - excellent support and distribution model
Disadvantages:
problems in non-Windows environments. Takes over all Microsoft
Windows limitations.
unclear vision and strategy
15. Type
Oracle (OWB and ODI)
Advantages:
based on Oracle Warehouse Builder and Oracle Data Integrator – two
very powerful tools;
tight connection to all Oracle datawarehousing applications;
tendency to integrate all tools into one application and one environment.
Disadvantages:
focus on ETL solutions, rather than in an open context of data
management;
tools are used mostly for batch-oriented work, transformation rather
than real-time processes or federation data delivery;
long-awaited bond between OWB and ODI brought only promises -
customers confused in the functionality area and the future is uncertain
16. Type
SAP BusinessObjects (Data Integrator / Data
Services)
Advantages:
integration with SAP
SAP Business Objects created a firm company determined to stir the
market;
Good data modeling and data-management support;
SAP Business Objects provides tools for data mining and quality;
profiling due to many acquisitions of other companies.
Quick learning curve and ease of use
Disadvantages:
SAP Business Objects is seen as two different companies
Uncertain future. Controversy over deciding which method of delivering
data integration to use (SAP BW or BODI).
BusinessObjects Data Integrator (Data Services) may not be seen as a
17. Types
SAS
Advantages:
experienced company, great support and most of all very powerful data
integration tool with lots of multi-management features
can work on many operating systems and gather data through number of
sources – very flexible
great support for the business-class companies as well for those medium
and minor ones
Disadvantages:
misplaced sales force, company is not well recognized
SAS has to extend influences to reach non-BI community
Costly
18. Types
Sun Microsystems
Advantages:
Data integration tools are a part of huge Java Composite Application
Platform Suite - very flexible with ongoing development of the products
'Single-view' services draw together data from variety of sources; small
set of vendors with a strong vision
Disadvantages:
relative weakness in bulk data movement
limited mindshare in the market
support and services rated below adequate
19. Types
Sybase
Advantages:
assembled a range of capabilities to be able to address a mulitude of
data delivery styles
size and global presence of Sybase create opportunities in the market
pragmatic near-term strategy - better of current market demand
broad partnerships with other data quality and data integration tools
vendors
Disadvantages:
falls behind market leaders and large vendors
gaps in many aspects of data management
20. Types
Syncsort
Advantages:
functionality; well-known brand on the market (40 years experience);
loyalimplementation, strong performance, targeted functionality and
lower costs customer and experience base;
easy
Disadvantages:
struggle with gaining mind share in the market
lack of support for other than ETL delivery styles
unsatisfactory with lack of capability of professional services
21. Types
Tibco Software
Advantages:
message-oriented application integration; capabilities based on common
SOA structures;
support for federated views; easy implementation, support
andperformance
Disadvantages:
scarce references from customers; not widely enough recognised for
data integration competencies
lacking in data quality capabilities.
22. Comparison
Pentaho Kettle vs Talend
Pentaho
Pentaho is a commerical open-source BI suite that has a product called
Kettle for data integration.
It uses an innovative meta-driven approach and has a strong and very
easy-to-use GUI.
The company started around 2001 (2002 was when kettle was integrated
into it).
It has a strong community of 13,500 registered users.
It has a stand-alone java engine that process the jobs and tasks for
moving data between many different databases and files.
It can schedule tasks (but you need a schedular for that - cron).
It can run remote jobs on "slave servers" on other machines.
It has data quality features: from its own GUI, writing more customised
SQL queries, Javascript and regular expressions.
23. Conclusion
Conclusion
Informatica and Pentaho have very good products.
Informatica has a far more extensive range of products, but compared
to Pentaho is very expensive.
Pentaho has proved that it can handle small to large scale systems.
Pentaho is gaining fast momentum with businesses that would not have
considered using open source products before.