SlideShare uma empresa Scribd logo
1 de 25
Intro to
Talend Open Studio
for
Data Integration
Philip Yurchuk
http://philip.yurchuk.com
What is Talend?
 Eclipse-based visual

programming editor
 Generates executable Java code
 Jobs can run standalone or
embedded (no special server)
 Batch or interactive (user input)
What is ETL?
 Extract: suck up data

 Transform: mess with it

Load: blow it out
Batch, integration, mi

gration, etc.
Extract from/load to where?
 Over 600 components

 Over 450 connectors
 Allows multiple

inputs/outputs in single job
Connectors
 Flat files

 Applications/Platforms

 Delimted (tab, CSV…)

 Alfresco

 XML

 Microsoft Dynamics

 JSON
 Excel
 Positional
 Apache HTTP

logs, HL7...

(CRM, AX)
 SAP
 Sage ERP X3
 Salesforce
 SugarCRM
Connectors (continued)
 Relational Databases
 MySQL
 Postgresql
 MS SQL
 Oracle
 Many more

 NoSQL/Columnar/OLAP/

Other
 Amazon RedShift
 Greenplum
 Hive
 OLAP cubes
 LDAP
 VectorWise
 Teradata
 More in Big Data ed.
How do we transport data?
 File system
 FTP
 SFTP/SCP
 Web service (SOAP,

REST)

 HTTP
 Mail, POP
 XMLRPC, Sockets, JMS, RSS...
Other Components
 Process data: join, filter, aggregate
 Flow control: loops, job invocation
 Logs, statistics
 Code: Java, Groovy
 On row data or standalone
 Can load libraries
Demo
Nifty Components
 FuzzyMatch - calculate Levenshtein distance or

phonetic similarity
 IntervalMatch – perform lookup/join based on
values falling within an interval
 Replace, ReplaceList - search and
replace, substitution
 UniqRow - output distinct rows based on defined
key columns
More Nifty Components
 XMLMap - Allows joins, column or row

filtering, transformations, and multiple outputs
 Normalize/Denormalize - split delimited strings
into columns or join columns into a string
 AggregateRow – GROUP BY;
min, max, sum, other functions used to aggregate
rows on a column
Tips and Tricks
 CamelCase job names for embedded jobs.
 Or prefix with ETL phase and order of execution
 Whenever appropriate (esp. for inserting

data), use the schema from the repository.
 When connecting, propagating changes to a DB
component will change it to a built-in
schema, which won't get updated.
Tips and Tricks
 Propagating changes to a DB component will

change it to a built-in schema, which won't get
updated after repo changes.
 On the other hand, remember that for
lookup/join (i.e., SELECT) queries you can
modify the query to only select the fields you
need. Propagating the schema is useful then.
Tips and Tricks
 Failure handling subjob:
 It’s an unconnected job (no triggers point to it)
 Use LogCatcher to catch, record component failures.
 Record failure in DB, file, email, etc.
 Add rollback component to undo DB changes if
necessary. May need to do this in the job if strategic
placement is needed.
Tips and Tricks
 In Java expressions, use methods, not

operators. E.g., concat(String) instead of the dot
operator, equals(Object) instead of ==.
 Technical components (like hash maps) are
hidden by default. See:
http://www.talendforge.org/forum/viewtopic.p
hp?pid=110860
Tips and Tricks
 When connecting, propagating changes to a DB

component will change it to a built-in
schema, which won't get updated after repo
changes.
 On the other hand, remember that for
lookup/join (i.e., SELECT) queries you can
modify the query to only select the fields you
need. Propagating the schema is useful then.
Tips and Tricks
 Use a context for job variables.
 Note you can specify type for variables.
 You can read from a file or database, or
pass in a context if an embedded Java
job.
Tips and Tricks
 For multi-host deployment:
 Export the job with a “bootstrap” context that has all
variables, but populates only a context config location that is
the same for all machines.
 The context config file has all values required for that host, e.g.
test DB connection for test machine.
 You can rely on the fact that Windows will interpret root as the
main system drive, so “/Data/” will translate to C:Data
 Be mindful of file permissions for sensitive context data
(e.g., DB password)
Tips and Tricks
 Use “Bulk” output components when possible.
 For transactional behavior:
 Start the job with DB connection
 Check “use existing connection” in all relevant
components
 Check "Die on error" in all relevant components
 End job with commit component
Room for Improvement
 UI stability

 Documentation
Books
 Getting Started with Talend Open Studio

for Data Integration by Bowen Jonathan
 Talend Open Studio Cookbook by Rick
Daniel Barton
 Big Data book coming…
Talend Forge
 http://www.talendforge.org/
 Forum – super helpful
 Exchange – free community components!
 Tutorials
 Bug tracker
 Source code
Talend Resources
 http://www.talend.com/resources
 Help Center
 Knowledge Base

 Webinars, screencasts
 Tutorials

 Docs are on download page
 And by pressing F1 on a component
Questions?
Compliments?
Consulting gigs?
 Contact me:
 philip@yurchuk.com
 http://philip.yurchuk.com
 http://www.linkedin.com/in/philipyurchuk/
Thank You!

Mais conteúdo relacionado

Mais procurados

Talend Big Data Tutorial | Talend DI and Big Data Certification | Talend Onli...
Talend Big Data Tutorial | Talend DI and Big Data Certification | Talend Onli...Talend Big Data Tutorial | Talend DI and Big Data Certification | Talend Onli...
Talend Big Data Tutorial | Talend DI and Big Data Certification | Talend Onli...Edureka!
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfIlham31574
 
Talend Open Studio for Big Data | Talend Open Studio Tutorial | Talend Online...
Talend Open Studio for Big Data | Talend Open Studio Tutorial | Talend Online...Talend Open Studio for Big Data | Talend Open Studio Tutorial | Talend Online...
Talend Open Studio for Big Data | Talend Open Studio Tutorial | Talend Online...Edureka!
 
Odi 11g master and work repository creation steps
Odi 11g master and work repository creation stepsOdi 11g master and work repository creation steps
Odi 11g master and work repository creation stepsDharmaraj Borse
 
Introduction of ssis
Introduction of ssisIntroduction of ssis
Introduction of ssisdeepakk073
 
Introduction to Data Vault Modeling
Introduction to Data Vault ModelingIntroduction to Data Vault Modeling
Introduction to Data Vault ModelingKent Graziano
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeKent Graziano
 
Azure data bricks by Eugene Polonichko
Azure data bricks by Eugene PolonichkoAzure data bricks by Eugene Polonichko
Azure data bricks by Eugene PolonichkoAlex Tumanoff
 
Analytics and Lakehouse Integration Options for Oracle Applications
Analytics and Lakehouse Integration Options for Oracle ApplicationsAnalytics and Lakehouse Integration Options for Oracle Applications
Analytics and Lakehouse Integration Options for Oracle ApplicationsRay Février
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationDenodo
 
Talend Data Quality
Talend Data QualityTalend Data Quality
Talend Data QualityTalend
 
Data Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and SparkData Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and SparkAnant Corporation
 

Mais procurados (20)

Talend Big Data Tutorial | Talend DI and Big Data Certification | Talend Onli...
Talend Big Data Tutorial | Talend DI and Big Data Certification | Talend Onli...Talend Big Data Tutorial | Talend DI and Big Data Certification | Talend Onli...
Talend Big Data Tutorial | Talend DI and Big Data Certification | Talend Onli...
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
 
What is ETL?
What is ETL?What is ETL?
What is ETL?
 
Talend Open Studio for Big Data | Talend Open Studio Tutorial | Talend Online...
Talend Open Studio for Big Data | Talend Open Studio Tutorial | Talend Online...Talend Open Studio for Big Data | Talend Open Studio Tutorial | Talend Online...
Talend Open Studio for Big Data | Talend Open Studio Tutorial | Talend Online...
 
Odi 11g master and work repository creation steps
Odi 11g master and work repository creation stepsOdi 11g master and work repository creation steps
Odi 11g master and work repository creation steps
 
Introduction of ssis
Introduction of ssisIntroduction of ssis
Introduction of ssis
 
Introduction to Data Vault Modeling
Introduction to Data Vault ModelingIntroduction to Data Vault Modeling
Introduction to Data Vault Modeling
 
ETL Process
ETL ProcessETL Process
ETL Process
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on Snowflake
 
Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
 
ETL Testing Overview
ETL Testing OverviewETL Testing Overview
ETL Testing Overview
 
Why Data Vault?
Why Data Vault? Why Data Vault?
Why Data Vault?
 
Azure data bricks by Eugene Polonichko
Azure data bricks by Eugene PolonichkoAzure data bricks by Eugene Polonichko
Azure data bricks by Eugene Polonichko
 
ETL
ETLETL
ETL
 
Analytics and Lakehouse Integration Options for Oracle Applications
Analytics and Lakehouse Integration Options for Oracle ApplicationsAnalytics and Lakehouse Integration Options for Oracle Applications
Analytics and Lakehouse Integration Options for Oracle Applications
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
Talend Data Quality
Talend Data QualityTalend Data Quality
Talend Data Quality
 
Oracle Data Integrator
Oracle Data Integrator Oracle Data Integrator
Oracle Data Integrator
 
Data Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and SparkData Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and Spark
 

Semelhante a Intro to Talend Open Studio for Data Integration

SQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesSQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesEduardo Castro
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin
 
The 90-Day Startup with Google AppEngine for Java
The 90-Day Startup with Google AppEngine for JavaThe 90-Day Startup with Google AppEngine for Java
The 90-Day Startup with Google AppEngine for JavaDavid Chandler
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalystdwm042
 
Effective Test Driven Database Development
Effective Test Driven Database DevelopmentEffective Test Driven Database Development
Effective Test Driven Database Developmentelliando dias
 
Handling Database Deployments
Handling Database DeploymentsHandling Database Deployments
Handling Database DeploymentsMike Willbanks
 
Ldap Synchronization Connector @ 2011.RMLL
Ldap Synchronization Connector @ 2011.RMLLLdap Synchronization Connector @ 2011.RMLL
Ldap Synchronization Connector @ 2011.RMLLsbahloul
 
Obevo Javasig.pptx
Obevo Javasig.pptxObevo Javasig.pptx
Obevo Javasig.pptxLadduAnanu
 
ilide.info-talend-open-studio-for-data-integration-pr_f4a743b84c8b04cbebbf4c7...
ilide.info-talend-open-studio-for-data-integration-pr_f4a743b84c8b04cbebbf4c7...ilide.info-talend-open-studio-for-data-integration-pr_f4a743b84c8b04cbebbf4c7...
ilide.info-talend-open-studio-for-data-integration-pr_f4a743b84c8b04cbebbf4c7...khadijahd2
 
Linq 1224887336792847 9
Linq 1224887336792847 9Linq 1224887336792847 9
Linq 1224887336792847 9google
 
Xml Java
Xml JavaXml Java
Xml Javacbee48
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryIlya Ganelin
 
Oracle DBA interview_questions
Oracle DBA interview_questionsOracle DBA interview_questions
Oracle DBA interview_questionsNaveen P
 
Linq To The Enterprise
Linq To The EnterpriseLinq To The Enterprise
Linq To The EnterpriseDaniel Egan
 
Workflow Management with Espresso Workflow
Workflow Management with Espresso WorkflowWorkflow Management with Espresso Workflow
Workflow Management with Espresso WorkflowRolf Kremer
 

Semelhante a Intro to Talend Open Studio for Data Integration (20)

SQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesSQL Server 2008 Integration Services
SQL Server 2008 Integration Services
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
 
The 90-Day Startup with Google AppEngine for Java
The 90-Day Startup with Google AppEngine for JavaThe 90-Day Startup with Google AppEngine for Java
The 90-Day Startup with Google AppEngine for Java
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
 
Migration from 8.1 to 11.3
Migration from 8.1 to 11.3Migration from 8.1 to 11.3
Migration from 8.1 to 11.3
 
Effective Test Driven Database Development
Effective Test Driven Database DevelopmentEffective Test Driven Database Development
Effective Test Driven Database Development
 
Sqllite
SqlliteSqllite
Sqllite
 
Handling Database Deployments
Handling Database DeploymentsHandling Database Deployments
Handling Database Deployments
 
Ldap Synchronization Connector @ 2011.RMLL
Ldap Synchronization Connector @ 2011.RMLLLdap Synchronization Connector @ 2011.RMLL
Ldap Synchronization Connector @ 2011.RMLL
 
Percona Lucid Db
Percona Lucid DbPercona Lucid Db
Percona Lucid Db
 
TaLend Online Training
TaLend Online TrainingTaLend Online Training
TaLend Online Training
 
Obevo Javasig.pptx
Obevo Javasig.pptxObevo Javasig.pptx
Obevo Javasig.pptx
 
ilide.info-talend-open-studio-for-data-integration-pr_f4a743b84c8b04cbebbf4c7...
ilide.info-talend-open-studio-for-data-integration-pr_f4a743b84c8b04cbebbf4c7...ilide.info-talend-open-studio-for-data-integration-pr_f4a743b84c8b04cbebbf4c7...
ilide.info-talend-open-studio-for-data-integration-pr_f4a743b84c8b04cbebbf4c7...
 
Linq 1224887336792847 9
Linq 1224887336792847 9Linq 1224887336792847 9
Linq 1224887336792847 9
 
Xml Java
Xml JavaXml Java
Xml Java
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
 
Oracle DBA interview_questions
Oracle DBA interview_questionsOracle DBA interview_questions
Oracle DBA interview_questions
 
Intro to Application Express
Intro to Application ExpressIntro to Application Express
Intro to Application Express
 
Linq To The Enterprise
Linq To The EnterpriseLinq To The Enterprise
Linq To The Enterprise
 
Workflow Management with Espresso Workflow
Workflow Management with Espresso WorkflowWorkflow Management with Espresso Workflow
Workflow Management with Espresso Workflow
 

Último

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 

Último (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

Intro to Talend Open Studio for Data Integration

  • 1. Intro to Talend Open Studio for Data Integration Philip Yurchuk http://philip.yurchuk.com
  • 2. What is Talend?  Eclipse-based visual programming editor  Generates executable Java code  Jobs can run standalone or embedded (no special server)  Batch or interactive (user input)
  • 3. What is ETL?  Extract: suck up data  Transform: mess with it Load: blow it out Batch, integration, mi gration, etc.
  • 4. Extract from/load to where?  Over 600 components  Over 450 connectors  Allows multiple inputs/outputs in single job
  • 5. Connectors  Flat files  Applications/Platforms  Delimted (tab, CSV…)  Alfresco  XML  Microsoft Dynamics  JSON  Excel  Positional  Apache HTTP logs, HL7... (CRM, AX)  SAP  Sage ERP X3  Salesforce  SugarCRM
  • 6. Connectors (continued)  Relational Databases  MySQL  Postgresql  MS SQL  Oracle  Many more  NoSQL/Columnar/OLAP/ Other  Amazon RedShift  Greenplum  Hive  OLAP cubes  LDAP  VectorWise  Teradata  More in Big Data ed.
  • 7. How do we transport data?  File system  FTP  SFTP/SCP  Web service (SOAP, REST)  HTTP  Mail, POP  XMLRPC, Sockets, JMS, RSS...
  • 8. Other Components  Process data: join, filter, aggregate  Flow control: loops, job invocation  Logs, statistics  Code: Java, Groovy  On row data or standalone  Can load libraries
  • 10. Nifty Components  FuzzyMatch - calculate Levenshtein distance or phonetic similarity  IntervalMatch – perform lookup/join based on values falling within an interval  Replace, ReplaceList - search and replace, substitution  UniqRow - output distinct rows based on defined key columns
  • 11. More Nifty Components  XMLMap - Allows joins, column or row filtering, transformations, and multiple outputs  Normalize/Denormalize - split delimited strings into columns or join columns into a string  AggregateRow – GROUP BY; min, max, sum, other functions used to aggregate rows on a column
  • 12. Tips and Tricks  CamelCase job names for embedded jobs.  Or prefix with ETL phase and order of execution  Whenever appropriate (esp. for inserting data), use the schema from the repository.  When connecting, propagating changes to a DB component will change it to a built-in schema, which won't get updated.
  • 13. Tips and Tricks  Propagating changes to a DB component will change it to a built-in schema, which won't get updated after repo changes.  On the other hand, remember that for lookup/join (i.e., SELECT) queries you can modify the query to only select the fields you need. Propagating the schema is useful then.
  • 14. Tips and Tricks  Failure handling subjob:  It’s an unconnected job (no triggers point to it)  Use LogCatcher to catch, record component failures.  Record failure in DB, file, email, etc.  Add rollback component to undo DB changes if necessary. May need to do this in the job if strategic placement is needed.
  • 15. Tips and Tricks  In Java expressions, use methods, not operators. E.g., concat(String) instead of the dot operator, equals(Object) instead of ==.  Technical components (like hash maps) are hidden by default. See: http://www.talendforge.org/forum/viewtopic.p hp?pid=110860
  • 16. Tips and Tricks  When connecting, propagating changes to a DB component will change it to a built-in schema, which won't get updated after repo changes.  On the other hand, remember that for lookup/join (i.e., SELECT) queries you can modify the query to only select the fields you need. Propagating the schema is useful then.
  • 17. Tips and Tricks  Use a context for job variables.  Note you can specify type for variables.  You can read from a file or database, or pass in a context if an embedded Java job.
  • 18. Tips and Tricks  For multi-host deployment:  Export the job with a “bootstrap” context that has all variables, but populates only a context config location that is the same for all machines.  The context config file has all values required for that host, e.g. test DB connection for test machine.  You can rely on the fact that Windows will interpret root as the main system drive, so “/Data/” will translate to C:Data  Be mindful of file permissions for sensitive context data (e.g., DB password)
  • 19. Tips and Tricks  Use “Bulk” output components when possible.  For transactional behavior:  Start the job with DB connection  Check “use existing connection” in all relevant components  Check "Die on error" in all relevant components  End job with commit component
  • 20. Room for Improvement  UI stability  Documentation
  • 21. Books  Getting Started with Talend Open Studio for Data Integration by Bowen Jonathan  Talend Open Studio Cookbook by Rick Daniel Barton  Big Data book coming…
  • 22. Talend Forge  http://www.talendforge.org/  Forum – super helpful  Exchange – free community components!  Tutorials  Bug tracker  Source code
  • 23. Talend Resources  http://www.talend.com/resources  Help Center  Knowledge Base  Webinars, screencasts  Tutorials  Docs are on download page  And by pressing F1 on a component
  • 24. Questions? Compliments? Consulting gigs?  Contact me:  philip@yurchuk.com  http://philip.yurchuk.com  http://www.linkedin.com/in/philipyurchuk/