SlideShare uma empresa Scribd logo
1 de 25
Intro to
Talend Open Studio
for
Data Integration
Philip Yurchuk
http://philip.yurchuk.com
What is Talend?
 Eclipse-based visual

programming editor
 Generates executable Java code
 Jobs can run standalone or
embedded (no special server)
 Batch or interactive (user input)
What is ETL?
 Extract: suck up data

 Transform: mess with it

Load: blow it out
Batch, integration, mi

gration, etc.
Extract from/load to where?
 Over 600 components

 Over 450 connectors
 Allows multiple

inputs/outputs in single job
Connectors
 Flat files

 Applications/Platforms

 Delimted (tab, CSV…)

 Alfresco

 XML

 Microsoft Dynamics

 JSON
 Excel
 Positional
 Apache HTTP

logs, HL7...

(CRM, AX)
 SAP
 Sage ERP X3
 Salesforce
 SugarCRM
Connectors (continued)
 Relational Databases
 MySQL
 Postgresql
 MS SQL
 Oracle
 Many more

 NoSQL/Columnar/OLAP/

Other
 Amazon RedShift
 Greenplum
 Hive
 OLAP cubes
 LDAP
 VectorWise
 Teradata
 More in Big Data ed.
How do we transport data?
 File system
 FTP
 SFTP/SCP
 Web service (SOAP,

REST)

 HTTP
 Mail, POP
 XMLRPC, Sockets, JMS, RSS...
Other Components
 Process data: join, filter, aggregate
 Flow control: loops, job invocation
 Logs, statistics
 Code: Java, Groovy
 On row data or standalone
 Can load libraries
Demo
Nifty Components
 FuzzyMatch - calculate Levenshtein distance or

phonetic similarity
 IntervalMatch – perform lookup/join based on
values falling within an interval
 Replace, ReplaceList - search and
replace, substitution
 UniqRow - output distinct rows based on defined
key columns
More Nifty Components
 XMLMap - Allows joins, column or row

filtering, transformations, and multiple outputs
 Normalize/Denormalize - split delimited strings
into columns or join columns into a string
 AggregateRow – GROUP BY;
min, max, sum, other functions used to aggregate
rows on a column
Tips and Tricks
 CamelCase job names for embedded jobs.
 Or prefix with ETL phase and order of execution
 Whenever appropriate (esp. for inserting

data), use the schema from the repository.
 When connecting, propagating changes to a DB
component will change it to a built-in
schema, which won't get updated.
Tips and Tricks
 Propagating changes to a DB component will

change it to a built-in schema, which won't get
updated after repo changes.
 On the other hand, remember that for
lookup/join (i.e., SELECT) queries you can
modify the query to only select the fields you
need. Propagating the schema is useful then.
Tips and Tricks
 Failure handling subjob:
 It’s an unconnected job (no triggers point to it)
 Use LogCatcher to catch, record component failures.
 Record failure in DB, file, email, etc.
 Add rollback component to undo DB changes if
necessary. May need to do this in the job if strategic
placement is needed.
Tips and Tricks
 In Java expressions, use methods, not

operators. E.g., concat(String) instead of the dot
operator, equals(Object) instead of ==.
 Technical components (like hash maps) are
hidden by default. See:
http://www.talendforge.org/forum/viewtopic.p
hp?pid=110860
Tips and Tricks
 When connecting, propagating changes to a DB

component will change it to a built-in
schema, which won't get updated after repo
changes.
 On the other hand, remember that for
lookup/join (i.e., SELECT) queries you can
modify the query to only select the fields you
need. Propagating the schema is useful then.
Tips and Tricks
 Use a context for job variables.
 Note you can specify type for variables.
 You can read from a file or database, or
pass in a context if an embedded Java
job.
Tips and Tricks
 For multi-host deployment:
 Export the job with a “bootstrap” context that has all
variables, but populates only a context config location that is
the same for all machines.
 The context config file has all values required for that host, e.g.
test DB connection for test machine.
 You can rely on the fact that Windows will interpret root as the
main system drive, so “/Data/” will translate to C:Data
 Be mindful of file permissions for sensitive context data
(e.g., DB password)
Tips and Tricks
 Use “Bulk” output components when possible.
 For transactional behavior:
 Start the job with DB connection
 Check “use existing connection” in all relevant
components
 Check "Die on error" in all relevant components
 End job with commit component
Room for Improvement
 UI stability

 Documentation
Books
 Getting Started with Talend Open Studio

for Data Integration by Bowen Jonathan
 Talend Open Studio Cookbook by Rick
Daniel Barton
 Big Data book coming…
Talend Forge
 http://www.talendforge.org/
 Forum – super helpful
 Exchange – free community components!
 Tutorials
 Bug tracker
 Source code
Talend Resources
 http://www.talend.com/resources
 Help Center
 Knowledge Base

 Webinars, screencasts
 Tutorials

 Docs are on download page
 And by pressing F1 on a component
Questions?
Compliments?
Consulting gigs?
 Contact me:
 philip@yurchuk.com
 http://philip.yurchuk.com
 http://www.linkedin.com/in/philipyurchuk/
Thank You!

Mais conteúdo relacionado

Mais procurados

What is Talend | Talend Tutorial for Beginners | Talend Online Training | Edu...
What is Talend | Talend Tutorial for Beginners | Talend Online Training | Edu...What is Talend | Talend Tutorial for Beginners | Talend Online Training | Edu...
What is Talend | Talend Tutorial for Beginners | Talend Online Training | Edu...Edureka!
 
Talend Components | tMap, tJoin, tFileList, tInputFileDelimited | Talend Onli...
Talend Components | tMap, tJoin, tFileList, tInputFileDelimited | Talend Onli...Talend Components | tMap, tJoin, tFileList, tInputFileDelimited | Talend Onli...
Talend Components | tMap, tJoin, tFileList, tInputFileDelimited | Talend Onli...Edureka!
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackMichel Tricot
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer GuideDeon Huang
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsTimothy Spann
 
Talend Introduction by TSI
Talend Introduction by TSITalend Introduction by TSI
Talend Introduction by TSIRemain Software
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Introducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumIntroducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumChengKuan Gan
 
Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageJulien Le Dem
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftBDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftAmazon Web Services
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data MeshLibbySchulze
 
DBT ELT approach for Advanced Analytics.pptx
DBT ELT approach for Advanced Analytics.pptxDBT ELT approach for Advanced Analytics.pptx
DBT ELT approach for Advanced Analytics.pptxHong Ong
 

Mais procurados (20)

What is Talend | Talend Tutorial for Beginners | Talend Online Training | Edu...
What is Talend | Talend Tutorial for Beginners | Talend Online Training | Edu...What is Talend | Talend Tutorial for Beginners | Talend Online Training | Edu...
What is Talend | Talend Tutorial for Beginners | Talend Online Training | Edu...
 
Talend Components | tMap, tJoin, tFileList, tInputFileDelimited | Talend Onli...
Talend Components | tMap, tJoin, tFileList, tInputFileDelimited | Talend Onli...Talend Components | tMap, tJoin, tFileList, tInputFileDelimited | Talend Onli...
Talend Components | tMap, tJoin, tFileList, tInputFileDelimited | Talend Onli...
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stack
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer Guide
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration Options
 
Talend Introduction by TSI
Talend Introduction by TSITalend Introduction by TSI
Talend Introduction by TSI
 
ETL
ETLETL
ETL
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Introducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumIntroducing Change Data Capture with Debezium
Introducing Change Data Capture with Debezium
 
Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
ETL Technologies.pptx
ETL Technologies.pptxETL Technologies.pptx
ETL Technologies.pptx
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftBDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
DBT ELT approach for Advanced Analytics.pptx
DBT ELT approach for Advanced Analytics.pptxDBT ELT approach for Advanced Analytics.pptx
DBT ELT approach for Advanced Analytics.pptx
 

Semelhante a Intro to Talend Open Studio for Data Integration

SQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesSQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesEduardo Castro
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin
 
The 90-Day Startup with Google AppEngine for Java
The 90-Day Startup with Google AppEngine for JavaThe 90-Day Startup with Google AppEngine for Java
The 90-Day Startup with Google AppEngine for JavaDavid Chandler
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalystdwm042
 
Effective Test Driven Database Development
Effective Test Driven Database DevelopmentEffective Test Driven Database Development
Effective Test Driven Database Developmentelliando dias
 
Handling Database Deployments
Handling Database DeploymentsHandling Database Deployments
Handling Database DeploymentsMike Willbanks
 
Ldap Synchronization Connector @ 2011.RMLL
Ldap Synchronization Connector @ 2011.RMLLLdap Synchronization Connector @ 2011.RMLL
Ldap Synchronization Connector @ 2011.RMLLsbahloul
 
Obevo Javasig.pptx
Obevo Javasig.pptxObevo Javasig.pptx
Obevo Javasig.pptxLadduAnanu
 
ilide.info-talend-open-studio-for-data-integration-pr_f4a743b84c8b04cbebbf4c7...
ilide.info-talend-open-studio-for-data-integration-pr_f4a743b84c8b04cbebbf4c7...ilide.info-talend-open-studio-for-data-integration-pr_f4a743b84c8b04cbebbf4c7...
ilide.info-talend-open-studio-for-data-integration-pr_f4a743b84c8b04cbebbf4c7...khadijahd2
 
Linq 1224887336792847 9
Linq 1224887336792847 9Linq 1224887336792847 9
Linq 1224887336792847 9google
 
Xml Java
Xml JavaXml Java
Xml Javacbee48
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryIlya Ganelin
 
Oracle DBA interview_questions
Oracle DBA interview_questionsOracle DBA interview_questions
Oracle DBA interview_questionsNaveen P
 
Linq To The Enterprise
Linq To The EnterpriseLinq To The Enterprise
Linq To The EnterpriseDaniel Egan
 
Workflow Management with Espresso Workflow
Workflow Management with Espresso WorkflowWorkflow Management with Espresso Workflow
Workflow Management with Espresso WorkflowRolf Kremer
 

Semelhante a Intro to Talend Open Studio for Data Integration (20)

SQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesSQL Server 2008 Integration Services
SQL Server 2008 Integration Services
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
 
The 90-Day Startup with Google AppEngine for Java
The 90-Day Startup with Google AppEngine for JavaThe 90-Day Startup with Google AppEngine for Java
The 90-Day Startup with Google AppEngine for Java
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
 
Migration from 8.1 to 11.3
Migration from 8.1 to 11.3Migration from 8.1 to 11.3
Migration from 8.1 to 11.3
 
Effective Test Driven Database Development
Effective Test Driven Database DevelopmentEffective Test Driven Database Development
Effective Test Driven Database Development
 
Sqllite
SqlliteSqllite
Sqllite
 
Handling Database Deployments
Handling Database DeploymentsHandling Database Deployments
Handling Database Deployments
 
Ldap Synchronization Connector @ 2011.RMLL
Ldap Synchronization Connector @ 2011.RMLLLdap Synchronization Connector @ 2011.RMLL
Ldap Synchronization Connector @ 2011.RMLL
 
Percona Lucid Db
Percona Lucid DbPercona Lucid Db
Percona Lucid Db
 
TaLend Online Training
TaLend Online TrainingTaLend Online Training
TaLend Online Training
 
Obevo Javasig.pptx
Obevo Javasig.pptxObevo Javasig.pptx
Obevo Javasig.pptx
 
ilide.info-talend-open-studio-for-data-integration-pr_f4a743b84c8b04cbebbf4c7...
ilide.info-talend-open-studio-for-data-integration-pr_f4a743b84c8b04cbebbf4c7...ilide.info-talend-open-studio-for-data-integration-pr_f4a743b84c8b04cbebbf4c7...
ilide.info-talend-open-studio-for-data-integration-pr_f4a743b84c8b04cbebbf4c7...
 
Linq 1224887336792847 9
Linq 1224887336792847 9Linq 1224887336792847 9
Linq 1224887336792847 9
 
Xml Java
Xml JavaXml Java
Xml Java
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
 
Oracle DBA interview_questions
Oracle DBA interview_questionsOracle DBA interview_questions
Oracle DBA interview_questions
 
Intro to Application Express
Intro to Application ExpressIntro to Application Express
Intro to Application Express
 
Linq To The Enterprise
Linq To The EnterpriseLinq To The Enterprise
Linq To The Enterprise
 
Workflow Management with Espresso Workflow
Workflow Management with Espresso WorkflowWorkflow Management with Espresso Workflow
Workflow Management with Espresso Workflow
 

Último

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 

Último (20)

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Intro to Talend Open Studio for Data Integration

  • 1. Intro to Talend Open Studio for Data Integration Philip Yurchuk http://philip.yurchuk.com
  • 2. What is Talend?  Eclipse-based visual programming editor  Generates executable Java code  Jobs can run standalone or embedded (no special server)  Batch or interactive (user input)
  • 3. What is ETL?  Extract: suck up data  Transform: mess with it Load: blow it out Batch, integration, mi gration, etc.
  • 4. Extract from/load to where?  Over 600 components  Over 450 connectors  Allows multiple inputs/outputs in single job
  • 5. Connectors  Flat files  Applications/Platforms  Delimted (tab, CSV…)  Alfresco  XML  Microsoft Dynamics  JSON  Excel  Positional  Apache HTTP logs, HL7... (CRM, AX)  SAP  Sage ERP X3  Salesforce  SugarCRM
  • 6. Connectors (continued)  Relational Databases  MySQL  Postgresql  MS SQL  Oracle  Many more  NoSQL/Columnar/OLAP/ Other  Amazon RedShift  Greenplum  Hive  OLAP cubes  LDAP  VectorWise  Teradata  More in Big Data ed.
  • 7. How do we transport data?  File system  FTP  SFTP/SCP  Web service (SOAP, REST)  HTTP  Mail, POP  XMLRPC, Sockets, JMS, RSS...
  • 8. Other Components  Process data: join, filter, aggregate  Flow control: loops, job invocation  Logs, statistics  Code: Java, Groovy  On row data or standalone  Can load libraries
  • 10. Nifty Components  FuzzyMatch - calculate Levenshtein distance or phonetic similarity  IntervalMatch – perform lookup/join based on values falling within an interval  Replace, ReplaceList - search and replace, substitution  UniqRow - output distinct rows based on defined key columns
  • 11. More Nifty Components  XMLMap - Allows joins, column or row filtering, transformations, and multiple outputs  Normalize/Denormalize - split delimited strings into columns or join columns into a string  AggregateRow – GROUP BY; min, max, sum, other functions used to aggregate rows on a column
  • 12. Tips and Tricks  CamelCase job names for embedded jobs.  Or prefix with ETL phase and order of execution  Whenever appropriate (esp. for inserting data), use the schema from the repository.  When connecting, propagating changes to a DB component will change it to a built-in schema, which won't get updated.
  • 13. Tips and Tricks  Propagating changes to a DB component will change it to a built-in schema, which won't get updated after repo changes.  On the other hand, remember that for lookup/join (i.e., SELECT) queries you can modify the query to only select the fields you need. Propagating the schema is useful then.
  • 14. Tips and Tricks  Failure handling subjob:  It’s an unconnected job (no triggers point to it)  Use LogCatcher to catch, record component failures.  Record failure in DB, file, email, etc.  Add rollback component to undo DB changes if necessary. May need to do this in the job if strategic placement is needed.
  • 15. Tips and Tricks  In Java expressions, use methods, not operators. E.g., concat(String) instead of the dot operator, equals(Object) instead of ==.  Technical components (like hash maps) are hidden by default. See: http://www.talendforge.org/forum/viewtopic.p hp?pid=110860
  • 16. Tips and Tricks  When connecting, propagating changes to a DB component will change it to a built-in schema, which won't get updated after repo changes.  On the other hand, remember that for lookup/join (i.e., SELECT) queries you can modify the query to only select the fields you need. Propagating the schema is useful then.
  • 17. Tips and Tricks  Use a context for job variables.  Note you can specify type for variables.  You can read from a file or database, or pass in a context if an embedded Java job.
  • 18. Tips and Tricks  For multi-host deployment:  Export the job with a “bootstrap” context that has all variables, but populates only a context config location that is the same for all machines.  The context config file has all values required for that host, e.g. test DB connection for test machine.  You can rely on the fact that Windows will interpret root as the main system drive, so “/Data/” will translate to C:Data  Be mindful of file permissions for sensitive context data (e.g., DB password)
  • 19. Tips and Tricks  Use “Bulk” output components when possible.  For transactional behavior:  Start the job with DB connection  Check “use existing connection” in all relevant components  Check "Die on error" in all relevant components  End job with commit component
  • 20. Room for Improvement  UI stability  Documentation
  • 21. Books  Getting Started with Talend Open Studio for Data Integration by Bowen Jonathan  Talend Open Studio Cookbook by Rick Daniel Barton  Big Data book coming…
  • 22. Talend Forge  http://www.talendforge.org/  Forum – super helpful  Exchange – free community components!  Tutorials  Bug tracker  Source code
  • 23. Talend Resources  http://www.talend.com/resources  Help Center  Knowledge Base  Webinars, screencasts  Tutorials  Docs are on download page  And by pressing F1 on a component
  • 24. Questions? Compliments? Consulting gigs?  Contact me:  philip@yurchuk.com  http://philip.yurchuk.com  http://www.linkedin.com/in/philipyurchuk/