Mais conteúdo relacionado Semelhante a Successful AI/ML Projects with End-to-End Cloud Data Engineering (20) Successful AI/ML Projects with End-to-End Cloud Data Engineering1. `
Successful AI/ML Projects with
End-to-End Cloud Data Engineering
Louis Polycarpou
Technical Director
Cloud, Data Engineering, and Data Integration
2. 2 © Informatica. Proprietary and Confidential.2 © Informatica. Proprietary and Confidential.2 © Informatica. Proprietary and Confidential.
AI/ML Projects in the Enterprise Today
Only 1% of AI/ML
projects are
successful
*Source: Databricks research 2018
3. 3 © Informatica. Proprietary and Confidential.3 © Informatica. Proprietary and Confidential.3 © Informatica. Proprietary and Confidential.
Why are AI/ML Projects so difficult?
• Data Scientists spend 80% of their time in preparing data.. only 20% on modeling
• Data challenges – data is coming in at high volume, high velocity from a variety of
sources
• Enterprise data can not be provisioned if it lacks governance or is hidden
• Lost productivity in repetitive data pipelines to move and prepare data
• Data Engineers spend too much time capacity planning of Big Data processing
End-to-End Data Engineering holds the Key!
4. End-to-End Data Engineering is Key to ML Projects
ANY
DATA
ANY
REGULATION
ANY
USER
ANY CLOUD / ANY TECHNOLOGY
ANY
LATENCY
METADATA
GOVERNANCE
INGEST STREAM INTEGRATE CLEANSE PREPARE DEFINE CATALOG RELATE PROTECT DELIVERENRICH
HYBRID
MODERN DATA INTEGRATION PATTERNS
5. Informatica Data Engineering Integration
Informatica + Databricks
Accelerate Data Engineering Pipelines for AI & Analytics
Informatica Cloud
Data Integration
Informatica Enterprise Data Catalog
Reliable Data Lakes at Scale
Data Discovery, Audit and Lineage
Data Pipeline Development
Data Ingestion from
Hybrid Sources
6. 6 © Informatica. Proprietary and Confidential.6 © Informatica. Proprietary and Confidential.
Informatica Enterprise Data Catalog
• Comprehensive discovery of data assets for accurate
machine learning models
• Easily find and discover trusted data for building
machine learning models
• Explore holistic data relationships
• End-to-End data lineage through the analytics process
• Integrated Business Glossary
• Crowd-sourced curation of data assets
• Machine-learning-based semantic inference and
recommendations
7. 7 © Informatica. Proprietary and Confidential.7
Informatica Data Engineering Portfolio
The industry’s most comprehensive data engineering solution for
multi-cloud & hybrid environments in Spark “true” serverless mode
Data Engineering Integration
(DEI)
Data Engineering Streaming
(DES)
Data Engineering Quality
(DEQ)
Data Engineering Masking
(DEM)
Intelligently manage
data pipelines for faster
insights. Data ingestion and
processing
Turn volumes of streaming and
IoT data into trusted insights
Govern all your data on Spark
in cloud and other
environments to ensure it’s
trusted and relevant
De-identify, de-sensitize, and
anonymize sensitive data from
unauthorized access for app
users, BI, and AI & analytics
9. 9 © Informatica. Proprietary and Confidential.9
select l_orderkey, sum(l_extendedprice * (1 -
l_discount)) as revenue, o_orderdate, o_shippriority
from CUSTOMER, ORDERS, LINEITEM where
c_mktsegment = 'AUTOMOBILE' and c_custkey =
o_custkey and l_orderkey = o_orderkey and
o_orderdate < date '1995-03-13' and l_shipdate > date
'1995-03-13' group by l_orderkey, o_orderdate,
o_shippriority order by revenue desc, o_orderdate limit
10;
SQL Query
No Code: Leverage the Power of Easy-to-Use Interface
Spark Code
package main.scala
import org.apache.spark.sql.DataFrame
import org.apache.spark.SparkContext
import org.apache.spark.sql.functions.sum
import org.apache.spark.sql.functions.udf
/**
* Query 3
*
*/
class Q03 extends TpchQuery {
override def execute(sc: SparkContext, schemaProvider: TpchSchemaProvider):
DataFrame = {
// this is used to implicitly convert an RDD to a DataFrame.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import schemaProvider._
val decrease = udf { (x: Double, y: Double) => x * (1 - y) }
val fcust = customer.filter($"c_mktsegment" === "BUILDING")
val forders = order.filter($"o_orderdate" < "1995-03-15")
val flineitems = lineitem.filter($"l_shipdate" > "1995-03-15")
fcust.join(forders, $"c_custkey" === forders("o_custkey"))
.select($"o_orderkey", $"o_orderdate", $"o_shippriority")
.join(flineitems, $"o_orderkey" === flineitems("l_orderkey"))
.select($"l_orderkey",
decrease($"l_extendedprice", $"l_discount").as("volume"),
$"o_orderdate", $"o_shippriority")
.groupBy($"l_orderkey", $"o_orderdate", $"o_shippriority")
.agg(sum($"volume").as("revenue"))
.sort($"revenue".desc, $"o_orderdate")
.limit(10)
}
}
DEI Mapping
Future proof your investments, design once
and run on best-of-breed engine
10. 10 © Informatica. Proprietary and Confidential.10
No Code: Schema Drift Handling
Handle complex structure and its changes
for both batch and streaming data
11. 11 © Informatica. Proprietary and Confidential.11
No Ops: Azure Databricks Support
Leverage the compute power of Databricks
on Azure for big data processing
12. 12 © Informatica. Proprietary and Confidential.12
No Ops: Advanced Spark Support
Take advantage of latest innovation,
performance, and scaling benefits
13. 13 © Informatica. Proprietary and Confidential.13
No Ops: Operational Insights
Deliver predictive operational insights about
your data engineering environments
14. 14 © Informatica. Proprietary and Confidential.14
No Limits on Data: Ingest Any Data in Real-time & Batch
Mass ingestion of streaming/
IoT data, files, and databases
15. 15 © Informatica. Proprietary and Confidential.15
No Limits on Data: High-Speed Mass Ingestion
Rely on easy to use, fast, and scalable
approach—no hand-coding
16. 16 © Informatica. Proprietary and Confidential.16
No Limits on Data: Spark Structured Streaming Support
Handle streaming data based on event
time instead of processing time
17. 17 © Informatica. Proprietary and Confidential.17 © Informatica. Proprietary and Confidential.
RELATIONAL
DEVICE DATA
WEBLOGS
Cloud-Ready Reference Architecture
Informatica + Azure Databricks
CATALOG SEARCH LINEAGE RECOMMENDATIONSPARSE MATCH
ACQUIRE INGEST PREPARE CATALOG SECURE GOVERN ACCESS CONSUME
Storage blobStorage blob SQL Data
Warehouse
ADLS /
BLOB
Azure Databricks ADLS /
BLOB
18. 18 © Informatica. Proprietary and Confidential.18 © Informatica. Proprietary and Confidential.
Takeda Technical Architecture
18
MARKET
CENTER
Data Sources Data SourcesData Sources
Informatica Data Engineering
Integration (DEI) and IICS
[IaaS]
Streaming
[PaaS]
STAGE
Storage
LAKE
Storage
HUB
Storage
MART
Storage
Databricks
[PaaS]
Data Visualization
[IaaS]
Self Server
Analytics
[PaaS]
Hadoop
[PaaS]
Storage
[PaaS]
Data Visualization
[SaaS]
Storage
[PaaS]
Databricks
[PaaS]
Analytics
COMM
Analytics
CORP
Analytics
GMS
…
Informatica
19. 19 © Informatica. Proprietary and Confidential.19 © Informatica. Proprietary and Confidential.19 © Informatica. Proprietary and Confidential.
Critical Success Factors of your AI/ML Projects
1 Find & discover data across all enterprise systems
2Accelerate movement of data to Databricks
3 Prepare & enrich the data before you start modeling
4Increase productivity with no-code UI for data engineering
5 Go serverless by processing data pipelines on Databricks
20. 20 © Informatica. Proprietary and Confidential.20 © Informatica. Proprietary and Confidential.20 © Informatica. Proprietary and Confidential.
Learn More
1. Stop by the Informatica booth #90 for a custom demo
2. Hear more about AI-Powered Streaming Analytics for Real-Time Customer
Experience – Tomorrow 11:00am Room: E102
3. Visit http://www.informatica.com/databricks
4. Sign up for Hands-on Workshops on Serverless Cloud Data Lakes