SlideShare uma empresa Scribd logo
1 de 36
EXPLAINABLE AI
Gary Allemann
Master Data Management
@mdm_za
Meet Matt…
3000 donuts a day for 30 years…
What don’t you know?
54% 42%
Data delivers competitive advantage
“Compared with their peers, high
performers report a greater variety
of actions to monetize data – with
greater revenue impact”
- McKinsey Global Survey: Fueling growth through data
monetization
“73.2%
Percentage of executives whose firms
have achieved measurable results from
Big Data and AI investments
- NewVantage Partners Big Data Executive Survey 2018
$1.8 Trillion
Projected annual revenue for
insights-driven businesses by 2021
- “Insights-Driven Businesses Set the Pace for Global
Growth,” Forrester, October 19, 2018
“85%
Firms that leverage customer behavioral
insights outperform peers by 85 percent
in sales growth and 25 percent in gross
margin
- McKinsey Global Survey: Capturing value from your
customer data
Common machine learning applications
• Anti-money laundering
• Fraud detection
• Cybersecurity
• Targeted marketing
• Recommendation engine
• Next best action
• Customer churn prevention
• Know your customer
Why do you have a data lake?
Syncsort 2019 data trends survey
Analytics Use Cases
Drive Data Lakes
and Enterprise
Data Hubs
Most organisations not getting full value
Syncsort 2019 data trends survey
91% of organizations
have not yet reached
a “transformational”
level of maturity in
data and analytics
- Gartner
68% of IT professionals
state that data silos
negatively impact their
organization’s ability to
get value from their data
• Every part of the
business demands
sophisticated data
analysis
• Departments need
access to the
company’s many data
sets, combined in
different ways
• IT can’t be a bottleneck
• Data has outgrown the
data warehouse
• Data lakes can be
polluted and chaotic
• Data is inconsistent
across data marts
Key challenges
Syncsort 2019 data trends survey
only 9% “very effective” in
getting value from data
IT decision makers waste 2 hours
daily looking for relevant data
3 pronged approach
Make data easier to
find and understand
Flexible data pipe lines Debug your data
• Manage bias
• Manage data quality
at scale
• Governance /
Traceability
• Batch and streaming
• Legacy, big data and
cloud
• Data governance
• Data catalog
Data Architecture
Metadata/Data Modelling
Data Security
Data
Integration
MDM/ReferenceData
DataQuality
DataGovernance
Business
Intelligemce
DataWarehouse
BigData
AIandML
Business-driven
IT-driven
MAKING DATA EASIER TO FIND AND UNDERSTAND
Data Governance and Catalog
Data Governance and Catalog
AI, Big Data, and Data Governance // Stan Christiaens, Collibra
(FirstMark's Data Driven)
Data Governance and Catalog
AI, Big Data, and Data Governance // Stan Christiaens, Collibra (FirstMark's
Data Driven)
• The differentiator for #AI is DATA
• Bias is like “a snake in the data grass”
• Finding data is a “people and process” problem
• Data (if you treat it as a strategic asset) should
have its own business process
BUILDING A QUALITY DATA PIPELINE
Data Governance and Catalog
Data Scientist
• Expert in statistical analysis, machine
learning techniques, finding answers to
business questions buried in datasets.
• Does NOT want to spend 50 – 90% of their
time tinkering with data, getting it into
good shape to train models – but
frequently does, especially if there’s no
data engineer on their team.
• When machine learning model is trained,
tested, and proven it will accomplish the
goal, turns it over to data engineer to
productionize. Not skilled at taking the
model from a test sandbox into
production, especially not at large scale.
Data Engineer
• Expert in data structures, data
manipulation, and constructing production
data pipelines.
• WANTS to spend all of their time working
with data, but usually has more on their
plate than they can keep up with. Anything
that will speed up their work is helpful.
• In most successful companies, is involved
from the beginning. First gathers, cleans
and standardizes data, helps data scientist
with feature engineering, provides top
notch data, ready to train models.
• After model is tested, builds robust high
scale, data pipelines to feed the models
the data they need in the correct format in
production to provide ongoing business
value.
Data Engineer to the rescue
Identify and onboard all relevant data
Data Lake or Cloud
Raw Landing Zone
Access & Onboard – Elect to include data to understand
• What you don’t know CAN hurt you – e.g. bias
• If you’ve left it out, you cannot know it exists
• Data sets have more power to predict when combined
Ensure the quality
Data Lake or Cloud
Raw Landing Zone
Refined Zone
Refine – cleanse, enrich, de-duplicate
• What data needs refinement? – use cases will determine
• Each data set should be refined once – don’t repeat work
Understand provenanc
Data Lake or Cloud
Raw Landing Zone
Refined Zone
Track Provenance
• Data lineage documentation is necessary for establishing data can be
trusted, and for auditing, regulatory compliance
• Also, useful for reproducing steps in production machine learning
data pipelines
Enrich and grow
Data Lake or Cloud
Raw Landing Zone
Refined Zone
Shop for data sets, features & validate against your questions
• Analyst, data scientist shops for data
• What do I need for my purpose?
• Quality is already assured, provenance documented
• Improves trust, saves time
1. Scattered and Difficult to Access Datasets
Much of the necessary data is trapped in mainframes or streams in from POS,
web clicks, etc. all in incompatible formats, making it difficult to gather and
prepare the data for model training.
2. Data Cleansing at Scale
Data quality cleansing and preparation routines have to be reproduced at
scale. Most data quality tools are not designed to work on that scale of data.
3. Entity Resolution
Distinguishing matches across massive datasets that indicate a single specific
entity (person, company, product, etc.) requires sophisticated multi-field
matching algorithms and a lot of compute power. Essentially everything has to
be compared to everything else.
4. Tracking Lineage from the Source
Data changes made to help train models have to be exactly duplicated in
production, in order for models to accurately make predictions on new data,
and for required audit trails. Capture of complete lineage, from source to end
point is needed.
Challenges of Engineering
Modern Data Pipelines
Onboard any data
22
Data
Onboard data, modify
on-the-fly to match
cloud storage models,
or store unchanged for
archive compliance.
Access data from
streaming and batch
sources outside
cluster.
Data Sources Data Lake
Data drift is a major issue
Dimensional Research
Hybrid and Multi-
Cloud
Strategies
• Ensure seamless data flow
to/from cloud, and among clouds
• Maximize choice for workload
optimization and interoperability
• Design once, deploy anywhere –
on premise and in the cloud
• Optimize cloud infrastructure for
cost and efficiency
• Minimize disruption and risk
• Build new skills to handle
different and emerging portfolios
Challenges
• Managing multiple clouds and
vendors
• Integrating data and applications
on-premise to cloud, across clouds
• Avoiding cloud lock-in
• Lack of skills to handle hybrid
multi-cloud world
• Cloud native or cloud first
for new applications
• Scalability and elasticity
• Hybrid: on-premises systems
and public and private
clouds
• Multi-cloud
• Cloud increases focus on
business process from tech
details
Seamlessly flow data to, from
and among clouds
Design Once, Deploy Anywhere – Public cloud, Private Cloud, Multi-Cloud, Hybrid or On-Prem
• Build a modern data pipeline with flexibility, agility
and elasticity
• Simplify accessing, integrating, governing your data
in a single software environment
• Get the most from the Cloud – no silos, no lock-in, no
re-work
• Move to/from on-premise to Cloud, or between
Clouds with no re-design, re-compile, no re-work
ever!
• Get excellent performance every time – without
tuning, load balancing, etc.
• Future-proof your applications
• Cleanse, enrich, de-duplicate
• What data needs refinement? – use
cases will determine
• Matching across massive datasets that
indicate a single specific entity
(person, company, product, etc.)
How dirty data hampers AI
Dimensional Research
Only 35% of senior
executives have a
high level of trust in
the accuracy of
their Big Data
Analytics*
92% of executives are
concerned about
the negative impact
of data and
analytics on
corporate
reputation*
Cost of poor data
quality rose by 50%
in 2017
(Gartner)
84% of CEOs
are concerned
about
the quality of the
data they’re basing
decisions on*
• Decision making – Trust the
data that drives your
business
• Machine learning & AI –
Train your models on
accurate data
• Customer centricity – Get a
single, complete and
accurate view of customer
for better sales, marketing
and service
• Compliance – Know your
data, and ensure its
accuracy to meet industry
and government regulations
The Modern Data Pipeline Needs Data Quality
*http://kpmg.com/guardiansoftrust
Common Data Quality Problems
• Many data records with different
layouts
• Lack of standardization of the
different fields
• Misspellings
• Data sourced from third parties
does not contain all the necessary
fields
• Inconsistent data formats
(measurements, languages,
postal conventions and dates)
• Names spelled differently
• Different number formatting
Common Data Quality Problems at Scale
Common
Challenges
• Big Data projects require:
Massive scalability
Low latency
Many data sources for a
complete view
• Data Quality processing
using a standalone server
can’t keep up
Millions of business
transactions a day are
now common
Standalone quality projects
may take several hours;
unlikely to meet end user
SLAs and/or key success
factors
Solution
Trillium Quality for Big Data
enables you to leverage the
power and scalability of Big
Data frameworks like
Spark, MapReduce
Performs data quality jobs
natively on the cluster
Leverages Intelligent Execution
– design once, deploy
anywhere – cloud, multi-
cloud, hybrid or on prem
No need to move/copy data for
quality processing; Big Data
remains in place
No coding or tuning; jobs are
automatically optimized
Benefits
• Data Pipeline delivers trusted
data for analytics
• Robust data quality processing
at Big Data scale to meet SLAs,
support use cases like Anti-
Money Laundering or
Customer 360
• No coding or tuning saves
time and resources – and
helps address Big Data skills
shortages
• Save time and network
resources by keeping data in
place
Cleanse data in Hadoop / Cloud
Transform, join, cleanse
and enhance data in
cluster with Spark or
MapReduce. Excellent
performance every time.
Data
Onboard data, modify
on-the-fly to match
cloud storage models,
or store unchanged for
archive compliance.
Access data from
streaming and batch
sources outside
cluster.
Data Sources Data Lake
Get end-to-end data lineage
Data Sources
Navigator or Atlas
gathers any other
changes made to
data on cluster.
Pass source-to-
cluster data
lineage info to
REST API and
Navigator or Atlas.
Data Lake
Data
Data Lineage
REST
API
Onboard data, modify
on-the-fly to match
cloud storage models,
or store unchanged for
archive compliance.
Access data from
streaming and batch
sources outside
cluster.
Transform, join, cleanse
and enhance data in
cluster with Spark or
MapReduce. Excellent
performance every time
Data changes
separately made
by MapReduce,
Spark, HiveQL.
Syncsort Published Lineage in Cl;oudera
33
Analysts Get Complete Picture with Trusted Data Provenance
Data Sources
Auditors
get end-to-
end data
lineage.
Analytics,
visualizations, and
machine learning
algorithms get
clean, complete
data.
Data Lake
Analytics,
Visualization,
Machine
Learning
Data changes
separately made
by MapReduce,
Spark, HiveQL.
Data
Data Lineage
Clean,
Complete
Data
RES
T
API
Onboard data, modify
on-the-fly to match
cloud storage models,
or store unchanged for
archive compliance.
Access data from
streaming and
batch sources
outside cluster.
Navigator or Atlas
gathers any other
changes made to
data on cluster.
Pass source-to-
cluster data
lineage info to
REST API and
Navigator or
Atlas.
Transform, join, cleanse
and enhance data in
cluster with Spark or
MapReduce. Excellent
performance every time
Forrester Research
The path to enterprise AI is full of twists
and turns, false starts, and lessons to
learn.
Surely without data quality, AI and
other advanced technologies can not
live up to their expectations.
What don’t you know?
• Gary Allemann
• +27 83 632 1591
• gary@masterdata.co.za
• www.masterdata.co.za
Questions

Mais conteúdo relacionado

Mais procurados

Practical Explainable AI: How to build trustworthy, transparent and unbiased ...
Practical Explainable AI: How to build trustworthy, transparent and unbiased ...Practical Explainable AI: How to build trustworthy, transparent and unbiased ...
Practical Explainable AI: How to build trustworthy, transparent and unbiased ...Raheel Ahmad
 
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Sri Ambati
 
Explainable AI in Industry (AAAI 2020 Tutorial)
Explainable AI in Industry (AAAI 2020 Tutorial)Explainable AI in Industry (AAAI 2020 Tutorial)
Explainable AI in Industry (AAAI 2020 Tutorial)Krishnaram Kenthapadi
 
Explainability and bias in AI
Explainability and bias in AIExplainability and bias in AI
Explainability and bias in AIBill Liu
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
Popular Text Analytics Algorithms
Popular Text Analytics AlgorithmsPopular Text Analytics Algorithms
Popular Text Analytics AlgorithmsPromptCloud
 
Data Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics CapabilitiesData Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics CapabilitiesDerek Kane
 
How ml can improve purchase conversions
How ml can improve purchase conversionsHow ml can improve purchase conversions
How ml can improve purchase conversionsSudeep Shukla
 
Building trust through Explainable AI
Building trust through Explainable AIBuilding trust through Explainable AI
Building trust through Explainable AIPeet Denny
 
Patrick Hall, H2O.ai - The Case for Model Debugging - H2O World 2019 NYC
Patrick Hall, H2O.ai - The Case for Model Debugging - H2O World 2019 NYCPatrick Hall, H2O.ai - The Case for Model Debugging - H2O World 2019 NYC
Patrick Hall, H2O.ai - The Case for Model Debugging - H2O World 2019 NYCSri Ambati
 
Guide to end end machine learning projects
Guide to end end machine learning projectsGuide to end end machine learning projects
Guide to end end machine learning projectsSkyl.ai
 
Applied Artificial Intelligence Unit 3 Semester 3 MSc IT Part 2 Mumbai Univer...
Applied Artificial Intelligence Unit 3 Semester 3 MSc IT Part 2 Mumbai Univer...Applied Artificial Intelligence Unit 3 Semester 3 MSc IT Part 2 Mumbai Univer...
Applied Artificial Intelligence Unit 3 Semester 3 MSc IT Part 2 Mumbai Univer...Madhav Mishra
 
Data Driven Engineering 2014
Data Driven Engineering 2014Data Driven Engineering 2014
Data Driven Engineering 2014Roger Barga
 
Building a Predictive Model
Building a Predictive ModelBuilding a Predictive Model
Building a Predictive ModelDKALab
 
1030 track 2 barrett_using our laptop
1030 track 2 barrett_using our laptop1030 track 2 barrett_using our laptop
1030 track 2 barrett_using our laptopRising Media, Inc.
 
Barga ACM DEBS 2013 Keynote
Barga ACM DEBS 2013 KeynoteBarga ACM DEBS 2013 Keynote
Barga ACM DEBS 2013 KeynoteRoger Barga
 

Mais procurados (20)

Practical Explainable AI: How to build trustworthy, transparent and unbiased ...
Practical Explainable AI: How to build trustworthy, transparent and unbiased ...Practical Explainable AI: How to build trustworthy, transparent and unbiased ...
Practical Explainable AI: How to build trustworthy, transparent and unbiased ...
 
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
 
Explainable AI in Industry (AAAI 2020 Tutorial)
Explainable AI in Industry (AAAI 2020 Tutorial)Explainable AI in Industry (AAAI 2020 Tutorial)
Explainable AI in Industry (AAAI 2020 Tutorial)
 
Explainability and bias in AI
Explainability and bias in AIExplainability and bias in AI
Explainability and bias in AI
 
Predictive analytics
Predictive analytics Predictive analytics
Predictive analytics
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Analytics in Online Retail
Analytics in Online RetailAnalytics in Online Retail
Analytics in Online Retail
 
Popular Text Analytics Algorithms
Popular Text Analytics AlgorithmsPopular Text Analytics Algorithms
Popular Text Analytics Algorithms
 
Data Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics CapabilitiesData Science - Part I - Sustaining Predictive Analytics Capabilities
Data Science - Part I - Sustaining Predictive Analytics Capabilities
 
How ml can improve purchase conversions
How ml can improve purchase conversionsHow ml can improve purchase conversions
How ml can improve purchase conversions
 
Building trust through Explainable AI
Building trust through Explainable AIBuilding trust through Explainable AI
Building trust through Explainable AI
 
Patrick Hall, H2O.ai - The Case for Model Debugging - H2O World 2019 NYC
Patrick Hall, H2O.ai - The Case for Model Debugging - H2O World 2019 NYCPatrick Hall, H2O.ai - The Case for Model Debugging - H2O World 2019 NYC
Patrick Hall, H2O.ai - The Case for Model Debugging - H2O World 2019 NYC
 
Guide to end end machine learning projects
Guide to end end machine learning projectsGuide to end end machine learning projects
Guide to end end machine learning projects
 
Applied Artificial Intelligence Unit 3 Semester 3 MSc IT Part 2 Mumbai Univer...
Applied Artificial Intelligence Unit 3 Semester 3 MSc IT Part 2 Mumbai Univer...Applied Artificial Intelligence Unit 3 Semester 3 MSc IT Part 2 Mumbai Univer...
Applied Artificial Intelligence Unit 3 Semester 3 MSc IT Part 2 Mumbai Univer...
 
Data Driven Engineering 2014
Data Driven Engineering 2014Data Driven Engineering 2014
Data Driven Engineering 2014
 
Machine learning
Machine learningMachine learning
Machine learning
 
Building a Predictive Model
Building a Predictive ModelBuilding a Predictive Model
Building a Predictive Model
 
1030 track 2 barrett_using our laptop
1030 track 2 barrett_using our laptop1030 track 2 barrett_using our laptop
1030 track 2 barrett_using our laptop
 
Barga ACM DEBS 2013 Keynote
Barga ACM DEBS 2013 KeynoteBarga ACM DEBS 2013 Keynote
Barga ACM DEBS 2013 Keynote
 
Data analysis
Data analysisData analysis
Data analysis
 

Semelhante a EXPLAINABLE AI AND THE MODERN DATA PIPELINE

ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
 
Operationalize analytics through modern data strategy
Operationalize analytics through modern data strategyOperationalize analytics through modern data strategy
Operationalize analytics through modern data strategyNagarro
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It? Caserta
 
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Precisely
 
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...Precisely
 
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on TrackYour AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on TrackPrecisely
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
Achieving a Single View of Business – Critical Data with Master Data Management
Achieving a Single View of Business – Critical Data with Master Data ManagementAchieving a Single View of Business – Critical Data with Master Data Management
Achieving a Single View of Business – Critical Data with Master Data ManagementDATAVERSITY
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityPrecisely
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedcedrinemadera
 
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Big Data Matching - How to Find Two Similar Needles in a Really Big HaystackBig Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Big Data Matching - How to Find Two Similar Needles in a Really Big HaystackPrecisely
 
Data Virtualization for Compliance – Creating a Controlled Data Environment
Data Virtualization for Compliance – Creating a Controlled Data EnvironmentData Virtualization for Compliance – Creating a Controlled Data Environment
Data Virtualization for Compliance – Creating a Controlled Data EnvironmentDenodo
 
The Bigger They Are The Harder They Fall
The Bigger They Are The Harder They FallThe Bigger They Are The Harder They Fall
The Bigger They Are The Harder They FallTrillium Software
 
Big data
Big dataBig data
Big dataRiya
 
Building Your Enterprise Data Marketplace with DMX-h
Building Your Enterprise Data Marketplace with DMX-hBuilding Your Enterprise Data Marketplace with DMX-h
Building Your Enterprise Data Marketplace with DMX-hPrecisely
 
The New Trillium DQ: Big Data Insights When and Where You Need Them
The New Trillium DQ: Big Data Insights When and Where You Need ThemThe New Trillium DQ: Big Data Insights When and Where You Need Them
The New Trillium DQ: Big Data Insights When and Where You Need ThemPrecisely
 
ADV Slides: Data Pipelines in the Enterprise and Comparison
ADV Slides: Data Pipelines in the Enterprise and ComparisonADV Slides: Data Pipelines in the Enterprise and Comparison
ADV Slides: Data Pipelines in the Enterprise and ComparisonDATAVERSITY
 

Semelhante a EXPLAINABLE AI AND THE MODERN DATA PIPELINE (20)

ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
Operationalize analytics through modern data strategy
Operationalize analytics through modern data strategyOperationalize analytics through modern data strategy
Operationalize analytics through modern data strategy
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
 
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
 
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
 
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on TrackYour AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Achieving a Single View of Business – Critical Data with Master Data Management
Achieving a Single View of Business – Critical Data with Master Data ManagementAchieving a Single View of Business – Critical Data with Master Data Management
Achieving a Single View of Business – Critical Data with Master Data Management
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-shared
 
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Big Data Matching - How to Find Two Similar Needles in a Really Big HaystackBig Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
 
Data Virtualization for Compliance – Creating a Controlled Data Environment
Data Virtualization for Compliance – Creating a Controlled Data EnvironmentData Virtualization for Compliance – Creating a Controlled Data Environment
Data Virtualization for Compliance – Creating a Controlled Data Environment
 
Trends in data analytics
Trends in data analyticsTrends in data analytics
Trends in data analytics
 
The Bigger They Are The Harder They Fall
The Bigger They Are The Harder They FallThe Bigger They Are The Harder They Fall
The Bigger They Are The Harder They Fall
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Building Your Enterprise Data Marketplace with DMX-h
Building Your Enterprise Data Marketplace with DMX-hBuilding Your Enterprise Data Marketplace with DMX-h
Building Your Enterprise Data Marketplace with DMX-h
 
The New Trillium DQ: Big Data Insights When and Where You Need Them
The New Trillium DQ: Big Data Insights When and Where You Need ThemThe New Trillium DQ: Big Data Insights When and Where You Need Them
The New Trillium DQ: Big Data Insights When and Where You Need Them
 
ADV Slides: Data Pipelines in the Enterprise and Comparison
ADV Slides: Data Pipelines in the Enterprise and ComparisonADV Slides: Data Pipelines in the Enterprise and Comparison
ADV Slides: Data Pipelines in the Enterprise and Comparison
 

Mais de Gary Allemann

Effective data governance for customer intelligence
Effective data governance for customer intelligenceEffective data governance for customer intelligence
Effective data governance for customer intelligenceGary Allemann
 
Cs2017 gary allemann presentation
Cs2017 gary allemann presentationCs2017 gary allemann presentation
Cs2017 gary allemann presentationGary Allemann
 
Avoiding compliance pitfalls
Avoiding compliance pitfallsAvoiding compliance pitfalls
Avoiding compliance pitfallsGary Allemann
 
Insurance summit making the shift from product to customer centric
Insurance summit   making the shift from product to customer centricInsurance summit   making the shift from product to customer centric
Insurance summit making the shift from product to customer centricGary Allemann
 
The shift to data driven marketing
The shift to data driven marketingThe shift to data driven marketing
The shift to data driven marketingGary Allemann
 
Moving from passive to active data governance
Moving from passive to active data governanceMoving from passive to active data governance
Moving from passive to active data governanceGary Allemann
 
Using gis to enhance customer experience
Using gis to enhance customer experienceUsing gis to enhance customer experience
Using gis to enhance customer experienceGary Allemann
 
Chief data-officer-to-big-data-officer
Chief data-officer-to-big-data-officerChief data-officer-to-big-data-officer
Chief data-officer-to-big-data-officerGary Allemann
 
Big data myths busted
Big data myths bustedBig data myths busted
Big data myths bustedGary Allemann
 
Governance beyond master data
Governance beyond master dataGovernance beyond master data
Governance beyond master dataGary Allemann
 
Big data, big revenue
Big data, big revenueBig data, big revenue
Big data, big revenueGary Allemann
 

Mais de Gary Allemann (12)

Effective data governance for customer intelligence
Effective data governance for customer intelligenceEffective data governance for customer intelligence
Effective data governance for customer intelligence
 
Cs2017 gary allemann presentation
Cs2017 gary allemann presentationCs2017 gary allemann presentation
Cs2017 gary allemann presentation
 
Avoiding compliance pitfalls
Avoiding compliance pitfallsAvoiding compliance pitfalls
Avoiding compliance pitfalls
 
Insurance summit making the shift from product to customer centric
Insurance summit   making the shift from product to customer centricInsurance summit   making the shift from product to customer centric
Insurance summit making the shift from product to customer centric
 
The shift to data driven marketing
The shift to data driven marketingThe shift to data driven marketing
The shift to data driven marketing
 
Moving from passive to active data governance
Moving from passive to active data governanceMoving from passive to active data governance
Moving from passive to active data governance
 
Using gis to enhance customer experience
Using gis to enhance customer experienceUsing gis to enhance customer experience
Using gis to enhance customer experience
 
Chief data-officer-to-big-data-officer
Chief data-officer-to-big-data-officerChief data-officer-to-big-data-officer
Chief data-officer-to-big-data-officer
 
Big data myths busted
Big data myths bustedBig data myths busted
Big data myths busted
 
Governance beyond master data
Governance beyond master dataGovernance beyond master data
Governance beyond master data
 
Big data, big revenue
Big data, big revenueBig data, big revenue
Big data, big revenue
 
Bridging the gap
Bridging the gapBridging the gap
Bridging the gap
 

Último

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excelysmaelreyes
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 

Último (20)

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excel
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 

EXPLAINABLE AI AND THE MODERN DATA PIPELINE

  • 1. EXPLAINABLE AI Gary Allemann Master Data Management @mdm_za
  • 3. 3000 donuts a day for 30 years…
  • 4. What don’t you know? 54% 42%
  • 5. Data delivers competitive advantage “Compared with their peers, high performers report a greater variety of actions to monetize data – with greater revenue impact” - McKinsey Global Survey: Fueling growth through data monetization “73.2% Percentage of executives whose firms have achieved measurable results from Big Data and AI investments - NewVantage Partners Big Data Executive Survey 2018 $1.8 Trillion Projected annual revenue for insights-driven businesses by 2021 - “Insights-Driven Businesses Set the Pace for Global Growth,” Forrester, October 19, 2018 “85% Firms that leverage customer behavioral insights outperform peers by 85 percent in sales growth and 25 percent in gross margin - McKinsey Global Survey: Capturing value from your customer data
  • 6. Common machine learning applications • Anti-money laundering • Fraud detection • Cybersecurity • Targeted marketing • Recommendation engine • Next best action • Customer churn prevention • Know your customer
  • 7. Why do you have a data lake? Syncsort 2019 data trends survey Analytics Use Cases Drive Data Lakes and Enterprise Data Hubs
  • 8. Most organisations not getting full value Syncsort 2019 data trends survey 91% of organizations have not yet reached a “transformational” level of maturity in data and analytics - Gartner 68% of IT professionals state that data silos negatively impact their organization’s ability to get value from their data • Every part of the business demands sophisticated data analysis • Departments need access to the company’s many data sets, combined in different ways • IT can’t be a bottleneck • Data has outgrown the data warehouse • Data lakes can be polluted and chaotic • Data is inconsistent across data marts
  • 9. Key challenges Syncsort 2019 data trends survey only 9% “very effective” in getting value from data IT decision makers waste 2 hours daily looking for relevant data
  • 10. 3 pronged approach Make data easier to find and understand Flexible data pipe lines Debug your data • Manage bias • Manage data quality at scale • Governance / Traceability • Batch and streaming • Legacy, big data and cloud • Data governance • Data catalog
  • 11. Data Architecture Metadata/Data Modelling Data Security Data Integration MDM/ReferenceData DataQuality DataGovernance Business Intelligemce DataWarehouse BigData AIandML Business-driven IT-driven
  • 12. MAKING DATA EASIER TO FIND AND UNDERSTAND Data Governance and Catalog
  • 13. Data Governance and Catalog AI, Big Data, and Data Governance // Stan Christiaens, Collibra (FirstMark's Data Driven)
  • 14. Data Governance and Catalog AI, Big Data, and Data Governance // Stan Christiaens, Collibra (FirstMark's Data Driven) • The differentiator for #AI is DATA • Bias is like “a snake in the data grass” • Finding data is a “people and process” problem • Data (if you treat it as a strategic asset) should have its own business process
  • 15. BUILDING A QUALITY DATA PIPELINE Data Governance and Catalog
  • 16. Data Scientist • Expert in statistical analysis, machine learning techniques, finding answers to business questions buried in datasets. • Does NOT want to spend 50 – 90% of their time tinkering with data, getting it into good shape to train models – but frequently does, especially if there’s no data engineer on their team. • When machine learning model is trained, tested, and proven it will accomplish the goal, turns it over to data engineer to productionize. Not skilled at taking the model from a test sandbox into production, especially not at large scale. Data Engineer • Expert in data structures, data manipulation, and constructing production data pipelines. • WANTS to spend all of their time working with data, but usually has more on their plate than they can keep up with. Anything that will speed up their work is helpful. • In most successful companies, is involved from the beginning. First gathers, cleans and standardizes data, helps data scientist with feature engineering, provides top notch data, ready to train models. • After model is tested, builds robust high scale, data pipelines to feed the models the data they need in the correct format in production to provide ongoing business value. Data Engineer to the rescue
  • 17. Identify and onboard all relevant data Data Lake or Cloud Raw Landing Zone Access & Onboard – Elect to include data to understand • What you don’t know CAN hurt you – e.g. bias • If you’ve left it out, you cannot know it exists • Data sets have more power to predict when combined
  • 18. Ensure the quality Data Lake or Cloud Raw Landing Zone Refined Zone Refine – cleanse, enrich, de-duplicate • What data needs refinement? – use cases will determine • Each data set should be refined once – don’t repeat work
  • 19. Understand provenanc Data Lake or Cloud Raw Landing Zone Refined Zone Track Provenance • Data lineage documentation is necessary for establishing data can be trusted, and for auditing, regulatory compliance • Also, useful for reproducing steps in production machine learning data pipelines
  • 20. Enrich and grow Data Lake or Cloud Raw Landing Zone Refined Zone Shop for data sets, features & validate against your questions • Analyst, data scientist shops for data • What do I need for my purpose? • Quality is already assured, provenance documented • Improves trust, saves time
  • 21. 1. Scattered and Difficult to Access Datasets Much of the necessary data is trapped in mainframes or streams in from POS, web clicks, etc. all in incompatible formats, making it difficult to gather and prepare the data for model training. 2. Data Cleansing at Scale Data quality cleansing and preparation routines have to be reproduced at scale. Most data quality tools are not designed to work on that scale of data. 3. Entity Resolution Distinguishing matches across massive datasets that indicate a single specific entity (person, company, product, etc.) requires sophisticated multi-field matching algorithms and a lot of compute power. Essentially everything has to be compared to everything else. 4. Tracking Lineage from the Source Data changes made to help train models have to be exactly duplicated in production, in order for models to accurately make predictions on new data, and for required audit trails. Capture of complete lineage, from source to end point is needed. Challenges of Engineering Modern Data Pipelines
  • 22. Onboard any data 22 Data Onboard data, modify on-the-fly to match cloud storage models, or store unchanged for archive compliance. Access data from streaming and batch sources outside cluster. Data Sources Data Lake
  • 23. Data drift is a major issue Dimensional Research
  • 24. Hybrid and Multi- Cloud Strategies • Ensure seamless data flow to/from cloud, and among clouds • Maximize choice for workload optimization and interoperability • Design once, deploy anywhere – on premise and in the cloud • Optimize cloud infrastructure for cost and efficiency • Minimize disruption and risk • Build new skills to handle different and emerging portfolios Challenges • Managing multiple clouds and vendors • Integrating data and applications on-premise to cloud, across clouds • Avoiding cloud lock-in • Lack of skills to handle hybrid multi-cloud world • Cloud native or cloud first for new applications • Scalability and elasticity • Hybrid: on-premises systems and public and private clouds • Multi-cloud • Cloud increases focus on business process from tech details
  • 25. Seamlessly flow data to, from and among clouds Design Once, Deploy Anywhere – Public cloud, Private Cloud, Multi-Cloud, Hybrid or On-Prem • Build a modern data pipeline with flexibility, agility and elasticity • Simplify accessing, integrating, governing your data in a single software environment • Get the most from the Cloud – no silos, no lock-in, no re-work • Move to/from on-premise to Cloud, or between Clouds with no re-design, re-compile, no re-work ever! • Get excellent performance every time – without tuning, load balancing, etc. • Future-proof your applications
  • 26. • Cleanse, enrich, de-duplicate • What data needs refinement? – use cases will determine • Matching across massive datasets that indicate a single specific entity (person, company, product, etc.) How dirty data hampers AI Dimensional Research
  • 27. Only 35% of senior executives have a high level of trust in the accuracy of their Big Data Analytics* 92% of executives are concerned about the negative impact of data and analytics on corporate reputation* Cost of poor data quality rose by 50% in 2017 (Gartner) 84% of CEOs are concerned about the quality of the data they’re basing decisions on* • Decision making – Trust the data that drives your business • Machine learning & AI – Train your models on accurate data • Customer centricity – Get a single, complete and accurate view of customer for better sales, marketing and service • Compliance – Know your data, and ensure its accuracy to meet industry and government regulations The Modern Data Pipeline Needs Data Quality *http://kpmg.com/guardiansoftrust
  • 28. Common Data Quality Problems • Many data records with different layouts • Lack of standardization of the different fields • Misspellings • Data sourced from third parties does not contain all the necessary fields • Inconsistent data formats (measurements, languages, postal conventions and dates) • Names spelled differently • Different number formatting
  • 29. Common Data Quality Problems at Scale Common Challenges • Big Data projects require: Massive scalability Low latency Many data sources for a complete view • Data Quality processing using a standalone server can’t keep up Millions of business transactions a day are now common Standalone quality projects may take several hours; unlikely to meet end user SLAs and/or key success factors Solution Trillium Quality for Big Data enables you to leverage the power and scalability of Big Data frameworks like Spark, MapReduce Performs data quality jobs natively on the cluster Leverages Intelligent Execution – design once, deploy anywhere – cloud, multi- cloud, hybrid or on prem No need to move/copy data for quality processing; Big Data remains in place No coding or tuning; jobs are automatically optimized Benefits • Data Pipeline delivers trusted data for analytics • Robust data quality processing at Big Data scale to meet SLAs, support use cases like Anti- Money Laundering or Customer 360 • No coding or tuning saves time and resources – and helps address Big Data skills shortages • Save time and network resources by keeping data in place
  • 30. Cleanse data in Hadoop / Cloud Transform, join, cleanse and enhance data in cluster with Spark or MapReduce. Excellent performance every time. Data Onboard data, modify on-the-fly to match cloud storage models, or store unchanged for archive compliance. Access data from streaming and batch sources outside cluster. Data Sources Data Lake
  • 31. Get end-to-end data lineage Data Sources Navigator or Atlas gathers any other changes made to data on cluster. Pass source-to- cluster data lineage info to REST API and Navigator or Atlas. Data Lake Data Data Lineage REST API Onboard data, modify on-the-fly to match cloud storage models, or store unchanged for archive compliance. Access data from streaming and batch sources outside cluster. Transform, join, cleanse and enhance data in cluster with Spark or MapReduce. Excellent performance every time Data changes separately made by MapReduce, Spark, HiveQL.
  • 33. 33 Analysts Get Complete Picture with Trusted Data Provenance Data Sources Auditors get end-to- end data lineage. Analytics, visualizations, and machine learning algorithms get clean, complete data. Data Lake Analytics, Visualization, Machine Learning Data changes separately made by MapReduce, Spark, HiveQL. Data Data Lineage Clean, Complete Data RES T API Onboard data, modify on-the-fly to match cloud storage models, or store unchanged for archive compliance. Access data from streaming and batch sources outside cluster. Navigator or Atlas gathers any other changes made to data on cluster. Pass source-to- cluster data lineage info to REST API and Navigator or Atlas. Transform, join, cleanse and enhance data in cluster with Spark or MapReduce. Excellent performance every time
  • 34. Forrester Research The path to enterprise AI is full of twists and turns, false starts, and lessons to learn. Surely without data quality, AI and other advanced technologies can not live up to their expectations.
  • 36. • Gary Allemann • +27 83 632 1591 • gary@masterdata.co.za • www.masterdata.co.za Questions

Notas do Editor

  1. The Refined Zone may be another cluster, another part of the same cluster, a Cloud, an analytic database, wherever the data sets can be easily stored and found by the people who need them. Select data sets based on use cases. Start with a use case that requires relatively few data sets and/or has relatively high business value. Get immediate ROI for that use case, then move to the next. Once a data set has been refined, it’s there for other use cases that might need the same data. Build on that by refining additional data sets for the next use case. And so on.
  2. That’s a data marketplace, and why you need one.
  3. IT is transforming to handle a combination of on premise, infrastructure-as-a-service, platform-as-a-service, and software-as-a-service. The best architecture will make choices affordable so an architecture with multiple cloud vendors is just as easy and powerful as using a single cloud. Going all-in on one cloud architecture puts IT in the same weak, single source position that many customers of companies such as Oracle find themselves in today. No matter what the current management of those vendors say, future managers will exploit this weakness to increase revenue. It is crucial that you do as much of the detailed work of handling complex programming, rules, transformations, and other forms of coding in ways that protect you from changes in the underlying infrastructure. The ideal form of expression of coding is in a system that could operate on-premises or in any cloud.
  4. Syncsort Connect for Big Data is specifically designed to simplify the process of accessing, integrating, governing and securing all your enterprise data – batch and streaming – in a single software environment. With Connect for Big Data you can: Visually design your jobs once, and deploy them anywhere – MapReduce, Spark, Linux, Unix, Windows – on premise or in the cloud. No changes or tuning required. Easily move applications from standalone server environments and from MapRedue to Spark – as easy as clicking on a drop-down menu Future-proof job designs for emerging compute frameworks Avoid tuning -- Intelligent Execution dynamically plans for applications at run-time based on the chosen compute framework Insulate your users from the underlying complexities of Hadoop and use existing ETL skills Cut development time in half
  5. Cloudera Navigator