EXPLAINABLE AI AND THE MODERN DATA PIPELINE

EXPLAINABLE AI
Gary Allemann
Master Data Management
@mdm_za

3000 donuts a day for 30 years…

What don’t you know?
54% 42%

Data delivers competitive advantage
“Compared with their peers, high
performers report a greater variety
of actions to monetize data – with
greater revenue impact”
- McKinsey Global Survey: Fueling growth through data
monetization
“73.2%
Percentage of executives whose firms
have achieved measurable results from
Big Data and AI investments
- NewVantage Partners Big Data Executive Survey 2018
$1.8 Trillion
Projected annual revenue for
insights-driven businesses by 2021
- “Insights-Driven Businesses Set the Pace for Global
Growth,” Forrester, October 19, 2018
“85%
Firms that leverage customer behavioral
insights outperform peers by 85 percent
in sales growth and 25 percent in gross
margin
- McKinsey Global Survey: Capturing value from your
customer data

Common machine learning applications
• Anti-money laundering
• Fraud detection
• Cybersecurity
• Targeted marketing
• Recommendation engine
• Next best action
• Customer churn prevention
• Know your customer

Why do you have a data lake?
Syncsort 2019 data trends survey
Analytics Use Cases
Drive Data Lakes
and Enterprise
Data Hubs

Most organisations not getting full value
91% of organizations
have not yet reached
a “transformational”
level of maturity in
data and analytics
- Gartner
68% of IT professionals
state that data silos
negatively impact their
organization’s ability to
get value from their data
• Every part of the
business demands
sophisticated data
analysis
• Departments need
access to the
company’s many data
sets, combined in
different ways
• IT can’t be a bottleneck
• Data has outgrown the
data warehouse
• Data lakes can be
polluted and chaotic
• Data is inconsistent
across data marts

Key challenges
only 9% “very effective” in
getting value from data
IT decision makers waste 2 hours
daily looking for relevant data

3 pronged approach
Make data easier to
find and understand
Flexible data pipe lines Debug your data
• Manage bias
• Manage data quality
at scale
• Governance /
Traceability
• Batch and streaming
• Legacy, big data and
cloud
• Data governance
• Data catalog

Data Architecture
Metadata/Data Modelling
Data Security
Data
Integration
MDM/ReferenceData
DataQuality
DataGovernance
Business
Intelligemce
DataWarehouse
BigData
AIandML
Business-driven
IT-driven

MAKING DATA EASIER TO FIND AND UNDERSTAND
Data Governance and Catalog

AI, Big Data, and Data Governance // Stan Christiaens, Collibra
(FirstMark's Data Driven)

AI, Big Data, and Data Governance // Stan Christiaens, Collibra (FirstMark's
Data Driven)
• The differentiator for #AI is DATA
• Bias is like “a snake in the data grass”
• Finding data is a “people and process” problem
• Data (if you treat it as a strategic asset) should
have its own business process

BUILDING A QUALITY DATA PIPELINE

Data Scientist
• Expert in statistical analysis, machine
learning techniques, finding answers to
business questions buried in datasets.
• Does NOT want to spend 50 – 90% of their
time tinkering with data, getting it into
good shape to train models – but
frequently does, especially if there’s no
data engineer on their team.
• When machine learning model is trained,
tested, and proven it will accomplish the
goal, turns it over to data engineer to
productionize. Not skilled at taking the
model from a test sandbox into
production, especially not at large scale.
Data Engineer
• Expert in data structures, data
manipulation, and constructing production
data pipelines.
• WANTS to spend all of their time working
with data, but usually has more on their
plate than they can keep up with. Anything
that will speed up their work is helpful.
• In most successful companies, is involved
from the beginning. First gathers, cleans
and standardizes data, helps data scientist
with feature engineering, provides top
notch data, ready to train models.
• After model is tested, builds robust high
scale, data pipelines to feed the models
the data they need in the correct format in
production to provide ongoing business
value.
Data Engineer to the rescue

Identify and onboard all relevant data
Data Lake or Cloud
Raw Landing Zone
Access & Onboard – Elect to include data to understand
• What you don’t know CAN hurt you – e.g. bias
• If you’ve left it out, you cannot know it exists
• Data sets have more power to predict when combined

Ensure the quality
Data Lake or Cloud
Raw Landing Zone
Refined Zone
Refine – cleanse, enrich, de-duplicate
• What data needs refinement? – use cases will determine
• Each data set should be refined once – don’t repeat work

Understand provenanc
Data Lake or Cloud
Raw Landing Zone
Refined Zone
Track Provenance
• Data lineage documentation is necessary for establishing data can be
trusted, and for auditing, regulatory compliance
• Also, useful for reproducing steps in production machine learning
data pipelines

Enrich and grow
Data Lake or Cloud
Raw Landing Zone
Refined Zone
Shop for data sets, features & validate against your questions
• Analyst, data scientist shops for data
• What do I need for my purpose?
• Quality is already assured, provenance documented
• Improves trust, saves time

1. Scattered and Difficult to Access Datasets
Much of the necessary data is trapped in mainframes or streams in from POS,
web clicks, etc. all in incompatible formats, making it difficult to gather and
prepare the data for model training.
2. Data Cleansing at Scale
Data quality cleansing and preparation routines have to be reproduced at
scale. Most data quality tools are not designed to work on that scale of data.
3. Entity Resolution
Distinguishing matches across massive datasets that indicate a single specific
entity (person, company, product, etc.) requires sophisticated multi-field
matching algorithms and a lot of compute power. Essentially everything has to
be compared to everything else.
4. Tracking Lineage from the Source
Data changes made to help train models have to be exactly duplicated in
production, in order for models to accurately make predictions on new data,
and for required audit trails. Capture of complete lineage, from source to end
point is needed.
Challenges of Engineering
Modern Data Pipelines

Onboard any data
22
Data
Onboard data, modify
on-the-fly to match
cloud storage models,
or store unchanged for
archive compliance.
Access data from
streaming and batch
sources outside
cluster.
Data Sources Data Lake

Data drift is a major issue
Dimensional Research

Hybrid and Multi-
Cloud
Strategies
• Ensure seamless data flow
to/from cloud, and among clouds
• Maximize choice for workload
optimization and interoperability
• Design once, deploy anywhere –
on premise and in the cloud
• Optimize cloud infrastructure for
cost and efficiency
• Minimize disruption and risk
• Build new skills to handle
different and emerging portfolios
Challenges
• Managing multiple clouds and
vendors
• Integrating data and applications
on-premise to cloud, across clouds
• Avoiding cloud lock-in
• Lack of skills to handle hybrid
multi-cloud world
• Cloud native or cloud first
for new applications
• Scalability and elasticity
• Hybrid: on-premises systems
and public and private
clouds
• Multi-cloud
• Cloud increases focus on
business process from tech
details

Seamlessly flow data to, from
and among clouds
Design Once, Deploy Anywhere – Public cloud, Private Cloud, Multi-Cloud, Hybrid or On-Prem
• Build a modern data pipeline with flexibility, agility
and elasticity
• Simplify accessing, integrating, governing your data
in a single software environment
• Get the most from the Cloud – no silos, no lock-in, no
re-work
• Move to/from on-premise to Cloud, or between
Clouds with no re-design, re-compile, no re-work
ever!
• Get excellent performance every time – without
tuning, load balancing, etc.
• Future-proof your applications

• Cleanse, enrich, de-duplicate
• What data needs refinement? – use
cases will determine
• Matching across massive datasets that
indicate a single specific entity
(person, company, product, etc.)
How dirty data hampers AI
Dimensional Research

Only 35% of senior
executives have a
high level of trust in
the accuracy of
their Big Data
Analytics*
92% of executives are
concerned about
the negative impact
of data and
analytics on
corporate
reputation*
Cost of poor data
quality rose by 50%
in 2017
(Gartner)
84% of CEOs
are concerned
about
the quality of the
data they’re basing
decisions on*
• Decision making – Trust the
data that drives your
business
• Machine learning & AI –
Train your models on
accurate data
• Customer centricity – Get a
single, complete and
accurate view of customer
for better sales, marketing
and service
• Compliance – Know your
data, and ensure its
accuracy to meet industry
and government regulations
The Modern Data Pipeline Needs Data Quality
*http://kpmg.com/guardiansoftrust

Common Data Quality Problems
• Many data records with different
layouts
• Lack of standardization of the
different fields
• Misspellings
• Data sourced from third parties
does not contain all the necessary
fields
• Inconsistent data formats
(measurements, languages,
postal conventions and dates)
• Names spelled differently
• Different number formatting

Common Data Quality Problems at Scale
Common
Challenges
• Big Data projects require:
Massive scalability
Low latency
Many data sources for a
complete view
• Data Quality processing
using a standalone server
can’t keep up
Millions of business
transactions a day are
now common
Standalone quality projects
may take several hours;
unlikely to meet end user
SLAs and/or key success
factors
Solution
Trillium Quality for Big Data
enables you to leverage the
power and scalability of Big
Data frameworks like
Spark, MapReduce
Performs data quality jobs
natively on the cluster
Leverages Intelligent Execution
– design once, deploy
anywhere – cloud, multi-
cloud, hybrid or on prem
No need to move/copy data for
quality processing; Big Data
remains in place
No coding or tuning; jobs are
automatically optimized
Benefits
• Data Pipeline delivers trusted
data for analytics
• Robust data quality processing
at Big Data scale to meet SLAs,
support use cases like Anti-
Money Laundering or
Customer 360
• No coding or tuning saves
time and resources – and
helps address Big Data skills
shortages
• Save time and network
resources by keeping data in
place

Cleanse data in Hadoop / Cloud
Transform, join, cleanse
and enhance data in
cluster with Spark or
MapReduce. Excellent
performance every time.
Data
on-the-fly to match
archive compliance.
Access data from
streaming and batch
sources outside
cluster.
Data Sources Data Lake

Get end-to-end data lineage
Data Sources
Navigator or Atlas
gathers any other
changes made to
data on cluster.
Pass source-to-
cluster data
lineage info to
REST API and
Navigator or Atlas.
Data Lake
Data
Data Lineage
REST
API
on-the-fly to match
archive compliance.
Access data from
streaming and batch
sources outside
cluster.
and enhance data in
performance every time
Data changes
separately made
by MapReduce,
Spark, HiveQL.

Syncsort Published Lineage in Cl;oudera

33
Analysts Get Complete Picture with Trusted Data Provenance
Data Sources
Auditors
get end-to-
end data
lineage.
Analytics,
visualizations, and
machine learning
algorithms get
clean, complete
data.
Data Lake
Analytics,
Visualization,
Machine
Learning
Data changes
separately made
by MapReduce,
Spark, HiveQL.
Data
Data Lineage
Clean,
Complete
Data
RES
T
API
on-the-fly to match
archive compliance.
Access data from
streaming and
batch sources
outside cluster.
Navigator or Atlas
gathers any other
changes made to
data on cluster.
Pass source-to-
cluster data
lineage info to
REST API and
Navigator or
Atlas.
and enhance data in
performance every time

Forrester Research
The path to enterprise AI is full of twists
and turns, false starts, and lessons to
learn.
Surely without data quality, AI and
other advanced technologies can not
live up to their expectations.

• Gary Allemann
• +27 83 632 1591
• gary@masterdata.co.za
• www.masterdata.co.za
Questions

EXPLAINABLE AI AND THE MODERN DATA PIPELINE

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a EXPLAINABLE AI AND THE MODERN DATA PIPELINE

Semelhante a EXPLAINABLE AI AND THE MODERN DATA PIPELINE (20)

Mais de Gary Allemann

Mais de Gary Allemann (12)

Último

Último (20)

EXPLAINABLE AI AND THE MODERN DATA PIPELINE

Notas do Editor