SlideShare uma empresa Scribd logo
1 de 37
Trillium Software System:
New Features and
Big Data Matching
Paige Roberts, Product Marketing Manager
Steve Shissler, Director, Sales Engineering
Agenda
1 Syncsort
2 New Features in TSS
3 Big Data Matching Principles
4 Big Data Matching Case Study
5 Demo
6 Questions
Who is
Syncsort?
>7,000 customers
84 of the Fortune 100
Customers in >100 countries
Headquarters: Pearl River, NY
U . S . L O C AT I O N S
• Burlington, MA; Irvine, CA;
Oakbrook Terrace, IL; Rochester, MN
G L O B A L P R E S E N C E
• U.K., France, Germany, Netherlands,
Israel, Hong Kong & Japan
Big Iron to Big Data is a fast-growing
market segment composed of solutions
that optimize traditional data systems
and deliver mission-critical data from
these systems to next-generation
analytic environments.
Global leader in
Big Iron to Big Data
Syncsort’s Trillium Software System:
New Features
Collibra Integration
Collibra can define and manage data quality
rules, but cannot enforce the rules on the
data or measure compliance to them.
Goal:
• Make data accessible, traceable and
meaningful to business users.
• Automatically, pass Collibra rules into Trillium
Discovery and get rule compliance data passed
back to Collibra
Requirements:
• Bi-directional near real-time integration
between Trillium Discovery and Collibra DGC
for quality measurement and monitoring
• Trillium business rule analysis results / data
quality metrics shown in Collibra dashboards.
• Data Stewards can quickly identify issues and
take corrective action when data quality
standards are not met.
Closing the Loop
Collibra Data Governance Center
• Enables non-technical users to define
business policies and data quality rules
in plain language
• Makes data quality performance
available to all users
Trillium Discovery
• Imports DBC business rules so technical user
can convert to executable data quality rules
• Constantly runs data quality metrics on near
real-time basis, passes results back to
Collibra dashboards
Rulebooks to Rules
Quality test Results
Bi-directional connectivity Constant sync
Metric falling below
thresholds can
trigger case in
Collibra Issue
Management
Trillium Quality for Big Data
Trillium Quality =
Best-of-breed data quality
solution.
Leader in Gartner Data
Quality Tools MQ 12 years
running.
Intelligent Execution =
Artificially intelligent
dynamic performance
optimizer for cluster
execution in MapReduce,
Amazon EMR, or Spark.
Trillium Quality +
Intelligent Execution =
High performance
industry-leading data
quality on Big Data and
Cloud platforms.
• Build data quality processes that
ensure high-quality data that
meets such key business needs as:
o Single customer view (SCV)
o Standardized product data
o Standardization for fraud detection
Trillium Quality – Powerful Data Cleansing
• Consolidate data sources on input
• Match on party, household, business, etc.
• Develop workflows to transform, parse,
standardize, match and survive best record
• Manage “householding” issues associated with
multiple physical addresses under a single account
KEY FUNCTIONALITY:
• Global address validation with individual country postal rules
• Enrich missing postal information, latitude/longitude and other reference data
Design Once, Deploy Anywhere
Intelligent Execution - Insulate your organization from underlying complexities of Hadoop.
Get excellent performance every time
without tuning, load balancing, etc.
No re-design, re-compile, no re-work ever
• Future-proof job designs for emerging
compute frameworks, e.g. Spark 2.x
• Move from dev to test to production
• Move from on-premise to Cloud
• Move from one Cloud to another
Use existing ETL skills
No parallel programming – Java, MapReduce, Spark …
No worries about:
• Mappers, Reducers
• Big side or small side of joins …
Design Once
in visual GUI
Deploy Anywhere!
On-Premise,
Cloud
Mapreduce, Spark,
Future Platforms
Windows, Unix,
Linux
Batch,
Streaming
Single Node,
Cluster
Trillium Quality for Big Data
• Deploy data quality workflows as native, parallel MapReduce or Spark
processes for optimal efficiency.
• Process hundreds of millions of records of data.
• Standardize, enhance, and match international data sets with postal and
country-code validation.
• Integrate, parse, standardize, and match new and legacy customer data
from multiple disparate sources.
• Increase processing efficiency.
• Support failover through Hadoop’s fault-tolerant design; during a node
failure, processing is redirected to another node.
Two Ways to Get Postal Updates
Trillium Postal Download Web Service
Trillium Postal Download Web Service is an
automated download service introduced in
TSS v15.7. The download service allows you
to check the status of your postal license and
download the postal directories from a
browser-based application.
TSS Download Center (File Portal) FTP website
TSS Download Center allows you to manually download
postal directories through Trillium Software’s secure
website. See the Trillium Software System Installation
Guide for procedures on downloading postal directories
through this website.
And more …
• Trillium Discovery REST APIs installed with TSS
server, documentation in Help file for easy
integration with other applications like ASG Data
Intelligence
• Unique ID (UUID) Function
• Trillium Language Pack Locale Setting
• Apache Tomcat Upgrade to v8.5.32
• Australian (AU) Postal Directories and AU Postal
Matcher changes in accordance with Australia Post
licensing terms
• And more …
Example:
German locale setting in config.txt
key rest_api {
value locale "de"
}
Big Data Matching
Finding Similar Needles in a Really Big Haystack
Nobody wants a data swamp instead of a data lake!
“This sure looked a lot nicer on the
whiteboard…”
Only 35% of senior
executives have a high
level of trust in the
accuracy of their Big
Data Analytics
92% of executives are
concerned about the
negative impact of data
and analytics on
corporate reputation
Cost of poor data quality
rose by 50% in 2017
(Gartner)
84% of CEOs
are concerned about
the quality of the data
they’re basing
decisions on
The importance of data
quality in the enterprise:
• Decision making – Trust the data
that drives your business
• Customer centricity – Get a
single, complete and accurate
view of your customer for better
sales, marketing and customer
service
• Compliance – Know your data,
and ensure its accuracy to meet
industry and government
regulations
• Machine learning & AI – Train
your models on accurate data
The Data Lake
Needs Data
Quality
“
”
The magic of machine learning is that you build a
statistical model based on the most valid dataset for
the domain of interest.
If the data is junk, then you’ll be building a junk
model that will not be able to do its job.
James Kobeilus
SiliconANGLE Wikibon
Lead Analyst for Data Science, Deep Learning, App Development
2018
Common Machine Learning Applications
• Anti-money laundering
• Fraud detection
• Cybersecurity
• Targeted marketing
• Recommendation engine
• Next best action
• Customer churn prevention
• Know your customer
De-Bugging Your Data
Incorrect, Incomplete, Mis-Formatted “Dirty Data” –
Mistakes and errors are almost never the patterns you’re
looking for in a data set. Correcting and standardizing will
tend to boost the signal.
Correcting data problems vastly increases a data set’s usefulness for machine learning.
De-Bugging Your Data
Incorrect, Incomplete, Mis-Formatted “Dirty Data” –
Mistakes and errors are almost never the patterns you’re
looking for in a data set. Correcting and standardizing will
tend to boost the signal.
Multiple copies – If you data comes from many sources, as it
often does, it may contain multiple records of information
about the same person, company, product or other entity.
Removing duplicates and enhancing the overall depth and
accuracy of knowledge about a single entity can make a huge
difference.
Correcting data problems vastly increases a data set’s usefulness for machine learning.
De-Bugging Your Data
Incorrect, Incomplete, Mis-Formatted “Dirty Data” –
Mistakes and errors are almost never the patterns you’re
looking for in a data set. Correcting and standardizing will
tend to boost the signal.
Multiple copies – If you data comes from many sources, as it
often does, it may contain multiple records of information
about the same person, company, product or other entity.
Removing duplicates and enhancing the overall depth and
accuracy of knowledge about a single entity can make a huge
difference.
Enrichment – Enriching data with other data sets, such as
geospatial, demographics, or firmographics data can provide
new depths of analysis. For example, adding latitude and
longitude may enable identification of geospatial patterns.
Correcting data problems vastly increases a data set’s usefulness for machine learning.
De-Bugging Your Data
Incorrect, Incomplete, Mis-Formatted “Dirty Data” –
Mistakes and errors are almost never the patterns you’re
looking for in a data set. Correcting and standardizing will
tend to boost the signal.
Multiple copies – If you data comes from many sources, as it
often does, it may contain multiple records of information
about the same person, company, product or other entity.
Removing duplicates and enhancing the overall depth and
accuracy of knowledge about a single entity can make a huge
difference.
Enrichment – Enriching data with other data sets, such as
geospatial, demographics, or firmographics data can provide
new depths of analysis. For example, adding latitude and
longitude may enable identification of geospatial patterns.
Correcting data problems vastly increases a data set’s usefulness for machine learning.
However, traditional data quality software
is designed to work on smaller data sets.
Traditional data quality processes are
an effective method to remove defects.
Data Quality Challenges of Enabling Machine Learning
1. Data Cleansing at Scale
• Data quality cleansing and preparation routines have to be reproduced at scale, both to get the data ready to train
machine learning models, and to comply with business regulations.
• Other data quality tools are not designed to work on that scale of data.
• Programming data cleansing workflows from scratch in Java MapReduce or Scala for Spark requires specialized skills
and takes at least twice as long as designing the same workflows in graphical point and click tools.
• Tuning those MapReduce or Spark workflows to get decent performance on a cluster takes even longer, and will
have to be re-done if the job is moved to a bigger or smaller cluster, or from an on-premise data center to the Cloud.
Harvard Business Review - 2018
“If your data is bad, your machine
learning tools are useless.”
Anonymous Computer Scientist - 1957
“Garbage in, garbage out.”
Common Data Quality Problems
• Many data records with different
layouts
• Lack of standardization of the
different fields
• Misspellings
• Data sourced from third parties does
not contain all the necessary fields
• Inconsistent data formats
(measurements, languages, postal
conventions and dates)
• Names spelled differently
• Different number formatting
“But I have a lot of data ….” Is not an excuse for non-compliance.
To comply with GDPR, companies must know the
answers to the following questions:
• What do we know about a given customer?
• Where is our customer data?
• Is our customer contact information current?
• How are we processing customer data?
And supply those answers in the form of business
processes that provide evidence of compliance.
Data Quality is Critical for GDPR Compliance
Data Quality Challenges of Enabling Machine Learning
1. Data Cleansing at Scale
• Data quality cleansing and preparation routines have to be reproduced at scale, both to ready the data for machine
learning, and to comply with business regulations.
• Other data quality tools are not designed to work on that scale of data.
• Programming data cleansing workflows from scratch in Java MapReduce or Scala for Spark requires specialized skills
and takes at least twice as long as designing the same workflows in graphical point and click tools.
• Tuning those MapReduce or Spark workflows to get decent performance on a cluster takes even longer, and will
have to be re-done if the job is moved to a bigger or smaller cluster, or from an on-premise data center to the Cloud.
2. Entity Resolution
• Distinguishing matches that relate to a single specific entity (a person, a company, a part, etc.) requires sophisticated
multi-field matching algorithms
• Distinguishing matches across massive datasets requires a lot of compute power. Essentially everything has to be
compared to everything else, multiple times in multiple ways.
• Other data quality tools cannot find and combine records of the same entity at that scale.
ROB SMITH
3 DAVY DRIVE
bob.smith@hotmail.com
01189407600
Name
Address1
City
Postal Code
Phone
Email
Customer
Service
S66 7EN
Address2
• Exact match + 36 different fuzzy matching
comparison algorithms
• Weighted decision trees
• Match scoring for confidence thresholds
• Multi-field matching, multi-pass and array
matching
• Transitive matching with multiple
different match criteria
A=B, B=C therefore A=B=C
• High performance everything-to-everything
comparison across any cluster in MapReduce
or Spark
Entity Resolution at Scale
Dr Bob Smith
bob.smith@hotmail.com
Name
Address1
City
Postal Code
Phone
Email
Web Login
Address2
Is that
you,
Bob?
Is that
you,
Bob?I have billions of records. How do I identify the same entity?
Are these two businesses owned by the same person?
Are these two accounts in the same building?
Mr Robert Smith
3 Davey Drive
S667EN
01189 407 600
Rotherham
Name
Address1
City
Postal Code
Phone
Email
Transfer
# 16
Address2
Bob Smith DR
3 Davy Dr #16
S667EN
01189 407 600
Rotherham
Name
Address1
City
Postal Code
Phone
Email
Purchase
Address2
Dr. B. Smith
3 Davy Dryve 16
S66 7EN
bsmith@gmail.com
01189 407 600
MALtby
Name
Address1
City
Postal Code
Phone
Email
ATM
Transaction
Address2
Anti-Money Laundering on Hadoop at Global Bank
Challenge: Meet AML transaction monitoring and Financial Conduct
Authority (FCA) compliance demands
• Data too large, diversely scattered to analyze
• Disparate data sources – Mainframe, RDBMS, Cloud, etc.
Requirements:
• Consolidate, clean, and verify data for all analytics and
reporting.
• MUST be secure: Kerberos and LDAP integration
required
• Need unmodified copy of
mainframe data stored on
Hadoop for backup, and
compliance archive
• MUST have complete, detailed data
lineage from origin to end point
Impact of Entity Resolution
Anti-Money
Laundering on
Hadoop at
Global Bank
Solution:
• Must be secure – Kerberos,
LDAP
• Must have lineage – data
origin to end point
• Massive data volumes
• Scattered data – Mainframe,
RDBMS, Cloud, …
• Must archive unaltered
mainframe data
Full Anti-Money Laundering
regulatory compliance with
financial crimes data lake –
high performance
results at massive scale.
• Full end-to-end data lineage
supplied to Apache Atlas and
ASG Data Intelligence
• Cluster-native data
verification, enrichment, and
demanding multi-field entity
resolution on Spark
• Unmodified mainframe
“Golden Records” stored on
Hadoop
Bank must monitor transactions
to detect Money Laundering for
FCA compliance.
Machine learning can detect
patterns, but …
Requires large amounts of
current, clean data.
• Syncsort DMX-h
• Syncsort’s Trillium Quality for Big Data
• Syncsort DMX Change Data Capture
• Hortonworks HDP
“
”
For want of a nail, the kingdom was lost.
For want of a data cleansing and integration tool,
the whole AI superstructure can fall down.
James Kobeilus
SiliconANGLE Wikibon
Lead Analyst for Data Science, Deep Learning, App Development
2018
Demo: Big Data Matching
With Trillium Quality for Big Data
Trillium Quality for Big Data – Data Cleansing at Scale
Boost effectiveness of machine learning, AI with complete, standardized data.
1. Visually create and test data
quality processes locally
2. Execute in MapReduce or Spark
On premise or in the Cloud
Identity management
Name Address City State Zip DOB
Nicholas Saunders 22 Shady Lane Mystic CT 06355 04/12/1971
N.M Saunders Jnr Crooked Trail Trenton NJ 08604 12/04/1971
Nick Saunders 22 Shady Street Mystic CT 06355 12/04/1971
Saunders, Nicholas M. 22 Shady Lane Mystic CT 06355 n/a
Nicholas Sanders Crooked Road Trenton NJ 08604 04/12/1971
Nicholas Saunders 22 Shady Street Mystic NJ 08604 12/04/1971
CUSTOMERS VENDORS ACCOUNTS
360º View
Questions?
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack

Mais conteúdo relacionado

Mais procurados

Inside open metadata—the deep dive
Inside open metadata—the deep diveInside open metadata—the deep dive
Inside open metadata—the deep dive
DataWorks Summit
 
Stream Scaling in Pravega
Stream Scaling in PravegaStream Scaling in Pravega
Stream Scaling in Pravega
DataWorks Summit
 
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
DataWorks Summit
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
Anna Shymchenko
 

Mais procurados (20)

Inside open metadata—the deep dive
Inside open metadata—the deep diveInside open metadata—the deep dive
Inside open metadata—the deep dive
 
Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration wi...
Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration wi...Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration wi...
Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration wi...
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
 
Big Data Testing Strategies
Big Data Testing StrategiesBig Data Testing Strategies
Big Data Testing Strategies
 
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
 
Stream Scaling in Pravega
Stream Scaling in PravegaStream Scaling in Pravega
Stream Scaling in Pravega
 
Webinar - Data Lake Management: Extending Storage and Lifecycle of Data
Webinar - Data Lake Management: Extending Storage and Lifecycle of DataWebinar - Data Lake Management: Extending Storage and Lifecycle of Data
Webinar - Data Lake Management: Extending Storage and Lifecycle of Data
 
Modern Data Architecture
Modern Data Architecture Modern Data Architecture
Modern Data Architecture
 
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
 
StreamCentral Technical Overview
StreamCentral Technical OverviewStreamCentral Technical Overview
StreamCentral Technical Overview
 
Next Generation Enterprise Architecture
Next Generation Enterprise ArchitectureNext Generation Enterprise Architecture
Next Generation Enterprise Architecture
 
Multi-tenant Hadoop - the challenge of maintaining high SLAS
Multi-tenant Hadoop - the challenge of maintaining high SLASMulti-tenant Hadoop - the challenge of maintaining high SLAS
Multi-tenant Hadoop - the challenge of maintaining high SLAS
 
Architecture of Big Data Solutions
Architecture of Big Data SolutionsArchitecture of Big Data Solutions
Architecture of Big Data Solutions
 
Creating a Next-Generation Big Data Architecture
Creating a Next-Generation Big Data ArchitectureCreating a Next-Generation Big Data Architecture
Creating a Next-Generation Big Data Architecture
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
 

Semelhante a Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack

Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Precisely
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Nathan Bijnens
 
The Bigger They Are The Harder They Fall
The Bigger They Are The Harder They FallThe Bigger They Are The Harder They Fall
The Bigger They Are The Harder They Fall
Trillium Software
 
Big data and the data quality imperative
Big data and the data quality imperativeBig data and the data quality imperative
Big data and the data quality imperative
Trillium Software
 
final oracle presentation
final oracle presentationfinal oracle presentation
final oracle presentation
Priyesh Patel
 
Accelerating Time to Success for Your Big Data Initiatives
Accelerating Time to Success for Your Big Data InitiativesAccelerating Time to Success for Your Big Data Initiatives
Accelerating Time to Success for Your Big Data Initiatives
☁Jake Weaver ☁
 
CWIN17 India / Bigdata architecture yashowardhan sowale
CWIN17 India / Bigdata architecture  yashowardhan sowaleCWIN17 India / Bigdata architecture  yashowardhan sowale
CWIN17 India / Bigdata architecture yashowardhan sowale
Capgemini
 

Semelhante a Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack (20)

Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
 
Deliveinrg explainable AI
Deliveinrg explainable AIDeliveinrg explainable AI
Deliveinrg explainable AI
 
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on TrackYour AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
 
The Bigger They Are The Harder They Fall
The Bigger They Are The Harder They FallThe Bigger They Are The Harder They Fall
The Bigger They Are The Harder They Fall
 
7 Emerging Data & Enterprise Integration Trends in 2022
7 Emerging Data & Enterprise Integration Trends in 20227 Emerging Data & Enterprise Integration Trends in 2022
7 Emerging Data & Enterprise Integration Trends in 2022
 
Big data and the data quality imperative
Big data and the data quality imperativeBig data and the data quality imperative
Big data and the data quality imperative
 
Big data
Big dataBig data
Big data
 
The New Trillium DQ: Big Data Insights When and Where You Need Them
The New Trillium DQ: Big Data Insights When and Where You Need ThemThe New Trillium DQ: Big Data Insights When and Where You Need Them
The New Trillium DQ: Big Data Insights When and Where You Need Them
 
final oracle presentation
final oracle presentationfinal oracle presentation
final oracle presentation
 
Accelerating Time to Success for Your Big Data Initiatives
Accelerating Time to Success for Your Big Data InitiativesAccelerating Time to Success for Your Big Data Initiatives
Accelerating Time to Success for Your Big Data Initiatives
 
CWIN17 India / Bigdata architecture yashowardhan sowale
CWIN17 India / Bigdata architecture  yashowardhan sowaleCWIN17 India / Bigdata architecture  yashowardhan sowale
CWIN17 India / Bigdata architecture yashowardhan sowale
 
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...Analyst Webinar: Discover how a logical data fabric helps organizations avoid...
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...
 
IBM Relay 2015: Open for Data
IBM Relay 2015: Open for Data IBM Relay 2015: Open for Data
IBM Relay 2015: Open for Data
 
Semantic 'Radar' Steers Users to Insights in the Data Lake
Semantic 'Radar' Steers Users to Insights in the Data LakeSemantic 'Radar' Steers Users to Insights in the Data Lake
Semantic 'Radar' Steers Users to Insights in the Data Lake
 
Managing Large Amounts of Data with Salesforce
Managing Large Amounts of Data with SalesforceManaging Large Amounts of Data with Salesforce
Managing Large Amounts of Data with Salesforce
 
Into dq ed wrazen
Into dq ed wrazenInto dq ed wrazen
Into dq ed wrazen
 

Mais de Precisely

How to Build Data Governance Programs That Last - A Business-First Approach.pdf
How to Build Data Governance Programs That Last - A Business-First Approach.pdfHow to Build Data Governance Programs That Last - A Business-First Approach.pdf
How to Build Data Governance Programs That Last - A Business-First Approach.pdf
Precisely
 
Zukuntssichere SAP Prozesse dank automatisierter Massendaten
Zukuntssichere SAP Prozesse dank automatisierter MassendatenZukuntssichere SAP Prozesse dank automatisierter Massendaten
Zukuntssichere SAP Prozesse dank automatisierter Massendaten
Precisely
 
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
Precisely
 
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3fTestjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
Precisely
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
Precisely
 
Moving IBM i Applications to the Cloud with AWS and Precisely
Moving IBM i Applications to the Cloud with AWS and PreciselyMoving IBM i Applications to the Cloud with AWS and Precisely
Moving IBM i Applications to the Cloud with AWS and Precisely
Precisely
 
Automate Your Master Data Processes for Shared Service Center Excellence
Automate Your Master Data Processes for Shared Service Center ExcellenceAutomate Your Master Data Processes for Shared Service Center Excellence
Automate Your Master Data Processes for Shared Service Center Excellence
Precisely
 

Mais de Precisely (20)

How to Build Data Governance Programs That Last - A Business-First Approach.pdf
How to Build Data Governance Programs That Last - A Business-First Approach.pdfHow to Build Data Governance Programs That Last - A Business-First Approach.pdf
How to Build Data Governance Programs That Last - A Business-First Approach.pdf
 
Zukuntssichere SAP Prozesse dank automatisierter Massendaten
Zukuntssichere SAP Prozesse dank automatisierter MassendatenZukuntssichere SAP Prozesse dank automatisierter Massendaten
Zukuntssichere SAP Prozesse dank automatisierter Massendaten
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Crucial Considerations for AI-ready Data.pdf
Crucial Considerations for AI-ready Data.pdfCrucial Considerations for AI-ready Data.pdf
Crucial Considerations for AI-ready Data.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Justifying Capacity Managment Webinar 4/10
Justifying Capacity Managment Webinar 4/10Justifying Capacity Managment Webinar 4/10
Justifying Capacity Managment Webinar 4/10
 
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
 
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...
 
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3fTestjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
 
Data Innovation Summit: Data Integrity Trends
Data Innovation Summit: Data Integrity TrendsData Innovation Summit: Data Integrity Trends
Data Innovation Summit: Data Integrity Trends
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Optimisez la fonction financière en automatisant vos processus SAP
Optimisez la fonction financière en automatisant vos processus SAPOptimisez la fonction financière en automatisant vos processus SAP
Optimisez la fonction financière en automatisant vos processus SAP
 
SAPS/4HANA Migration - Transformation-Management + nachhaltige Investitionen
SAPS/4HANA Migration - Transformation-Management + nachhaltige InvestitionenSAPS/4HANA Migration - Transformation-Management + nachhaltige Investitionen
SAPS/4HANA Migration - Transformation-Management + nachhaltige Investitionen
 
Automatisierte SAP Prozesse mit Hilfe von APIs
Automatisierte SAP Prozesse mit Hilfe von APIsAutomatisierte SAP Prozesse mit Hilfe von APIs
Automatisierte SAP Prozesse mit Hilfe von APIs
 
Moving IBM i Applications to the Cloud with AWS and Precisely
Moving IBM i Applications to the Cloud with AWS and PreciselyMoving IBM i Applications to the Cloud with AWS and Precisely
Moving IBM i Applications to the Cloud with AWS and Precisely
 
Effective Security Monitoring for IBM i: What You Need to Know
Effective Security Monitoring for IBM i: What You Need to KnowEffective Security Monitoring for IBM i: What You Need to Know
Effective Security Monitoring for IBM i: What You Need to Know
 
Automate Your Master Data Processes for Shared Service Center Excellence
Automate Your Master Data Processes for Shared Service Center ExcellenceAutomate Your Master Data Processes for Shared Service Center Excellence
Automate Your Master Data Processes for Shared Service Center Excellence
 
5 Keys to Improved IT Operation Management
5 Keys to Improved IT Operation Management5 Keys to Improved IT Operation Management
5 Keys to Improved IT Operation Management
 
Unlock Efficiency With Your Address Data Today For a Smarter Tomorrow
Unlock Efficiency With Your Address Data Today For a Smarter TomorrowUnlock Efficiency With Your Address Data Today For a Smarter Tomorrow
Unlock Efficiency With Your Address Data Today For a Smarter Tomorrow
 
Navigating Cloud Trends in 2024 Webinar Deck
Navigating Cloud Trends in 2024 Webinar DeckNavigating Cloud Trends in 2024 Webinar Deck
Navigating Cloud Trends in 2024 Webinar Deck
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack

  • 1. Trillium Software System: New Features and Big Data Matching Paige Roberts, Product Marketing Manager Steve Shissler, Director, Sales Engineering
  • 2. Agenda 1 Syncsort 2 New Features in TSS 3 Big Data Matching Principles 4 Big Data Matching Case Study 5 Demo 6 Questions
  • 3. Who is Syncsort? >7,000 customers 84 of the Fortune 100 Customers in >100 countries Headquarters: Pearl River, NY U . S . L O C AT I O N S • Burlington, MA; Irvine, CA; Oakbrook Terrace, IL; Rochester, MN G L O B A L P R E S E N C E • U.K., France, Germany, Netherlands, Israel, Hong Kong & Japan Big Iron to Big Data is a fast-growing market segment composed of solutions that optimize traditional data systems and deliver mission-critical data from these systems to next-generation analytic environments. Global leader in Big Iron to Big Data
  • 4. Syncsort’s Trillium Software System: New Features
  • 5. Collibra Integration Collibra can define and manage data quality rules, but cannot enforce the rules on the data or measure compliance to them. Goal: • Make data accessible, traceable and meaningful to business users. • Automatically, pass Collibra rules into Trillium Discovery and get rule compliance data passed back to Collibra Requirements: • Bi-directional near real-time integration between Trillium Discovery and Collibra DGC for quality measurement and monitoring • Trillium business rule analysis results / data quality metrics shown in Collibra dashboards. • Data Stewards can quickly identify issues and take corrective action when data quality standards are not met.
  • 6. Closing the Loop Collibra Data Governance Center • Enables non-technical users to define business policies and data quality rules in plain language • Makes data quality performance available to all users Trillium Discovery • Imports DBC business rules so technical user can convert to executable data quality rules • Constantly runs data quality metrics on near real-time basis, passes results back to Collibra dashboards Rulebooks to Rules Quality test Results Bi-directional connectivity Constant sync Metric falling below thresholds can trigger case in Collibra Issue Management
  • 7. Trillium Quality for Big Data Trillium Quality = Best-of-breed data quality solution. Leader in Gartner Data Quality Tools MQ 12 years running. Intelligent Execution = Artificially intelligent dynamic performance optimizer for cluster execution in MapReduce, Amazon EMR, or Spark. Trillium Quality + Intelligent Execution = High performance industry-leading data quality on Big Data and Cloud platforms.
  • 8. • Build data quality processes that ensure high-quality data that meets such key business needs as: o Single customer view (SCV) o Standardized product data o Standardization for fraud detection Trillium Quality – Powerful Data Cleansing • Consolidate data sources on input • Match on party, household, business, etc. • Develop workflows to transform, parse, standardize, match and survive best record • Manage “householding” issues associated with multiple physical addresses under a single account KEY FUNCTIONALITY: • Global address validation with individual country postal rules • Enrich missing postal information, latitude/longitude and other reference data
  • 9. Design Once, Deploy Anywhere Intelligent Execution - Insulate your organization from underlying complexities of Hadoop. Get excellent performance every time without tuning, load balancing, etc. No re-design, re-compile, no re-work ever • Future-proof job designs for emerging compute frameworks, e.g. Spark 2.x • Move from dev to test to production • Move from on-premise to Cloud • Move from one Cloud to another Use existing ETL skills No parallel programming – Java, MapReduce, Spark … No worries about: • Mappers, Reducers • Big side or small side of joins … Design Once in visual GUI Deploy Anywhere! On-Premise, Cloud Mapreduce, Spark, Future Platforms Windows, Unix, Linux Batch, Streaming Single Node, Cluster
  • 10. Trillium Quality for Big Data • Deploy data quality workflows as native, parallel MapReduce or Spark processes for optimal efficiency. • Process hundreds of millions of records of data. • Standardize, enhance, and match international data sets with postal and country-code validation. • Integrate, parse, standardize, and match new and legacy customer data from multiple disparate sources. • Increase processing efficiency. • Support failover through Hadoop’s fault-tolerant design; during a node failure, processing is redirected to another node.
  • 11. Two Ways to Get Postal Updates Trillium Postal Download Web Service Trillium Postal Download Web Service is an automated download service introduced in TSS v15.7. The download service allows you to check the status of your postal license and download the postal directories from a browser-based application. TSS Download Center (File Portal) FTP website TSS Download Center allows you to manually download postal directories through Trillium Software’s secure website. See the Trillium Software System Installation Guide for procedures on downloading postal directories through this website.
  • 12. And more … • Trillium Discovery REST APIs installed with TSS server, documentation in Help file for easy integration with other applications like ASG Data Intelligence • Unique ID (UUID) Function • Trillium Language Pack Locale Setting • Apache Tomcat Upgrade to v8.5.32 • Australian (AU) Postal Directories and AU Postal Matcher changes in accordance with Australia Post licensing terms • And more … Example: German locale setting in config.txt key rest_api { value locale "de" }
  • 13. Big Data Matching Finding Similar Needles in a Really Big Haystack
  • 14. Nobody wants a data swamp instead of a data lake! “This sure looked a lot nicer on the whiteboard…”
  • 15. Only 35% of senior executives have a high level of trust in the accuracy of their Big Data Analytics 92% of executives are concerned about the negative impact of data and analytics on corporate reputation Cost of poor data quality rose by 50% in 2017 (Gartner) 84% of CEOs are concerned about the quality of the data they’re basing decisions on The importance of data quality in the enterprise: • Decision making – Trust the data that drives your business • Customer centricity – Get a single, complete and accurate view of your customer for better sales, marketing and customer service • Compliance – Know your data, and ensure its accuracy to meet industry and government regulations • Machine learning & AI – Train your models on accurate data The Data Lake Needs Data Quality
  • 16. “ ” The magic of machine learning is that you build a statistical model based on the most valid dataset for the domain of interest. If the data is junk, then you’ll be building a junk model that will not be able to do its job. James Kobeilus SiliconANGLE Wikibon Lead Analyst for Data Science, Deep Learning, App Development 2018
  • 17. Common Machine Learning Applications • Anti-money laundering • Fraud detection • Cybersecurity • Targeted marketing • Recommendation engine • Next best action • Customer churn prevention • Know your customer
  • 18. De-Bugging Your Data Incorrect, Incomplete, Mis-Formatted “Dirty Data” – Mistakes and errors are almost never the patterns you’re looking for in a data set. Correcting and standardizing will tend to boost the signal. Correcting data problems vastly increases a data set’s usefulness for machine learning.
  • 19. De-Bugging Your Data Incorrect, Incomplete, Mis-Formatted “Dirty Data” – Mistakes and errors are almost never the patterns you’re looking for in a data set. Correcting and standardizing will tend to boost the signal. Multiple copies – If you data comes from many sources, as it often does, it may contain multiple records of information about the same person, company, product or other entity. Removing duplicates and enhancing the overall depth and accuracy of knowledge about a single entity can make a huge difference. Correcting data problems vastly increases a data set’s usefulness for machine learning.
  • 20. De-Bugging Your Data Incorrect, Incomplete, Mis-Formatted “Dirty Data” – Mistakes and errors are almost never the patterns you’re looking for in a data set. Correcting and standardizing will tend to boost the signal. Multiple copies – If you data comes from many sources, as it often does, it may contain multiple records of information about the same person, company, product or other entity. Removing duplicates and enhancing the overall depth and accuracy of knowledge about a single entity can make a huge difference. Enrichment – Enriching data with other data sets, such as geospatial, demographics, or firmographics data can provide new depths of analysis. For example, adding latitude and longitude may enable identification of geospatial patterns. Correcting data problems vastly increases a data set’s usefulness for machine learning.
  • 21. De-Bugging Your Data Incorrect, Incomplete, Mis-Formatted “Dirty Data” – Mistakes and errors are almost never the patterns you’re looking for in a data set. Correcting and standardizing will tend to boost the signal. Multiple copies – If you data comes from many sources, as it often does, it may contain multiple records of information about the same person, company, product or other entity. Removing duplicates and enhancing the overall depth and accuracy of knowledge about a single entity can make a huge difference. Enrichment – Enriching data with other data sets, such as geospatial, demographics, or firmographics data can provide new depths of analysis. For example, adding latitude and longitude may enable identification of geospatial patterns. Correcting data problems vastly increases a data set’s usefulness for machine learning. However, traditional data quality software is designed to work on smaller data sets. Traditional data quality processes are an effective method to remove defects.
  • 22. Data Quality Challenges of Enabling Machine Learning 1. Data Cleansing at Scale • Data quality cleansing and preparation routines have to be reproduced at scale, both to get the data ready to train machine learning models, and to comply with business regulations. • Other data quality tools are not designed to work on that scale of data. • Programming data cleansing workflows from scratch in Java MapReduce or Scala for Spark requires specialized skills and takes at least twice as long as designing the same workflows in graphical point and click tools. • Tuning those MapReduce or Spark workflows to get decent performance on a cluster takes even longer, and will have to be re-done if the job is moved to a bigger or smaller cluster, or from an on-premise data center to the Cloud.
  • 23. Harvard Business Review - 2018 “If your data is bad, your machine learning tools are useless.” Anonymous Computer Scientist - 1957 “Garbage in, garbage out.”
  • 24. Common Data Quality Problems • Many data records with different layouts • Lack of standardization of the different fields • Misspellings • Data sourced from third parties does not contain all the necessary fields • Inconsistent data formats (measurements, languages, postal conventions and dates) • Names spelled differently • Different number formatting
  • 25. “But I have a lot of data ….” Is not an excuse for non-compliance. To comply with GDPR, companies must know the answers to the following questions: • What do we know about a given customer? • Where is our customer data? • Is our customer contact information current? • How are we processing customer data? And supply those answers in the form of business processes that provide evidence of compliance. Data Quality is Critical for GDPR Compliance
  • 26. Data Quality Challenges of Enabling Machine Learning 1. Data Cleansing at Scale • Data quality cleansing and preparation routines have to be reproduced at scale, both to ready the data for machine learning, and to comply with business regulations. • Other data quality tools are not designed to work on that scale of data. • Programming data cleansing workflows from scratch in Java MapReduce or Scala for Spark requires specialized skills and takes at least twice as long as designing the same workflows in graphical point and click tools. • Tuning those MapReduce or Spark workflows to get decent performance on a cluster takes even longer, and will have to be re-done if the job is moved to a bigger or smaller cluster, or from an on-premise data center to the Cloud. 2. Entity Resolution • Distinguishing matches that relate to a single specific entity (a person, a company, a part, etc.) requires sophisticated multi-field matching algorithms • Distinguishing matches across massive datasets requires a lot of compute power. Essentially everything has to be compared to everything else, multiple times in multiple ways. • Other data quality tools cannot find and combine records of the same entity at that scale.
  • 27. ROB SMITH 3 DAVY DRIVE bob.smith@hotmail.com 01189407600 Name Address1 City Postal Code Phone Email Customer Service S66 7EN Address2 • Exact match + 36 different fuzzy matching comparison algorithms • Weighted decision trees • Match scoring for confidence thresholds • Multi-field matching, multi-pass and array matching • Transitive matching with multiple different match criteria A=B, B=C therefore A=B=C • High performance everything-to-everything comparison across any cluster in MapReduce or Spark Entity Resolution at Scale Dr Bob Smith bob.smith@hotmail.com Name Address1 City Postal Code Phone Email Web Login Address2 Is that you, Bob? Is that you, Bob?I have billions of records. How do I identify the same entity? Are these two businesses owned by the same person? Are these two accounts in the same building? Mr Robert Smith 3 Davey Drive S667EN 01189 407 600 Rotherham Name Address1 City Postal Code Phone Email Transfer # 16 Address2 Bob Smith DR 3 Davy Dr #16 S667EN 01189 407 600 Rotherham Name Address1 City Postal Code Phone Email Purchase Address2 Dr. B. Smith 3 Davy Dryve 16 S66 7EN bsmith@gmail.com 01189 407 600 MALtby Name Address1 City Postal Code Phone Email ATM Transaction Address2
  • 28. Anti-Money Laundering on Hadoop at Global Bank Challenge: Meet AML transaction monitoring and Financial Conduct Authority (FCA) compliance demands • Data too large, diversely scattered to analyze • Disparate data sources – Mainframe, RDBMS, Cloud, etc. Requirements: • Consolidate, clean, and verify data for all analytics and reporting. • MUST be secure: Kerberos and LDAP integration required • Need unmodified copy of mainframe data stored on Hadoop for backup, and compliance archive • MUST have complete, detailed data lineage from origin to end point
  • 29. Impact of Entity Resolution
  • 30. Anti-Money Laundering on Hadoop at Global Bank Solution: • Must be secure – Kerberos, LDAP • Must have lineage – data origin to end point • Massive data volumes • Scattered data – Mainframe, RDBMS, Cloud, … • Must archive unaltered mainframe data Full Anti-Money Laundering regulatory compliance with financial crimes data lake – high performance results at massive scale. • Full end-to-end data lineage supplied to Apache Atlas and ASG Data Intelligence • Cluster-native data verification, enrichment, and demanding multi-field entity resolution on Spark • Unmodified mainframe “Golden Records” stored on Hadoop Bank must monitor transactions to detect Money Laundering for FCA compliance. Machine learning can detect patterns, but … Requires large amounts of current, clean data. • Syncsort DMX-h • Syncsort’s Trillium Quality for Big Data • Syncsort DMX Change Data Capture • Hortonworks HDP
  • 31. “ ” For want of a nail, the kingdom was lost. For want of a data cleansing and integration tool, the whole AI superstructure can fall down. James Kobeilus SiliconANGLE Wikibon Lead Analyst for Data Science, Deep Learning, App Development 2018
  • 32. Demo: Big Data Matching With Trillium Quality for Big Data
  • 33. Trillium Quality for Big Data – Data Cleansing at Scale Boost effectiveness of machine learning, AI with complete, standardized data. 1. Visually create and test data quality processes locally 2. Execute in MapReduce or Spark On premise or in the Cloud
  • 34. Identity management Name Address City State Zip DOB Nicholas Saunders 22 Shady Lane Mystic CT 06355 04/12/1971 N.M Saunders Jnr Crooked Trail Trenton NJ 08604 12/04/1971 Nick Saunders 22 Shady Street Mystic CT 06355 12/04/1971 Saunders, Nicholas M. 22 Shady Lane Mystic CT 06355 n/a Nicholas Sanders Crooked Road Trenton NJ 08604 04/12/1971 Nicholas Saunders 22 Shady Street Mystic NJ 08604 12/04/1971 CUSTOMERS VENDORS ACCOUNTS 360º View

Notas do Editor

  1. For Collibra users: We are the only data quality solution with out-of-the-box bi-directional integration with Collibra Governance Center to give you “closed loop” data governance If Trillium Discovery metrics fall below thresholds, customer can implement so case can be triggered in Collibra Issue Management Data stewards alerted, enabling them to take corrective actions
  2. Intelligent execution – artificially intelligent dynamic performance optimizer: Visually design your jobs once, and deploy them anywhere – MapReduce, Spark, Linux, Unix, Windows – on premise or in the cloud. No changes or tuning required. Easily move applications from standalone server environments, from MapRedue to Spark, from on premise to cloud – as easy as clicking on a drop-down menu Future-proof job designs for emerging compute frameworks Avoid tuning -- Intelligent Execution dynamically plans for applications at run-time based on the chosen compute framework Insulate your users from the underlying complexities of Hadoop and use existing data quality skills Cut development time in half
  3. Traditional data quality software is not designed to work at Hadoop scale.
  4. https://www.zdnet.com/article/most-executives-dont-trust-their-organizations-data-analytics-and-ai/
  5. Data is the new source code for AI.
  6. Match scoring for confidence thresholds – in a user-friendly scoring map that you can easily tune Multi-pass matching for different combinations of fields Array matching – cross check multi-word or multi-field information - example 3 Davy Dr #16 all in address 1 compared to 3 Davey Drive in Add1 and #16 in Add2 Even without intentionally trying to conceal identity, it can be difficult to resolve a single person or business from multiple touches across multiple data systems, each with it’s own data quality issues. Without good entity resolution, money laundering is much easier to get away with. You could hide who you are from a computer as easily as calling yourself Dr. Robert Smith in one place and Bob Smith in another. Data cleansing and standardization at scale, the previous step, will increase the number of matches found significantly, but doing an everything to everything comparison across a cluster is still a big challenge. Data scientists should be focused on perfecting anti money laundering models, not the perfect windowing functions in Spark for doing Levenstein distance matching on a cluster. Examples of multi-field matching: Name + email Name + phone Name + physical address Email + phone Multi-pass matching means you go over the data multiple times comparing different combinations of fields. Fuzzy matching algorithm examples: keystroke distance, Levenstein distance, etc., distance comparison of geo-location Specialized date, name, street, etc comparison algorithms
  7. he Financial Conduct Authority (FCA) is a financial regulatory body in the United Kingdom, but operates independently of the UK Government, and is financed by charging fees to members of the financial services industry.[3] The FCA regulates financial firms providing services to consumers and maintains the integrity of the financial markets in the United Kingdom.[4]
  8. Overall, a good entity resolution solution makes AML teams 81% more productive