When consolidating multiple sources of information from across your organization, how do you find the records that relate to the same customer, the same company or the same product? This is the challenge faced by many businesses today when putting a data lake to work. The problem is made far worse when different systems may not have the same contact entered the same way. Is Bob Smith the same as Robert Smith? How about Dr. Robert L. Smith - is he the same person? What about Syncsort, Inc and Sinksort Corp.? Are those the same company? One must compare each individual record to every other record in the dataset with some very sophisticated matching algorithms to determine who is who, and you may have to compare the data multiple times in multiple ways to resolve each entity.
Just to add to the difficulty, let’s say your organization has very large volumes of records in your data lake - you don’t have to compare a thousand records to a thousand other records multiple times - you must compare a million to a million, or 100 million to 100 million. This kind of compute intensive comparison can bring even a powerful cluster to its knees.
This is a problem Syncsort customers must solve, and we have developed some very powerful and intelligent software to tackle it.
View this presentation as we discuss the challenges of entity resolution at scale, how Syncsort’s Trillium data quality software line has tackled them successfully in production clusters and see a demonstration of this software in action.
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
1. Trillium Software System:
New Features and
Big Data Matching
Paige Roberts, Product Marketing Manager
Steve Shissler, Director, Sales Engineering
2. Agenda
1 Syncsort
2 New Features in TSS
3 Big Data Matching Principles
4 Big Data Matching Case Study
5 Demo
6 Questions
3. Who is
Syncsort?
>7,000 customers
84 of the Fortune 100
Customers in >100 countries
Headquarters: Pearl River, NY
U . S . L O C AT I O N S
• Burlington, MA; Irvine, CA;
Oakbrook Terrace, IL; Rochester, MN
G L O B A L P R E S E N C E
• U.K., France, Germany, Netherlands,
Israel, Hong Kong & Japan
Big Iron to Big Data is a fast-growing
market segment composed of solutions
that optimize traditional data systems
and deliver mission-critical data from
these systems to next-generation
analytic environments.
Global leader in
Big Iron to Big Data
5. Collibra Integration
Collibra can define and manage data quality
rules, but cannot enforce the rules on the
data or measure compliance to them.
Goal:
• Make data accessible, traceable and
meaningful to business users.
• Automatically, pass Collibra rules into Trillium
Discovery and get rule compliance data passed
back to Collibra
Requirements:
• Bi-directional near real-time integration
between Trillium Discovery and Collibra DGC
for quality measurement and monitoring
• Trillium business rule analysis results / data
quality metrics shown in Collibra dashboards.
• Data Stewards can quickly identify issues and
take corrective action when data quality
standards are not met.
6. Closing the Loop
Collibra Data Governance Center
• Enables non-technical users to define
business policies and data quality rules
in plain language
• Makes data quality performance
available to all users
Trillium Discovery
• Imports DBC business rules so technical user
can convert to executable data quality rules
• Constantly runs data quality metrics on near
real-time basis, passes results back to
Collibra dashboards
Rulebooks to Rules
Quality test Results
Bi-directional connectivity Constant sync
Metric falling below
thresholds can
trigger case in
Collibra Issue
Management
7. Trillium Quality for Big Data
Trillium Quality =
Best-of-breed data quality
solution.
Leader in Gartner Data
Quality Tools MQ 12 years
running.
Intelligent Execution =
Artificially intelligent
dynamic performance
optimizer for cluster
execution in MapReduce,
Amazon EMR, or Spark.
Trillium Quality +
Intelligent Execution =
High performance
industry-leading data
quality on Big Data and
Cloud platforms.
8. • Build data quality processes that
ensure high-quality data that
meets such key business needs as:
o Single customer view (SCV)
o Standardized product data
o Standardization for fraud detection
Trillium Quality – Powerful Data Cleansing
• Consolidate data sources on input
• Match on party, household, business, etc.
• Develop workflows to transform, parse,
standardize, match and survive best record
• Manage “householding” issues associated with
multiple physical addresses under a single account
KEY FUNCTIONALITY:
• Global address validation with individual country postal rules
• Enrich missing postal information, latitude/longitude and other reference data
9. Design Once, Deploy Anywhere
Intelligent Execution - Insulate your organization from underlying complexities of Hadoop.
Get excellent performance every time
without tuning, load balancing, etc.
No re-design, re-compile, no re-work ever
• Future-proof job designs for emerging
compute frameworks, e.g. Spark 2.x
• Move from dev to test to production
• Move from on-premise to Cloud
• Move from one Cloud to another
Use existing ETL skills
No parallel programming – Java, MapReduce, Spark …
No worries about:
• Mappers, Reducers
• Big side or small side of joins …
Design Once
in visual GUI
Deploy Anywhere!
On-Premise,
Cloud
Mapreduce, Spark,
Future Platforms
Windows, Unix,
Linux
Batch,
Streaming
Single Node,
Cluster
10. Trillium Quality for Big Data
• Deploy data quality workflows as native, parallel MapReduce or Spark
processes for optimal efficiency.
• Process hundreds of millions of records of data.
• Standardize, enhance, and match international data sets with postal and
country-code validation.
• Integrate, parse, standardize, and match new and legacy customer data
from multiple disparate sources.
• Increase processing efficiency.
• Support failover through Hadoop’s fault-tolerant design; during a node
failure, processing is redirected to another node.
11. Two Ways to Get Postal Updates
Trillium Postal Download Web Service
Trillium Postal Download Web Service is an
automated download service introduced in
TSS v15.7. The download service allows you
to check the status of your postal license and
download the postal directories from a
browser-based application.
TSS Download Center (File Portal) FTP website
TSS Download Center allows you to manually download
postal directories through Trillium Software’s secure
website. See the Trillium Software System Installation
Guide for procedures on downloading postal directories
through this website.
12. And more …
• Trillium Discovery REST APIs installed with TSS
server, documentation in Help file for easy
integration with other applications like ASG Data
Intelligence
• Unique ID (UUID) Function
• Trillium Language Pack Locale Setting
• Apache Tomcat Upgrade to v8.5.32
• Australian (AU) Postal Directories and AU Postal
Matcher changes in accordance with Australia Post
licensing terms
• And more …
Example:
German locale setting in config.txt
key rest_api {
value locale "de"
}
14. Nobody wants a data swamp instead of a data lake!
“This sure looked a lot nicer on the
whiteboard…”
15. Only 35% of senior
executives have a high
level of trust in the
accuracy of their Big
Data Analytics
92% of executives are
concerned about the
negative impact of data
and analytics on
corporate reputation
Cost of poor data quality
rose by 50% in 2017
(Gartner)
84% of CEOs
are concerned about
the quality of the data
they’re basing
decisions on
The importance of data
quality in the enterprise:
• Decision making – Trust the data
that drives your business
• Customer centricity – Get a
single, complete and accurate
view of your customer for better
sales, marketing and customer
service
• Compliance – Know your data,
and ensure its accuracy to meet
industry and government
regulations
• Machine learning & AI – Train
your models on accurate data
The Data Lake
Needs Data
Quality
16. “
”
The magic of machine learning is that you build a
statistical model based on the most valid dataset for
the domain of interest.
If the data is junk, then you’ll be building a junk
model that will not be able to do its job.
James Kobeilus
SiliconANGLE Wikibon
Lead Analyst for Data Science, Deep Learning, App Development
2018
17. Common Machine Learning Applications
• Anti-money laundering
• Fraud detection
• Cybersecurity
• Targeted marketing
• Recommendation engine
• Next best action
• Customer churn prevention
• Know your customer
18. De-Bugging Your Data
Incorrect, Incomplete, Mis-Formatted “Dirty Data” –
Mistakes and errors are almost never the patterns you’re
looking for in a data set. Correcting and standardizing will
tend to boost the signal.
Correcting data problems vastly increases a data set’s usefulness for machine learning.
19. De-Bugging Your Data
Incorrect, Incomplete, Mis-Formatted “Dirty Data” –
Mistakes and errors are almost never the patterns you’re
looking for in a data set. Correcting and standardizing will
tend to boost the signal.
Multiple copies – If you data comes from many sources, as it
often does, it may contain multiple records of information
about the same person, company, product or other entity.
Removing duplicates and enhancing the overall depth and
accuracy of knowledge about a single entity can make a huge
difference.
Correcting data problems vastly increases a data set’s usefulness for machine learning.
20. De-Bugging Your Data
Incorrect, Incomplete, Mis-Formatted “Dirty Data” –
Mistakes and errors are almost never the patterns you’re
looking for in a data set. Correcting and standardizing will
tend to boost the signal.
Multiple copies – If you data comes from many sources, as it
often does, it may contain multiple records of information
about the same person, company, product or other entity.
Removing duplicates and enhancing the overall depth and
accuracy of knowledge about a single entity can make a huge
difference.
Enrichment – Enriching data with other data sets, such as
geospatial, demographics, or firmographics data can provide
new depths of analysis. For example, adding latitude and
longitude may enable identification of geospatial patterns.
Correcting data problems vastly increases a data set’s usefulness for machine learning.
21. De-Bugging Your Data
Incorrect, Incomplete, Mis-Formatted “Dirty Data” –
Mistakes and errors are almost never the patterns you’re
looking for in a data set. Correcting and standardizing will
tend to boost the signal.
Multiple copies – If you data comes from many sources, as it
often does, it may contain multiple records of information
about the same person, company, product or other entity.
Removing duplicates and enhancing the overall depth and
accuracy of knowledge about a single entity can make a huge
difference.
Enrichment – Enriching data with other data sets, such as
geospatial, demographics, or firmographics data can provide
new depths of analysis. For example, adding latitude and
longitude may enable identification of geospatial patterns.
Correcting data problems vastly increases a data set’s usefulness for machine learning.
However, traditional data quality software
is designed to work on smaller data sets.
Traditional data quality processes are
an effective method to remove defects.
22. Data Quality Challenges of Enabling Machine Learning
1. Data Cleansing at Scale
• Data quality cleansing and preparation routines have to be reproduced at scale, both to get the data ready to train
machine learning models, and to comply with business regulations.
• Other data quality tools are not designed to work on that scale of data.
• Programming data cleansing workflows from scratch in Java MapReduce or Scala for Spark requires specialized skills
and takes at least twice as long as designing the same workflows in graphical point and click tools.
• Tuning those MapReduce or Spark workflows to get decent performance on a cluster takes even longer, and will
have to be re-done if the job is moved to a bigger or smaller cluster, or from an on-premise data center to the Cloud.
23. Harvard Business Review - 2018
“If your data is bad, your machine
learning tools are useless.”
Anonymous Computer Scientist - 1957
“Garbage in, garbage out.”
24. Common Data Quality Problems
• Many data records with different
layouts
• Lack of standardization of the
different fields
• Misspellings
• Data sourced from third parties does
not contain all the necessary fields
• Inconsistent data formats
(measurements, languages, postal
conventions and dates)
• Names spelled differently
• Different number formatting
25. “But I have a lot of data ….” Is not an excuse for non-compliance.
To comply with GDPR, companies must know the
answers to the following questions:
• What do we know about a given customer?
• Where is our customer data?
• Is our customer contact information current?
• How are we processing customer data?
And supply those answers in the form of business
processes that provide evidence of compliance.
Data Quality is Critical for GDPR Compliance
26. Data Quality Challenges of Enabling Machine Learning
1. Data Cleansing at Scale
• Data quality cleansing and preparation routines have to be reproduced at scale, both to ready the data for machine
learning, and to comply with business regulations.
• Other data quality tools are not designed to work on that scale of data.
• Programming data cleansing workflows from scratch in Java MapReduce or Scala for Spark requires specialized skills
and takes at least twice as long as designing the same workflows in graphical point and click tools.
• Tuning those MapReduce or Spark workflows to get decent performance on a cluster takes even longer, and will
have to be re-done if the job is moved to a bigger or smaller cluster, or from an on-premise data center to the Cloud.
2. Entity Resolution
• Distinguishing matches that relate to a single specific entity (a person, a company, a part, etc.) requires sophisticated
multi-field matching algorithms
• Distinguishing matches across massive datasets requires a lot of compute power. Essentially everything has to be
compared to everything else, multiple times in multiple ways.
• Other data quality tools cannot find and combine records of the same entity at that scale.
27. ROB SMITH
3 DAVY DRIVE
bob.smith@hotmail.com
01189407600
Name
Address1
City
Postal Code
Phone
Email
Customer
Service
S66 7EN
Address2
• Exact match + 36 different fuzzy matching
comparison algorithms
• Weighted decision trees
• Match scoring for confidence thresholds
• Multi-field matching, multi-pass and array
matching
• Transitive matching with multiple
different match criteria
A=B, B=C therefore A=B=C
• High performance everything-to-everything
comparison across any cluster in MapReduce
or Spark
Entity Resolution at Scale
Dr Bob Smith
bob.smith@hotmail.com
Name
Address1
City
Postal Code
Phone
Email
Web Login
Address2
Is that
you,
Bob?
Is that
you,
Bob?I have billions of records. How do I identify the same entity?
Are these two businesses owned by the same person?
Are these two accounts in the same building?
Mr Robert Smith
3 Davey Drive
S667EN
01189 407 600
Rotherham
Name
Address1
City
Postal Code
Phone
Email
Transfer
# 16
Address2
Bob Smith DR
3 Davy Dr #16
S667EN
01189 407 600
Rotherham
Name
Address1
City
Postal Code
Phone
Email
Purchase
Address2
Dr. B. Smith
3 Davy Dryve 16
S66 7EN
bsmith@gmail.com
01189 407 600
MALtby
Name
Address1
City
Postal Code
Phone
Email
ATM
Transaction
Address2
28. Anti-Money Laundering on Hadoop at Global Bank
Challenge: Meet AML transaction monitoring and Financial Conduct
Authority (FCA) compliance demands
• Data too large, diversely scattered to analyze
• Disparate data sources – Mainframe, RDBMS, Cloud, etc.
Requirements:
• Consolidate, clean, and verify data for all analytics and
reporting.
• MUST be secure: Kerberos and LDAP integration
required
• Need unmodified copy of
mainframe data stored on
Hadoop for backup, and
compliance archive
• MUST have complete, detailed data
lineage from origin to end point
30. Anti-Money
Laundering on
Hadoop at
Global Bank
Solution:
• Must be secure – Kerberos,
LDAP
• Must have lineage – data
origin to end point
• Massive data volumes
• Scattered data – Mainframe,
RDBMS, Cloud, …
• Must archive unaltered
mainframe data
Full Anti-Money Laundering
regulatory compliance with
financial crimes data lake –
high performance
results at massive scale.
• Full end-to-end data lineage
supplied to Apache Atlas and
ASG Data Intelligence
• Cluster-native data
verification, enrichment, and
demanding multi-field entity
resolution on Spark
• Unmodified mainframe
“Golden Records” stored on
Hadoop
Bank must monitor transactions
to detect Money Laundering for
FCA compliance.
Machine learning can detect
patterns, but …
Requires large amounts of
current, clean data.
• Syncsort DMX-h
• Syncsort’s Trillium Quality for Big Data
• Syncsort DMX Change Data Capture
• Hortonworks HDP
31. “
”
For want of a nail, the kingdom was lost.
For want of a data cleansing and integration tool,
the whole AI superstructure can fall down.
James Kobeilus
SiliconANGLE Wikibon
Lead Analyst for Data Science, Deep Learning, App Development
2018
32. Demo: Big Data Matching
With Trillium Quality for Big Data
33. Trillium Quality for Big Data – Data Cleansing at Scale
Boost effectiveness of machine learning, AI with complete, standardized data.
1. Visually create and test data
quality processes locally
2. Execute in MapReduce or Spark
On premise or in the Cloud
34. Identity management
Name Address City State Zip DOB
Nicholas Saunders 22 Shady Lane Mystic CT 06355 04/12/1971
N.M Saunders Jnr Crooked Trail Trenton NJ 08604 12/04/1971
Nick Saunders 22 Shady Street Mystic CT 06355 12/04/1971
Saunders, Nicholas M. 22 Shady Lane Mystic CT 06355 n/a
Nicholas Sanders Crooked Road Trenton NJ 08604 04/12/1971
Nicholas Saunders 22 Shady Street Mystic NJ 08604 12/04/1971
CUSTOMERS VENDORS ACCOUNTS
360º View
For Collibra users:
We are the only data quality solution with out-of-the-box bi-directional integration with Collibra Governance Center to give you “closed loop” data governance
If Trillium Discovery metrics fall below thresholds, customer can implement so case can be triggered in Collibra Issue Management
Data stewards alerted, enabling them to take corrective actions
Intelligent execution – artificially intelligent dynamic performance optimizer:
Visually design your jobs once, and deploy them anywhere – MapReduce, Spark, Linux, Unix, Windows – on premise or in the cloud. No changes or tuning required.
Easily move applications from standalone server environments, from MapRedue to Spark, from on premise to cloud – as easy as clicking on a drop-down menu
Future-proof job designs for emerging compute frameworks
Avoid tuning -- Intelligent Execution dynamically plans for applications at run-time based on the chosen compute framework
Insulate your users from the underlying complexities of Hadoop and use existing data quality skills
Cut development time in half
Traditional data quality software is not designed to work at Hadoop scale.
Match scoring for confidence thresholds – in a user-friendly scoring map that you can easily tune
Multi-pass matching for different combinations of fields
Array matching – cross check multi-word or multi-field information - example 3 Davy Dr #16 all in address 1 compared to 3 Davey Drive in Add1 and #16 in Add2
Even without intentionally trying to conceal identity, it can be difficult to resolve a single person or business from multiple touches across multiple data systems, each with it’s own data quality issues. Without good entity resolution, money laundering is much easier to get away with. You could hide who you are from a computer as easily as calling yourself Dr. Robert Smith in one place and Bob Smith in another.
Data cleansing and standardization at scale, the previous step, will increase the number of matches found significantly,
but doing an everything to everything comparison across a cluster is still a big challenge. Data scientists should be focused on perfecting anti money laundering models,
not the perfect windowing functions in Spark for doing Levenstein distance matching on a cluster.
Examples of multi-field matching:
Name + email
Name + phone
Name + physical address
Email + phone
Multi-pass matching means you go over the data multiple times comparing different combinations of fields.
Fuzzy matching algorithm examples: keystroke distance, Levenstein distance, etc., distance comparison of geo-location
Specialized date, name, street, etc comparison algorithms
he Financial Conduct Authority (FCA) is a financial regulatory body in the United Kingdom, but operates independently of the UK Government, and is financed by charging fees to members of the financial services industry.[3] The FCA regulates financial firms providing services to consumers and maintains the integrity of the financial markets in the United Kingdom.[4]
Overall, a good entity resolution solution makes AML teams 81% more productive