Organizations are increasingly challenged to deliver on new initiatives with more data sources and higher volumes of data across divergent, hybrid architectures. With this enterprise challenge in mind, Syncsort introduces Trillium DQ version 16 bringing the full range of data quality functionality forward into a highly scalable, natively executed framework that works on both traditional and distributed platforms to ensure consistency of processing while achieving the performance necessary for today’s workloads and data volumes.
This webcast highlights the capabilities of Trillium DQ v16 with a focus on its highly scalable, distributed architecture.
View this webinar on-demand to learn:
• How Trillium Discovery provides easy-to-use insight into Big Data, relational, and text-based data sources for rapid understanding of your data sources
• How Trillium Quality delivers high-scale, high-performance execution for critical data quality processes including global data enrichment and multi-domain entity resolution
The New Trillium DQ: Big Data Insights When and Where You Need Them
1. The New Trillium DQ:
Big Data Insights When and
Where You Need Them
Harald Smith
1
2. Speaker
Harald Smith
• Director of Product Marketing, Syncsort
• 20+ years in Information Management with a focus
on data quality, integration, and governance
• Co-author of Patterns of Information Management
• Author of two Redbooks on Information Governance
and Data Integration
• Blogs on Dataversity and InfoWorld
2
3. Only 35%of senior executives have a
high level of trust in the
accuracy of their Big Data
Analytics
KPMG 2016 Global CEO Outlook
92% of
executives are concerned
about the negative impact of
data and analytics on
corporate reputation
KPMG 2017 Global CEO Outlook
80%of AI/ML projects are stalling
due to poor data quality
Dimensional Research, 2019
ALL Data Needs
Data Quality
“Societal trust in business is
arguably at an all-time low
and, in a world increasingly
driven by data and technology,
reputations and brands are
ever harder to protect.”
EY “Trust in Data and Why it Matters”, 2017.
The importance of data quality
in the enterprise:
• Decision making
• Customer centricity
• Compliance
• Machine learning & AI
3
4. Key Outcomes
• Maximize the value of data quality across your organization
• Deploy and leverage data quality capabilities consistently when and
where needed
• Leverage the resources and skills your organization has invested in
whether on-premise or in the cloud
• Scale to address the data challenges you face and deliver high quality
results you can trust for critical business decisions
• Integrate best-in-class data quality into your data governance framework
to ensure visibility across your organization
• Ensure global data requirements are addressed
4
Trillium DQ version 16
5. • Single cross-platform scalable architecture
• Native Big Data connectivity
• Distributed execution for all functions
• Full, rich data quality capabilities and familiar interface
• Design-once, deploy-anywhere data quality projects
• Out-of-the-box data governance integration with Collibra
• Broad location and geoenrichment data options
Trillium DQ v16 Highlights
5
6. Ensures consistent use, processing, and outcomes for traditional or distributed platforms, on-premise or in the cloud
6
Trillium DQ – common scalable architecture
UI Server or
Edge Node
ODBC
Native RDBMS
Delimited
Fixed
Cobol
Distributed
Cluster
Distributed HDFS / Distributed Execution / Distributed Storage
Name Node
Trillium DQ
Metadata
Delimited
HDFS
7. 2xFaster data cleansing and
matching on small
distributed cluster – more
nodes, faster time
3xFaster data profiling on
small distributed cluster
– more nodes, faster
time with linear scaling
2xFaster data profiling even
on traditional platforms
Key Outcomes
• More sources of data
• Higher volumes of data
• Faster processing of data
• Fit limited time windows
• Utilize Big Data investments
• Reduced disk space usage
Scalable
Architecture
7
8. 8
Trillium DQ for Big Data on Amazon EMR:
• Cleansed, standardized and matched over
130 million recs/hour on basic 10-node
test cluster
• Processing full transaction volume daily, and
business is growing
• Met the business SLA’s with ability to scale
Challenge Solution
Delivered higher levels of matching/data accuracy and satisfied contracts
Saved software costs – Replaced multiple solutions – Melissa Data, Oracle de-dupe, ...
Saved Amazon cluster costs and left room for company growth
Impact
Ensure accurate corporate credit ratings of 330M global
companies for clients within contracted timeframes.
• Could not scale to deliver ratings to clients within SLA’s –
impacting client fulfillment
• Need to process >800M records daily
• Lacked flexibility to address issues with similar company
names including volume and variety of data sources
“We can’t afford to miss or mix up information about businesses with similar names. Companies
count on our highly accurate predictive scoring to provide fast, accurate ratings for their potential
customers and vendors.”
Match to corporate credit data with Syncsort Trillium
9. Key Outcomes
• Reduce the time for business analysts to discover and understand
data on Hadoop platforms
• Allow business analysts who understand the data but have little
technical expertise to quickly find data and run data profiling in
three steps
• Let analysts explore results and drilldown to details within
seconds per view to review and then report on data issues to
business leaders
• Scale to large volumes of data sources & attributes so that
business analysts can understand the contents of any data source
needed for business decisions
9
Trillium Discovery
10. • Delivers enterprise trusted Trillium Discovery on traditional and distributed
Hadoop platforms for high-volume, scalable data profiling
• Provides complete Trillium Discovery data profiling for analysis & review
• Attribute metadata, value & pattern frequencies, key & dependency analysis,
cross-source join analysis, drill down to any outlier or issue, and more…
• Provides easily configured native connectivity for Big Data sources
• Provides managing and monitoring for task execution
• Integrates with the security frameworks (Kerberos, AD, LDAP) of
Big Data platforms
10
Trillium Discovery
11. Execute Profiling
1
n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
Trillium Discovery – Data Profiling at Scale
Select Source Explore ProfilesRun Profiling
Stored Profiling Results
▪ Metadata & Statistics
▪ Frequency Distributions
▪ Drilldown Indices
Share &
Govern
Results
Integration
(APIs)
Notification
Collaboration
Native Connectors
▪ HDFS source directories
▪ …
Drilldown to IssuesEvaluate Business Rules
3 Steps to Run
12. Key Outcomes
• Match and link any data entity – customers, suppliers, products, etc. –
into a trusted single view to support a broad array of business-critical
use cases (e.g. Customer 360, fraud, AML)
• Parse and standardize complex multi-domain data, extended with
enrichment and verification of critical address and geolocation data –
all leveraging out-of-the-box templates
• Utilize “design once, deploy anywhere” approach to speed time-to-
value and focus on building data quality business logic while letting the
product handle the technical aspects of framework execution with no
coding or tuning required
• Leverage the high-performance compute power of distributed Hadoop
frameworks to process high volumes within targeted time windows to
meet critical Service Level Agreements (SLA’s)
12
Trillium Quality
13. 13
Trillium Quality
• Integrate, parse, standardize, and match new and legacy customer data
from multiple disparate sources.
• Provide high-quality entity resolution through multi-domain deduplication
and matching with the most comprehensive set of match comparisons
available, including fuzzy matching, distance comparisons, and more.
• Standardize, enhance, and match international data sets with postal and
country-code validation.
• Deploy data quality workflows as native MapReduce processes for optimal
efficiency.
• Process hundreds of millions of records of data.
• Increase processing efficiency.
• Support failover through Hadoop’s fault-tolerant design; during a node
failure, processing is redirected to another node.
14. Syncsort Trillium Delivers Data You can Trust
Data Profiling Business Rules &
Data Quality
Assessment
Data Validation,
Standardization,
Enrichment & more
Matching, Entity
Resolution &
Verification
•Customer 360
•AI/ML
Operational Integrations
•Analytics &
Reporting
Data Governance
Trillium Discovery
Trillium Quality
+ Global Address Verification
Trillium DQ/Trillium DQ for Big Data
•Collibra DGC
•BI tools
14
15. 15
Trillium Quality for Big Data to support next-generation
AML transaction monitoring and FCA compliance
• Cluster-native data verification, enrichment, and
demanding multi-field entity resolution executing
natively on Spark within financial crimes database
• Unmodified mainframe “Golden Records” stored on
Hadoop
Global Bank
Challenge Solution
Ensure Anti-Money Laundering regulatory compliance is met through financial crimes data lake –
high performance results at massive scale.
Achieve fast time to value with flexible deployment and ease of use
Ensure the data lake is trusted source of data feeding critical machine learning-based fraud detection
Expanding use to additional Customer Engagement solutions and applications.
Impact
Meet AML transaction monitoring and
Financial Conduct Authority (FCA) compliance
• Data volume too large, diversely scattered to
analyze
• Disparate data sources – Mainframe, RDBMS,
Cloud, etc.
• Maximize the value/ROI of the data lake
16. Trillium DQ + Collibra DGC
Trillium Discovery
• Market-leading, best-of-breed
data quality solution
• Profile and understand all the
critical data
• Leverage highly flexible business
rules for the right metrics
• Find ALL the DQ issues
Out-of-the-box integration of DQ
metrics with Collibra DGC
✓ Bi-directional solution
✓ Automated & synchronized
✓ Configurable to organizational
needs for all profiling results –
broad API support
Collibra DGC
• Market-leading, best-of-breed
data governance solution
• Establish a common
understanding of the business
• Automate governance and
stewardship tasks
• Interact with common workflows
Deploy Trillium’s bi-directional data
quality integration to ensure:
✓ All key business rules are
implemented and validated
✓ DQ metrics are automatically
delivered to those who need to
know when they need to know
16
17. Delivers fully integrated data duality with Collibra
Collibra Data Governance Center
✓ Enables non-technical users to define business
policies and data quality rules in plain
language
✓ Makes data quality metrics and performance
available to all users
Trillium Discovery
✓ Automatically receives business rules so technical
user can convert to executable data quality rules
✓ Constantly runs data quality metrics on desired
schedule, automatically delivers results back to
Collibra dashboards
Rulebooks to Rules
Quality test Results
Bi-directional connectivity Constant sync
Metric falling below
thresholds can
trigger workflow in
Collibra Issue
Management
17
18. 18
Connection to/from Collibra is straightforward
Packaged
Workflow
• Out-of-the-box packaged workflow with Trillium Discovery
✓ Easy to setup and run – no complex technical requirements
✓ Part of delivered product – use immediately; no add-on charges; fully supported
• Automatically connects to and delivers content via REST API’s
✓ Collibra provides a single self-service API which facilitates connecting integrations to Collibra DGC
✓ Trillium Discovery provides standard, documented REST API’s – easy to extend application;
insulated from underlying product changes; same API’s used by UI, so always tested
19. 19
Trillium DQ with Collibra DGC to:
• Profile, analyse and provide measurement of
data quality concerns
• Integrate data quality rules and metrics between
the tools to ensure management has immediate
knowledge of improvements/issues
DNB
Challenge Solution
Pilot phase for 2 branches completed July 2019
• Able to provide proof that data wasn’t “missing”, but pinpointed a number of quality issues requiring improvements
• Able to report to regulators on the findings with proof rather than previous hearsay
Spun off requirements to provide similar work for all branches AND Head Office
Addressing Master Data Analysis on customer data and associated cleanup
Impact
Poor, inconsistent customer data, and aggressive
timelines to address regulatory compliance
requirements (BCBS239, GDPR, and AML)
• Focus on whether DNB can measure Data Quality
in an ongoing manner
• Concerns around Customer Sanctions Screening
and Transaction Monitoring
See: The Data Journey at DNB: Data Driven Customer Centricity
20. • Rich set of capabilities to discover, classify, profile, and evaluate data across
platforms including big data, cloud.
Don’t need to move data off the cluster and can provide drilldown to all issues
• High performance standardization and matching for entity resolution with
global coverage in batch & real time.
Meet challenging time windows for critical analytics and regulations
• Native connectivity, execution, and storage for optimized Big Data processing.
Take full advantage of the cluster to expand and scale
• Design once, deploy anywhere architecture that future proofs existing
applications.
Leverage the skills you already have
• Ease to connect to & integrate with CRM, ERP, MDM, enrichment, and Data
Governance solutions.
Deliver consistent data quality processing and results throughout the organization
20
Trillium DQ
21. 21
Available end of month
• Linux
• Cloudera
• CDH 5.8.3, 5.11, 5.15.2, 5.16.2
• HDP 2.6.4
• Google Cloud Platform
• Amazon EMR (Trillium Quality – now; Trillium Discovery - coming soon)
• Windows (coming soon)
22. Turn your data into a
trusted view of your
customers, products
and more
Power machine
learning and
advanced analytics
with reliable, fit-for-
purpose data
Gain actionable
business insights
from high-volume
disparate data sets
from across the
enterprise
Deploy industry-
leading data quality
processes at massive
scale, with no coding
or Big Data skills
required
Trillium DQ
evaluates &
transforms your
data for trusted
business insights
22
23. Next Steps
For more information on Trillium DQ and our other Syncsort
Trillium data quality solutions, please visit:
https://www.syncsort.com/en/solutions/data-quality
https://www.syncsort.com/en/products/trillium-dq
https://www.syncsort.com/en/products/trillium-dq-for-big-data
23