2. WHO AM I
• BI consultant @ Ordina
• member of SQLUG.be
• MCTS, MCITP in SQL Server 2008
• working with Microsoft BI for over 2 years
• beer and comic books enthusiast
• married with children…
3. INTRODUCTION
data quality?
Data are of high quality "if they are fit for their intended uses in
operations, decision making and planning" (J. M. Juran).
- Wikipedia on Data Quality
• achieved through people, technology & processes
• can be measured with various dimensions
• accuracy
• consistency
• completeness
• duplicates (uniqueness)
• timeliness
• validness
• bad data = bad business
4. INTRODUCTION
Data Quality Issue Sample Data Problem
Standard Are data elements consistently Gender code = M, F, U in one system and Gender
defined and understood? code = 0, 1, 2 in another system
Complete Is all necessary data present ? 20% of customers’ last name is blank,
50% of zip-codes are 99999
Accurate Does the data accurately A supplier is listed as ‘Active’ but went out of
represent reality or a verifiable business six years ago
source?
Valid Do data values fall within Temperature recordings should be between
acceptable ranges? -100°C and +100°C
Unique Data appears several times Prince, The Artist formerly known as Prince, The
Artist, … are they the same person?
5. INTRODUCTION
Monitoring Cleansing
Tracking and monitoring Amend, remove or enrich
the state of Quality data that is incorrect or
activities and Quality incomplete. This includes
of Data correction, standardization
and enrichment.
Monitoring Cleansing
Profiling Matching
Profiling
Matching
Analysis of the data
Identifying, linking or
source to provide insight
merging related entries
into the quality of the
within or across sets of data.
data and help to identify
data quality issues.
6. OUTLINE
• introduction
• overview of data quality services
• building a knowledge base
• data cleansing & matching
• SSIS integration
• conclusion
7. OVERVIEW OF DQS
Data Quality Services (DQS) is a
Knowledge-Driven data quality solution,
enabling IT Pros and data stewards to easily
improve the quality of their data
8. OVERVIEW OF DQS
Knowledge-
Based on a Data Quality Knowledge Base (DQKB)
Driven
Semantics Data Domains capture the semantics of your data
Knowledge
Acquires additional knowledge the more you use it
Discovery
Open and Support use of user-generated knowledge and IP
Extendible by 3rd party reference data providers
Compelling user experience designed for increased
Easy to use productivity
9. OVERVIEW OF DQS
• easy installation
• pre-installation checks
o SQL Server 2012 database engine (server)
o .NET 4.0 & IE 6.0 or higher (client)
• installation of DQS using SQL Server set-up
• post-installation tasks
o run DQSInstaller.exe
o grant DQS roles to users
o enable TCP/IP
10. OUTLINE
• introduction
• overview of data quality services
• building a knowledge base
• data cleansing & matching
• SSIS integration
• conclusion
11. BUILDING A KNOWLEDGE BASE
Knowledge
Management
Build Discover / Explore Data / Connect
Integrated Knowledge
Profiling
Base
Use
DQ Projects
12. BUILDING A KNOWLEDGE BASE
Values
Composite
Domains
Domains
Represent
3rd party the data type
Reference
Data Domains Knowledge
Rules & Base
Relations
Matching
Policy
15. BUILDING A KNOWLEDGE BASE
• iterative process
• knowledge discovery
• gather knowledge from
o Excel
o SQL Server
• profiling of data
o not the same as SSIS profiling task!
• automatically detects anomalies
16. BUILDING A KNOWLEDGE BASE
• domain management
• knowledge about fields is kept in domains
• data steward can
o create rules
o assign synonyms and corrections
o create term based relations (str. street)
o link domains together into
composite domains
• import knowledge from
o reference data (e.g. Azure Marketplace)
o other knowledge bases
17. OUTLINE
• introduction
• overview of data quality services
• building a knowledge base
• data cleansing & matching
• SSIS integration
• conclusion
18. DATA CLEANSING & MATCHING
• cleansing • St. --> street (corrected)
• why? • Microsot --> Microsoft (corrected)
o identifies incomplete or incorrect data • john.doe@hotmail (invalid)
o standardizes and enriches data by using • 0472/34672 (invalid)
domain values, domain rules and reference data
• Verbeek --> Verbeeck (suggested)
• DQS cleansing
o create a knowledge base or select an existing one
o create a data quality project
o 2-step process
– computer assisted cleansing
– interactive cleansing
o export results
19. DATA CLEANSING & MATCHING
• matching • Prince
• The Artist Formerly Known
• why? •
As Prince
The Artist
o identify duplicates with the data source
•
o create consolidated view of data
• Jon Doe, High Street 13, NY,
• DQS matching doe@gmail.com
o build a matching policy in KB John Doe, High Str, NY,
o matching training doe@gmail.com
o create matching project
o choose survivors
DQ Client – Match Results
21. DATA CLEANSING & MATCHING
• create a cleansing project
• uses knowledge gathered in a DQS knowledge base
• simple user-friendly process
• profile results
22. DATA CLEANSING & MATCHING
• create a matching project
• uses a matching policy created
in a knowledge base
• eliminates duplicates
• profile results
• the more knowledge that is added the better results will be
o tip: clean-up the data first using a cleansing project
• choose survivors at the end
• export results into .csv
or SQL Server
23. OUTLINE
• introduction
• overview of data quality services
• building a knowledge base
• data cleansing & matching
• SSIS integration
• conclusion
24. SSIS INTEGRATION SSIS Data Flow
Knowledge
Base
SSIS Package
Source + Data correction
Values/Rules Mapping Component Destination
Reference Data
Definition
26. SSIS INTEGRATION
• cleaning as a batch process
• only cleaning, matching is (not yet?) possible
• composite domains are supported
27. OUTLINE
• introduction
• overview of data quality services
• building a knowledge base
• data cleansing & matching
• SSIS integration
• conclusion
28. CONCLUSION
Knowledge-driven Easy To Use Open & Extendible
Rich Knowledge Base Focus on productivity and Focus on cloud-based
Continuous improvement user experience Reference Data
and knowledge acquisition Designed for business users User-generated knowledge
Build once, reuse for Out-of-the-box knowledge Integration with SSIS
multiple DQ improvements
29. RESOURCES
• DQS Team Blog @ MSDN
http://blogs.msdn.com/b/dqs/
• DQS documentation @ MSDN
http://msdn.microsoft.com/en-us/library/ff877917(v=sql.110).aspx
• SQL Server 2012 Resource Center (nice How-To videos)
http://msdn.microsoft.com/en-us/sqlserver/ff898410.aspx
• DQS Forum @ MSDN
http://social.msdn.microsoft.com/Forums/en-
US/sqldataqualityservices/threads
• TechEd presentation about DQS by Elad Ziklik
http://channel9.msdn.com/Events/TechEd/NorthAmerica/2011/DBI207