In a Big Data Warehousing Meetup, we discussed how Hadoop 2.0 and YARN is used to solve the problem of identity resolution and customer matching. In this session, attendees learned how to build a dynamic, integrated customer view, extracting, cleansing, matching and linking data from virtually any data source structured or unstructured, maximizing the value of Hadoop 2.0 and YARN.
Joe Caserta, President, Caserta Concepts covered the identity resolution process and its challenges, and how it fits into the MDM paradigm and the big data ecosystem for customer behavior analytics.
For more information, visit www.casertaconcepts.com
2. 6:30 Networking
Grab some food and drink... Make some friends.
6:45 Joe Caserta
President
Caserta Concepts
Welcome + Intro to Big MDM
About the Meetup
7:15 George Corugedo,
CTO
RedPoint Global
Introduction and Overview of RedPoint
Demo of Customer Matching on Hadoop
8:00 Q&A Ask Questions, Share your experience
8:15 More Networking
Don’t leave until you make at least one new Data Nerd friend!
Agenda
3. • Big Data is a complex, rapidly changing
landscape
• We want to share our stories and hear
about yours
• Great networking opportunity for like
minded data nerds
• Founded by Caserta Concepts
• November 10, 2012
• Next BDW Meetup:
• March 3rd
• Topic: Graph Databases for MDM
• Location: NWC
About the BDW Meetup #BDWmeetup
#maximizeDataValue
@CasertaConcepts
@RedPointGlobal
4. Top 20 Big Data
Consulting - CIO Review
Launched Big Data practice
Co-author, with Ralph Kimball, The
Data Warehouse ETL Toolkit (Wiley)
Dedicated to Data Warehousing,
Business Intelligence since 1996
Began consulting database
programing and data modeling 25+ years hands-on experience
building database solutions
Founded Caserta Concepts in NYC
Web log analytics solution published
in Intelligent Enterprise
Formalized Alliances / Partnerships –
System Integrators
Partnered with Big Data vendors
Cloudera, Hortonworks, IBM, Cisco,
Datameer, Basho more…
Launched Training practice, teaching
data concepts world-wide
Laser focus on extending Data
Warehouses with Big Data solutions
1986
2004
1996
2009
2001
2010
2013
Launched Big Data Warehousing
(BDW) Meetup-NYC ~1500 Members
2012
2014
Established best practices for big
data ecosystem implementation –
Healthcare, Finance, Insurance
Top 20 Most Powerful
Big Data consulting firms
Dedicated to Data Governance
Techniques on Big Data (Innovation)
Caserta Timeline
5. About Caserta Concepts
• Award-winning technology innovation consulting with
expertise in:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Ad Tech / Higher Ed
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Strategy, Implementation
• Writing, Education, Mentoring
• Data Science & Analytics
• Cloud Computing
• Data Interaction & Visualization
6. Does this word cloud excite you?
Speak with us about our open positions: leslie@casertaconcepts.com
Help Wanted
Spark
Big Data Architect NoSQL
EC2,EMR,Redshift
9. Staging
Library
Consolidated
Library
Standardization Matching
Integrated
Library
Survivorship
Source ID Name Home Address Birth Date SSN
SYS A 123 Jim Stagnitto 123 Main St 8/20/1959 123-45-6789
SYS B ABC J. Stagnitto 132 Main Street 8/20/1959 123-45-6789
SYS C XYZ James Stag NULL 8/20/1959 NULL
Source ID Name Home Address Birth Date SSN Std Name Std Addr MDM ID
SYS A 123 Jim Stagnitto 123 Main St 8/20/1959 123-45-6789 James Stagnitto 123 Main Street 1
SYS B ABC J. Stagnitto 132 Main Street 8/20/1959 123-45-6789 James Stagnitto 132 Main Street 1
SYS C XYZ James Stag NULL 8/20/1959 NULL James Stag NULL 1
MDM ID Name Home Address Birth Date SSN
1 James Stagnitto 123 Main Street 8/20/1959 123-45-6789
Mastering Data
Validation
10. Informational
Master Data
MDM Information Ecosystem
10
Operational
Master Data
Holistic
Master Data
Service
Leads
Policies
Claims
Enrolls
Sales
Finance
DW
Dimensions &
Cross-References
Marketing
Insights
11. 11
Traditional Approaches to MDM
• Registry
• Transactional
• Co-Existence
App A App B
App A App BSingle Version of
the Truth
App A App BNon-Sensitive
Shared
Sensitive
App
12. New Master Data Management
Traditional MDM will do depending on your data size and
requirements:
Relational is awkward, extreme normalization, poor usability and
performance
NoSQL stores like HBase has benefits
If you need super high performance low millisecond response
times to incorporate into your Big Data ETL
Flexible Schema
Graph database is near perfect fit. Relationships and graph
analysis bring master data to life!
Data quality and matching processes are required
Little to no community or vendor support
Achievable in Hadoop with YARN
MDM is best represented in a Graph database
Describe the workings of a full-feature MDM – near-real-time operational features, batch features
Extensibility into other forms of master data: i.e.: Product, etc.