Hadoop and big data don't sit as an island in organizations. To analyze event streams and similar data requires integrating with other data from systems in the organization. This isn't easy with big data systems today because there are disparities in the technoogies and environments when compared to traditional IT. Data virtualization is one way to smooth over the integration and allow Hadoop to access other data, or allow SQL-oriented tools to access Hadoop
Using Data Virtualization to Integrate With Big Data
1. The Role of Data
Virtualization in a
World of Big Data
June 6, 2012
Mark Madsen
@markmadsen
www.ThirdNature.net
Information Management Through Human History
New technology development
(innovation)
creates
New methods to cope
(maturation)
creates
New information scale and availability
(saturation)
creates…
Copyright Third Nature, Inc.
2. Big Data
You keep using that word.
I do not think it means
what you think it means.
3. What makes data “big”?
Hierarchical structures
Nested structures
Encoded values
Non‐standard (for a
database) types
Deep structure
Very large amounts
Human authored text
“big” is better off being defined as “complex” or “hard to manage”
Copyright Third Nature, Inc.
6. Reality is multiple data stores and platforms
Separate, purpose-built databases and processing systems for
different types of data and query / computing workloads is the
norm for information delivery. Data flows between most of these
environments.
BI, Reporting,
Dashboards
1 Marge Inovera $150,000 Statsi tic ai n 1 Marge I novera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n
2 Anit a Bath $120,000 Sewer i nspector 2 Anita Bath $120,000 Sew er i nspector 2 Anit aBath $120,000 Sewer i nspector 2 Anit aBath $120,000 Sewer i nspector 2 Anit aBath $120,000 Sewer i nspector 2 Anit a Bath $120,000 Sewer i nspector 2 Anit aBath $120,000 Sewer i nspector 2 Anit aBath $120,000 Sewer i nspector 2 Anit aBath $120,000 Sewer i nspector
3 vI an Awfulti ch $160,000 Derm atologist 3 Ivan Awfulit ch $160,000 Dermatologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist
4 Nadia Geddit $36,000 DBA 4 N daia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA
Data
Warehouse
1 Marge I novera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n
2 Anita Bath $120,000 Sew er i nspector 2 Anit aBath $120,000 Sewer i nspector 2 Anit a Bath $120,000 Sewer i nspector 2 Anit aBath $120,000 Sewer i nspector 2 Anit aBath $120,000 Sewer i nspector 2 Anit a Bath $120,000 Sewer i nspector 2 Anit a Bath $120,000 Sewer i nspector 2 Anit aBath $120,000 Sewer i nspector 2 Anit a Bath $120,000 Sewer i nspector
3 Ivan Awfulit ch $160,000 Dermatologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist
4 N daia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA
Databases Documents Flat Files XML Queues ERP Applications
Source Environments
Example “big data”: Web tracking data
USER_ID 301212631165031
SESSION_ID 590387153892659
VISIT_DATE 1/10/2010 0:00
SESSION_START_DATE 1:41:44 AM
PAGE_VIEW_DATE 1/10/2010 9:59
https://www.phisherking.com/gifts/store/LogonForm?mmc=
link‐src‐email‐_‐m100109‐_‐44IOJ1‐_‐shop&langId=‐
DESTINATION_URL 1&storeId=1055&URL=BECGiftListItemDisplay
REFERRAL_NAME Direct
REFERRAL_URL ‐
PAGE_ID PROD_24259_CARD
REL_PRODUCTS PROD_24654_CARD, PROD_3648_FLOWERS
SITE_LOCATION_NAME VALENTINE'S DAY MICROSITE
SITE_LOCATION_ID SHOP‐BY‐HOLIDAY VALENTINES DAY
IP_ADDRESS 67.189.110.179
MOZILLA/4.0 (COMPATIBLE; MSIE 7.0; AOL 9.0; WINDOWS
BROWSER_OS_NAME NT 5.1; TRIDENT/4.0; GTB6; .NET CLR 1.1.4322)
7. Example “big data”: Web tracking data
USER_ID 301212631165031
SESSION_ID 590387153892659
The event stream
VISIT_DATE 1/10/2010 0:00 contains IDs, but no
SESSION_START_DATE 1:41:44 AM reference data…
PAGE_VIEW_DATE 1/10/2010 9:59
https://www.phisherking.com/gifts/store/LogonForm?mmc=
link‐src‐email‐_‐m100109‐_‐44IOJ1‐_‐shop&langId=‐
DESTINATION_URL 1&storeId=1055&URL=BECGiftListItemDisplay
REFERRAL_NAME Direct
REFERRAL_URL ‐
PAGE_ID PROD_24259_CARD
REL_PRODUCTS PROD_24654_CARD, PROD_3648_FLOWERS
SITE_LOCATION_NAME VALENTINE'S DAY MICROSITE
SITE_LOCATION_ID SHOP‐BY‐HOLIDAY VALENTINES DAY
IP_ADDRESS 67.189.110.179
MOZILLA/4.0 (COMPATIBLE; MSIE 7.0; AOL 9.0; WINDOWS
BROWSER_OS_NAME NT 5.1; TRIDENT/4.0; GTB6; .NET CLR 1.1.4322)
Reference data, aka dimensions, master data. This isn’t an OLTP
DB, there is no reference data available from the source.
I need that It would be logical
data now. to keep all the
It will take
. data in one place.
6 months
The typical situation for analysts
8. There are two architectural approaches to
facilitating analysis, depending on where the
analyst works in the environment:
1. Back end integration: For analysts working within
the BD environment ‐ Reaching out from the
environment to get other data that's needed to
make sense of information.
2. Front end integration: For analysts working in a
more conventional BI / analysis environment ‐
reaching in to the BD environment from other tools.
Solution: copy the data into Hadoop?
Just load it from the DW. If it’s there. Otherwise, dump and load
the data from the sources.
Great for one-time analysis, but if you need to do it again next
week, or if you need current values on a regular basis?
You can build custom extracts from each source. But…
Data warehouse • Poor tool support
OLTP Sources
• Problem of on-demand
/ current values
• Minimal data
management possible
in the Hadoop
environment
• The analyst waits
10. Data virtualization can simplify access across the entire
data environment, “big” or not
DV also enables shared metadata across environments, avoiding
the costs of model integration and burying it in source code.
BI, Reporting,
Dashboards
Data virtualization layer (front end)
1 Marge Inovera $150,000 Statsi tic ai n 1 Marge I novera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n
2 Anit aBath $120,000 Sewer i nspector 2 Anita Bath $120,000 Sew er i nspector 2 Anit aBath $120,000 Sewer i nspector 2 Anit aBath $120,000 Sewer i nspector 2 Anit aBath $120,000 Sewer i nspector 2 Anit a Bath $120,000 Sewer i nspector 2 Anit aBath $120,000 Sewer i nspector 2 Anit aBath $120,000 Sewer i nspector 2 Anit aBath $120,000 Sewer i nspector
3 Iv an Awfulti ch $160,000 Derm atologist 3 Ivan Awfulit ch $160,000 Dermatologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist
4 Nadia Geddit $36,000 DBA 4 N daia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA
Data
Warehouse
1 Marge I novera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n 1 Marge Inovera $150,000 Statsi tic ai n
2 Anita Bath $120,000 Sew er i nspector 2 Anit aBath $120,000 Sewer i nspector 2 Anit a Bath $120,000 Sewer i nspector 2 Anit aBath $120,000 Sewer i nspector 2 Anit aBath $120,000 Sewer i nspector 2 Anit a Bath $120,000 Sewer i nspector 2 Anit a Bath $120,000 Sewer i nspector 2 Anit aBath $120,000 Sewer i nspector 2 Anit a Bath $120,000 Sewer i nspector
3 Ivan Awfulit ch $160,000 Dermatologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist 3 Iv an Awfulti ch $160,000 Derm atologist
4 N daia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA 4 Nadia Geddit $36,000 DBA
DV layer (back end)
Databases Documents Flat Files XML Queues ERP Applications
Source Environments
Bridge the data environment to uses beyond BI
The use cases are now interactive applications, lower latency
data, complex analytics and extend beyond read‐only queries.
11. About the Presenter
Mark Madsen is president of Third
Nature, a technology research and
consulting firm focused on business
intelligence, analytics and
information management. Mark is an
award-winning author, architect and
former CTO whose work has been
featured in numerous industry
publications. During his career Mark
received awards from the American
Productivity & Quality Center, TDWI,
Computerworld and the Smithsonian
Institute. He is an international
speaker, contributing editor at
Intelligent Enterprise, and manages
the open source channel at the
Business Intelligence Network. For
more information or to contact Mark,
visit http://ThirdNature.net.
About Third Nature
Third Nature is a research and consulting firm focused on new and
emerging technology and practices in business intelligence, analytics and
performance management. If your question is related to BI, analytics,
information strategy and data then you‘re at the right place.
Our goal is to help companies take advantage of information-driven
management practices and applications. We offer education, consulting
and research services to support business and IT organizations as well as
technology vendors.
We fill the gap between what the industry analyst firms cover and what IT
needs. We specialize in product and technology analysis, so we look at
emerging technologies and markets, evaluating technology and hw it is
applied rather than vendor market positions.