O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Understanding Metadata: Why it’s essential to your big
data solution and how to manage it well
Tuesday, June 21, 2016
Ben ...
Speakers
Ben Sharma, Co-Founder & CEO – Zaloni
---
Ben Sharma is a passionate technologist and thought leader in big data,...
In today’s data environment with structured and unstructured data,
the importance of metadata is increased
•  Metadata all...
Data architecture modernizationTraditionalNew
Data Lake
Sources ETL EDW
Derived
(Transformed)
Discovery Sandbox
EDW
Stream...
Data lake reference architecture
Consumption
Zone
Source
System
File Data
DB Data
ETL Extracts
Streaming
Transient
Loading...
•  Reduced time to insight for analytics
•  Modern Data architecture will require a holistic approach to metadata
Metadata...
Considerations:
•  Integration with Enterprise Metadata
Management Solutions
•  Automated process for new metadata to
be r...
Data lineage example in Bedrock for impact analysis
Zaloni Proprietary8
Metadata enhancing data quality and reliability
Zaloni Proprietary9
Business users can quickly answer questions such as:
Data profiling speeds up data discovery and time to insight
•  How ma...
Data profiling example in Mica
Capture profiling metrics for every entity
•  Automatically collect profiling metrics at th...
Data catalog example in Mica
Zaloni Proprietary12
•  Logical data lake that can include all tiers of storage:
§  Files, HDFS, Object store in on-premise and cloud environme...
Example: Metadata management in Financial Services
Register/ update
metadata
RDBMS/
Mid Tier
Mainframe
COBOL
Flat files
SA...
DON’T GO IN THE LAKE WITHOUT US
Grounding Big Data
Vikram Sreekanti
UC Berkeley
REMEMBERING THE PAST
Data Warehouse
Single Source of Truth
Enterprise Information Architecture
Golden Master
…
Truth
Truth
Big data took us to a new world
There were changes in volume, velocity and variety,
which were challenging.
Big data took us to a new world
There were changes in volume, velocity and variety,
which were challenging.
The real challenge now is the meaning and valu...
WHAT IS DIFFERENT?
Shift in technology
Data representations
Shift in behavior
Data-driven organizations
Shift in behavior
Data-driven organizations
Data in products
Started with the Internet.
Now, the Internet of Things
By 2017:
marketing spends more on tech than IT does.
Data in marketing
GARTNER GROUP
By 2020:
90% of tech budget controlle...
MANY USE CASES
MANY CONSTITUENCIES
MANY INCENTIVES
MANY CONTEXTS
WHAT IS DIFFERENT?
Shift in technology
Data representations
Shift in behavior
Data-driven organizations
Shift in technology
Data representations
Raw data in the data lake
Simplifies capture
Encourages exploration
What does it
mean?
It depends on
the context.
A LITTLE SCENARIO
HDFS
BITS
All the web logs from last year
VIEWS, MODELS, CODE
A script to extract orders. To be used for Market Basket analysis.
VIEWS, MODELS, CODE
A Hive table of orders. To be used for Market Basket analysis.
BITS
All the web logs from last year
VIEWS, MODELS, CODE
Code to extract abandoned user sessions
VIEWS, MODELS, CODE
A retargeting model
A hive table
of orders
A retargeting
model
VIEWS, MODELS, CODE
MANY SCRIPTS
MANY MODELS
MANY APPLICATIONS
MANY CONTEXTS
A broader context for big data
ground
THE MEANING AND VALUE OF DATA DEPENDS ON CONTEXT
Application Context
Views, models, code
Behavioral Context
Data lineage &...
APPLICATION CONTEXT
Metadata
Models for interpreting
the data for use
§ Data structures
§ Semantic structures
§ Statistica...
HISTORICAL CONTEXT
Versions
Web logs Code to extract user/
movie rentals
Recommender for movie
licensing
Trends over time
...
BEHAVIORAL CONTEXT
Why Dora?!
Lineage & Usage
2 4 8 7 9
BEHAVIORAL CONTEXT
Lineage & Usage
Data Science
Recommenders
“You should compare with
book sales from last year....
7
7
9
9
THE BIG CONTEXT
A NEW WORLD NEEDS NEW SERVICES
WHAT ARE WE BUILDING?
Grounding philosophy
§ Start useful, stay useful.
§ Stay general.
§ Design for scale.
ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
Parsing &
Featurization
Catalog &
...
Scavenging
and Ingestion
Search &
Query
Scheduling &
Workflow
Versioned
Storage ID & Auth
COMMON GROUND CONTEXT MODEL
Pach...
COMMON GROUND
Versions
Models
Usage
An unopinionated context model
COMMON GROUNDModels
Versions
Usage
Versions
Usage
Models
Model Graphs
The metamodel
member k1
member k1:
string
member k2
Object 2
member k1
member k2:
number
member k11:
string member k12
element 1 element...
COMMON GROUNDModels
Versions
Usage
Models
Versions
Usage
Versions
Usage
Models
Model Graphs
Version Graphs
The versioning ...
COMMON GROUNDModels
Versions
Usage
Models
Versions
Usage
Versions
Usage
Models
Model Graphs
Version Graphs
The versioning ...
a3eb4b765520b0d0ab90594dcf2373c1ce5dbb0b0
0e9233e8e99cccd6861d304968efa4c945a0b918
3e64220f08374629ad43ca652d4ce7cef0bdbbc...
COMMON GROUNDModels
Versions
Usage
Models
Versions
Usage
Models
Versions
Usage
Versions
Usage
Models
Model Graphs
Version ...
USAGE GRAPHS
Everything can participate in usage
Models
Versions
Usage
Models
Versions
Usage
Models
Versions
Usage
Version...
COMMON GROUNDVersions
Models
Usage
Model Graphs
Version Graphs
Usage Graphs: Lineage
The model
INITIAL FOCUS AREAS
ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
Parsing &
Featurization
Catalog &
...
ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
Catalog &
Discovery
Wrangling
Anal...
ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
Parsing &
Featurization
Catalog &
...
ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
Parsing &
Featurization
Catalog &
...
Learn more at:
http://www.ground-context.org
@vsreekanti
Understanding Metadata: Why it's essential to your big data solution and how to manage it well
Próximos SlideShares
Carregando em…5
×

Understanding Metadata: Why it's essential to your big data solution and how to manage it well

866 visualizações

Publicada em

In this O'Reilly webcast, Ben Sharma (cofounder and CEO of Zaloni) and Vikram Sreekanti (software engineer in the AMPLab at UC Berkeley) discuss the value of collecting and analyzing metadata, and its potential to impact your big data solution and your business.

Watch the replay here: http://oreil.ly/28LO7IW

Publicada em: Dados e análise
  • Seja o primeiro a comentar

Understanding Metadata: Why it's essential to your big data solution and how to manage it well

  1. 1. Understanding Metadata: Why it’s essential to your big data solution and how to manage it well Tuesday, June 21, 2016 Ben Sharma | Vikram Sreekanti
  2. 2. Speakers Ben Sharma, Co-Founder & CEO – Zaloni --- Ben Sharma is a passionate technologist and thought leader in big data, analytics and enterprise infrastructure solutions. Having previously worked in technology leadership at NetApp, Fujitsu and others, Ben's expertise ranges from business development to production deployment in a wide array of technologies including Hadoop, HBase, databases, virtualization and storage. Ben is co-author of Architecting Data Lakes and Java in Telecommunications. Vikram Sreekanti, Software Engineer – AMPLab, UC Berkeley Vikram Sreekanti is a software engineer working on research in the AMPLab at UC Berkeley. A graduate of Berkeley's computer science department, he will begin his Ph.D. in Fall 2016, working with Joe Hellerstein.
  3. 3. In today’s data environment with structured and unstructured data, the importance of metadata is increased •  Metadata allows you to keep track of what data is in the data lake, its source, its format and its lineage •  Metadata allows for better change management through Impact Analysis •  The result is data visibility, reliability and reduced time to insight for your analytics Metadata matters in a big data world Zaloni Proprietary3
  4. 4. Data architecture modernizationTraditionalNew Data Lake Sources ETL EDW Derived (Transformed) Discovery Sandbox EDW Streaming Unstructured Data Various Sources Zaloni Proprietary Reporting, BI Extracts Data Science Data Discovery Reporting, BI Extracts 4
  5. 5. Data lake reference architecture Consumption Zone Source System File Data DB Data ETL Extracts Streaming Transient Loading Zone Raw Data Refined Data Trusted Data Discovery Sandbox Original unaltered data attributes Tokenized Data APIs Reference Data Master Data Data Wrangling Data Discovery Exploratory Analytics Metadata Data Quality Data Catalog Security Data Lake Integrate to common format Data Validation Data Cleansing Aggregations OLTP or ODS Enterprise Data Warehouse Logs (or other unstructured data) Cloud Services Business Analysts Researchers Data Scientists Zaloni Proprietary5
  6. 6. •  Reduced time to insight for analytics •  Modern Data architecture will require a holistic approach to metadata Metadata improves data visibility and reliability Type of Metadata Description Example Technical Captures the form and structure of each data set Type of data (text, JSON, Avro), structure of the data (fields and their types) Operational Captures lineage, quality, profile and provenance of the data Source and target locations of data, size, number of records, lineage Business Captures what it all means to the user Business names, descriptions, tags, quality and masking rules Zaloni Proprietary6
  7. 7. Considerations: •  Integration with Enterprise Metadata Management Solutions •  Automated process for new metadata to be registered in the Data Lake •  Data follows the registered metadata Automated metadata registration API check-in copy to repository retrieve metadata Enterprise Metadata Repositories END START metadata file Hadoop Cluster Edge-node to Cluster (SFTP) add tags origin info, timestamp, etc. Metadata operational metadata file Zaloni Proprietary7
  8. 8. Data lineage example in Bedrock for impact analysis Zaloni Proprietary8
  9. 9. Metadata enhancing data quality and reliability Zaloni Proprietary9
  10. 10. Business users can quickly answer questions such as: Data profiling speeds up data discovery and time to insight •  How many records does an entity have? What is its total size? •  What does the activity look like for a specific entity (streaming, updated monthly, untouched from a year ago)? •  Is this entity a subset of another entity? •  Does this entity likely contain duplicates? •  Does this data apply to my target customers/market? •  What is the min/max of a particular column? •  Is this data reliable/does it have enough valid values? Zaloni Proprietary10
  11. 11. Data profiling example in Mica Capture profiling metrics for every entity •  Automatically collect profiling metrics at the: §  Entity level (e.g., size of data set) §  Field level (e.g., values, frequency of the field) •  Visually display metrics with metadata •  Allow data quality check rules to be created based on profiling information  Zaloni Proprietary11
  12. 12. Data catalog example in Mica Zaloni Proprietary12
  13. 13. •  Logical data lake that can include all tiers of storage: §  Files, HDFS, Object store in on-premise and cloud environments •  Data lifecycle management across tiered storage environments §  Hot -> Warm -> Cold on an entity level based on policies/SLAs §  Across on-premise and cloud environments §  Take advantage of various storage technologies §  Provide data management features to automate scheduling and orchestration of data movement between heterogeneous storage environments •  Elastic and on-demand compute for various analytical workloads Data lifecycle management powered by Metadata Zaloni Proprietary13
  14. 14. Example: Metadata management in Financial Services Register/ update metadata RDBMS/ Mid Tier Mainframe COBOL Flat files SAS files Source Systems Metadata repositories Metadata Management solution Extract/ Read metadata Data Ingestion Data Quality and Validation Layout Standardization Operational Metadata Generation Layout Standardization Data Acquisition Automation •  Automated Data Acquisition Framework providing timeliness of data •  Capture Metadata in all phases: Ingestion, Transformation •  Integration with Enterprise Metadata Management •  Integrated Data Quality Analysis Zaloni Proprietary14
  15. 15. DON’T GO IN THE LAKE WITHOUT US
  16. 16. Grounding Big Data Vikram Sreekanti UC Berkeley
  17. 17. REMEMBERING THE PAST Data Warehouse Single Source of Truth Enterprise Information Architecture Golden Master … Truth Truth
  18. 18. Big data took us to a new world
  19. 19. There were changes in volume, velocity and variety, which were challenging. Big data took us to a new world
  20. 20. There were changes in volume, velocity and variety, which were challenging. The real challenge now is the meaning and value of data, which depend critically on context. Big data took us to a new world
  21. 21. WHAT IS DIFFERENT? Shift in technology Data representations Shift in behavior Data-driven organizations
  22. 22. Shift in behavior Data-driven organizations
  23. 23. Data in products Started with the Internet. Now, the Internet of Things
  24. 24. By 2017: marketing spends more on tech than IT does. Data in marketing GARTNER GROUP By 2020: 90% of tech budget controlled outside of IT.
  25. 25. MANY USE CASES MANY CONSTITUENCIES MANY INCENTIVES MANY CONTEXTS
  26. 26. WHAT IS DIFFERENT? Shift in technology Data representations Shift in behavior Data-driven organizations
  27. 27. Shift in technology Data representations
  28. 28. Raw data in the data lake Simplifies capture Encourages exploration What does it mean? It depends on the context.
  29. 29. A LITTLE SCENARIO HDFS
  30. 30. BITS All the web logs from last year
  31. 31. VIEWS, MODELS, CODE A script to extract orders. To be used for Market Basket analysis.
  32. 32. VIEWS, MODELS, CODE A Hive table of orders. To be used for Market Basket analysis.
  33. 33. BITS All the web logs from last year
  34. 34. VIEWS, MODELS, CODE Code to extract abandoned user sessions
  35. 35. VIEWS, MODELS, CODE A retargeting model
  36. 36. A hive table of orders A retargeting model VIEWS, MODELS, CODE
  37. 37. MANY SCRIPTS MANY MODELS MANY APPLICATIONS MANY CONTEXTS
  38. 38. A broader context for big data ground
  39. 39. THE MEANING AND VALUE OF DATA DEPENDS ON CONTEXT Application Context Views, models, code Behavioral Context Data lineage & usage Historical Context In and over time
  40. 40. APPLICATION CONTEXT Metadata Models for interpreting the data for use § Data structures § Semantic structures § Statistical structures Theme: An unopinionated model of context
  41. 41. HISTORICAL CONTEXT Versions Web logs Code to extract user/ movie rentals Recommender for movie licensing Trends over time How does a movie with these features fare over time? Point in time A promising new movie is similar to older hot movies at time of release!
  42. 42. BEHAVIORAL CONTEXT Why Dora?! Lineage & Usage
  43. 43. 2 4 8 7 9 BEHAVIORAL CONTEXT Lineage & Usage Data Science Recommenders “You should compare with book sales from last year.” Curation Tips “Logistics staff checks weather data the 1st Monday of every month.” Proactive Impact Analysis “The Twitter analysis script changed. You should check the boss’ dashboard!”
  44. 44. 7 7 9 9 THE BIG CONTEXT A NEW WORLD NEEDS NEW SERVICES
  45. 45. WHAT ARE WE BUILDING? Grounding philosophy § Start useful, stay useful. § Stay general. § Design for scale.
  46. 46. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND Parsing & Featurization Catalog & Discovery Wrangling Analytics & Vis Reference Data Data Quality Reproducibility Model Serving Scavenging and Ingestion Search & Query Scheduling & Workflow Versioned Storage ID & Auth
  47. 47. Scavenging and Ingestion Search & Query Scheduling & Workflow Versioned Storage ID & Auth COMMON GROUND CONTEXT MODEL Pachyderm Chronos Parsing & Featurization Catalog & Discovery Wrangling Analytics & Vis Reference Data Data Quality Reproducibility Model Serving ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND
  48. 48. COMMON GROUND Versions Models Usage An unopinionated context model
  49. 49. COMMON GROUNDModels Versions Usage Versions Usage Models Model Graphs The metamodel
  50. 50. member k1 member k1: string member k2 Object 2 member k1 member k2: number member k11: string member k12 element 1 element 2 element 3 element 1 element 2 element 3 Root RELATIONAL SCHEMA JSON DOCUMENT Schema 1 Table 1 Column 1 Column c Table t Column 1 Column d foreign key Models Versions Usage Versions Usage Models
  51. 51. COMMON GROUNDModels Versions Usage Models Versions Usage Versions Usage Models Model Graphs Version Graphs The versioning model
  52. 52. COMMON GROUNDModels Versions Usage Models Versions Usage Versions Usage Models Model Graphs Version Graphs The versioning model
  53. 53. a3eb4b765520b0d0ab90594dcf2373c1ce5dbb0b0 0e9233e8e99cccd6861d304968efa4c945a0b918 3e64220f08374629ad43ca652d4ce7cef0bdbbca 3e0bada008655fe32d7d136eac0a3f333d23ed80fd75a4ba16f96d11f3f954854acc2d739054233 Directed Acyclic Graphs (partial orders) In this order In no particular order VERSION GRAPHSModels Versions Usage Models Versions Usage Versions Usage Models
  54. 54. COMMON GROUNDModels Versions Usage Models Versions Usage Models Versions Usage Versions Usage Models Model Graphs Version Graphs Usage Graphs: Lineage The usage model
  55. 55. USAGE GRAPHS Everything can participate in usage Models Versions Usage Models Versions Usage Models Versions Usage Versions Usage Models
  56. 56. COMMON GROUNDVersions Models Usage Model Graphs Version Graphs Usage Graphs: Lineage The model
  57. 57. INITIAL FOCUS AREAS
  58. 58. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND Parsing & Featurization Catalog & Discovery Wrangling Analytics & Vis Reference Data Data Quality Reproducibility Model Serving Scavenging and Ingestion Search & Query Scheduling & Workflow Versioned Storage ID & Auth INITIAL FOCUS AREAS
  59. 59. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND Catalog & Discovery Wrangling Analytics & Vis Reference Data Data Quality Scavenging and Ingestion Search & Query Scheduling & Workflow Versioned Storage ID & Auth INITIAL FOCUS AREAS Parsing & Featurization Model Serving Reproducibility
  60. 60. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND Parsing & Featurization Catalog & Discovery Wrangling Analytics & Vis Reference Data Data Quality Reproducibility Model Serving Scavenging and Ingestion Search & Query Scheduling & WorkflowID & Auth INITIAL FOCUS AREAS Versioned Storage
  61. 61. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND Parsing & Featurization Catalog & Discovery Wrangling Analytics & Vis Reference Data Data Quality Reproducibility Model Serving Scavenging and Ingestion Search & Query Scheduling & Workflow Versioned Storage ID & Auth ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES
  62. 62. Learn more at: http://www.ground-context.org @vsreekanti

×