O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Enterprise Data Classification and Provenance

1.789 visualizações

Publicada em

Enterprise Data Classification and Provenance

Publicada em: Tecnologia
  • Your opinions matter! get paid for them! click here for more info...●●● https://tinyurl.com/realmoneystreams2019
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • Your opinions matter! get paid BIG $$$ for them! START NOW!!..  http://ishbv.com/surveys6/pdf
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui

Enterprise Data Classification and Provenance

  1. 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Enterprise Data Classification and Provenance Apache Atlas Shwetha Shivalingamurthy Suma Shivaprasad
  2. 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Disclaimer This document may contain product features and technology directions that are under development, may be under development in the future or may ultimately not be developed. Project capabilities are based on information that is publicly available within the Apache Software Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache Software Foundation community development process can all effect timing and final delivery. This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product. Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Since this document contains an outline of general product development plans, customers should not rely upon it when making purchasing decisions.
  3. 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda • Demo • Big Data Governance • Overview of Atlas • Atlas architecture • Features and Roadmap
  4. 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo usecase – Ad network • Matches advertiser demand with ad space supply from publishers • Billing based on ad impressions/ad engagement • Enables targeting, tracking and reporting of ad impressions • Typical reports/queries: • Mismatch of demand and supply • Country/os wise reports • Top advertisers/publishers
  5. 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data landscape Traditional warehouse Ad servers User Ad Impression, Click, Billing logs Metadata Summaries
  6. 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data governance requirements • Cross platform lineage – impact analysis, forensic, discovery • Asset search • Common Business Terms • Compliance
  7. 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo • Technical and business metadata • Cross Component Lineage • Creating views • Create tags • Entity deletes • Search using tags, attributes • Entity audit • Business catalog – find assets • Flexible model, external lineage ingest HDP 2.5
  8. 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Governance Data Discovery and Tagging Metadata Management Data Lineage/Prov enance Access Management Data Security & PrivacyData Quality Compliance and Audit Data Wrangling Data Lifecycle Management Data integration Data Governance Aspects Data governance refers to processes, methods and tools used in an enterprise for effective control of availability, usability, integrity, and security of data
  9. 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Enterprise Data Governance: Apache Atlas Data Management along the entire data lifecycle with integrated provenance and lineage capability • Cross component lineage Modeling with Metadata enables comprehensive business metadata vocabulary with enhanced tagging and attribute capabilities • Common Business Language • Hierarchically organized – No dupes ! Interoperable Solutions across the Hadoop ecosystem, through a common metadata store • Combine and Exchange Metadata STRUCTURED TRADITIONAL RDBMS METADATA MPP APPLIANCES Kafka Storm Sqoop Hive ATLAS METADATA Falcon RANGER Custom Partners
  10. 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Background: DGI Community becomes Apache Atlas May 2015 Apache Atlas Incubation DGI group Kickoff Dec 2014 Aug 2016 HDP 2.5/ Apache 0.7 Release Global Financial Company * DGI: Data Governance Initiative Key Benefits: • Co-Dev = Built for real customer use cases • Faster & Safer = Customers know business + HWX knows Hadoop • Code contributors - Hortonworks, IBM, Aetna , Merck, Target Jul 2015 HDP 2.3/ Apache 0.5 Foundation Release
  11. 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Architecture
  12. 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Architecture
  13. 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Atlas Type System • Defines model – schema of metadata • Flexible and powerful to define any model/custom types • Supports inheritance • Types • Primitive types – bool, integer types, string, date, enum • Collections - array, map • Struct – set of attributes • Class – Identifiable struct, hierarchy • Trait – set of attributes, hierarchy
  14. 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive Model DataSet metaType: ClassType name: String required hive_db metaType: ClassType name: string required createTime: date required parameters: map<string,string> optional hive_table metaType: ClassType db: hive_ db required createTime: date required columns: array<hive_column> required hive_column metaType: ClassType name: string required type: string required extends references references 0..n
  15. 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Entities Instances of types Name: rawlogs Guid: 1 createTime: 2015-01-01 10:00 Type: hive_db name: impressions Guid: 2 Type: hive_table name: adv_id type: string Guid: 3 Type: hive_column name: user_id type: string Guid: 4 Type: hive_column db column column EXPIRES_ON Time: March, 2016 PII trait trait
  16. 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Graph Engine • Graph Database • Titan with storage backed by HBase • Types and Entities are translated to the Graph Model • Classes, Structs and Traits map to a vertex • Relationships are mapped as edges • Rich relationships between metadata objects • Indexing and Search • Indexing based on type annotations • External indexing – Titan backed by Solr
  17. 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Titan property graph model Graph Search with Gremlin saturn = g.V.has('name','saturn').next() hercules = saturn.as(‘x’).in(‘father’).loop(‘x’) { it.loops > 3}.next() hercules.outE(‘battled’).has(‘time’, T.gt, 1).inV.name cerberus  hydra
  18. 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Search Find Relevant Assets based on their attributes , associations with business terms DSL with sql like syntax based on type system from $type is $trait where $clause select|has $attributes, repeat Examples  Select columns from a hive_table where its name is “impressions” and db name is “raw” hive_column where table.name=”impressions", table.db.name = ‘raw’  Select all columns from hive tables which are tagged as “PII” hive_column is ‘PII’ Full text search ‘(rawlogs) AND hive’ ‘(rawlogs OR supply*) AND hive’
  19. 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Features and Roadmap
  20. 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Atlas Component Integration & Lineage • Cross- component dataset lineage. Centralized location for all metadata inside HDP • Single Interface point for Metadata Exchange with platforms outside of HDP Apache Atlas Hive Ranger Falcon Sqoop Storm Kafka Spark NiFi HBase Partner Custom HDP 2.3 HDP 2.5 Beyond HDP 2.5 HDP 2.5 External
  21. 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Business Catalog for Ease of Use  Organize data assets along business terms – Authoritative: Hierarchical Taxonomy Creation – Agile modeling: Model Conceptual, Logical, Physical assets – Definition and assignment of tags like PII (Personally Identifiable Information)  Comprehensive features for compliance – Multiple user profiles including Data Steward and Business Analysts – Object auditing to track “Who did it” – Metadata Versioning to track ”what did they do”  Faster Insight: ( Roadmap ) – Data Quality tab for profiling and sampling – User Comments Key Benefits: Organize data assets along business terms Compliance Features: Faster Insight
  22. 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Ranger: Introduction Centralized authorization and auditing across Hadoop components • HDFS, Hive, HBase, Knox, Strom, YARN, Kafka, Solr, .. • Audit logs to: Solr, HDFS, RDBMS, Log4j, .. Resource based security • Policies for specific set of resources • Requires revision of policies as resources get added/moved Classification based security • Policies for classifications and not for specific resources • A single policy protects resources in multiple components • As classification for resources change, appropriate policies would automatically be applied • Enables separation of duties: resource-classification and security policies
  23. 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Scalable Access Control – Reusable Tag Policy User group • AD • Linux Resources: • Files • Tables • Topologies Atlas Tag • PII ANY asset PII • Files • Tables • Topologies Single Admin Group Assigns Many Stewards Tag + Single point of enforcement and audit All future tagging is covered by existing policy Not Scalable Scalable
  24. 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Open: Governance Ready Certification Program Choice: Customers choose features that they want to deploy—a la carte versus vendor lock Curated & Fast: Selected group of vendor partners to provide rich, complimentary and complete features ready to deploy Agile: Low switching costs, Faster deployment and innovation Centralized : Common SLA & common open metadata store Flexibility: Interoperability of products through Atlas metadata Safe: HDP at core to provide stability and interoperability Completed: • Waterline • Dataguise • Attivio • Trifacta Pending: • Collibra • Alation • Meta Integration (Miti) • Paxata • Syncsort • Talend
  25. 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Roadmap… • MultiTenancy • Titan 1.x Migration • Hive Column Level Lineage
  26. 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Summary • Designed for Hadoop at platform, not application level • High Confidence data in Hadoop for regulated verticals • Compliance and business objectives aligned to data organization • Faster discovery for analysts – reduce time to value • Agile and adaptable – ensures information is current by native connectors • Dynamic protection with Ranger in simple audited policies
  27. 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Learn More: • Apache Incubator link http://atlas.incubator.apache.org/ • Hortonworks links: http://hortonworks.com/solutions/security-and- governance/ • https://community.hortonworks.com/spaces/64/governance-lifecycle- track.html?topics=Atlas&type=question • Atlas Technical User Guide - http://atlas.incubator.apache.org/AtlasTechnicalUserGuide.pdf
  28. 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Questions
  29. 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Backup
  30. 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dynamic Access Policy Apache Ranger + Atlas Integration
  31. 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How does Atlas work with Ranger at scale? Atlas provides: Metadata • Business Classification (taxonomy): Company > HR > Driver • Hierarchy with Inheritance of attribute to child objects: Sensitive “PII” tag of department HR will be inherited by group HR> Driver • Atlas will notify Ranger via Kafka Topic for changes Apache Atlas Hive Ranger Falcon Kafka Storm Atlas provides the metadata tag to create policies Ranger provides: Access & Entitlements • Ranger will cache tags and asset mapping for performance • Ranger will have a policy based on tags instead of roles. • Example: PII = <group> This can work for a may assets.
  32. 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Automatic update of policies – active protection Metastore • Tags • Assets • Entities Notification Framework Kafka Topics Atlas Atlas Client • Subscribes to Topic • Gets Metadata Updates PDP Resource Cache Ranger Notification Metadata updates Message durability Optimized for Speed Event driven updates
  33. 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Ranger: Authorization and Auditing HBase Ranger Administration Portal HDFS Hive Server2 Ranger Audit StoreRanger Policy Store Ranger Plugin Hadoop Components Enterprise Users Log4j Knox Storm YARN Kafka Solr HDFS Solr Ranger Plugin Ranger Plugin Ranger Plugin Ranger Plugin Ranger Plugin Ranger Plugin Ranger Plugin RDBMS
  34. 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Big Data Governance Current Landscape • Opaque Data and in variety of data stores – HDFS, S3, Data warehouses • Schema is hardly sufficient – Hive Metastore, Avro, Data Warehouse • Platform tools like Ranger and Falcon solve parts of the problem Need for Data governance Organizations need data governance to understand its information to answer questions such as: • What do we know about our information? • Where did this data come from and how’s it being used? • Does this data adhere to company policies and rules? • Need for effective control and consumption of data Atlas helps customers discover information about data objects, their meaning, location, characteristics, and usage.
  35. 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Business Taxonomy Business Taxonomy (Catalog) The practice and science of classification of things or concepts, including the principles that underlie such classification. The business organization model is hierarchical making authoritative with no duplication. Tags: Traits vs. Labels vs. Business Taxonomy Atlas has Tags that are authorative and prevent duplication. Tag can span different parts of the business taxonomy. A tag PII can be used in HR as well Finance or Sales. Benefits: A view of data assets organized by business language Compliance, Acceptable use – Dynamic Metadata based access control Common taxonomy through Hadoop components
  36. 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Principle Roles & Activities in an Enterprise • Data Steward – Curator, responsible for data classification – associate business taxonomy and tagging, access policies • Data Scientist – Analyst, primary consumer of Business Taxonomy • Administrator/Operations – Role management, Data lifecycle management (Archival, retention) • Data Engineer – Data ingress and egress, semantic data quality • 50% - 80%+ Time spend looking for data • Profit Center • Primary User of Atlas • Enables Scientist Goal: < 25% spent on finding data = Empowering scientist to spend their time uncovering insights -- faster
  37. 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Governance Usecases : Impact analysis  HortonAdNetwork – A large size Ad network which has an international footprint with multiple publishers and advertisers across several countries  Complex ETL jobs and data pipelines processing real-time ad network data from several different sources and various data processing platforms  No easy way to determine the root cause when something is off charts  Data analysts need effective data provenance tools for Impact/Root cause anaylsis  Cross component lineage is a must  Data Lineage (Provenance) Data lineage is defined as a data life cycle that includes the data's origins and where it moves over time. It describes what happens to data as it goes through diverse processes. It helps provide visibility into the analytics pipeline and simplifies tracing errors back to their sources
  38. 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Governance Usecases - Compliance  HortoniaBank – mid size bank expanding from US to international markets  2 Customer Tables owned by BH: 50K customer records each with 38 fields (PII, PHI, PCI & non-sensitive data) – us_customers: USA person data only – ww_customers: multi-language, multi-country, localized person data  1 data set of prospects leased from a data broker – tax_2010: Data lease expired already!
  39. 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved User Group Access Privileges joe_analyst us_employee US Data Only, non-sensitive data only, rest forbidden depending on sensitivity kate_hr us_hr US Data Only, All sensitive data (PCI, PII, PHI) Tag Based Policies  US HR team members can see all original data (PCI, PII,….)  Analysts are prohibited from viewing PII data in any of the tables  Anyone except operations/Admin are prohibited to access tax_2010 after the specified date - Expires_on policy turns off access on the configured expiry date
  40. 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sqoop Teradata Connector Apache Kafka Expanded Native Connector: Dataset Lineage Custom Activity Reporter Metadata Repository RDBMS Any process using Sqoop is covered No other tool tracks IOT of the box

×