O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Unleashing the power of Apache Atlas
with Apache Ranger
Virtual Data Connector Project
NIGEL JONES
JONESN@UK.IBM.COM
DATAW...
About Me – Nigel Jones
 https://www.linkedin.com/in/nigelljones/
 jonesn@uk.ibm.com (Anyone still use email?)
 @planetf...
The Problem…..
WHY ARE WE HERE…..
Data?
 What data do I have?
 What does it mean?
 Where is it?
 Who has access to it?
 Who owns it?
 What quality is ...
Regulatory needs
 Adhere to regulations like BCBS-239 and GDPR
 Need to know meaning, value of the data
 Demonstrate pr...
So what do we need to address this?
Metadata..
 Metadata enables data to be used outside of the application that created it.
 Analytics and decision making
...
Which can support…
 An enterprise data catalogue that lists all data including where it is, what it
is, who owns it, it’s...
But easily…
 Open frameworks & APIs
 Automatic collection & discovery of metadata in a dynamic heterogeneous
environment...
The vision
Open and
Unified Metadata
Virtualization Data Connector project
Data virtualization project
 Collaboration – IBM, several banks & open community
 A Data Lake environment
 Not just Had...
Apache Atlas
 “Atlas is a scalable and extensible set of core foundational governance
services – enabling enterprises to ...
Apache Ranger
 Centralized security administration to manage all security related tasks in a
central UI or using REST API...
Project Interactions
Search/Rep
ort
GaianDB
• Search for list of assets by metadata
• Search for data
• Reporting tool obt...
Why Atlas and Ranger?
 Open Source essential to forming an active ecosystem
 Vision, active community & evolving – abili...
Refined virtual connector scope scope
GaianDB
Ranger
Plugin
Titan
(GraphDB,
Metadata
Repository)
Ranger
Config
Ranger Serv...
GaianDB & Virtualizer
 GaianDB
 Open Source
 Federated, self learning, dynamic configuration
 Based on Apache Derby
 ...
Atlas – glossary enhancements
 Get Atlas closer to parity with commercial offerings
 Business Terms – categories, catego...
Atlas – other enhancements
 Consumer Centric APIs
 Open Metadata Access Services (OMAS)
 REST & more Kafka notification...
Ranger areas being looked at
 Building a plugin for GaianDB
 Access control, simple masking. More later
 User synchroni...
Beyond the MVP
 Open Discovery Framework
 Consider other security enforcement engines – such as Apache Sentry &
driving ...
The vision
 An enterprise data catalog that lists all of your data, where it is located, its origin (lineage), owner, str...
Summary
 Atlas can help us have an industry wide common metadata platform around
which a vibrant ecosystem can evolve
 N...
Questions
After this talk
jonesn@uk.ibm.com
17:50 Room 4 – Security & Governance BOF
zzz
z
z
z
z
Questions?
Backup charts
Atlas
graphDB
“gaiandb”
IGC
IGC REST API
Oracle
Data
HDFS
Data
Netezza
Data
P-JDBC P-JDBCP-JDBC
GAF OMAS
Virtual
Asset
OMA...
Metadata areas and types
Policy Metadata (Principles,
Regulations, Standards, Approaches,
Rule Specifications, Roles and
M...
User & Group/Role synchronization
UserSync2
LDAP holds role-membership
(LDAP groups) – could also be
Active Directory
ATLA...
Atlas Glossary v2: Tag Sync to Ranger
TagSync2
ATLAS glossary manages a
sophisticated enterprise
glossary structure
• Atla...
Policy (Rule) synchronization
RuleSync
• Generate policies in Ranger based off entities in Atlas
• Currently designing how...
VirtualDataConnector JIRAS 20170402
 RANGER-1488
 RANGER-1487
 RANGER-1486
 RANGER-1485
 RANGER-1464
 RANGER-1454
 ...
References
 Apache Atlas - http://atlas.apache.org/
 Top level JIRA for this activity https://issues.apache.org/jira/bro...
Próximos SlideShares
Carregando em…5
×

ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real Time

346 visualizações

Publicada em

Security is at the core of every bank activity. ING set an ambitious goal to have an insight into the overall network data activity. The purpose is to quickly recognize and neutralize unwelcomed guests such as malware, viruses and to prevent data leakage or track down misconfigured software components.
Since the inception of the CoreIntel project we knew we were going to face the challenges of capturing, storing and processing vast amount of data of a various type from all over the world. In our session we would like to share our experience in building scalable, distributed system architecture based on Kafka, Spark Streaming, Hadoop and Elasticsearch to help us achieving these goals.
Why choosing good data format matters? How to manage kafka offsets? Why dealing with Elasticsearch is a love-hate relationship for us or how we just managed to put it all together with wire encryption everywhere and a kerberized Hadoop cluster.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real Time

  1. 1. Unleashing the power of Apache Atlas with Apache Ranger Virtual Data Connector Project NIGEL JONES JONESN@UK.IBM.COM DATAWORKS, MUNICH, APRIL 2017 Apache®, Apache Atlas, Apache Ranger & other Apache project names referenced are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
  2. 2. About Me – Nigel Jones  https://www.linkedin.com/in/nigelljones/  jonesn@uk.ibm.com (Anyone still use email?)  @planetf1 – noisy, f1, electric vehicles, food & drink …. A split of work/life accounts didn’t work for me!  And of course the Apache Atlas & Ranger mailing lists & JIRA!  Science fan at school uni. It was cloud chambers back then… now just the cloud   IBM Hursley, UK since 1990  Last 3 years focus on Data Lake, Information Governance, Open Metadata
  3. 3. The Problem….. WHY ARE WE HERE…..
  4. 4. Data?  What data do I have?  What does it mean?  Where is it?  Who has access to it?  Who owns it?  What quality is it?  How does it relate to other data?  How to I control, audit & understand access?
  5. 5. Regulatory needs  Adhere to regulations like BCBS-239 and GDPR  Need to know meaning, value of the data  Demonstrate processes in place to govern access  Audit  Significant fines if rules breached  Whilst ensuring easy, ready access to appropriate data for data professionals to support an agile business
  6. 6. So what do we need to address this?
  7. 7. Metadata..  Metadata enables data to be used outside of the application that created it.  Analytics and decision making  New business applications  Reporting and compliance  Metadata describes the format and content of data allowing people to judge which dataset to use for a new project  Structure  Meaning  Origin  Valid values and quality  Usage and ownership  Regulations and classifications that apply  Metadata describes the business context and classification of data allowing automated governance processes to operate.
  8. 8. Which can support…  An enterprise data catalogue that lists all data including where it is, what it is, who owns it, it’s meaning, quality, where it came from , and can fully describe it’s business context & how the data should be governed….  Subject Matter experts searching, collaborating, feeding back about their data needs and use  Automated governance actions to protect and manage including auditing, monitoring, quality control, rights management
  9. 9. But easily…  Open frameworks & APIs  Automatic collection & discovery of metadata in a dynamic heterogeneous environment  Using predefined standards for glossaries, schemas, rules, regulations to reduce cost  Cheap to integrate new tools  No proprietary lock-in & assumptions that all tools are from one suite or vendor  Avoiding silos  Distributed and Open
  10. 10. The vision Open and Unified Metadata
  11. 11. Virtualization Data Connector project
  12. 12. Data virtualization project  Collaboration – IBM, several banks & open community  A Data Lake environment  Not just Hadoop, but other sources too  Business Terms, Classifications, Metadata rich  Offer virtualized views. Expose relational data with business terms  Manage Access to resources – permit, deny, log, filter/mask …. THROUGH METADATA  Open, pluggable  Working through use cases, design, initial MVP (this year)  Critique, feedback is welcomed. We’re looking for guidance and support from the Atlas & Ranger communities as well as contribute our ideas  Proposed changes all go through mailing list and JIRA for feedback
  13. 13. Apache Atlas  “Atlas is a scalable and extensible set of core foundational governance services – enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem.” …. http://www.apache.org  Open Community -- Apache Incubator since May 2015  Type agnostic metadata store  REST API & UI  Supports many Hadoop components including HBase, Hive, Sqoop, Storm & others
  14. 14. Apache Ranger  Centralized security administration to manage all security related tasks in a central UI or using REST APIs.  Fine grained authorization to do a specific action and/or operation with Hadoop component/tool and managed through a central administration tool  Standardize authorization method across all Hadoop components.  Enhanced support for different authorization methods - Role based access control, attribute based access control etc.  Centralize auditing of user access and administrative actions (security related) within all the components of Hadoop.  … from http://ranger.apache.org
  15. 15. Project Interactions Search/Rep ort GaianDB • Search for list of assets by metadata • Search for data • Reporting tool obtains data to draw report Underlying data, sql, hive, HDFS, Oracle, Netezza etc Manages logical views Deploys rules, pushes classifications, source for user roles (not users) +ranger plugin to permit/deny, mask etc Pulls rules. classifications RDBMSHadoop Apache Atlas Apache Ranger Apache Solr
  16. 16. Why Atlas and Ranger?  Open Source essential to forming an active ecosystem  Vision, active community & evolving – ability to contribute & work with others to provide the best solution  Already have good core capabilities  Atlas type system is very flexible  Ranger offers a range of policy types and provides a pluggable framework  Already cross project integration  Use of tag based policie in Ranger sourced from Atlas  Can be used independently of full Hadoop stack
  17. 17. Refined virtual connector scope scope GaianDB Ranger Plugin Titan (GraphDB, Metadata Repository) Ranger Config Ranger Server Atlas Poll Policies OMAS OMRS IGC Pre Post Create View Metadata Extract physical metadata Manage Logical Tables Virtualizer Retrieve meta data Retrieve meta data Retrieve meta data Push meta data Oracle Netezza Hive Tables Push and query meta data Data Lake Repositories Meta Data Data Lake Virtualization tag-sync rule-sync Config (eg Policies, Audit log location) LDAP Audit Log Mapper Search for data/reporting Push and query metadata Meta Data Navigator Meta Data Datameer
  18. 18. GaianDB & Virtualizer  GaianDB  Open Source  Federated, self learning, dynamic configuration  Based on Apache Derby  Already had “policy” support – we’re plugging in Ranger for this project  Virtualizer  Listens to event notifications on assets etc  Creates view definitions in GaianDB, and new Atlas APIs to store metadata. Could use different virtual engine..  Designed to be open to other virtualization technologies. LT1 LT2 DS2DS1 DS3 PolicyPlugin (ranger) Virtualizer Atlas GaianDB supports federation – not used for MVP
  19. 19. Atlas – glossary enhancements  Get Atlas closer to parity with commercial offerings  Business Terms – categories, category hierarchies  Has-a, is-a, type-of, synonym, antonym, arbitrary relationships  Assets mapped to Business Terms  Classifications  Hierarchy  Navigable mappings to retain ability to flatten tags to ranger  Instead of hive column EMP_SALARY -> SPI, now can be EMP_SALARY -> SALARY -> SPI …  Used to drive governance  ATLAS-1410
  20. 20. Atlas – other enhancements  Consumer Centric APIs  Open Metadata Access Services (OMAS)  REST & more Kafka notifications  Asset, Catalog, Connector, Glossary, Governance Action, Governance Definitions, Information View, Roles and Access  Repository level APIs  Open Metadata Repository Services (OMRS)  REST & more Kafka notifications  Pluggability through an Open Connector Framework to other metadata repositories – distributed and Open  Standard data model/core  Enhancement to core model – versioning, external linkage etc  More standard types ie for all relational databases to ease sharing
  21. 21. Ranger areas being looked at  Building a plugin for GaianDB  Access control, simple masking. More later  User synchronization (large #users, role of Atlas)  Changes to tag sync process for New glossary proposal  As more metadata goes into Atlas, it becomes source for generation of some kinds of policies. Where is the master?  Generating ranger rules from governance definitions  How about control of access to Atlas itself?  Aside: Interfaces used by enforcement engines (such as to get classification data) need to be efficient – these should work for projects like Apache Sentry as well as Atlas
  22. 22. Beyond the MVP  Open Discovery Framework  Consider other security enforcement engines – such as Apache Sentry & driving more capability around rules & governance actions from Atlas metadata  Work on standard models to support different domains  Lineage  From high level design lineage through to operational detail. Logs vs graph….  API metadata  Infrastructure – JanusGraph…  Abstraction added by IBM in last few months for titan 1
  23. 23. The vision  An enterprise data catalog that lists all of your data, where it is located, its origin (lineage), owner, structure, meaning, classification and quality  Spanning systems both on premise and cloud providers  Hosted locally to your data platforms but integrated to provide the enterprise view  New data tools (from any vendor) connect to your data catalog out of the box  No vendor lock-in; nor expensive population of yet another proprietary siloed metadata repository  Metadata is added automatically to the catalog as new data is created  Extensible discovery processes characterise and classify the data  Interested parties and processes are notified  Subject matter experts collaborating around the data  Locate the data they need, quickly and efficiently  Feed back their knowledge about the data and the uses they have made about it to help others and support economic evaluation of data  Automated governance processes protect and manage your data  Metadata-driven access control  Auditing, metering and monitoring  Quality control and exception management  Rights management  Predefined standards for glossaries, data schemas, rules and regulations that reduce the cost of doing business  Open frameworks and APIs for collaborating with universities, traditional vendors and new innovators around data and advanced analytics
  24. 24. Summary  Atlas can help us have an industry wide common metadata platform around which a vibrant ecosystem can evolve  Not only in Hadoop but more broadly  Metadata driven governance can be scalable & enable us to manage our data better, and be compliant with regulations  The ideas presented here resonate with many people we’ve spoken to  Get involved! I’d love to hear the feedback on this approach!  Comment on the JIRAS, ask questions, contribute, disagree… ;-)  Look at JIRA Tag “VirtualDataConnector” or start at ATLAS-1689  Atlas wiki  “Innovation happens best not in isolation but in collaboration” (keynote)  THANKS!
  25. 25. Questions After this talk jonesn@uk.ibm.com 17:50 Room 4 – Security & Governance BOF zzz z z z z Questions?
  26. 26. Backup charts
  27. 27. Atlas graphDB “gaiandb” IGC IGC REST API Oracle Data HDFS Data Netezza Data P-JDBC P-JDBCP-JDBC GAF OMAS Virtual Asset OMAS Search Search/Explore UI Catalog OMAS OMRS OMRS GAF Pre GAF Post Connector Framework * Atlas boundaries Developed in POC May not be in POC initially * May be hardcoded at first C o n n e c t o r F r a m e w o r k ATLAS Virtualizer Architecture
  28. 28. Metadata areas and types Policy Metadata (Principles, Regulations, Standards, Approaches, Rule Specifications, Roles and Metrics) Governance Actions and Processes Augmentation MappingImplementation Connector Directories Access Access Information Auditor Integration Developer Business Analyst Data Scientist Information Worker Information Owner Information Governor Information Steward Data Quality Analyst Business Objects and Relationships, Taxonomies and Ontologies Business Attributes Organization Information Curator Teaming Metadata (people profiles, communities, projects, notebooks, …) Models and Schemas 3 2 4 5 Physical Asset Descriptions (Data stores, APIs, models and components) Asset Collections (Sets, Typed Sets, Type Organized Sets) Information Views Rights Management Reference Data Feedback Metadata (tags, comments, ratings, …) ClassificationSchemes Classification Strategy Subject Area Definition Campaigns and Projects Infrastructure and systems Rollout 1 Discovery Metadata (profile data, technical classification, data classification, data quality assessment, …) Augmentation Instrument Association Information Process Instrumentation (design lineage) 6 7
  29. 29. User & Group/Role synchronization UserSync2 LDAP holds role-membership (LDAP groups) – could also be Active Directory ATLAS manages definitive list of roles <that are used for atlas managed sources> • Corporate LDAP has a huge number of users/groups • Ranger currently needs to sync all • In future perhaps we establish group/role membership during authentication • Capability for alternative source could be merged in to base UserSync LDAP lookup -> group:member Governance Action OMAS - getRoles Apache Ranger LDAP Apache Atlas
  30. 30. Atlas Glossary v2: Tag Sync to Ranger TagSync2 ATLAS glossary manages a sophisticated enterprise glossary structure • Atlas Glossary v2 Proposed in ATLAS-1410 (David Radley) Sync Builds on existing tagsync approach • New API in Atlas will flatten classification structure • No changes to ranger – but exposing richer classification could be area of future work Governance Action OMAS Confidential Salary emp_renum Business Term Hive Column Business Term Confidential emp_renum Hive Column Tag Apache Ranger Apache Atlas
  31. 31. Policy (Rule) synchronization RuleSync • Generate policies in Ranger based off entities in Atlas • Currently designing how this works • Scoped by policy service so existing Ranger UI approach still works Governance Action OMAS - getRules Role Classifications Asset Ranger Rule Action Apache Ranger Apache Atlas
  32. 32. VirtualDataConnector JIRAS 20170402  RANGER-1488  RANGER-1487  RANGER-1486  RANGER-1485  RANGER-1464  RANGER-1454  RANGER-1234  RANGER-1186  RANGER-1168  ATLAS-1696  ATLAS-1694  ATLAS-1691  ATLAS-1158  ATLAS-520  ATLAS-519  ATLAS-455  ATLAS-197  Create Ranger plugin for gaiandb  generate rules from Governance definitions in Atlas  New usersync alternative for Atlas (vdc)  Ranger support for Virtual Data Connector Project (ATLAS)  Support Atlas v2 glossary in Atlas plugin (for access control to terms etc)  Support of Atlas v2 glossary API proposal for tag source  Post-evaluation phase user extensions  Ranger Source: eclipse  Add data masking for tag based policies  Governance Action Framework OMAS  Sample assets to support Virtual Connector Project  OMAS Interfaces for Atlas  Build ATLAS using Docker  Temporal / Versioning support for types, traits, entites ....  metrics  Timeouts in tests should be configurable from system property  Add build instructions in top level dir
  33. 33. References  Apache Atlas - http://atlas.apache.org/  Top level JIRA for this activity https://issues.apache.org/jira/browse/ATLAS- 1689  Apache Ranger - http://ranger.apache.org/  GaianDB  https://github.com/gaiandb/gaiandb  https://developer.ibm.com/open/openprojects/gaian-database/  The case for open metadata – A.M.Chessell  http://www.ibmbigdatahub.com/blog/case-open-metadata

×