Enviar pesquisa
Carregar
Netezza Deep Dives
•
15 gostaram
•
3,116 visualizações
R
Rush Shah
Seguir
Denunciar
Compartilhar
Denunciar
Compartilhar
1 de 115
Baixar agora
Baixar para ler offline
Recomendados
Netezza fundamentals for developers
Netezza fundamentals for developers
Biju Nair
Using Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve Performace
Biju Nair
Netezza workload management
Netezza workload management
Biju Nair
IBM Pure Data System for Analytics (Netezza)
IBM Pure Data System for Analytics (Netezza)
Girish Srivastava
Netezza Architecture and Administration
Netezza Architecture and Administration
Braja Krishna Das
Raid
Raid
dinaselim
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Altinity Ltd
Your first ClickHouse data warehouse
Your first ClickHouse data warehouse
Altinity Ltd
Recomendados
Netezza fundamentals for developers
Netezza fundamentals for developers
Biju Nair
Using Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve Performace
Biju Nair
Netezza workload management
Netezza workload management
Biju Nair
IBM Pure Data System for Analytics (Netezza)
IBM Pure Data System for Analytics (Netezza)
Girish Srivastava
Netezza Architecture and Administration
Netezza Architecture and Administration
Braja Krishna Das
Raid
Raid
dinaselim
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Altinity Ltd
Your first ClickHouse data warehouse
Your first ClickHouse data warehouse
Altinity Ltd
Backup and recovery in oracle
Backup and recovery in oracle
sadegh salehi
Google Bigtable Paper Presentation
Google Bigtable Paper Presentation
vanjakom
IBM Netezza
IBM Netezza
Ahmed Salman
Mastering PostgreSQL Administration
Mastering PostgreSQL Administration
EDB
MySQL8.0_performance_schema.pptx
MySQL8.0_performance_schema.pptx
NeoClova
What's new in Oracle 19c & 18c Recovery Manager (RMAN)
What's new in Oracle 19c & 18c Recovery Manager (RMAN)
Satishbabu Gunukula
Open Source 101 2022 - MySQL Indexes and Histograms
Open Source 101 2022 - MySQL Indexes and Histograms
Frederic Descamps
What to Expect From Oracle database 19c
What to Expect From Oracle database 19c
Maria Colgan
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Altinity Ltd
Unix Memory Management - Operating Systems
Unix Memory Management - Operating Systems
Drishti Bhalla
Oracle RAC on Extended Distance Clusters - Customer Examples
Oracle RAC on Extended Distance Clusters - Customer Examples
Markus Michalewicz
Oracle ACFS High Availability NFS Services (HANFS)
Oracle ACFS High Availability NFS Services (HANFS)
Anju Garg
[오픈소스컨설팅]RHEL7/CentOS7 Pacemaker기반-HA시스템구성-v1.0
[오픈소스컨설팅]RHEL7/CentOS7 Pacemaker기반-HA시스템구성-v1.0
Ji-Woong Choi
SQL Server Clustering Part1
SQL Server Clustering Part1
Sql Trainer Kareem
The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019
Amit Banerjee
DAS RAID NAS SAN
DAS RAID NAS SAN
Ghassen Smida
Oracle Cluster Rac
Oracle Cluster Rac
Brahim Belghmi
Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0
Mydbops
Why Oracle on IBM POWER7 is Better Than Oracle Exadata - The Advantages of IB...
Why Oracle on IBM POWER7 is Better Than Oracle Exadata - The Advantages of IB...
miguelnoronha
Maximum Availability Architecture - Best Practices for Oracle Database 19c
Maximum Availability Architecture - Best Practices for Oracle Database 19c
Glen Hawkins
Performance Tuning
Performance Tuning
Jannet Peetz
Client server
Client server
National Institute of Biologics
Mais conteúdo relacionado
Mais procurados
Backup and recovery in oracle
Backup and recovery in oracle
sadegh salehi
Google Bigtable Paper Presentation
Google Bigtable Paper Presentation
vanjakom
IBM Netezza
IBM Netezza
Ahmed Salman
Mastering PostgreSQL Administration
Mastering PostgreSQL Administration
EDB
MySQL8.0_performance_schema.pptx
MySQL8.0_performance_schema.pptx
NeoClova
What's new in Oracle 19c & 18c Recovery Manager (RMAN)
What's new in Oracle 19c & 18c Recovery Manager (RMAN)
Satishbabu Gunukula
Open Source 101 2022 - MySQL Indexes and Histograms
Open Source 101 2022 - MySQL Indexes and Histograms
Frederic Descamps
What to Expect From Oracle database 19c
What to Expect From Oracle database 19c
Maria Colgan
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Altinity Ltd
Unix Memory Management - Operating Systems
Unix Memory Management - Operating Systems
Drishti Bhalla
Oracle RAC on Extended Distance Clusters - Customer Examples
Oracle RAC on Extended Distance Clusters - Customer Examples
Markus Michalewicz
Oracle ACFS High Availability NFS Services (HANFS)
Oracle ACFS High Availability NFS Services (HANFS)
Anju Garg
[오픈소스컨설팅]RHEL7/CentOS7 Pacemaker기반-HA시스템구성-v1.0
[오픈소스컨설팅]RHEL7/CentOS7 Pacemaker기반-HA시스템구성-v1.0
Ji-Woong Choi
SQL Server Clustering Part1
SQL Server Clustering Part1
Sql Trainer Kareem
The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019
Amit Banerjee
DAS RAID NAS SAN
DAS RAID NAS SAN
Ghassen Smida
Oracle Cluster Rac
Oracle Cluster Rac
Brahim Belghmi
Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0
Mydbops
Why Oracle on IBM POWER7 is Better Than Oracle Exadata - The Advantages of IB...
Why Oracle on IBM POWER7 is Better Than Oracle Exadata - The Advantages of IB...
miguelnoronha
Maximum Availability Architecture - Best Practices for Oracle Database 19c
Maximum Availability Architecture - Best Practices for Oracle Database 19c
Glen Hawkins
Mais procurados
(20)
Backup and recovery in oracle
Backup and recovery in oracle
Google Bigtable Paper Presentation
Google Bigtable Paper Presentation
IBM Netezza
IBM Netezza
Mastering PostgreSQL Administration
Mastering PostgreSQL Administration
MySQL8.0_performance_schema.pptx
MySQL8.0_performance_schema.pptx
What's new in Oracle 19c & 18c Recovery Manager (RMAN)
What's new in Oracle 19c & 18c Recovery Manager (RMAN)
Open Source 101 2022 - MySQL Indexes and Histograms
Open Source 101 2022 - MySQL Indexes and Histograms
What to Expect From Oracle database 19c
What to Expect From Oracle database 19c
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Unix Memory Management - Operating Systems
Unix Memory Management - Operating Systems
Oracle RAC on Extended Distance Clusters - Customer Examples
Oracle RAC on Extended Distance Clusters - Customer Examples
Oracle ACFS High Availability NFS Services (HANFS)
Oracle ACFS High Availability NFS Services (HANFS)
[오픈소스컨설팅]RHEL7/CentOS7 Pacemaker기반-HA시스템구성-v1.0
[오픈소스컨설팅]RHEL7/CentOS7 Pacemaker기반-HA시스템구성-v1.0
SQL Server Clustering Part1
SQL Server Clustering Part1
The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019
DAS RAID NAS SAN
DAS RAID NAS SAN
Oracle Cluster Rac
Oracle Cluster Rac
Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0
Why Oracle on IBM POWER7 is Better Than Oracle Exadata - The Advantages of IB...
Why Oracle on IBM POWER7 is Better Than Oracle Exadata - The Advantages of IB...
Maximum Availability Architecture - Best Practices for Oracle Database 19c
Maximum Availability Architecture - Best Practices for Oracle Database 19c
Semelhante a Netezza Deep Dives
Performance Tuning
Performance Tuning
Jannet Peetz
Client server
Client server
National Institute of Biologics
Computer's clasification
Computer's clasification
MayraChF
Database ,14 Parallel DBMS
Database ,14 Parallel DBMS
Ali Usman
An operating system for multicore and clouds: mechanism and implementation
An operating system for multicore and clouds: mechanism and implementation
Mohanadarshan Vivekanandalingam
EOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - Paper
David Walker
Storage Area Networks Unit 1 Notes
Storage Area Networks Unit 1 Notes
Sudarshan Dhondaley
Overview of Distributed Systems
Overview of Distributed Systems
vampugani
Cluster Computing
Cluster Computing
BOSS Webtech
Symmetric multiprocessing and Microkernel
Symmetric multiprocessing and Microkernel
Manoraj Pannerselum
Greenplum Architecture
Greenplum Architecture
Alexey Grishchenko
EMC Greenplum Database version 4.2
EMC Greenplum Database version 4.2
EMC
High performance operating system controlled memory compression
High performance operating system controlled memory compression
Mr. Chanuwan
Cluster computing
Cluster computing
pooja khatana
Components of Computer PARALLEL-PROCESSING.pptx
Components of Computer PARALLEL-PROCESSING.pptx
DaveEstonilo
Cluster Computing
Cluster Computing
BishowRajBaral
IRJET - The 3-Level Database Architectural Design for OLAP and OLTP Ops
IRJET - The 3-Level Database Architectural Design for OLAP and OLTP Ops
IRJET Journal
JavaOne BOF 5957 Lightning Fast Access to Big Data
JavaOne BOF 5957 Lightning Fast Access to Big Data
Brian Martin
System models for distributed and cloud computing
System models for distributed and cloud computing
purplesea
Cluster Computers
Cluster Computers
shopnil786
Semelhante a Netezza Deep Dives
(20)
Performance Tuning
Performance Tuning
Client server
Client server
Computer's clasification
Computer's clasification
Database ,14 Parallel DBMS
Database ,14 Parallel DBMS
An operating system for multicore and clouds: mechanism and implementation
An operating system for multicore and clouds: mechanism and implementation
EOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - Paper
Storage Area Networks Unit 1 Notes
Storage Area Networks Unit 1 Notes
Overview of Distributed Systems
Overview of Distributed Systems
Cluster Computing
Cluster Computing
Symmetric multiprocessing and Microkernel
Symmetric multiprocessing and Microkernel
Greenplum Architecture
Greenplum Architecture
EMC Greenplum Database version 4.2
EMC Greenplum Database version 4.2
High performance operating system controlled memory compression
High performance operating system controlled memory compression
Cluster computing
Cluster computing
Components of Computer PARALLEL-PROCESSING.pptx
Components of Computer PARALLEL-PROCESSING.pptx
Cluster Computing
Cluster Computing
IRJET - The 3-Level Database Architectural Design for OLAP and OLTP Ops
IRJET - The 3-Level Database Architectural Design for OLAP and OLTP Ops
JavaOne BOF 5957 Lightning Fast Access to Big Data
JavaOne BOF 5957 Lightning Fast Access to Big Data
System models for distributed and cloud computing
System models for distributed and cloud computing
Cluster Computers
Cluster Computers
Netezza Deep Dives
1.
Premier, Inc. Netezza Deep
Dive Rush Shah – Product Director Premier Connect Enterprise 2016.08.14
2.
2 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Workshop Objective Why • The case for implementing Data Warehouse Appliance What • Traditional Approach (SMP) • Appliance Approach (MPP) Difference • What is Netezza Data Warehouse? Capabilities • IBM Netezza – Architecture and Capabilities
3.
3 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Workshop Outcome Assimilate and articulate reasons for the need of Data Warehouse Appliance. Assimilate and articulate the concepts of Netezza Architecture Assimilate, articulate and ability to apply selected functionality Ability to optimize Netezza databases for performance
4.
4 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Process – an instance of a computer program that is being executed. It has its own set of resources (e.g. memory) that are not shared with other processes. Thread – a process may run multiple threads to perform instructions in parallel. Threads within a process share resources (e.g. memory). Server – a program running to serve the requests of other programs. The term is also used to refer to a physical computer dedicated to run one or more services. Multi-Process – adding more throughput by running multiple instances of a process or service. These can be on a single server or can be distributed across multiple servers. Multi-Threaded – within a process, being able to perform multiple tasks simultaneously across multiple CPUs. Software Terminology
5.
5 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Scalability – scalability is about supporting multiple simultaneous actions, not about making a single action faster. Availability – the ability of a solution to be resistant to component failures. Increasing the availability of a solution will increase the cost. Scale Up – adding more resources (CPU, RAM, etc) to a single server. Scale Out – adding more resources (CPU, RAM, etc) by adding more servers in a “cluster”. Hardware Architecture Terminology
6.
6 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Single Point of Failure – within a solution, a component that if it fails will cause the solution to fail as a whole. Active/Active – when all instances of a multi-process service will process requests. Active/Passive – when only some instances of a multi- process service will process requests and the other instances are only activated in the event of a component failure. Hardware Architecture Terminology
7.
7 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Concurrency, Parallelism, Multitasking, Distributed systems…. • Refers to the ability of software (and hardware) to share resources at the same time; preform multiple computations at the same time. • Software is written for concurrency to exploit hardware parallelism. • Allows efficient processing of asynchronous events (User Interface thread working independently of computation thread). Concurrency • Complex computations or large jobs can often be divided into smaller ones, which are then processed concurrently. • Parallelism is a feature of the Hardware: with multi-core or multi-processor computer having multiple processing elements; or clusters and MPPs use multiple computers to work on the same task. Parallel Computing • Refers to the use of Distributed Systems to solve computational problems. • Refers to a software system in which components located on a network or cluster of computers communicate and coordinate their efforts achieve a common goal. • Similar to Parallel Computing, problem is subdivided into multiple tasks each performed on a different computer. • Distributed System must have concurrency of components and redundancy of data--to allow for failure of components and self-sufficient computing. DistributedGrid Computing
8.
8 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. CPU CACHE RAM BUS Storage A very High Level Hardware 101 • Central Processing Unit • Instruction Performer • Hardware piece know as microprocessor • Contains Arithmetic Logical Unit (ALU) & Control Unit (CU) • Stores copies of data from most frequently used data • Smaller and faster memory—integrated with the CPU • Reduces avg. time to access memory • A type of computer storage with integrated circuits that can access data randomly in a constant time regardless of physical location • Volatile memory -- you loose all data when the computer is shut down • A communication system that transfers data from one component to another This implies wires, fiber optic cables • Static Memory – Memory is not lost when computer is shut down. • Has a Write Head and a Read Head • Stores large amounts of data • Slowest of all components Tapes, magnetic disks, optical disks
9.
9 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Symmetrical Multi-Processing Architecture Very Large SMP Multiple CPUs, Large Bank of Memory, separate storage Area Network. Shared RAM, shared NAS, shared I/O fiber optic bandwidth…in short, shared everything…
10.
10 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. CPU-Centric SMP architectures tends to solve large data and complex computation problems with faster CPUs and bigger RAMs. This approach is suitable for transferring and processing relatively small amounts of data. This processor-bound approach also allows for highly complex computations. Bus-constrained: Bandwidth Bottleneck Moving large amounts of data from storage (SAN drive) to processor depends on the BUS size (typically 70-100 MB). Data processing speed is constrained by the speed of the fastest BUS—typically 100 to 1000 times slower than the CPU Shared Everything Architecture RAM, InputOutput Bus and Storage Disk are shared among different tasks: putting severe limitations on ability to perform parallel processing. Shared resources are not necessarily dedicated to Large Data, complex calculations. SMP and Large Scale Data Processing
11.
11 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Barriers of Traditional Data Warehousing Loosely Coupled (Shared Everything) Architecture Barriers to Performance General Purpose Server & Storage OLTP Database Barriers to Efficiency Inefficient use of Human Resources Inefficient use of System Resources (software installation and upgrades, System Management and Data Flow)
12.
12 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Requirements for a New Approach! Support Failure Failures of components should result in a graceful decline of performance—not system failures. Failed components should be able to rejoin the system upon recovery. Data & Component Recoverability Component failures should not affect the execution or the outcome of a job Failure should not result in any loss of data If a component fails, its work should be resumed by the functioning units of the system. Scalability Adding load to the system should result in graceful decline of performance not system failures Increasing system resources should support increased load capacity and performance agility
13.
13 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. MPP Architecture = Share Nothing = Divide and Conquer MPP systems consist of very large numbers of processors each processor has its own memory, backplane and storage. The no shared-resources approach of pure MPP systems allows nearly linear scalability High availability is another advantage – when one node fails, another can take over. DW appliances (using MPP) Distribute data onto dedicated disk storage units connected to each server in the appliance. Distribution allows DW appliances (using MPP) to resolve a relational query by scanning data on each server in parallel. The divide-and-conquer approach delivers high performance and scales linearly as new servers are added into the architecture. Massively Parallel Processing (MPP) Architecture
14.
14 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Small SMP clusters operate in parallel sharing a storage area network and management structure. Each CPU within an SMP Node shares RAM, IO bus, and storage network. Although this approach will allow for a very high degree of parallelism, the resource-sharing built into this architecture will impose bottleneck on distributed computing. Especially the IO bus and Storage--the slowest components in any architecture—are shared and therefore are not suitable to handle terabytes of data flowing from storage to SMP clusters. MPP as Clustered SMP: Hybrid Approach
15.
15 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. In Teradata’s MPP Architecture , processor-RAM-storage disk pairs (“nodes”) operating in parallel divide the workload to execute queries over large sets of data. Each processor communicates with its associated disk drive to get raw data and perform calculations. One SMP Host collects intermediate results and assemble the query response for delivery back to the requesting application. With no contention for resources between MPP nodes, this architecture does allow for scalability to petascale database sizes. A major weakness of this architecture, however, is that it requires significant movement of data from disks to processors for BI queries. MPP Architecture
16.
16 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Purpose-built analytics engine Integrated database, server and storage Standard interfaces Speed: 10-100x faster than traditional system Simplicity: Minimal administration and tuning Scalability: Peta-scale user data capacity Smart: High-performance advanced analytics Allows efficient resource allocation, information sharing and high availability. Netezza: TALL CLAIMS!
17.
17 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Netezza Approach: Asymmetric Massively Parallel Processing ClientBI Applications Local Applications ODBC JDBC OLE DB SQL/92/99 IBM Netezza Data Warehouse Appliance RDBMS Server Storage + + 8150 RDBMS + Server + Storage
18.
18 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Moore’s Law: over the history of computing hardware, the number of transistors on integrated circuits doubles approximately every two years. TGMLC: The great Moore's law compensator (TGMLC), referred to as bloat or Wirth's law, is the principle that successive generations of computer software acquire enough bloat to offset the performance gains predicted by Moore's law. In Netezza’ s AMPP architecture most processing is handled by the massively parallel snippet processing units (SPU), as early in the data flow as possible. This approach of “bringing the query to the data” eliminates extraneous traffic from storage to the CPU and the resulting delays. Bringing Query to the Data Demand for I/O is outpacing CPU Performance gains
19.
19 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Netezza AMPP Architecture: Deep Dive
20.
20 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Primary Tier Netezza’s AMPP architecture is a two-tiered system designed to handle very large queries from multiple users. The first tier is a high-performance Linux SMP host. A second host is available for fully redundant, dual-host configurations. The host compiles queries received from applications, and generates query execution plans. It then divides a query into a sequence of sub-tasks, or snippets, that can be executed in parallel, and distributes the snippets to the second tier for execution. The host returns the final results to the requesting application. Secondary Tier The second tier consists of dozens to hundreds or thousands of Snippet Processing Units (SPUs) operating in parallel. Each SPU is an intelligent query processing and storage node, and consists of a powerful commodity processor, dedicated memory, a disk drive and a field-programmable disk controller with hard-wired logic to manage data flows and process queries at the disk level. The massively parallel, shared-nothing SPU blades provide the performance advantage of MPP. Conclusion Nearly all query processing is done at the SPU level, with each SPU operating on its portion of the database. All operations that lend themselves easily to parallel processing including: record operations, parsing, filtering, projecting, interlocking and logging, are performed by the SPU nodes, significantly reducing the amount of data required to be moved within the system. Operations on sets of intermediate results, such as sorts, joins and aggregates, are executed primarily on the SPUs, but can also be done on the host, depending on the processing cost and complexity of that operation. Netezza AMPP Architecture: Deep Dive
21.
21 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. S-Blades or Bleeding Edge: Genie in a Bottle FPGA Core CPU Core Uncompress Project Restrict, Visibility Complex ∑ Joins, Aggs, etc. select DISTRICT, PRODUCTGRP, sum(NRX) from MTHLY_RX_TERR_DATA where MONTH = '20091201' and MARKET = 509123 and SPECIALTY = 'GASTRO' Slice of table MTHLY_RX_TERR_DATA (compressed) where MONTH = '20091201' and MARKET = 509123 and SPECIALTY = 'GASTRO' sum(NRX) select DISTRICT, PRODUCTGRP, sum(NRX)
22.
22 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Netezza Software Architecture: Salient Points
23.
23 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Crossfire: SMP Vs. MPP
24.
24 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Netezza – Key Hardware Components
25.
25 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Traditional Complexity -- Netezza Simplicity 0. CREATE DATABASE TEST LOGFILE 'E:OraDataTESTLOG1TEST.ORA' SIZE 2M, 'E:OraDataTESTLOG2TEST.ORA' SIZE 2M, 'E:OraDataTESTLOG3TEST.ORA' SIZE 2M, 'E:OraDataTESTLOG4TEST.ORA' SIZE 2M, 'E:OraDataTESTLOG5TEST.ORA' SIZE 2M EXTENT MANAGEMENT LOCAL MAXDATAFILES 100 DATAFILE 'E:OraDataTESTSYS1TEST.ORA' SIZE 50 M DEFAULT TEMPORARY TABLESPACE temp TEMPFILE 'E:OraDataTESTTEMP.ORA' SIZE 50 M UNDO TABLESPACE undo DATAFILE 'E:OraDataTESTUNDO.ORA' SIZE 50 M NOARCHIVELOG CHARACTER SET WE8ISO8859P1; 1. Oracle* table and indexes 2. Oracle tablespace 3. Oracle datafile 4. Veritas file 5. Veritas file system 6. Veritas striped logical volume 7. Veritas mirror/plex 8. Veritas sub-disk 9. SunOS raw device 10. Brocade SAN switch 11. EMC Symmetrix volume 12. EMC Symmetrix striped meta-volume 13. EMC Symmetrix hyper-volume 14. EMC Symmetrix remote volume (replication) 15. Days/weeks of planning meetings Netezza: Low (ZERO) Touch: CREATE DATABASE my_db;
26.
26 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Oracle (Existing DDL) Oracle Create Table - Logical Model CREATE TABLE MTHLY_RX_TERR_DATA ( TERR_HIER_SK INTEGER, CAL_DT_SK INTEGER, CUST_INCTV_GEO_SK INTEGER, PROD_SK INTEGER, DIST_CHAN_SK INTEGER, PSCRBR_CUST_SK INTEGER, PLAN_CUST_SK INTEGER, SRC_SYS_SK INTEGER, NRX_COUNT NUMBER(15,3), XY_PROD_ID NUMBER(10), NRX_UNITS_QTY NUMBER(15,3), XY_TERR_ID NUMBER, NRX_DOL_AMT NUMBER(15,3), TRX_COUNT NUMBER(15,3), TRX_UNITS_QTY NUMBER(15,3), TRX_DOL_AMT NUMBER(15,3), ADS_ROW_LOAD_DT DATE, PSCRBR_XY_CUST_ID VARCHAR2(30 BYTE), PLAN_XY_CUST_ID VARCHAR2(30 BYTE), ADS_ROW_LAST_UPDT_DT DATE, PROD_LEVEL VARCHAR2(30 BYTE), MONTH_NUM INTEGER, XY_HCP_INCTV_ZIP_CD VARCHAR2(20 BYTE), IMS_PLAN_ID VARCHAR2(20 BYTE), MKT_SPEC_GROUP_ID VARCHAR2(20 BYTE), ALGN_TYPE_FLAG VARCHAR2(10 BYTE) ) Oracle Create Tablespaces, Partitions TABLESPACE ADS1_DATA PCTUSED 0 PCTFREE 10 INITRANS 1 MAXTRANS 255 NOLOGGING PARTITION BY RANGE (MONTH_NUM) SUBPARTITION BY LIST (PROD_LEVEL) SUBPARTITION TEMPLATE (SUBPARTITION BRAND VALUES ('BRAND'), SUBPARTITION PRODGROUP VALUES ('PRODGROUP'), SUBPARTITION PFS VALUES ('PFS'), SUBPARTITION OTHERS VALUES (DEFAULT) ) ( PARTITION P_DEFAULT_LL VALUES LESS THAN (1) NOLOGGING TABLESPACE ADS1_DATA PCTFREE 10 INITRANS 1 MAXTRANS 255 STORAGE ( BUFFER_POOL DEFAULT ). . . Oracle Create Indexes – CREATE INDEX MRT_PLAN_XY_CUST_ID ON MTHLY_RX_TERR_DATA (PLAN_XY_CUST_ID) NOLOGGING TABLESPACE ADS1_INDEX PCTFREE 10 INITRANS 2 MAXTRANS 255 STORAGE ( INITIAL 64K MINEXTENTS 1 MAXEXTENTS 2147483645 PCTINCREASE 0 BUFFER_POOL DEFAULT ) PARALLEL ( DEGREE DEFAULT INSTANCES DEFAULT ); Netezza DDL Create Table - Logical Model CREATE TABLE MTHLY_RX_TERR_DATA ( TERR_HIER_SK INTEGER, CAL_DT_SK INTEGER, CUST_INCTV_GEO_SK INTEGER, PROD_SK INTEGER, DIST_CHAN_SK INTEGER, PSCRBR_CUST_SK INTEGER, PLAN_CUST_SK INTEGER, SRC_SYS_SK INTEGER, NRX_COUNT NUMERIC(15,3), XY_PROD_ID BIGINT, NRX_UNITS_QTY NUMERIC(15,3), XY_TERR_ID BIGINT, NRX_DOL_AMT NUMERIC(15,3), TRX_COUNT NUMERIC(15,3), TRX_UNITS_QTY NUMERIC(15,3), TRX_DOL_AMT NUMERIC(15,3), ADS_ROW_LOAD_DT TIMESTAMP, PSCRBR_XY_CUST_ID VARCHAR(30), PLAN_XY_CUST_ID VARCHAR(30), ADS_ROW_LAST_UPDT_DT TIMESTAMP, PROD_LEVEL VARCHAR(30), MONTH_NUM INTEGER, XY_HCP_INCTV_ZIP_CD VARCHAR(20), IMS_PLAN_ID VARCHAR(20), MKT_SPEC_GROUP_ID VARCHAR(20), ALGN_TYPE_FLAG VARCHAR(10) ) DISTRIBUTE ON (PSCRBR_XY_CUST_ID); • Logical Model Only • No indexes • No Physical Tuning / Admin • Distribute Data by Columns or round robin Netezza eliminates indexes, and their associated storage space, and re-build times currently required by Oracle, DB2, SQL Server Data Model Migration and Simplification
27.
27 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. The KISS Principle – Yes, its that simple!…. No dbspace/tablespace sizing and configuration No redo/physical/Logical log sizing and configuration No page/block sizing and configuration for tables No extent sizing and configuration for tables No Temp space allocation and monitoring No RAID level decisions for dbspaces No logical volume creations of files No integration of OS kernel recommendations No maintenance of OS recommended patch levels No JAD sessions to configure host/network/storage No storage administration No indexes and tuning No software installation Resources become Data Managers instead of Database Administrators
28.
28 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Introduction to HADOOP Architectural Principles behind HADOOP Workshop – Part II
29.
29 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. The Specter of Big Data!
30.
30 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Modern data system have to process far more data than has been hitherto possible. Facebook is estimated to be larger than 100 PB of data E-bay and Twitter estimated at around 10 PB Many Organization are generating data at the rate of Terabytes per week. Traditional Distributed system rely on the same old framework Faster CPUs & Bigger RAMS SANs for data storage with finite bandwidth Data is copied from SANs to compute nodes at run time. Fine for relatively small amounts of data Big volumes of data is the real bottleneck Moore’s Law has stayed firm for nearly 40 years, but so has Greg’s Law (TGLMC) Getting data to the CPUs is the real bottleneck If we consider the typical disk transfer rates of 75 MB/S, time taken to copy transfer 100 GB of data to the CPU would be 22 minutes – assuming no interruptions. Actual time will be worse because most servers do not have 100 GB of RAM Data Bottleneck
31.
31 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Requirements for a New Approach! Support Failure Failures of components should result in a graceful decline of performance—not system failures. Failed components should be able to rejoin the system upon recovery. Data & Component Recoverability Component failures should not affect the execution or the outcome of a job Failure should not result in any loss of data If a component fails, its work should be resumed by the functioning units of the system. Scalability Adding load to the system should result in graceful decline of performance not system failures Increasing system resources should support increased load capacity and performance agility
32.
32 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Hadoop is based on the work done by Google in late 1990s and early 2000. Specifically, papers describing the Google File System (GFS) and MapReduce. Hadoop is an open-source project overseen by Apache Software Foundation. Hadoop is written entirely in JAVA. Based on the same principle as Data Warehouse Appliances Distribute data as it is initially stored in the system. Processing happens where data is stored. Individual nodes can work on the data local to those nodes. Node should talk to each other as little as possible (Shared-Nothing Architecture). Data is replicated multiple times on the system for increased availability and reliability. Reliability and infinite scalability (not the case with Netezza or Teradata). Core concepts: Hadoop File System (HDFS) & Map Reduce. Set of machines running HDFS and MapReduce are known as clusters. Individual machines are known as nodes Hadoop ecosystem also includes many other projects: Pig, HIVE, Hbase, Flume, Oozie, etc. Hadoop: There it is!
33.
33 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster When data is loaded into the system, it splits it into blocks. (64128 MB). Data is split into blocks and distributed across mulIple nodes in the cluster . Each block is replicated multiple times. Typically, 3 times on different nodes. MapReduce is the system used to process data on the Hadoop cluster. Consists of two phases Map and Reduce Between the two stages is another stage shuffle and sort Each Map task operates on discrete portion of the dataset After all Map (initial processing) tasks are complete, MapReduce system distributes intermediate data to the nodes for the Reduce tasks. Master program allocates work to the nodes: ten or hundred or thousands of nodes work in parallel, each on their own part of overall dataset. How it Works
34.
34 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Hadoop will perform the map task on data stored locally on the node thru HDFS. If this not possible, Map task will have to transfer data across the network as it processes data. When the Map tasks are accomplished, data is transferred across the network to Reducers. All Map tasks in general have to communicate (send processed data) to all Reduce Tasks . Reduce tasks may run on the same physical machines as Map task, however, there is no concept of data locality for Reducers. The Reduce tasks cannot begin until all Map tasks have been completed. In practice, Hadoop does start transmitting data to the Reduce tasks as soon as the last Mappers finish. However, Reduce tasks cannot begin until all intermediate data is transferred and sorted. Data Locality
35.
35 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. If a node fails, master will detect the failure and reassign the work to a different node in the system. Restarting a task does not require communication with other nodes working on other portions of the data. If a failed node restarts it is automatically added back to the system. If a node appears to be running slowly, the master can redundantly execute another instance of the same task. Also known as Speculative Execution. Result from first to finish will be used. This feature does not exist in data warehouse appliance like Netezza, Teradata or Greenplum. Fault Tolerance
36.
36 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. NameNode Holds metadata for HDFS Secondary NameNode: Not a failover or hot standby for the Name Node It performs housekeeping functions for the namenode. DataNode Stores data blocks JobTracker Manages and distributes MapReduce jobs TaskTracker Manages and monitors individual Map Reduce Tasks Each daemon runs in its own JVM No single node on a real cluster will run all five daemon although this is possible Hadoop Daemonology
37.
37 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Nodes fall in two different categories Master Nodes Runs NameNode, SecondaryName Node and JobTracker daemons Only one of these runs in each Hadoop Cluster Slave Nodes Runs DataNode and TaskTracker daemons. Each Slave Node will run both of these daemons. Daemons in other words….
38.
38 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Comparison Sheet
39.
39 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Using Netezza for Optimal Performance • Distribution (including scenarios for selecting the right candidate) • Grooming • Generate Stats • Zone Maps • Materialized Views • Interpreting Queries Creating and Managing Data • Databases and Tables • Security Workshop Part III -- Agenda
40.
40 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. The Optimal Performance Bubble-Chart DISTRIBUTION MVs ZONE MAPS GROOM GEN STATS CBT
41.
41 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Distribution A way of storing data in the data slices to enable distributed and parallel computing. A table exists once in the Netezza catalog, but physically the table has as many instances as there are SPUs Queries are compiled into as many snippets as there are SPUs for processing This is Single Instruction Multiple Data Architecture. At any given moment the data we are seeking is only as far away as the slowest performing SPU. We want all the SPUs working together on every query. The higher the number of SPUs are working, the faster the query will complete. Distribution & Collocation – Two Peas in a Pod Collocation The distribution of two or more tables on the same Snippet Processor (SPU) based on the same value If the Fact and Dimension table are distributed on the same column than Netezza will Hash the column values and store all data on the same SPU based on the hash value. The column may not be surrogate keys. Netezza does not know or care if the column is primary key in one table and foreign key in another.
42.
42 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Distribution – Syntactically speaking Random (Round Robin) •CREATE TABLE <tablename> [ ( <column> [, … ] ) ]DISTRIBUTE ON RANDOM; ColumnarHash Based •CREATE TABLE <tablename> [ ( <column> [, … ] ) ] DISTRIBUTE ON (<column> [,…]) ; Random (Round Robin) • CREATE TABLE encntr_fct (encntr_sk bigint, encntr_cd varachar (20), ptnt_sk bigint, discharge_dt timestamp, encntr_dt timestamp); ColumnarHash Based • CREATE TABLE encntr_fct (encntr_sk bigint, encntr_cd varachar (20), ptnt_sk bigint, discharge_dt timestamp, encntr_dt timestamp) DISTRIBUTE ON (ptnt_sk)
43.
43 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Data & Process Skew: Bugs that don’t budge • Occurs when data is not distributed evenly on all SPUs (data slices) • Occurs when data in fact tables is distributed based on low cardinality dimensional keys or columns with Boolean data types. Data Skew • Occurs when a tables is distributed evenly amongst the SPUs but the distribution is based on columns that are not stored on all SPUs. • Occurs most often when a table is distributed on date datatype columns • Data skewness may also cause Process skewness Process Skew
44.
44 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. When one table of the join is very small the Netezza Optimizer will choose to broadcast a table instead of doing a redistribute All SPUs send their records to the Host for consolidation and the full table is broadcast to all nodes. In some cases it may not be possible to distribute both tables on the relationship key, in this case Netezza will redistribute the needed columns to other disks (data slices). In this case one table is distributed by the join key and the other is redistributed If neither table of a join is distributed on the join key a double redistribute is necessary. Both tables are redistributed on the join key Redistribution & Broadcasting: Small is big, big is bad! Broadcasting Single Redistribution Double Redistribute
45.
45 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Distribution Keys – Do’s and Don’t Good Distribution = Maximum Collocated Processing • Columns with high cardinality • Integer Columns • Columns used frequently in query joins • You may use 4 columns, but use as few as possible • Use columns that can perform equi-joins on multiple fact tables • Use CTAS for analytical tables, to reuse distribution Bad Distribution = Redistributions & Skewness • Boolean keys – Will cause data skews • Date datatype – May cause data and process skews • Surrogate keys of Mini-Dimension Table – Low Cardinality • Do not use multiple surrogate keys of different dimension tables as the distribution key of the table
46.
46 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. What would happen if the Encounter table is distributed on 3 columns: Provider Sk, Patient SK, and Diagnosis SK? Scenario 1 – Distribution on Multiple Surrogate Keys Encounter Calendar Organization Provider Clinical Focus Grps ICD Codes Patient MSAPR DRGS Problem Lists Encounter Type CPT HCPCS Medication Lists AllergiesConditions HealthcareDataModel EncounterSubject ConceptualModel Answer: Depending on the query this will cause a massive amount of redistribution of all three dimension tables. While Provider and Diagnosis dimension tables are typically smaller the patient dimension tends to be massive and redistributing can adversely impact query performance. If possible, restrict distribution to a single key.
47.
47 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. There are three big tables Lab Results, Specimen Collection and Encounter Lab Results is distributed on Lab Result Sk, Encounter on Encounter Sk, Specimen Collection on Specimen SK, each SK represents the unique record of the FACT table. What will happen to a query joining all of these three table? Scenario 2: Distribution on High Cardinality Column Lab Results Encounter Organization Provider Clinical Order Type Susceptibility Patient Lab Tests (LOINC)Specimen Encounter Type Calendar Organism (SNOMED) HealthcareDataModel LabResultSubject ConceptualModel The Distribution strategy here is based on high cardinality columns of each FACT table. However, it is not based on join columns. Lab Results table Joins Specimen Collection table on Specimen SK. Encounter table joins Lab Results table on Encounter SK. The distribution strategy above will result into massive double redistribution of two other fact tables. It could even result into disk joins. Solution: Find the common connector between the three fact tables: each Fact table in this scenario has Encounter SK and Patient SK. Thus the distribution strategy must be based on these two columns
48.
48 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. What will happen to queries if the Encounter table was distributed based on Encounter (aka Contact) date key? Scenario 3: Distribution on a Date Key Encounter Calendar Organization Provider Clinical Focus Grps ICD Codes Patient MSAPR DRGS Problem Lists Encounter Type CPT HCPCS Medication Lists AllergiesConditions HealthcareDataModel EncounterSubject ConceptualModel The Distribution strategy here is based on a lower cardinality column. Calendar Dimension with 200 years worth of days is a table no greater than 100K records. The problem here is if you are querying for 2-3 years worth of data, you will not be using SPUs on which the rest of the historical data is distributed and stored. Not only will this result into data and process skews, it will also cause redistribution of large dimension tables (like Patient) or other large Fact tables because distribution is not based on join columns. Solution: NEVER DISTRIBUTE ON DATE (including Date sk which in integer datatype) style or Boolean datatype columns.
49.
49 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Let’s take the hypothesis that your distribution strategy is theoretically justifiable You have chosen to distribute all your Fact tables by Patient or by Encounter to avert massive and multiple redistributions. But what is happening in your queries? Are you informing Netezza to use that distribution? What if you are still joining Lab Results table and Specimen Collection table on Specimen SK and Encounter table and Specimen Collection table on Encounter? Will this join scheme work? Encounter Join Specimen_Collection on Encounter_sk Join Lab_Results on Specimen_Sk Scenario 4: Correct Distribution is not enough! Lab Results Encounter Organization Provider Clinical Order Type Susceptibility Patient Lab Tests (LOINC) Specimen Type Encounter Type Calendar Organism (SNOMED) Pharmacy Order & Administration Encounter Organization Provider Route Patient Dosage UOM Encounter Type Calendar Medication (MULTUM)
50.
50 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Two Common Scenarios in which we forget to “use” Distribution Join clause must have Distribution key specified in the query. In the scenario above rewrite the join clause thus: Encounter Join Specimen_Collection on Encounter_SK and Join Lab_Results on Specimen_Sk and Encounter_Sk Notice the redundancy in the last join. Joining Lab Result table with Specimen collection is completely redundant PARTITION clause in a Windows SQL function must include distribution key. Let’s count organism acquisition rate per Hospital Inpatient encounter in the Lab Result table. Lab Result table is distributed on Patient SK (surrogate key) count (distinct organism_sk) over (partition by encounter_sk) Rewrite this function to redundantly include Patient Sk count (distinct organism_sk) over (partition by encounter_sk, patient_sk) Scenario 4: Distribution Regained!
51.
51 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Scenario 5: In Search of Distribution Lost! • Typically the Measures is derived for each Encounter. Therefore, using Encounter Sk or number server as an justified candidate for Distribution AHRQ Measures • Typically, Measures if the Patient’s health assessment over different time period passes or fails the CMS set criteria. Patient SK or number serves an ideal candidate for Distribution ACO Measures • Typically, Antibiograms measure the resistance to antibiotics for a hospital population. Specimen SK serve an ideal candidate for distribution key because the organisms develop in the collected specimen and therefore each result is associated with the specimens collected from the patient ANTIBIOGRAMS
52.
52 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Who is the ideal candidate? • ENCOUNTER • PATIENT • SPECIMEN • DIAGNOSIS When in doubt, use the candidate with a lower cardinality among the available candidates— provided that all candidates have a very high cardinality. Who is the ideal candidate? PATIENT ENCOUNTER 1 DIAGNOSES 1 DIAGNOSIS 2 ENCOUNTER 2 DRUG 1 DRUG 2 DRUG 3 SPECIMEN 1 SPECIMEN 2 ENCOUNTER 3 LAB TEST 1 PROCEDURE 1 LAB TEST 2 PROCEDURE 2 DIAGNOSES 1
53.
53 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Zone Maps – Where Ignorance is Bliss! • An extent is the smallest unit of disk allocation (3 MB of disk space). • A zone map is an internal mapping structure to show the range (min and max) of values within each extent. • Zone maps are internally generated system tables. • Zone maps reduce disk scan operations that are required to retrieve data by eliminating records outside the start and end range of a WHERE clause queries. • When a table is dropped or truncated, the system updates the Zone map and removes all records associated with the tables. • Netezza automatically creates new zone maps and refreshes existing ones on each data slice when you insert, update, load data into tables, or generate statistics. • During Generate Statistics execution zone maps are disabled. • In addition, zone maps for the following data types will be created if columns of this type are used as the ORDER BY restriction for a materialized view or as the organizing key of a CBT: – Char datatypes - all sizes, but only the first 8 bytes are used in the zone map – Numeric - all sizes up to numeric(18) – Float, Double, Boolean
54.
54 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. CBT clusters records based on a key and stores them in the same or nearby extents. Cluster Based Tables are a way to rearrange data within all the extents in the same data slice, according to the values in the columns picked. Syntax to create Cluster Based Table uses the Organize ON keyword CBT – Expeditious Organization? SYNTAX:-- CREATE TABLE ... ORGANIZE ON (<column>, <column>, ...<column>) CBTs can be changed with Alter statement. Use the None keyword to transform a table into a non-CBT table. SYNTAX:-- ALTER TABLE ... ORGANIZE ON [ NONE | <column, ...>) ]; Organize on columns that are most often used in WHERE conditions and for Joins. Expedites data selection. Balances query performance between the organization columns. May impact query performance and compression ratios. Materialized views are not allowed on CBTs. Clustering is “defragmentation” of data: must be manually initiated by Groom command.
55.
55 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Although the Netezza system automatically maintains certain statistics, you might want to run GENERATE STATISTICS command & the Groom Table command periodically. Generate statistics gathers statistics for each table’s column's like duplicate values, minimum values, maximum values, null values, unique values and updates the system catalog tables. Generate Statistics command should especially be used when a table structure is altered or renamed. It forces the Netezza system to updates the zone maps and thereby improve query performance. The difference between 'generate statistics' and 'generate express statistics' is based on how the column uniqueness is calculated. The 'generate express statistics' calculates estimated dispersion values based on the sampling of rows in the table. As part of your routine database maintenance activities, plan to recover disk space that is occupied by outdated or deleted rows. In normal Netezza operations, an update or delete of a table row does not remove the version of the row. This approach benefits multi-version concurrency control. Over time however, the outdated or deleted tuples are of no interest to any transaction. You can reclaim the space that they occupy by using the SQL GROOM TABLE command. Groom commands typically run longer for Cluster Based Tables (tables created with Organize on clause). Groom and Generate Stats: Always! Syntax: Generate statistics on table <table name> ; Generate express statistics on table <table name>;
56.
56 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. To migrate data for a versioned table: GROOM TABLE <table_name> VERSIONS; To reclaim deleted records in a table: GROOM TABLE <table_name> RECORDS ALL; To identify data pages that contain only deleted records and to reclaim extents that are empty as a result: GROOM TABLE <table_name> PAGES ALL; To organize data that is not already organized in a clustered base table: GROOM TABLE <table_name> RECORDS READY; Groom: Syntactically Speaking SYNTAX: GROOM TABLE <name> <mode-choice> <mode-choice>:= RECORDS READY | RECORDS ALL | PAGES ALL | PAGES START | VERSIONS
57.
57 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Materialized View is a sorted projection of one and only one table MVs reduce the width of the data being scanned by creating a thin version of the table--based on frequently queried columns. This selection of frequently queried columns is called projection. MVs are typically sorted like the base tables, but can be ordered on columns different from base table. This reduces table scans due to enhanced zone maps. and it improves query performance if the sorted columns are included in the “where” clause of the query. Essentially, MVs improve performance by reducing the amount of data transferred from the disk to the CPU-RAM. Materialized Views – New York-Style Thin Slices!!!
58.
58 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Materialized Views are user defined Materialized Views are automatically managed o The query planneroptimizer in NZ host decides whether to use base table or MV based on factors like least scanning cost: sort order required by the query, columns projected and used in “where” clauses o Adding MVs can be useful if the Netezza database is used by multiple applications (especially when you have applications (ETL loaders) which writes into the database and one that reads from the database (BI Apps) Materialized Views are automatically maintained o MVs are automatically maintained on LOAD, INSERT, UPDATE, DELETE, and TRUNCATE TABLE o REFRESH in Materialized Views refreshes the sort order of the data, not the data itself. o System automatically adds a column to the MV to define where in the base table the record originated. o LOAD, INSERT and UPDATE will append unsorted rows to the base table, this may degrade performance over time o For large table modification or batch jobs—LOAD, INSERT or UPDATE—it is recommended to SUSPEND a materialized view before the batch operation and REFRESH the materialized view upon the completion of the batch job. MVs – What’s under the hood?
59.
59 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Base Table • Materialized View can be based on only one base table. • Materialized View cannot be created on external tables, temporary tables, system tables or Cluster-based table. SELECT clause • Expressions are not allowed in the projected (select clause) columns • Materialized View must have atleast one column in the SELECT clause JOIN clause • No JOINs are allowed in the query of a materialized view WHERE clause • No WHERE clause is allowed in the query of materialized view ORDER BY clause • Columns in the ORDER BY clause of the query must also be specified in the SELECT clause • ORDER BY clause cannot be descending Thou Shall Not! in Materialized Views
60.
60 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Thin Slices imply as few columns as possible o Tables with 100-150 columns will not perform as well during Table scans because of the width of data. o Netezza does not care for the number of records 100s of million or billions, it cares for the width of the record it is scanning before projection o So, create materialized view on Frequently queried columns o Ensure that you use ORDER BY for the most restrictive column o Create few different materialized views for each fact table to accommodate different querying scenarios o Set an acceptable threshold percentage of unsorted data in a Materialized View o Periodically, on a weekly basis if you run nightly updates and inserts, manually refresh the Materialized Views to reorder unsorted this will benefit performance. o Date datatype columns are ideal for ORDER BY. But depending on the query filter Booleans may also be an appropriate target. o Remember: Unlike Cluster Based Table, Materialized Views do INCUR storage overhead. Best Practices for Materialized Views
61.
61 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Traditionally Speaking – Not what you thought, mate! If order by clause is not specified in the DDL of materialized view, it will inherit the natural ordering of the base table Materialized View will have the same distribution as the base table SYNTAX CREATE MATERIALIZED VIEW <view_name> AS SELECT <column, column, ...> FROM <base_table> [ORDER BY <column, column, ...>]; SYNTAX : -- ALTER VIEWS ON <table> MATERIALIZE {SUSPEND|REFRESH}; MVs should be suspended before ETL jobs or large table modification and refreshed after the large data operations are completed Since unsorted data will append to base tables after data operations, it is important to set an appropriate refresh threshold for the re-sorting the data SYNTAX : -- SET SYSTEM DEFAULT MATERIALIZE THRESHOLD <%>; SYNTAX : -- SET SYSTEM DEFAULT MATERIALIZE REFRESH THRESHOLD TO <NUMBER>;
62.
62 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. A sequence is a database object from which multiple users can generate unique integers. Sequences are used to generate unique numbers. Mostly used for surrogate key columns in dimensions. These key values are ambiguous from an end user perspective and are not guaranteed to be contiguous in the order of insert. This is because each SPU's caches the sequence values for performance as to eliminate unnecessary traffic between the host and SPU for the next call of sequence. The NPS system will flush the cache when a) NPS system is stopped; b) System or SPU crashes; C) when some issues an alter sequence command. Unused values in the cache will no longer available if the system flushes the cache. Retrieving Sequence Values: Sequences: Random Uniqueness SYNTAX: Create sequence <sequence name> as <data type> [options]; Syntax: select next value for <sequence name> select next <integer> values for <sequence name>
63.
63 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. A partition is a division of a database table. Table partitioning is normally done for managing space, performance and availability. Partitioning table leads to distribution of the table onto multiple disks if they are available—however partitioning does not guarantee or attempt collocated processing. The key difference between partitioning (in traditional RDBMS) and distribution (in DW Appliance) is he possibility of collocated data placement. Even if you partition two different tables on the same partition key, there is not guarantee that the database will store them in the same disk—let alone process them on the same node. Typical partitioning is horizontal in nature, which is to say, a single logical table is physically divided into multiple physical tables on the hard disk based on row value of column, for example, if you partition a Encounter Fact table based on Encounter Date (by Week) you will create a new partition of the table for each new week. Vertical Partitioning is a way to divide a database table based on columns rather than row-values. In Netezza there is no way to distribute different table columns on different nodedisk units. It is deceptive to think that Netezza Materialized Views (MVs) are same as vertical partitions. Partitioning is useful for Data Loading but not retrieving Large Volumes of data, for example, in a traditional RDBMS Partitioning for Performance can involve contradictory practices: Partition Pruning (Partitioning to avoid table scans, typically focused on the Where clause of the query) and Partition-Wise Joins (Partitioning on columns used for joining tables). Partitioning for availability is a false promise for data warehouses. If one partition of table data is not available it may adversely impact the query output. This is only useful in OLTP scenarios. Data Distribution in Netezza and Hadoop involve redundant data storage (data is stored twice or thrice on different nodes), so if one node goes down it would not impact query output. Partitioning <> Distribution
64.
64 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Superficially, Netezza ZoneMaps and traditional RDMS indexes aid the performance of the lookup. An index consists of column values from one table, and those values are stored in a data structure. Index is a data structure stored and managed separately in the data disk. Different types of Indexes B-Tree Index – Allows to store sorted values—date and integer columns , useful for range searches. Hash Index – Very useful for lookup textual columns R-Tree Index – Used in Geospatial searches Bitmap Index – Used on columns with Boolean Data Types Traditional RDBMS automatically create an index for all surrogate key columns (primaryforeign key) and all columns with a constraint. ZoneMaps are created automatically on all integer and date columns in Netezza. Indexes slow down data load operations, for Data Warehouses it is recommended that Data is loaded before indexes are created. ZoneMaps have not impact on Data Load operations. It is recommended that before indexes structures are created on big tables, the developer should created a temporary table spaces (for sort operations) in the database to reduce resource contention. The concept of table spaces does not exist in Netezza. The database can use indexes more effectively when it has statistical information about the tables involved in the queries. This means the DBAs must frequently gather statistics to improve performance of data retrieval and load queries. Netezza collects Just in Time (JIT) statistics on data being loaded or manipulated in the system. It also provides an ability to organize data in a nearby extents to reduced disk I/O. The most essential difference is that indexes are pointers. They store the address of where data is located on the disk where ZoneMaps are location-ignorant, they only know the min and max values of the columns and therefore ZoneMaps only know WHERE NOT TO LOOK. Indexing <> Zonemaps
65.
65 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Bad Performance – There is always a reason! Data Skew & Process Skew Large Table Broadcasts & Large Table Redistribution Very Large Table (disk hash) joins, expression based joins, bad join key data types ZoneMaps are not invoked: Where clause on non-zone-mappable columnsSubqueries Unnecessary use of CBTs Poorly written SQL Limits on Concurrency
66.
66 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Optimizer Leveraging Stats Performance Understand the Query! The Netezza service that determines the best execution plan is called the Optimizer • Data Distribution and Organization • Join Orders (join algorithms like Hash, Merge Sort, Nested Loop) • Sorting, grouping, restricting sets at different timesparts of the query The execution (Query) plan defines exactly what the database does to return the query results. The Optimizer also depends on accurate and up to date statistics to create an optimal query plan. • Number of Rows • Min and Max values (date and number data types) • Frequency and Dispersion • Number of extents in table, and in the weakest data slice (SPU with the highest data SPU) Netezza’s Cost based optimizer takes the FPGA capabilities into account and determines • Best methods for scan operations • Selection of join methods and Join Order • Data movement between SPUs: Redistribute and Broadcast
67.
67 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. The Optimizer may rewrite queries to improve performance • Push down tables to Sub queries • Pull up or De Correlation of Subqueries • Rewrite of Expressions All query tasks are broken down into snippets that are executed in parallel. Examples of snippet operations include Scan operations, joins, filters, sorting, aggregations, etc. The Netezza Snippet Scheduler determines which snippet to add to the current workload based on session priority, estimated memory consumption vs. available memory, estimated SPU disk time Optimizer is a tough Editor!
68.
68 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Netezza provides three different ways to display Query Plans • EXPLAIN VERBOSE provides detailed cost estimated per SPU and operation • Text Plans provides a natural language description of the plan • HTML plans provides a graphical representation of the query plan Syntactically Speaking – Query Plans SYNTAX:-- EXPLAIN VERBOSE <QUERY> SYNTAX:-- EXPLAIN PLANTEXT <QUERY> SYNTAX:-- EXPLAIN PLANGRAPH <QUERY>
69.
69 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Example Query select fcy.fcy_nm as facility , cdr.mo_of_yr_abbr as month , ( case when enc.ptnt_cl_cd = 'I' then 'INPATIENT' else 'OUTPATIENT' end ) as patient_class , avg(los_cnt) as Lenght_of_Stay from encntr as enc inner join cdr on enc.dschrg_cdr_dk = cdr.cdr_dk inner join fcy_demog_ref as fcy on enc.fcy_num = fcy.fcy_num where cdr.yr_num = 2012 group by fcy.fcy_nm, cdr.mo_of_yr_abbr, cdr.mo_and_yr_num, enc.ptnt_cl_cd order by cdr.mo_and_yr_num Query joins three tables: encounter, calendar and hospital facility Distributed on Facility and Encounter Num The query has grouping, sorting and filtering options.
70.
70 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Restrictions imply the WHERE clause of the query Projections imply the SELECT clause of the query Join type like MergeSort or Disk (multi-node) HASH joins can have a big impact. EXPLAIN VERBOSE – Detailed Query Plan Node 1. [SPU Sequential Scan table "fcy_demog_ref" as "fcy" {}] -- Estimated Rows = 23, Width = 36, Cost = 0.0 .. 0.0, Conf = 100.0 Projections: 1:fcy.fcy_nm 2:fcy.fcy_num [SPU Broadcast] [HashIt for Join] Node 2. [SPU Sequential Scan table "cdr" {}] -- Estimated Rows = 364, Width = 15, Cost = 0.0 .. 2.1, Conf = 80.0 Restrictions: (cdr.yr_num = 2012) Projections: 1:cdr.mo_of_yr_abbr 2:cdr.mo_and_yr_num 3:cdr.cdr_dk Cardinality: cdr.cdr_dk 364 (Adjusted) cdr.mo_and_yr_num 364 (Adjusted) [SPU Broadcast] [HashIt for Join] Node 3. [SPU Sequential Scan table "encntr" as "enc" {(enc.encntr_num),(enc.fcy_num)}] -- Estimated Rows = 11155981, Width = 23, Cost = 0.0 .. 804.7, Conf = 100.0 Projections: 1:enc.ptnt_cl_cd 2:enc.los_cnt 3:enc.dschrg_cdr_dk 4:enc.fcy_num
71.
71 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Plan Graph makes it easy to see Join Orders In this case the calendar table is broadcasted after the restriction of Year 2012. The facility tables is also broadcasted (23 rows) without any restrictions or filters. The first join occurs between Encounter and Calendar tables Followed by join with Facility table The result set is then Grouped for Patient Class, Facility, Month, etc. Then the aggregation step takes place The last step is the sorting step. Notice how Netezza provides statistics for each stage: rows returned, the size of data returned, the width or the record etc. HTML PLAN GRAPHS
72.
72 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. For long-running queries use EXPLAIN functionality to identify possible problems Analyze query plan, do the cost estimates seem reasonable? Is the system performing redistributions or broadcasts? Could you prevent this through different distribution keys? If the system broadcasts tables, are those small? You may need to generate statistics. Review Join order. Is the biggest table scanned last? Run the query again after you have updated the statistics and changed the distribution, then review changes in query plan Other options include grooming tables that had lots of deletions, creation of materialized views or use of cluster based tables Use Explain Plans for Long Running Queries
73.
73 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. • Databases • Tables • Users • Groups • Privileges • Row Level Data Authotization Creating & Managing Data
74.
74 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Create Databases can be created by a SQL tool, NzAdmin or NzSql (Unix utility) Privilege to create database must be granted. Or you must be an Admin Syntax:-- CREATE DATABASE database_name; Example:-- create database my_db; Creating & Managing Databases ALTER You can rename a database ,or change ownership of a database. Syntax: ALTER DATABASE db_name [RENAME|OWNER] TO name; Example:-- alter database my_db to clinical_db; Example:-- alter database clinical_db owner to ‘jdoe’ DROP You can drop a database –if privileges are granted to you to do so. Syntax: DROP DATABASE database_name; Example:-- drop database clinical_db;
75.
75 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. A user must be an Admin or must have object privileges in order to create a table Creating Tables CREATE TABLE table_name ( column_name data_type, column_name data_type, column_name data_type); distribute on (column_1, column_2); You may also create temporary table with a slight modification to the syntax Limitations: 1600 columns; column size <= 64k and row size <= 65535k CREATE [TEMPORARY | TEMP] TABLE table_name (column_name data_type) [,…];
76.
76 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Use the smallest integer data type necessary . Do not use approximate numeric for columns for distribution columns, join columns or columns that require mathematical operations (sum, avg, etc.). Floating point data types have performance implications. Netezza will not perform hash joins on floating point data type columns, instead it must instead must perform a slower sort merge join. Use date, time and timestamp for consistency and validation. Column Data Types
77.
77 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Constraints are rules every recordrow in a table must satisfy at the time load or update operations Netezza enforces NOT NULL and DEFAULT constraints. Netezza will not enforce referential integrity constraints for reasons of performance but specifying them may be helpful for the integration with third- party tools. Constraints CREATE TABLE table_name ( column_name data_type NOT NULL, column_name data_type NOT NULL, column_name data_type NOT NULL DEFAULT value, PRIMARY KEY (column_name, column_name) UNIQUE (column));
78.
78 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. CREATE TABLE AS, also known as, CTAS table is just a regular table Efficient method to create new tables from existing tables without costly reload of data Highly recommended for changing the distribution of a table without data reload. You may also create temporary session-specific CTAS tables for your analytical needs CTAS – The best tool for Analytics Syntax: CREATE TABLE table_name AS <select clause>; Syntax: CREATE TABLE table_name AS <select clause> DISTRIBUTE ON (column_name); Syntax: CREATE [TEMPORARY | TEMP] TABLE table_name AS <select clause>;
79.
79 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Tables maybe truncated to remove all rows but retain the table structure or definition Truncating will remove Zonemap. Tables can be dropped to remove them from the database. You can rename the table or change ownership of the table. Managing Tables Syntax: TRUNCATE TABLE table_name; Example: TRUNCATE TABLE encntr_fct; Syntax: DROP TABLE table_name; Example: DROP TABLE encntr_fct; Syntax: ALTER TABLE table_name [RENAME|OWNER] TO name; Example: ALTER TABLE encntr_fct RENAME TO encounter_fact; Example: ALTER TABLE encounter_fact OWNER TO jdoe;
80.
80 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. You can add, rename, or drop columns in table When you drop a column or rename a column, Netezza performs versioning of the dropped or renamed column. Use GROOM Table and GENERATE STATS immediately after the table is altered. Altering Table Structure Syntax: ALTER TABLE table_name [ADD|DROP] COLUMN column_name datatype; Syntax: ALTER TABLE encntr_fct ADD COLUMN encntr_num varchar(20); Syntax: ALTER TABLE encntr_fct DROP COLUMN encntr_num; Syntax: ALTER TABLE table_name RENAME COLUMN column_name TO new_column_name; Syntax: ALTER TABLE encntr_fct RENAME COLUMN encntr_num TO encntr_nbr;
81.
81 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. • Manage SPUs and Data Slices • Manage Backups and Restores (Disaster Recovery) • Manage Host Statistics • Managing Alerts and Event Rules Appliance Management • Creator or Consumer of Data • Application Service Account or Actual User • Level of privileges to modify a database or objects • Capability to assign or revoke privileges Database Management • Usage of Columns in a Table • Usage of Rows in a table • Usage of Database Objects • Usage of Databases Data Authorization Securing Netezza
82.
82 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Securing Netezza II Users Admin Individual User Groups PUBLIC Admin Defined Groups Privileges Admin Object Local Vs Global
83.
83 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Default User is Admin. Netezza Super User has all object permissions. All user (Admin user + other Users) are members of the system group PUBLIC User, groups and database names must be unique – cannot overlap. User Names cannot be greater than 128 characters in length Users can be created using any SQL (ODBCJDBC) tool, NZSQL Utility in Unix- based platform or NzAdmin utility Users maybe altered or dropped Managing Users Syntax: CREATE USER user_name WITH [options]; Example: CREATE USER jdoe WITH password ‘jenesaispas’; Syntax: ALTER USER user_name WITH [options]; Example: alter USER jdoe WITH password ‘idonotknow’; Syntax: DROP USER user_name; Syntax: DROP USER user_name;
84.
84 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. RENAME TO <newname> | RESET ACCOUNT | PASSWORD {'<pw>'|NULL} | EXPIRE PASSWORD PASSWORDEXPIRY <days> AUTH {LOCAL|DEFAULT} SYSID <userid> | IN GROUP <usergrp>[,<usergrp>…] | IN RESOURCEGROUP <rsg> VALID UNTIL '<valid_date>' | DEFPRIORITY {CRITICAL|HIGH|NORMAL|LOW|NONE} | MAXPRIORITY {CRITICAL|HIGH|NORMAL|LOW|NONE} | ROWSETLIMIT <rslimit> | SESSIONTIMEOUT <sessiontimeout> | QUERYTIMEOUT <querytimeout> | CONCURRENT SESSIONS <concsessions> | SECURITY LABEL {'<seclabel>|PUBLIC::'} | AUDIT CATEGORY {NONE|'<category>[,<category>…]'} COLLECT HISTORY {ON|OFF|DEFAULT} | ALLOW CROSS JOIN {TRUE|FALSE|NULL} | ACCESS TIME {ALL|DEFAULT|(<access-time>[,<access-time>…])} Managing Users -- Options
85.
85 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Groups allow database administrators to club users to gather based on their functions, departments, or usage activities Groups are the most efficient method to control user permissions. Netezza comes with a predefined group called PUBLIC Users are automatically added, upon creation, to the group PUBLIC Users cannot be dropped or deleted from the group PUBLIC PUBLIC group cannot be deleted Permissions granted to a group are automatically inherited by the user. User, groups and database names must be unique – cannot overlap. User Names cannot be greater than 128 characters in length Managing Groups Syntax: CREATE GROUP group_name WITH [options]; Example: create group qlty_of_care; Example: create group qlty_of_care with jdoe, johnpublic;
86.
86 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Groups maybe altered or deleted Managing Groups II Syntax: ALTER GROUP name { {ADD | DROP } USER <user>[,<user>…] | OWNER TO <user> | RENAME TO <new_group_name> | WITH <clause> [<clause>…] } Example: alter group qlty_of_care add user jjohnson; Example: alter group qlty_of_care rename to quality; Example: alter group quality drop user jdoe, jjohnson; Syntax: DROP GROUP group_name; Example: drop group quality;
87.
87 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. DEFPRIORITY {CRITICAL|HIGH|NORMAL|LOW|NONE} | MAXPRIORITY {CRITICAL|HIGH|NORMAL|LOW|NONE} | ROWSETLIMIT <rslimit> | SESSIONTIMEOUT <sessiontimeout> | QUERYTIMEOUT <querytimeout> | CONCURRENT SESSIONS <concsessions> | RESOURCE MINIMUM <min_percent> | RESOURCE MAXIMUM <max_percent> | JOB MAXIMUM <jobmax> | COLLECT HISTORY {ON|OFF|DEFAULT} | ALLOW CROSS JOIN {TRUE|FALSE|NULL} | PASSWORDEXPIRY <days> ACCESS TIME {ALL|DEFAULT|(<access_time>[,<access_time>…])} | Managing Groups - Options
88.
88 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Scope of Privileges Global – Applicable to the entire system Local – Applicable to the database within which the privileges are recorded Admin Privileges Controls creation of objects and system administration Some privileges are global in scope, regardless of the current database Object Privileges Controls access to specific object Privileges granted in the system database are global in scope Privileges granted with a specific database are local to that database With Power Comes Responsibility
89.
89 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Privileges can be granted to individual objects or object classes. Object classes include: o DATABASE o TABLE o VIEW o FUNCTION o SEQUENCE o SYNONYM o USER o GROUP Presence of keyword ON in the GRANT statement implies object privileges List is a special privilege o You can only see an object if you have list privilege on it. o You can only connect to a database when you have list privileges on it o You can see Users activity when you have list rights on both the user and the database they are connected to. OBJECT Privileges
90.
90 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Creator or database object automatically has all privileges over the object All other users must be granted object privileges GRANT command with “ON” keyword gives specific privileges on an object or object class Grant Privileges are always additive o Any privilege granted to the user plus o Any privilege granted to user group the user belongs to o Any privilege granted to the PUBLIC group Use REVOKE command to remove object privileges from user, group or PUBLIC OBJECT Privileges II Syntax:-- GRANT object_privilege_name ON object_name TO {PUBLIC|GROUP group_name| user_name} [WITH GRANT OPTION]; Example: GRANT ALL ON TABLE TO jdoe; Example: GRANT list, select on TABLE to group quality Example: GRANT list, select, insert, update on encntr_fct to jdoe
91.
91 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Admin privilege allows user to create certain type of objects or execute some global operations No ON keyword is required while granting this privilege ADMIN Privileges
92.
92 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Netezza super-user, ADMIN, is the only user that has full admin privileges Other users must be granted Admin privileges Admin privileges are associated with OBJECT CLASSES not specific objects GRANT command without the “ON” clause gives specific admin privileges to user or group ADMIN Privileges Syntax:-- GRANT admin_privilege_name TO {PUBLIC|GROUP group_name| user_name} [WITH OPTION];
93.
93 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Six nzsql (UNIX-utility) commands to display privileges To see what SQL commands are issued when these nzsql statements are issued on Unix command prompt, use nzsql with “–E” option THE NZ SIX
94.
94 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Where’s the metadata at?
95.
95 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. A user’s privileges are evaluated in an order of precedence. In case of conflicts, the order of precedence determines the user’s rights on Netezza database Object Privileges granted explicitly to the user. Object Privileges granted via a group. Database Object Class privileges granted to the user. Database Object Class privileges granted via a group. Global Object Class privileges granted to the user. Global Object Class privileges granted via a group. Order of Precedence
96.
96 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Which of the two users can see databases? Which of the two users can create databases? Scenario 1 – Object Vs Admin Privilege create user user1 with password ‘user1’; create user user2 with password ‘user2’; grant all on database to user1; grant database to user2; User Created User Created Object Privileges granted Database Privilege granted User1 – We gave him all the privileges on all databases User2 – We gave him administration level privilege to create database
97.
97 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Scenario 2 – Object Vs Object Class Privilege create user user1 with password ‘user1’; create user user2 with password ‘user2’; create database test_db; grant list on test_db to user1; grant list on database to user2; User Created User Created Database Created Object Privilege granted Object Class Privilege granted Which of the two users can see the test_db database? Which of the two users can connect to the test_db database? Which of the two users can see (and/or connect to) other databases? Both – User1 has list on test_db; User2 has he privilege on all database Both – Ditto. List privilege allows both user to connect to the database User2 – We gave User2 list privilege on the object class, so User2 can connect to all databases
98.
98 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Can user1 select from encntr_fct table? Scenario 3 Order of Precedence – Object Vs Object Class --Connect to system database create user user1 with password 'user1'; grant list on database to user1; --connect to clinical_db database grant all on table to user1; grant delete on encntr_fct to user1; --connect to clinical_db as user1 select * from encntr_fct limit 2; NO!!! --No matter what object class privileges a user was granted, from the moment he received at least one object privilege, object class privileges do not apply to that object any more. --Order in which object and object class privileges were granted does not matter.
99.
99 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Can user1 select from tables in the clinical_db database? Can user1 select from tables in other databases? Scenario 4 – Global Vs Local Prvileges --connect to system database create user user1 with password 'user1'; grant list on database to user1; grant select on table to user1; --connect to clinical_db database as user1 select * from cdr_dim limit 2; --connect to clinical_db as user1 select * from encntr_fct limit 2; --connect to system database create user user2 with password 'user2'; grant list on database to user2; --connect to clinical_db database as admin grant select on table to user2; --connect to clinical_db database as user2 select * from cdr_dim limit 2; --connect to clinical_db as user2 select * from encntr_fct limit 2; Can user2 select from tables in the clinical_db database? Can user2 select from tables in other databases? YES! – We granted select on object class TABLE while connected to the SYSTEM database. so the privileges are global YES! – DITTO! YES! – We granted select on TABLE while connected to the clinical_db database. NO! – We granted select on TABLE while connected to the clinical_db database, so the privileges are local.
100.
100 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Can user1 select from the encntr_fct table? Scenario 5 Order of Precedence – Global Vs Local --connect to system database create user user1 with password 'user1'; grant list on database to user1; grant select on table to user1; --connect to clinical_db database grant alter on table to user1; --connect to clinical_db as user1 select * from encntr_fct limit 2; --connect to system database as admin User created Object Class Global Privilege Granted Global Object Privilege Granted --connect to the clinical_db as admin Local Object Class Privilege Granted --connect to clinical_db as user1 Select Data command issued NO!!! -- Local Privileges override Global Privileges --No matter what global object class privileges a user was granted, from the moment the user received at least one local object class privilege, global class privileges do not apply to that class of objects any more. -- Order in which local and global object class privileges are granted does notmatter.
101.
101 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. From which of the above tables can the user select? Scenario 6 Order of Precedence – User Vs Group --connect to system database create user user1 with password 'user1'; create group group1; alter group group1 add user user1; grant list on database to group1; --connect to clinical_db database grant select on table to group1; revoke select on table from user1; grant alter on cdr_dim to group1; grant alter on ptnt_dim to user1; --connect to clinical_db as user1 select * from encntr_fct limit 2; select * from cdr_dim limit 2; select * from ptnt_dim limit2; Connected to SYSTEM Database as admin USER1 is created GROUP1 is created USER1 is added to GROUP1 Object Class Global Privilege to connect to DBs Connected to CLINICAL_DB as admin Local Object Class privilege granted to group1 Local Object Class privilege revoked from user1 Local Object privilege granted on cdr_dim table Local Object privilege granted in ptnt_dim table Connect to CLINICAL DB as user1 select * from encntr_fct limit 2; select * from cdr_dim limit 2; select * from ptnt_dim limit2; YES!– Group 1 has select rights on object class table. NO! – Group1 has object privilege to alter cdr_dim table and object class privilege to select on all tables. Object Privilege overrides Object Class privilege. Therefore, user1 cannot select on cdr_dim table NO!— Ditto !
102.
102 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Can user1, as an owner of clinical_db, grant certain privileges to user2? Scenario 7 – Grants from Non-Admin USER --connect to system database create user user1 with password 'user1'; create user user2 with password 'user2'; grant list on database to user1; grant list on database to user2; grant database to user1 --connect to system database as user 1 create database clinical_db; --connect to clinical_db as user1 create table test_tbl1 (c1 int, c2 char); grant list, select on test_tbl1 to user2; Connect to System database as admin user User1 created User2 created Global object class privilege granted to User1 Global object class privilege granted to User2 Global admin privilege to create DBs granted Connect to system database as user1 Clinical_db database created Connect to clinical_db as user1 Test_tbl1 table created Grant Object privileges on test_tb1 to user2 NO!!! --User1 does not see user2 & User1 does not have the privilege to alter User 2 To fix the problem connect to the system database as an ADMIN user grant list, alter on user2 to user1;
103.
103 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. A privilege given to a user via a group cannot be revoked from that user by submitting “revoke … from <user> command”; it can be revoked only from the group. A privilege given to a user explicitly cannot be revoked from that user by submitting “revoke … from <group> command”; it can be revoked only explicitly from the user. A user can be a member of multiple groups and hence receive some privileges more than once. Privilege granted via one group cannot be revoked via a different group. User has a certain privilege if it was granted to him explicitly or through at least one of the groups he belongs to. User and Group Privileges – Order of Precedence GRANTS ARE ADDITIVE! REVOKES ARE NOT NECESSARILY SUBTRACTIVE!
104.
104 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Netezza Workload Management can be performed By setting limits on Customer (aka Tenant), User GroupRole, User Level resource utilization Guaranteed Resource Allocation (GRA) limit are set to ensure that there is no monopolization of resources. Netezza allows HighLowCritical Prioritization of Queries. By simple workload limits -- Max queriesjobs, row limits, query time limit, etc. Short Query Bias – Netezza provides express lanes for short queries. Workload Management Based on high-level GRA settings, Netezza manages resources across the whole stack: host, node, CPU, FPGA, memory, disk
105.
105 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Job Limits per customer or User Group Query Prioritization and Job Limits Within the Customer 1 slice, max number of jobs = 11 Prioritized Query Execution
106.
106 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Simple Workload Limits can be set globally, by customer, user grouprole, or user. Limit number of jobs, rows, query time, idle time Event management, e.g. Long-running query Apply a basic level of global control at the very minimum Simple Workload Limits Limit by result rows Manage long-running queries Eliminate idle sessions
107.
107 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Best Practices for Maintaining Privileges • Grant Privileges at a group level • Grant Privileges at an object class level • Define Workloads: for Multi-tenancy, or if there is resource contention between multiple Applications Manage at a high level • If there are different database for different applications, grant object class privileges at specific database level (Local level). Break up data logically • Create separate group for Admin user(s) • Create separate group for ETL users of each database • Create service account for each application that will access database. Ex: service account for ETL tool, service account for BI tool, service account for Statistical tool, etc. • Create separate generic groups for Read-Only and Read-Write users for each database Create Groups based on Usage Types
108.
108 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Data Authorization – Security Models in Clinical Domain •All authorized members of the org or dept. can see all the data pertaining to their org or dept. •Typically, used for (non-clinical and HR functions) operational and supply-chain reporting •Can be applied to clinical data: the user of one department can see the patient records associated to that Hospital or Departments. Organization or Departmental Model •Access is granted to the patient and all of their records within the Network regardless of Hospital or dept. where the care is provided •Used for clinical reporting (ACO, AHRQ, PQRS, etc.) 360° Patient View Model •Hybrid Model also implies different type of access for different user types. •Some user types (IT administrator's and developers, executives and managers) have access to all patient data for all hospitals, others have access to all aggregated data for all organizations, • Some have access to only data pertaining to their department • Care Providers have full access to all care data for patients attributed to them Hybrid (Combination) Model
109.
109 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Anatomy of Dimensional Model Event Encounter Who Patient Provider What Procedure Where Location or Unit When Proc. Date Admit Date Why Diagnosis How Drug Delivery Route?
110.
110 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Organizational Model of Security Event Encounter Who Patient Provider What Procedure Where Location or Unit When Proc. Date Admit Date Why Diagnosis How Drug Delivery Route? User Group 2 Location Table User Group Table Table has details about Security Group and Users who are members of the security group. Table is usually published from LDAP. Table has User Group and Locations to which the User Groups are authorized.
111.
111 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Patient Centric Model of Security Event Encounter Who Patient Provider What Procedure Where Location or Unit When Proc. Date Admit Date Why Diagnosis How Drug Delivery Route? User Group 2 Patient Group Table User Group Table Table has User Group and Patient Groups to which the User Groups are authorized. Patient Group 2 Patient Table Table has details of Patients and Patient Groups of which the Patients are members of. Table has details about Security Group and Users who are members of the security group. Table is usually published from LDAP.
112.
112 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Hybrid Model of Security Encounter Who Patient Provider What Procedure Where Location or Unit When Proc. Date Admit Date Why Diagnosis How Drug Delivery Route? User Group 2 Patient Group Table User Group Table Patient Group 2 Patient Table User Group 2 Location Table
113.
113 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. In-built mechanism to secure row-level data with Security Labels and Row Secure Tables (RST) are often restrictive and cumbersome. Use database-table-driven security solution but maintain your security in LDAP. o Publish LDAP user rolegroup information into a database table at regular frequency o Use the User rolegroup table and associate that to Location (Hospital, Clinic, Department, etc.) based on criteria defined and approved by the data stewardship program. o Use the User rolegroup table and associate that to Patient Group (ACO-attributed population, etc. ) based on criteria defined and approved by the data stewardship program. o Security criteria may be rule-based (and therefore could be automated) or manual (and therefore cannot be automated. o All security solutions have exceptions which require some manual intervention (inserts, updates and deletes). Ensure that you have a process defined to rapidly handle manual interventions. Table-driven security solutions tend to be flexible and scalable without major overhaul. Best Practices for Row Level Security
114.
114 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Netezza SQL Command Reference: https://www.ibm.com/developerworks/community/blogs/Netezza/?lang=en http://pic.dhe.ibm.com/infocenter/ntz/v7r0m3/index.jsp?topic=2Fcom.ibm.nz.dbu.doc2Fr_dbuser_ntz_sql_com mand_reference.html Image References: the following images are sourced from Netezza Architecture Comparison Whitepaper (IBM) Wikimedia Commons Netezza Appliance Architecture Whitepaper IBM IOD Presentation of Netezza Architecture Netezza Underground -- the book and the blog, an indispensable tool in any Netezza journey. The Netezza JEDI: Shawn Fox Appendix
115.
115 PROPRIETARY &
CONFIDENTIAL – © 2013 PREMIER INC. Netezza Appliance Specification Sample
Baixar agora