O documento discute as características principais do Exadata. Resume que o Exadata consiste em duas camadas - uma camada de armazenamento e uma camada de banco de dados - que se comunicam através do protocolo IDB para melhorar o desempenho. O offloading/Smart Scan processa consultas nas células de armazenamento para reduzir dados transferidos e uso de CPU. O Exadata também usa compressão híbrida colunar, índices de armazenamento e cache flash inteligente para melhorar o desempenho ainda mais.
Drill Down the most underestimate Oracle Feature - Database Resource Manager
Exadata - O Todo é maior que a soma das Partes
1. EXADATA
“O TODO É MAIOR QUE A SOMA DAS
PARTES”
Luís Marques - @drune http://lcmarques.com
@FailsafeSA
#failsafedays
2. AGENDA
• O que é o Exadata – “O Ferro”
• Offloading / Smart Scan – “A alma do negócio”
• Storage Indexes
• Hybrid Columnar Compression
• Exadata Smart Flash Cache
• Entender Exadata Performance – “Isto sim!”
• (Des)(Re) Aprendizagem Exadata
Twitter: @failsafeSA #failsafedays
3. EXADATA – O QUE É?
• Bottleneck: Mover quantidades massivas de dados da Storage
para o Database Server
• 2 partes: “Storage layer” e “Database layer”
Twitter: @failsafeSA #failsafedays
4. EXADATA – O QUE É?
• Database layer
• Vários servidores (Sun Intel x86-64) a correr Oracle 11gR2
• Configurados em um ou mais RAC clusters
• RAC não é obrigatório
• ASM é obrigatório com o objectivo do mapping do “storage
layer”.
• Storage layer
• Vários servidores (Sun Intel x86-64)
• Corre Oracle software server software – cellsrv
• A comunicação é feita entre layers via iDB – protocolo de
rede sobre InfiniBand
• Não há comunicação directa entre storage cells
Twitter: @failsafeSA #failsafedays
5. EXADATA – DB LAYER
• Database Layer/Server Software
• Arquitectura Intel x86-64 a correr OEL 5/6
• Oracle 11gR2 instalado – Nada específico em relação ao
Exadata
• ASM (Oracle Automatic Storage Management) – DB
servers sem acesso directo ao storage
• iostat torna-se inútil – Sem OS calls para abrir/fechar
ficheiros.
• ASM é totalmente responsável pela redundância:
Normal/High – Não há RAID por hardware/software
• ASM – Mirroring entre as storage cells
• RAC é aconselhado quando necessário!
Twitter: @failsafeSA #failsafedays
6. EXADATA – STORAGE LAYER
• Storage Layer/Server Software
• Arquitectura Intel x86-64 a correr OEL 5/6
• Cell Services (cellsrv)
• Multi-threaded
• Processa pedidoPode enviar dados já processados ou blocos
de volta para o database server
• s I/O vindos dos Database Layer.
• Implementa o IORM (I/O Resource Manager)
• Management Server (MS)
• Interface entre o cellsrv e o cellcli (Cell Command Line
Interface)
• Restart Server (RS)
• Monitorização de processos
• OSWatcher
• Colecta informação acerca do SO: vmstat e netstat
Twitter: @failsafeSA #failsafedays
7. EXADATA – IDB
• IDB – Intelligent Database protocol
• Comunicação entre os 2 layers
• Function shipping architecture – Informação sobre o SQL
executado para as storage cells e reenvio de dados
processados
• Limita os dados enviados para o Database server –
apenas as rows e colunas que satisfazem a query
• Podem enviar blocos quando o offload não é possível;
• IDB usa RDS (Reliable Datagram Sockets) “over”
Infiniband:
• Baixa latência
• Baixo overhead
• Uso mínimo de CPU
Twitter: @failsafeSA #failsafedays
9. EXADATA – É BONITO POR FORA?
Twitter: @failsafeSA #failsafedays
10. OFFLOAD/SMART SCAN
• Offload / Smart Scan – A cura para todos (quase) os males.
• Processamento feito nos storage servers em vez dos
database servers = OFFLOAD
• Smart Scan é uma “run time decision”
• Objectivos:
• Reduzir o volume de dados transferidos entre a storage e
o database servers
• Reduzir o consumo de CPU nos database servers
• Reduzir o tempo de acesso aos discos
Twitter: @failsafeSA #failsafedays
11. SMART SCAN - EXEMPLO
• Imaginemos que….
• 1 tabela com 1 coluna
• 50 registos por bloco
• Apenas 1 bloco de 8k
• Query:
select * from t1 where rowid = ‘AAAAB0AABAAAAOhAAA’
• Pelo menos o bloco todo tem que ser lido (8k) significando
uma transferência extra e inutil de 49 registos.
• Multipliquem isto por biliões = Demasiados dados
irrelevantes passados ao database server = bottleneck
Twitter: @failsafeSA #failsafedays
12. SMART SCAN - REQUISITOS
• Para obter um smart scan:
• Full Scan
• Full Table Scan (Table Access Storage Full)
• Index Fast Full Scan
• Direct Path Read
• Presente no 11gR2
• Os dados são lidos directamente para o PGA, fazendo
bypass da buffer cache (SGA)
• Parallel e Serial (SMALL_TABLE_THRESHOLD)
• Exadata Storage
• Objectos que usem uma mistura de Exadata Storage e
“não” Exadata storage não são eligiveis
Twitter: @failsafeSA #failsafedays
13. SMART SCAN – COLUMN
PROJECTION
• Column Projection
• Metadata enviada para as storage cells (via iDB)
• Resultado enviado via iDB
• Exemplo: 4 colunas de uma tabela de 100 colunas
possíveis : select a, b, c, d, e, f from table t where a=7;
• IO_CELL_OFFLOAD_ELIGIBLE_BYTES (volume de
dados evitados pelo uso da column projection)
• IO_INTERCONNECT_BYTES (volume de dados que
foram efectivamente retornados ao DB server)
Twitter: @failsafeSA #failsafedays
14. SMART SCAN –
PREDICATE FILTERING
• Predicate Filtering
• O iDB contem informação sobre os predicados
• As clausulas WHERE (filtering) é feito nas storage cells
em vez do database server.
• Exemplo:
• select a, b, c, d, e, f from table t where a=7;
• Redução do volume de dados enviado para o database
server
• Redução do uso do CPU
Twitter: @failsafeSA
15. “NON” SMART SCAN
• Smart File Creation
• Inicialização de blocos (formatação) quando alocados pelo
Database server
• Criação e extensão de datafiles (tablespaces)
• RMAN Incremental Backups
• BCT (Block Change Tracking) passa a ser individual ao
bloco em vez de grupo de blocos
• Menos blocos a serem backup, menos tempo de backup
• RMAN Restores
• “File initialization” durante o Restore
Twitter: @failsafeSA
16. SMART SCAN – “DISABLERS”
• Se algum dos requisitos não for cumprido
• No caso de:
• Clustered Tables
• Index Organized Tables (IOTs)
• Partial Smart Scan (block shipping mode)
• Chained rows – Smart Scan pausa, single block read
database server
• Read Consistency Issues – Se existir um bloco “mais
novo” que o lido pela query, single block read no database
server
Twitter: @failsafeSA
17. SMART SCAN – STORAGE
INDEXES
• Não são indices regulares (Btree, bitmap, etc)
• Indentificam a localização onde o registo não está
• Guardam o valor máximo e mínimo cada coluna e uma
flag para NULL em cada unidade de storage de 1MB.
• Não são passiveis de tuning nem alteração
• São recriados a cada storage cell reboot
Twitter: @failsafeSA
19. STORAGE INDEXES - REQUISITOS
• Requisitos
• Smart Scan / Offload
• Pelo menos um predicato (WHERE)
• Comparação: =, <, >, BETWEEN, >=, <=, IN, IS NULL, IS NOT NULL
• Suporta
• Multi predicados (WHERE…AND…AND)
• Joins entre várias tabelas
• Parallel Querys
• HCC
• Bind Variables
• Partitions
• Sub-querys
• Não Suportado
• CLOB
• Predicados com % (LIKE ‘%’)
Twitter: @failsafeSA
20. STORAGE INDEXES -
PERFORMANCE
• Podem resultar em aumentos dramáticos de performance
• A forma como os dados são ordenados) é
determinante (clustering factor*)
• Flag para valores NULL permite, ao contrário dos Btree
uma acrescimo de performance
Twitter: @failsafeSA
21. HCC – HYBRID COLUMNAR
COMPRESSION
• Tipos de Compressão disponíveis:
• BASIC
• Compressão apenas com operação de direct path insert
• Unidade de compressão: bloco (8k/16k…)
• Datawarehousing oriented
• OLTP
• Compressão para todos os tipos de operação
• Symbol table para valores repetidos
• Compressão não imediata: Quando o bloco fica cheio, a
compressão ocorre
• Fallback do HCC
• HCC
• Compressão apenas com operação de direct path insert:
• APPEND/ Parallel Insert/ SQL*Loader/CTAS…
• Outro tipo de operações = OLTP
Twitter: @failsafeSA
22. HCC – MECÂNICA
• Dados guardados em formato não convencional – ordenados e
em forma de coluna
• Disponível apenas na Exadata storage
• Blocos combinados em estruturas: Compression Units (CU)
com 32k/64k
• Formato intermédio entre row e column oriented storage:
• Permite ler um registo inteiro numa única CU
• Dados enviados para o Database server são comprimidos
• Decompressão = Database server
Twitter: @failsafeSA
23. EXADATA SMART FLASH CACHE
• Marketing: “The World’s First Database Machine for OLTP”
• Hardware
• Cell cache nos storage servers
• Cada storage server = 4 cartas Sun Flash PCIe
Accelerator F20 no total de 3.2TB (X4)
• “Oracle is using flash PCIe cards in Exadata – not
flash disks”
• 1.33 GB/s throughput em cada PCIe flash disk
• 1,960,000 8K write I/Os per second
• Full RAC = 56 PCI Flash Cards – 44.8 TB
• Energy Storage Module (ESM) – Flush de dados voláteis
para storage não volátil.
Twitter: @failsafeSA
24. ESFC – É BONITO POR FORA?
Twitter: @failsafeSA
25. EXADATA SMART FLASH CACHE
• Cache para os Storage disks
• Não sujeito a Smart Scans!
• Use Cases
• Cache (ESFC) – [CellCLI> create flashcache all]
• Discos ASM (SSD) [CellCLI> create flashcache all size=200g]
• Ambos
• Performance
• Usa PCIe cards em vez de discos SSDs para evitar
bottleneck das controladoras (disk interface)
• Smart Caching
• Data cache inteligente – Dados “hot” na Flash Cache
• Dados ou objectos “non hot” são ignorados
• Optimização de políticas de caching por parte do DBA (alter
table foo.bar storage (cell_flash_cache keep);
• Exadata Flash Cache compression (~80T X4 full rack):
Hardware capable
Twitter: @failsafeSA
27. EXADATA SMART FLASH LOG
• Redo no discos Flash Cache/SSD? NÃO!
• A não ser que…
• Over budget – Demasiado dinheiro para comprar discos
normais
• Tenham um Exadata:
• Parallel write redo discos e flash: “Faster wins”
• LGWR notificado assim que o primeiro termina
• Bottleneck mitigado: 1 dos I/O subsystem sobrecarregado
Twitter: @failsafeSA
28. EXADATA – (RE)APRENDER
• OLTP
• ESFC
• Percentagem alta para physical I/O (avg wait 0.5ms)
• Escalabilidade: Upgrade half/full rack = 2x CPUs, 2x ESFC
• ESFL
• Write-Intensive workload = LGWR performance
• Datawarehousing
• Smart Scans/Offload
• FTS/IFFS – Direct path: “Run-time decision”:
• Explain Plan é inutil
• Chained Rows – “pass-through mode”
• Hints? “No No No”
• Storage Cells demasiado ocupadas:
• Column projection e Predicate filtering não executados
• Block shipping mode
• “Workaround solution” – “cell physical IO bytes pushed back to excessive
CPU”
• Partitioning
• Tamanho interessa!
• To Index or Not to Index
• Exadata is different – Thing differently ™
Twitter: @failsafeSA
LIBCELL: which is a library that is linked with the Oracle kernel.
LIBCELL has the code that knows how to request data via iDB. This provides a very nonintrusive mechanism to allow the Oracle kernel to talk to the storage tier via network-based calls instead of operating system reads and writes.
LIBCELL: which is a library that is linked with the Oracle kernel.
LIBCELL has the code that knows how to request data via iDB. This provides a very nonintrusive mechanism to allow the Oracle kernel to talk to the storage tier via network-based calls instead of operating system reads and writes.
Smart Scan - The concept has been around for some time. In fact, rumor has it that Oracle approached at least one of the large SAN manufacturers several years ago with the idea. The manufacturer was apparently not interested at the time and Oracle decided to pursue the idea on its own.
In a non-Exadata environment this data has to be read from disk over the IO subsystem (most commonly fibre channel) using large, sequential multi-block IOs. During the time of this IO request the database process waiting for the data will be left in a wait state (known as db file scattered read) whilst the blocks from disk are scattered to available slots in the data cache (in memory on the database server). This will inevitable cause many useful blocks to be aged out of the cache, having adverse implications on performance going forward.
Oracle automatically determines whether to use direct path reads for non-parallel scans. The calculation is based on several factors including the size of the object, the size of the buffer cache and how many of the objects blocks are already in the buffer cache.
In 10g, serial table scans for "large" tables used to go through cache (by default)
Queries against objects whose segments reside on these mixed storage diskgroups are also not eligible for Offloading.
On non- Exadata platforms, block changes are tracked for groups of blocks
When Oracle encounters a chained row, the head piece will contain a pointer to the block containing the second row piece. Since the storage cells do not communicate directly with each other, and it is unlikely that the chained block resides on the same storage cell, cellsrv simply ships the entire block and allows the database layer to deal with it.
if Oracle notices that a block is “newer” than the current query, the process of finding an age-appropriate version of the block is left for the database layer to deal with. This effectively pauses the Smart Scan processing while the database does its traditional read consistency processing.
As you can see in the diagram, the first storage region in the Customer table has a maximum value of 77,indicating that it’s possible for it to contain rows that will satisfy the query predicate(cust_age >35). The other storage regions in the diagram do not have maximum values that are high enough to contain any records that will satisfy the query predicate. Therefore, those storage regions will not be read from disk.
In addition to the minimum and maximum values, there is a flag to indicate whether any of the records in a storage region contain nulls. The fact that nulls are represented at all is somewhat surprising given that nulls are not stored in traditional Oracle indexes. This ability of Storage Indexes to track nulls may actually have repercussions for design and implementation decisions.
SORTING DATA:
Suppose you have a table that has a column with unique values (that is, no value is repeated). If the data is stored on disk in such a manner that the rows are ordered by that column, then there will be one and only one storage region for any given value of that column. Any query with an equality predicate on that column will have to read at most one storage region.
The feature is only appropriate for data that is no longer being modified, though, because of locking issues and the fact that updated rows are moved into a much less compressed format (OLTP compression format). For this reason, HCC should only be used with data that is no longer being modified (or only occasionally modified).
When data is loaded, column values for a set of rows are grouped together and compressed. After the column data for a set of rows has been compressed, it is stored in a compression unit.
The cellsrv program may actually fire off async I/O requests to both the disk and the flash cache.
If the requested data is in the cache, the requests will be fulfilled by the flash cache before the disk reads will be able to complete.
When the system is heavily loaded, it is possible for some requests to be fulfilled by the flash cache while others are fulfilled by the disks. This two-pronged attack effectively increases the amount of throughput that the system can deliver.
However, after sending an acknowledgement back to the database server, Oracle’s storage software then copies the data into the cache, assuming it is suitable for caching. This is a key point. The metadata that is sent with the write request lets the storage software know if the data is likely to be used again and if so, the data is also written to the cache.
The cellsrv program may actually fire off async I/O requests to both the disk and the flash cache.
If the requested data is in the cache, the requests will be fulfilled by the flash cache before the disk reads will be able to complete.
When the system is heavily loaded, it is possible for some requests to be fulfilled by the flash cache while others are fulfilled by the disks. This two-pronged attack effectively increases the amount of throughput that the system can deliver.
However, after sending an acknowledgement back to the database server, Oracle’s storage software then copies the data into the cache, assuming it is suitable for caching. This is a key point. The metadata that is sent with the write request lets the storage software know if the data is likely to be used again and if so, the data is also written to the cache.
Chained Rows pass-through: combinação single block reads wait events com cell Smart Scan wait Events
Hints: Podem prevenir o Exadata de usar algumas das suas features
Chained Rows pass-through: combinação single block reads wait events com cell Smart Scan wait Events
Hints: Podem prevenir o Exadata de usar algumas das suas features