1. "Technology is dominated by two types of
people: those who understand what they do
not manage, and those who manage what
they do not understand.“
Putt's Law and the Successful Technocrat: How to
Win in the Information Age
2. DAT322: SQL Server
2005 Memory Internal
Geyzerskiy Dmitriy
Chief Architect Microsoft Technologies
dimag@dbnet.co.il
3. Session Objectives and Agenda
• Windows Memory Management
• NUMA Architecture
• SQL Server Memory Management
• AWE vs. 64-bit
11. 3GB Process Space Option
• /3GB switch in BOOT.INI
• /USERVA (between 2048 and 3072, in 128MB increments)
• .EXE must be linked with LARGEADDRESSAWARE flag
12. 64 bit Address Space
• Map more data into the address space
• The application “speed” is the same on 32 bit and 64 bit
• OS needs 2GB of memory to hold pointers to 16GB or
more physical memory
x64 IA64
16. Sizing the Page File
• More RAM should mean smaller page file!
• Crash dump settings affect
• Full: size of RAM
• Kernel dump: much smaller
• To size correctly, review what goes there
• Minimum should = commit charge peak
• Maximum could be a multiple of this
21. What is SMP
• SMP – Symmetric Multi-Processing
• Front-bus point of contention
• Difficult to scale beyond 32 CPU
Front-bus
22. What is NUMA
NUMA (Non-Uniformed Memory Access)
Local Memory
Access Foreign Memory Access
4x local
Local Memory
Access
23. What is Interleaved-NUMA
• Enable NUMA hardware to behave as SMP
• Memory are used by all CPUs
• Each CPU’s cache line access slice of memory from all nodes
• SQL Server 2000 should use interleaved-NUMA
Local Memory Access Foreign Memory Access
24. What is Soft-NUMA
• Activates custom NUMA configuration on top of
any of hardware
• Registry settings control final SoftNUMA
configuration
• Provides greater performance, scalability,
and manageability on SMP as well as on real
NUMA hardware
25. Soft-NUMA Configuration Example
We have:
NUMA system with 2 nodes and 4 CPU per
Node
We need:
2 CPUs for loading application and the rest
of CPUs for queries.
34. Agenda
• Windows Memory Management
• NUMA Architecture
• SQL Server Memory Management
• AWE vs. 64-bit
35. Address Windowing Extensions (AWE)
• Access more than 4GB of physical memory.
• Is ignored on systems with less than 3GB of physical memory.
• Is never swapped to disk.
Allocate the physical memory (Lock Pages in Memory)1
Create a region in the process address space to serve as
a window for mapping views of this physical memory2
Map a view of the physical memory into the virtual
memory window
3
36. SQL Server Process Address Space with AWE
0xFFFFFFFF
0x00000000
0x80000000
0xC0000000
Operating System
SQL Server or OS
(/3GB switch)
SQL Server or OS
(/3GB switch)
SQLSer
MemToLeave areaMemToLeave area
Thread stacksThread stacks
Other
Locks
Query
Workspace
PlanCache
DBPage
Cache
(BufferPool
AWE
Memory
37. SQL Server 2005 32-bit AWE Memory
• Right OS version
• Windows Server 2003 Standard and up
• /PAE in boot.ini enables 32-bit OS
to address more than 4GB memory
• SQL Server Edition
• Enterprise Edition
• Developer Edition
• sp_configure ‘awe enabled’
38. Lock Pages In Memory Option
• Entry in the SQLERROR log
• 64 bit: Using locked pages for buffer pool
• 32 bit: Address Windowing Extensions is enabled
• Discarded in Standard Edition
• The Local System account has the 'lock pages in
memory' privilege by default
A significant part of sql server process memory has been paged
out. This may result in a performance degradation. Duration: 0
seconds. Working set (KB): 1086400, committed (KB): 2160928,
memory utilization: 50%.
A significant part of sql server process memory has been paged
out. This may result in a performance degradation. Duration: 0
seconds. Working set (KB): 1086400, committed (KB): 2160928,
memory utilization: 50%.
39.
40. SQL Server 2005 64 bit vs. 32 bit
• The only way to get virtual memory > 3GB
• What is different from 32-bit?
• All pointers are 64-bit
• SQL Server commits ‘min server memory’ memory at startup
• Some internal memory-related data-structure constants larger
• 64-bit alignment of data structures
41. SQL Server 2005 64 bit vs. 32 bit
• What is the same?
• No on-disk database format changes
• No differences in buffer pool policy / algorithms from 32-bit
• All uses of memory can use additional 64-bit memory
• DB Page Cache, Query Workspace Memory, Plan Cache,
Locks, External uses, Utilities, …
43. Resources
• Blogs
• Slava Oks’ blog: http://blogs.msdn.com/slavao
• SQL Programmability & API Development Team Blog:
http://blogs.msdn.com/sqlprogrammability/
• External Links:
• NUMA FAQ: http://lse.sourceforge.net/numa/faq
• Books:
• Eldad Eilam: Reversing: Secrets of Reverse Engineering
• Ken Henderson: SQL Server 2005 Practical
Troubleshooting The Database Engine
• Kalen Delaney: Inside Microsoft SQL Server 2005 The
Storage Engine
44. Summary
• It pays to understand SQL Server memory management
• A number of performance issues either originate or manifest as
memory issues
• Memory-based performance tuning is a very useful technique
• Significant internal and external changes in SQL Server 2005
• Consider NUMA for your next large-scale project
• Upgrade your system to 64 bit
49. Database Page Cache
• Most common use of memory - often referred to as
“Buffer Pool”
• Stores database pages – index, heaps
• Lazy writer thread sweeps across buffer pool to age
pages out of cache
• Uses a modified clock algorithm
• Each page has a reference count
• Reference count divided by 4 each time clock hand passes
• Pages with reference count 0 can be discarded
• A Dirty page needs to be written out first
• Favors keeping often-used pages in cache
• Higher level index pages naturally favored
• Full scans may do some damage to buffer pool
• Clock algorithm limits damage
50. Monitoring Database Page Cache
• Perfmon: SQL Server:Buffer Manager
• Buffer cache hit ratio: SQL 2000 SP3 onwards, this is “recent”
data (last 2K to 3K accesses)
• Page life expectancy: low value (< 300 seconds) indicates
“churn” in buffer pool
• Physical Disk: Avg Disk sec/read, write transfer
• Beware perfmon averaging
• Free list stalls/sec: another indication of memory pressure
• AWE counters – may correlate to kernel time
• Related: Per file I/O statistics obtained via
::fn_virtualfilestats(dbid, fileid)
• IoStallMS shows file-level I/O bottleneck
51. Plan Cache
• Caches plans for various types of batches
• Stored procedures, Triggers, Ad-hoc SQL, Auto-parameterized
SQL, Parameterized SQL (sp_executesql or via client APIs)
• Plans are of two types
• Compiled plan
• Read-only
• One per unique combination of statement, dbid, unique
combination of set options
• Executable plan / Execution Context
• Derived from compiled plan – points back to it
• One per concurrent execution
• Contains execution specific data – e.g. parameter/row values
• Not all executable plans cacheable – e.g. hash, parallel plans
• No pre-defined upper limit for size of plan cache
• Depends on buffer pool to manage space
52. Monitoring Plan Cache
• master.dbo.syscacheobjects
• Lists all items in plan cache
• Can aggregate this data to get use counts
• Very useful indicator of nature of application
• dbcc proccache
• High level summary data on plan cache
• dbcc cachestats
• Summary by cache object type
• Perfmon counters under Cache Manager
• Counts of cache pages, objects
• However, hit ratios are from instance startup
• Perfmon counters under SQL Statistics
• Can monitor compiles, recompiles, etc.
• Profiler Events
• SP:CacheHit, SP:CacheMiss, SP:CacheInsert
53. SQL Server 32-bit AWE Memory
• Mapping and Un-mapping AWE memory
• Mapping cost is small – equivalent to soft fault
• Un-mapping cost is substantial – need to update page tables
on all processors
• Pages mapped mostly 1 at a time
• Read-ahead may map multiple at a time
• Perfmon:
• Buffer Manager: AWE lookup maps/sec
• Pages un-mapped many at a time
• Up to 1 MB at a go
• Perfmon:
• Buffer Manager: AWE unmap calls/sec
• Buffer Manager: AWE unmap pages/sec
• Only DB Page Cache is able to use AWE Memory
• No virtual memory pointers within
54. Buffer Pool & AWE
• AWE enabled (system default) for 64-bit
environment
• When using AWE mechanism, buffer pool no
longer uses virtual memory committed
• dbcc memorystatus
Memory Manager KB
------------------------------ ----------------
VM Reserved 16979888
VM Committed 217928
AWE Allocated 14116272
Reserved Memory 1024
Reserved Memory In Use 0
SQL Server 2005 Memory InternalsWhere did my memory go? How is my memory being used? How can I find out which operation uses all the memory? This is just a small set of the questions that will be answered during this session.
Virtual Memory and Paging
Virtual memory is a fundamental concept in contemporary operating systems. The idea is that instead of letting software directly access physical memory, the
processor, in combination with the operating system, creates an invisible layer between the software and the physical memory. For every memory access, the
processor consults a special table called the page table that tells the process which physical memory address to actually use. Of course, it wouldn’t be
practical to have a table entry for each byte of memory (such a table would be larger than the total available physical memory), so instead processors divide
memory into pages.
Pages are just fixed-size chunks of memory; each entry in the page table deals with one page of memory. The actual size of a page of memory differs
between processor architectures, and some architectures support more than one page size. IA-32 processors generally use 4K pages, though they also support
2 MB and 4 MB pages. For the most part Windows uses 4K pages, so you can generally consider that to be the default page size.
Committed: backed by some type of physical storage: physical memory or page file.
Reserved: address space set aside for later use; no physical storage used
Free: not used
Paging
Paging is a process whereby memory regions are temporarily flushed to the
hard drive when they are not in use. The idea is simple: because physical
memory is much faster and much more expensive than hard drive space, it
makes sense to use a file for backing up memory areas when they are not in
use. Think of a system that’s running many applications. When some of these
applications are not in use, instead of keeping the entire applications in physical
memory, the virtual memory architecture enables the system to dump all
of that memory to a file and simply load it back as soon as it is needed. This
process is entirely transparent to the application.
Page Faults
From the processor’s perspective, a page fault is generated whenever a memory
address is accessed that doesn’t have a valid page-table entry. As end
users, we’ve grown accustomed to the thought that a page-fault equals bad
news. That’s akin to saying that a bacterium equals bad news to the human
body; nothing could be farther from the truth. Page faults have a bad reputation
because any program or system crash is usually accompanied by a message
informing us of an unhandled page fault. In reality, page faults are
triggered thousands of times each second in a healthy system. In most cases,
the system deals with such page faults as a part of its normal operations. A
good example of a legitimate page fault is when a page has been paged out to
the paging file and is being accessed by a program. Because the page’s pagetable
entry is invalid, the processor generates a page fault, which the operating
system resolves by simply loading the page’s contents from the paging file and
resuming the program that originally triggered the fault.
Working Sets
A working set is a per-process data structure that lists the current physical
pages that are in use in the process’s address space. The system uses working
sets to determine each process’s active use of physical memory and which
memory pages have not been accessed in a while. Such pages can then be
paged out to disk and removed from the process’s working set.
It can be said that the memory usage of a process at any given moment can
be measured as the total size of its working set. That’s generally true, but is a
bit of an oversimplification because significant chunks of the average process
address space contain shared memory, which is also counted as part of the
total working set size. Measuring memory usage in a virtual memory system
is not a trivial task!
Kernel Memory and User Memory
Probably the most important concept in memory management is the distinctions
between kernel memory and user memory. It is well known that in order
to create a robust operating system, applications must not be able to access the
operating system’s internal data structures. That’s because we don’t want a
single programmer’s bug to overwrite some important data structure and
destabilize the entire system. Additionally, we want to make sure malicious
software can’t take control of the system or harm it by accessing critical operating
system data structures.
Windows uses a 32-bit (4 gigabytes) memory address that is typically
divided into two 2-GB portions: a 2-GB application memory portion, and a
2-GB shared kernel-memory portion. There are several cases where 32-bit systems
use a different memory layout, but these are not common. The general
idea is that the upper 2 GB contain all kernel-related memory in the system
and are shared among all address spaces. This is convenient because it means
that the kernel memory is always available, regardless of which process is currently
running. The upper 2 GB are, of course, protected from any user-mode
access.
Visual Studio solution DEMO introducing Win32API commands to get system page size, a number of CPU cores and etc.
Windbg: !memusage 0x8
38:13 in video
Soft faults – pages that are brought to the WorkingSet from either Standby List or Modified List (no disk IO involved)
Hard faults – paging read operation involving disk IO
Process/Page Faults – counter (misleading)
Free List and Zero List will be empty on most busy systems, so memory manager will get the memory from Standby List.
Standby List is a File cache.
Available memory = Free List + Zero List + Standby List
Don’t use memory optimizers!
The only indicator that you need more memory is Available Memory that stays small a lot (Available Bytes Perfmon Counter)
Commit charge: limit (maximum of page file) – total committed private memory (sum of all private bytes from all processes + kernel memory paged pool (private bytes))
PF Usage in Task Manager is rather Potential Pagefile Usage (how much memory may be committed or paged out)
Peak: a real counter that should be used for sizing the PF
Windows shrinks the PF when it no longer references it or commits virtual memory to disk
How to determine the appropriate page file size for 64-bit versions of Windows Server 2003 or Windows XP
http://support.microsoft.com/default.aspx/kb/889654/
Process Explorer, Task Manager, Windows Debugger
Mem Usage in TM is actually Physical Memory Working Set (PE)
VM Size in TM is actually Private Bytes in PE (both reserved and comitted)
Virtual Size in PE is actual VM Size
Memory leaks monitoring using Private Bytes (Delta and History in PE)
If there is a memory leak we would potentially have to hit a limit of 8TB. Actually we will hit a limit sooner.
System Commit Limit is Total Amount of private virtual memory across all the processes in the system + operating system itself, that the system can keep track of at any one time.
Most of the physical memory because Windows uses a some amount of memory on startup that can’t be used for process virtual memory.
The process starts with empty working set. As far as the process starts consuming resources the OS populates the working set.
Prefetch technology (Myth about emptying Prefetch directory on Windows XP)
As the process working set grows the Memory Manager will take some pages away (LRU)
Symmetric Multiprocessing, or SMP, is a multiprocessor computer architecture where two or more identical processors are connected to a single shared main memory. Most common multiprocessor systems today use an SMP architecture.
SMP systems allow any processor to work on any task no matter where the data for that task is located in memory; with proper operating system support, SMP systems can easily move tasks between processors to balance the workload efficiently. On the downside, memory is much slower than the processors accessing them, and even single-processor machines tend to spend a considerable amount of time waiting for data to arrive from memory. SMP makes this worse, as only one processor can access memory at a time; it is possible for several processors to be starved.
The typical SMP system includes multiple CPUs. The CPU typically contains an L1 cache. The L2 cache is typically managed by the CPU but the memory for the L2 cache is external to the CPU. The system may have an L3 cache which is managed externally to the CPU. The L3 cache is likely to be shared by multiple CPUs. The system will also contain main memory. The contents of main memory may be present in any of the caches. Hardware must exist to maintain the coherency of main memory and the various caches. Typical memory latencies:
L1 cache hit:
L2 cache hit:
L3 cache hit:
memory access:
The system also contains one or more IO busses, IO controllers attatched to the IO bus, and devices attached to the controllers.
Source: http://lse.sourceforge.net/numa/faq/
Minimize/eliminate front-bus contention to surpass scalability limits of SMP architecture
Performance penalty for accessing foreign node memory
Application needs to be NUMA-aware to take advantage the node-locality design
What does NUMA stand for?NUMA stands for Non-Uniform Memory Access.
OK, So what does Non-Uniform Memory Access really mean to me?Non-Uniform Memory Access means that it will take longer to access some regions of memory than others. This is due to the fact that some regions of memory are on physically different busses from other regions. For a more visual description, please refer to the section on NUMA architeture implementations. Also, see the real-world analogy for the NUMA architecture. This can result in some programs that are not NUMA-aware performing poorly. It also introduces the concept of local and remote memory.
What is the difference between NUMA and SMP?The NUMA architecture was designed to surpass the scalability limits of the SMP architecture. With SMP, which stands for Symmetric Multi-Processing, all memory access are posted to the same shared memory bus. This works fine for a relatively small number of CPUs, but the problem with the shared bus appears when you have dozens, even hundreds, of CPUs competing for access to the shared memory bus. NUMA alleviates these bottlenecks by limiting the number of CPUs on any one memory bus, and connecting the various nodes by means of a high speed interconnect.
What is the difference between NUMA and ccNUMA?The difference is almost nonexistent at this point. ccNUMA stands for Cache-Coherent NUMA, but NUMA and ccNUMA have really come to be synonymous. The applications for non-cache coherent NUMA machines are almost non-existent, and they are a real pain to program for, so unless specifically stated otherwise, NUMA actually means ccNUMA.
What is a node?One of the problems with describing NUMA is that there are many different ways to implement this technology. This has led to a plethora of &quot;definintions&quot; for node. A fairly technically correct and also fairly ugly definition of a node is: a region of memory in which every byte has the same distance from each CPU. A more common definition is: a block of memory and the CPUs, I/O, etc. physically on the same bus as the memory. Some architectures do not have memory, CPUs, and I/O all on the same physical bus, so the second definition does not truly hold. In many cases, the less technical definition should be sufficient, but often the technical definition is more correct.
What is meant by local and remote memory?The terms local memory and remote memory are typically used in reference to a currently running process. That said, local memory is typically defined to be the memory that is on the same node as the CPU currently running the process. Any memory that does not belong to the node on which the process is currently running is then, by that definition, remote.Local and remote memory can also be used in reference to things other than the currently running process. When in interrupt context, there technically is no currently executing process, but memory on the node containing the CPU handling the interrupt is still called local memory. Also, you could use local and remote memory in terms of a disk. For example if there was a disk (attatched to node 1) doing a DMA, the memory it is reading or writing would be called remote if it were located on another node (ie: node 0).
What do you mean by distance?NUMA-based architectures necessarily introduce a notion of distance between system components (ie: CPUs, memory, I/O busses, etc). The metric used to determine a distance varies, but hops is a popular metric, along with latency and bandwidth. These terms all mean essentially the same thing that they do when used in a networking context (mostly because a NUMA machine is not all that different from a very tightly coupled cluster). So when used to describe a node, we could say that a particular range of memory is 2 hops (busses) from CPUs 0..3 and SCSI Controller 0. Thus, CPUs 0..3 and the SCSI Controller are a part of the same node.
Could you give a real-world analogy of the NUMA architecture to help understand all these terms?Imagine that you are baking a cake. You have a group of ingredients (=memory pages) that you need to complete the recipe(=process). Some of the ingredients you may have in your cabinet(=local memory), but some of the ingredients you might not have, and have to ask a neighbor for(=remote memory). The general idea is to try and have as many of the ingredients in your own cabinet as possible, since this reduces your time and effort in making the cake.You also have to remember that your cabinets can only hold a fixed amount of ingredients(=physical nodal memory). If you try and buy more, but you have no room to store it, you may have to ask your neighbor to keep it in his/her cabinet until you need it(=local memory full, so allocate pages remotely).A bit of a strange example, I&apos;ll admit, but I think it works. If you have a better analogy, I&apos;m all ears! ;)
Why should I use NUMA? What are the benefits of NUMA?The main benefit of NUMA is, as mentioned above, scalability. It is extremely difficult to scale SMP past 8-12 CPUs. At that number of CPUs, the memory bus is under heavy contention. NUMA is one way of reducing the number of CPUs competing for access to a shared memory bus. This is accomplished by having several memory busses and only having a small number of CPUs on each of those busses. There are other ways of building massively multiprocessor machines, but this is a NUMA FAQ, so we&apos;ll leave the discussion of other methods to other FAQs.
What are the peculiarities of NUMA?CPU and/or node caches can result in NUMA effects. For example, the CPUs on a particular node will have a higher bandwidth and/or a lower latency to access the memory and CPUs on that same node. Due to this, you can see things like lock starvation under high contention. This is because if CPU x in the node requests a lock already held by another CPU y in the node, it&apos;s request will tend to beat out a request from a remote CPU z.
What are some alternatives to NUMA?Also, splitting memory up and (possibly arbitrarily) assigning it to groups of CPUs can give some performance benefits similar to actual NUMA. A setup like this would be like a regular NUMA machine where the line between local and remote memory is blurred, since all the memory is actually on the same bus. The PowerPC Regatta system is an example of this.You can achieve some NUMA-like performance by using clusters as well. A cluster is very similar to a NUMA machine, where each individual machine in the cluster becomes a node in our virtual NUMA machine. The only real difference is the nodal latency. In a clustered environment, the latency and bandwidth on the internodal links are likely to be much worse.
On startup, Database Engine writes the node information to the error log. To determine the node number of the node you want to use, either read the node information from the error log, or from the sys.dm_os_schedulers view.
select * from sys.dm_os_schedulers
http://blogs.msdn.com/slavao/archive/2005/08/18/453354.aspx
http://blogs.msdn.com/slavao/articles/441058.aspx
http://technet.microsoft.com/en-us/library/ms178144.aspx
1. Use flipchart and calculator to show the calculation of the affinity mask
The customer wanted to partition single SQL server instance based on the load. Customer’s application is heterogynous. It consists of TPCH type queries and data loading applications. The customer has a system, which is NUMA, with 2 nodes and 4 CPU per Node. The customer wanted to give the loading application two CPUs and the rest of CPUs to the queries. Is it possible to achieve it?
As you might guess the answer is SQL2005&apos;s Soft NUMA support. We advised them to configure SQL Server and clients to treat system as three node NUMA system. (Surprised? Yes it is possible with SQL 2005 Soft NUMA support). The configuration looks like following: zero node has 4 CPUs, first node has 2 CPUs and last node has 2 CPUs. Keep in mind, when you configure SQL Server for Soft NUMA, soft nodes should fully be contained in the real nodes, i.e. a soft node can not span several real NUMA nodes. Customer’s TPCH queries were configured to utilize zero and first nodes and the load application was configured to utilize last node. Once configured and started this configuration worked as expected – load was fully partitioned across CPUs. It is important to notice that the customer was very delighted with the experience. Below is the example of node & network configuration we provided customer with:
http://blogs.msdn.com/slavao/archive/2005/08/18/453354.aspx
Lock Mgr and Buffer Pool are not really part of SQLOS, they are just lumped with us today
Maps internal and external memory states into appropriate notification
Encapsulates simple state machine
Sends memory notification to memory clerks
“Slows down” if memory state doesn’t change for some time
Scheduling facts
Runs on its own scheduler (hidden) per (NUMA) node
Non-preemptive mode
http://blogs.msdn.com/slavao/archive/2005/02/19/376714.aspx
http://sqlserver.ro/blogs/cristians_blog/archive/2007/10/02/the-return-of-the-ring-buffers.aspx
http://support.microsoft.com/default.aspx/kb/918483
Two types of memory pressure
VAS memory Pressure
Physical memory pressure
VAS Memory Pressure
RM notified through
Reactive – notified by memory node
Virtual or Shared memory interfaces fail to allocate a region of 4MB and below
RM doesn&apos;t get notified if size of a region above 4MB
Proactive
RM probes VAS for 4MB size
Physical memory pressure
Internal
Shrink of Buffer Pool causes internal pressure
Dynamic change of max server memory
75% BP stolen by cache
When triggered reclaim page from cache
External (OS)
Signaled by OS, wakes up RM and broadcast notification to memory clerks
BP recalculates target commit upon, start shrinking if new target lower than current committed till pressure disappear
De-commit in non-awe mode
Free up physical memory in AWE mode (different from SQL Server 2000)
BP only monitors external memory pressure
Resource Monitor and Memory Pressure
When configuring SQL Server it is very important to understand how it reacts to memory pressure. I have already spent significant amount of time describing types of memory pressure. In this post you will understand why it is important. Memory pressure is categorized into two major groups: VAS and physical. Physical memory pressure could be imposed by OS, we called external or it could be imposed by the process itself we call it internal.
SQLOS implements a complete framework to enable process&apos;s handling any type of memory pressure. In the heart of the framework lies Resource Monitor task, RM. RM monitors state of the external and internal memory indicators. Once one of them changes, RM observes state of all indicators. Then it maps indicator&apos;s states into corresponding notification. Once notifications is calculated it broadcasts it to memory clerks.
` ------------------
| Resource Monitor|
/ ------------------\
/ | \
------------------------------ ------------- ------------------------------
| Low Physical Internal/External | | Low VAS | | High Physical Internal/External |
------------------------------ ------------- ------------------------------
Resource Monitor and Memory Clerks
Remember SQLOS has two types of nodes: memory node and cpu nodes. Memory nodes provide locality of allocations and cpu nodes provide locality of scheduling. Currently every cpu node has its own resource monitor. The reason is to be able to react to memory pressure on a given node - I will talk more about cpu nodes when covering SQLOS scheduling subsystem. For now remember that depending on machine configuration there could be multiple RM tasks running at the same time.
Large memory consumers leverage memory clerks to allocate memory. One more important task of memory clerks is to process notifications from RM. A consumer can subscribe its clerk to receive memory pressure notifications and react to it accordingly.
Every cpu node has a list of memory clerks. First RM calculates notification it needs to send. Then it goes through the list and broadcast notification to each memory clerk one by one. During the broadcast caches receive notification as well since they are memory clerks.
------------------
------------------------| Resource Monitor| -------------------------
/ ------------------ \
/ / \ \
/ / \ \
--------------------- ----------------------- -------------------------- ------------------
|Generic Memory Clerk| | Cache Memory Clerk | | Buffer Pool Memory Clerk | | CLR Memory Clerk |
--------------------- ----------------------- -------------------------- ------------------
From RM&apos;s scheduling point of view there are couple of important points you need to be aware.
Resource Monitor runs on its own scheduler, we called it hidden scheduler
Resource Monitor runs in non-preemptive mode.
DAC node doesn’t have its own Resource Monitor
There are several memory clerks that can respond to memory pressure. We already talked about caches. In addition every cpu node leverages its clerk to trim worker and system thread pools under memory pressure. Full text leverages its memory clerk to shrink shared memory buffers it shares with MSSearch. CLR uses its clerk to trigger GC. Buffer pool leverages its clerk to respond to external and VAS memory pressure only. (Why?)
External Memory pressure: RM and Buffer Pool
From SQLOS perspective Buffer Pool is a single page allocator - extensively used memory manager. External memory pressure is signaled by Windows. RM wakes up and broadcasts corresponding notification to clerks. Upon receiving the notification BP recalculates its target commit, amount of physical memory BP is allowed to consume. Keep in mind that target commit can&apos;t be lower than configuration parameter specified through sp_configure min server memory and can&apos;t be higher of max server memory. If new target commit is lower than currently committed buffers, BP starts shrinking until external physical memory pressure disappears. During this process BP tries to decommit or in case of AWE free physical memory back to OS. Remember that in SQL2000 BP didn&apos;t react to physical memory pressure when running in AWE mode.
Internal Memory Pressure: BP and Resource Monitor
Shrinkage of BP causes internal memory pressure. This is one of the ways for BP to get process into internal physical memory pressure What components BP notifies about internal memory pressure? Yes, you guessed correctly, SQLOS exposes a mechanism for BP to turn on RM&apos;s indicator corresponding to internal memory pressure. As you learned RM translates the indicator&apos;s signal to notification it will broadcast to clerks. BP has its clerk and will get RM&apos;s notification back. Oh no, we get into infinite loop!? Actually this is not the case because BP only monitors external physical memory pressure. It ignores internal physical memory pressure altogether.
There are couple other ways for internal physical pressure to appear. It could be caused by dynamically changing max server memory. In addition it could raise when 75% of BP&apos;s pages are stolen using SQLOS&apos;s single page allocator interface. By triggering internal physical memory pressure BP reclaims its pages from caches and other components currently consuming them.
VAS Memory Pressure
So far I discussed how SQLOS and consequently SQL Server handles physical external and internal memory pressure. Handling VAS pressure is harder because on Windows it is difficult to recognize it. There are two ways how RM gets notified about VAS pressure. The first way is for memory node to notify RM. When memory node Virtual or Shared memory interfaces fail to allocate a region of 4MB and below (RM doesn&apos;t get notified if size of a region above 4MB), memory node turns on RM&apos;s VAS low indicator. There also exists proactive way, when RM is running it probes VAS for 4MB size if such region no longer exists RM itself turns on VAS low signal and starts broadcasting corresponding notification.
Responding to VAS pressure is what makes Yukon different from SQL2000. In SQL2000 for server is hard to recover once it gets into VAS pressure. In Yukon VAS pressure notification will be send to all memory clerks so they have opportunity to shrink. For example cpu node will shrink its threads, CLR might unload appdomains that currently not in use, network libs will shrink their network buffers.
You remember, when talking about SQLOS memory manager, I mentioned that in AWE mode BP is capable of reacting to VAS pressure? Here it all comes together. When BP receives VAS low notification it enumerates its 4MB VAS regions it reserved previously. If it finds 4MB region that is not currently in use or either used by database pages it can easily free it.
Monitoring memory pressure:
The subject won&apos;t be complete without taking a look at how one can monitor, diagnose, different types of pressures SQL Server gets exposed to. Yes, we made yours and our life simpler. There is dmv that you can take a look at to find out history of memory pressure.
Following query shows a set of last notification RM broadcasted:
select * from sys.dm_os_ring_buffers
where
ring_buffer_type=&apos;RING_BUFFER_RESOURCE_MONITOR&apos;
(yes we have several different ring buffers that you can pick into :-), including schedulers, exceptions and OOMs, but these are subjects for different posts)
Here is the example of query output.
&lt;Record id = &quot;0&quot; type =&quot;RING_BUFFER_RESOURCE_MONITOR&quot; time =&quot;788327260&quot;&gt;
&lt;ResourceMonitor&gt;
&lt;Notification&gt;RESOURCE_MEMPHYSICAL_HIGH&lt;/Notification&gt;
&lt;Indicators&gt;1&lt;/Indicators&gt;
&lt;NodeId&gt;0&lt;/NodeId&gt;
&lt;/ResourceMonitor&gt;
&lt;MemoryNode id=&quot;0&quot;&gt;
&lt;AvailableMemoryOnNode&gt;0&lt;/AvailableMemoryOnNode&gt;
&lt;ReservedMemory&gt;2111472&lt;/ReservedMemory&gt;
&lt;CommittedMemory&gt;20944&lt;/CommittedMemory&gt;
&lt;SharedMemory&gt;0&lt;/SharedMemory&gt;
&lt;AWEMemory&gt;0&lt;/AWEMemory&gt;
&lt;SinglePagesMemory&gt;1792&lt;/SinglePagesMemory&gt;
&lt;MultiplePagesMemory&gt;6680&lt;/MultiplePagesMemory&gt;
&lt;CachedMemory&gt;592&lt;/CachedMemory&gt;
&lt;/MemoryNode&gt;
&lt;MemoryRecord&gt;
&lt;TotalPhysicalMemory&gt;1047556&lt;/TotalPhysicalMemory&gt;
&lt;AvailablePhysicalMemory&gt;542532&lt;/AvailablePhysicalMemory&gt;
&lt;TotalPageFile&gt;3254476&lt;/TotalPageFile&gt;
&lt;AvailablePageFile&gt;2242756&lt;/AvailablePageFile&gt;
&lt;TotalVirtualAddressSpace&gt;2097024&lt;/TotalVirtualAddressSpace&gt;
&lt;AvailableVirtualAddressSpace&gt;972352&lt;/AvailableVirtualAddressSpace&gt;
&lt;AvailableExtendedVirtualAddressSpace&gt;0&lt;/AvailableExtendedVirtualAddressSpace&gt;
&lt;/MemoryRecord&gt;
&lt;/Record&gt;
Following query shows when BP, single page allocator, turns on/off internal memory pressure
select * from sys.dm_os_ring_buffers
where
ring_buffer_type=&apos;RING_BUFFER_SINGLE_PAGE_ALLOCATOR&apos;
&lt;Record id = &quot;9&quot; type =&quot;RING_BUFFER_SINGLE_PAGE_ALLOCATOR&quot; time =&quot;789165566&quot;&gt;
&lt;Pressure status=&quot;0&quot;&gt;&lt;AllocatedPages&gt;477&lt;/AllocatedPages&gt;
&lt;AllAllocatedPages&gt;477&lt;/AllAllocatedPages&gt;
&lt;TargetPages&gt;31553&lt;/TargetPages&gt;
&lt;AjustedTargetPages&gt;31553&lt;/AjustedTargetPages&gt;
&lt;CurrentTime&gt;788967250&lt;/CurrentTime&gt;
&lt;DeltaTime&gt;110&lt;/DeltaTime&gt;
&lt;CurrentAllocationRequests&gt;79709&lt;/CurrentAllocationRequests&gt;
&lt;DeltaAllocationRequests&gt;156&lt;/DeltaAllocationRequests&gt;
&lt;CurrentFreeRequests&gt;79232&lt;/CurrentFreeRequests&gt;
&lt;DeltaFreeRequests&gt;23640&lt;/DeltaFreeRequests&gt;
&lt;/Pressure&gt;
&lt;/Record&gt;
Sorting the outputs from these two queries by time will allow you to observe the actual behavior of the SQL Server over time with respect to memory pressure.
If you are a careful reader most of the output from ring buffer queries should make sense to you by now. Some time latter on I will try to spend more time on detailed description of the output.
Conclusion:
Memory pressure might significantly impact server performance and stability. Especially when SQL Server shares a box with other applications or shares its VAS with xps or CLR. Memory pressure might triger extra I/Os, recompilies, and other unnecessary activities. Understanding and diagnosing the types of memory pressure SQL Server is exposed to is very important part of managing your server and writing applications for it. I hope the information provided in this post will enable you to do your job more efficiently.
http://www.modhul.com/2007/11/10/optimising-system-memory-for-sql-server-part-i/
The Address Windowing Extensions (AWE) facility in Windows exists to allow applications to access more than 4GB of physical memory. A 32-bit pointer is an integer that is limited to storing values of 0xFFFFFFFF or less—that is, to references within a linear 4GB memory address space. AWE allows an application to circumvent this limitation and access all the memory supported by the operating system.
At a conceptual level, AWE is nothing new—operating systems and applications have been using similar mechanisms to get around pointer limitations practically since the dawn of computers. Typically, mechanisms that allow a pointer to access memory at locations beyond its direct reach (i.e., at addresses too large to store in the pointer itself) pull off their magic by providing a window or region within the accessible address space that is used to transfer memory to and from the inaccessible region. This is how AWE works: You provide a region in the process address space—a window—to serve as a kind of staging area for transfers to and from memory that would otherwise be inaccessible to user mode code.
In order to use AWE, an application:
Allocates the physical memory to be accessed using the Win32 AllocateUserPhysicalPages API function. This function requires that the caller have the Lock Pages in Memory permission.
Creates a region in the process address space to serve as a window for mapping views of this physical memory using the VirtualAlloc API function.
Maps a view of the physical memory into the virtual memory window using the MapUserPhysicalPages or MapUserPhysicalPagesScatter Win32 API functions.
While AWE exists on all editions of Windows 2000 and later and can be used even on systems with less than 2GB of physical RAM, it&apos;s most typically used on systems with 2GB or more of memory because it&apos;s the only way a 32-bit process can access memory beyond 3GB. If you enable AWE support in SQL Server on a system with less than 3GB of physical memory, the system ignores the option and uses conventional virtual memory management instead. One interesting characteristic of AWE memory is that it is never swapped to disk. You&apos;ll notice that the AWE-specific API routines refer to the memory they access as physical memory. This is exactly what AWE memory is: physical memory that is never swapped to or from the system paging file.
DBCC MEMORYSTATUS
(AWE available only for BUFFERPOOL)
When using AWE, you must remove the /3GB switch from the boot.ini if you have 16GB or more physical memory installed in the server because the
OS needs 2GB of memory to hold pointers to 16GB or more physical memory
/3GB vs. AWE
The ability to increase the private process address space by as much as 50 percent via application memory tuning is certainly a handy and welcome enhancement to Windows memory management facilities; however, the Windows AWE facility is far more flexible and scalable. As I said earlier, when you increase the private process address space by a gigabyte, that gigabyte comes from the kernel mode address space, which shrinks from 2GB to 1GB. Since the kernel mode code is already cramped for space even when it has the full 2GB to work with, shrinking this space means that certain internal kernel structures must also shrink. Chief among these is the table Windows uses to manage the physical memory in the machine. When you shrink the kernel mode partition to 1GB, you limit the size of this table such that it can manage a maximum of only 16GB of physical memory. For example, if you&apos;re running under Windows 2000 Data Center on a machine with 64GB of physical memory and you boot with the /3GB option, you&apos;ll be able to access only 25 percent of the machine&apos;s RAM—the remaining 48GB will not be usable by the operating system or applications. AWE also allows you to access far more memory than /3GB does. Obviously, you get just one additional gigabyte of private process space via /3GB. This additional space is made available to applications that are large-address aware automatically and transparently, but it is limited to just 1GB. AWE, by contrast, can make the entirety of the physical RAM that&apos;s available to the operating system available to an application provided the application has been coded to make use of the AWE Win32 API functions. So, while AWE is more trouble to use and access, it&apos;s far more flexible and open-ended.
This isn&apos;t to say that there aren&apos;t situations where /3GB is preferable to AWE—there certainly are. For example, if you need more space for memory allocations that cannot reside in AWE memory (thread stacks, lock memory, procedure plans), you may find that /3GB is a better fit.
SQL Server acquires ‘max server memory’ memory at startup
Documented in BOL ‘Managing AWE memory’
When available physical memory is less than ‘max server memory’, allocates as much as available
AWE memory not released with memory pressure – static (dynamic in 2005)
AWE memory can not be shared between processes (SQL Server instances)
How to reduce paging of buffer pool memory in the 64-bit version of SQL Server 2005
http://support.microsoft.com/kb/918483
http://blogs.msdn.com/psssql/archive/2007/10/18/do-i-have-to-assign-the-lock-privilege-for-local-system.aspx
SELECT CONVERT (varchar(30), GETDATE(), 121) as runtime, DATEADD (ms, -1 * ((sys.cpu_ticks / sys.cpu_ticks_in_ms) - a.[Record Time]), GETDATE()) AS Notification_time, a.* , sys.ms_ticks AS [Current Time] FROM (SELECT x.value(&apos;(//Record/ResourceMonitor/Notification)[1]&apos;, &apos;varchar(30)&apos;) AS [Notification_type], x.value(&apos;(//Record/MemoryRecord/MemoryUtilization)[1]&apos;, &apos;bigint&apos;) AS [MemoryUtilization %], x.value(&apos;(//Record/MemoryRecord/TotalPhysicalMemory)[1]&apos;, &apos;bigint&apos;) AS [TotalPhysicalMemory_KB], x.value(&apos;(//Record/MemoryRecord/AvailablePhysicalMemory)[1]&apos;, &apos;bigint&apos;) AS [AvailablePhysicalMemory_KB], x.value(&apos;(//Record/MemoryRecord/TotalPageFile)[1]&apos;, &apos;bigint&apos;) AS [TotalPageFile_KB], x.value(&apos;(//Record/MemoryRecord/AvailablePageFile)[1]&apos;, &apos;bigint&apos;) AS [AvailablePageFile_KB], x.value(&apos;(//Record/MemoryRecord/TotalVirtualAddressSpace)[1]&apos;, &apos;bigint&apos;) AS [TotalVirtualAddressSpace_KB], x.value(&apos;(//Record/MemoryRecord/AvailableVirtualAddressSpace)[1]&apos;, &apos;bigint&apos;) AS [AvailableVirtualAddressSpace_KB], x.value(&apos;(//Record/MemoryNode/@id)[1]&apos;, &apos;bigint&apos;) AS [Node Id], x.value(&apos;(//Record/MemoryNode/ReservedMemory)[1]&apos;, &apos;bigint&apos;) AS [SQL_ReservedMemory_KB], x.value(&apos;(//Record/MemoryNode/CommittedMemory)[1]&apos;, &apos;bigint&apos;) AS [SQL_CommittedMemory_KB], x.value(&apos;(//Record/@id)[1]&apos;, &apos;bigint&apos;) AS [Record Id], x.value(&apos;(//Record/@type)[1]&apos;, &apos;varchar(30)&apos;) AS [Type], x.value(&apos;(//Record/ResourceMonitor/Indicators)[1]&apos;, &apos;bigint&apos;) AS [Indicators], x.value(&apos;(//Record/@time)[1]&apos;, &apos;bigint&apos;) AS [Record Time] FROM (SELECT CAST (record as xml) FROM sys.dm_os_ring_buffers WHERE ring_buffer_type = &apos;RING_BUFFER_RESOURCE_MONITOR&apos;) AS R(x)) a CROSS JOIN sys.dm_os_sys_info sys ORDER BY a.[Record Time] ASC
What is different from 32-bit?
All pointers are 64-bit
SQL Server commits ‘min server memory’ memory at startup
Otherwise, could take a long time to reach ‘min server memory’ on large memory systems
Some internal memory-related data-structure constants larger
64-bit alignment of data structures
What is the same?
No on-disk database format changes
Attach/detach, replication, log shipping …
No differences in buffer pool policy / algorithms from 32-bit
All uses of memory can use additional 64-bit memory
DB Page Cache, Query Workspace Memory, Plan Cache, Locks, External uses, Utilities,
What is different from 32-bit?
All pointers are 64-bit
SQL Server commits ‘min server memory’ memory at startup
Otherwise, could take a long time to reach ‘min server memory’ on large memory systems
Some internal memory-related data-structure constants larger
64-bit alignment of data structures
What is the same?
No on-disk database format changes
Attach/detach, replication, log shipping …
No differences in buffer pool policy / algorithms from 32-bit
All uses of memory can use additional 64-bit memory
DB Page Cache, Query Workspace Memory, Plan Cache, Locks, External uses, Utilities,
Comparison of 32-Bit and 64-Bit Memory Architecture
Article ID : 294418
SUMMARY
In the following table, the increased maximum resources of computers that are based on 64-bit versions of Windows and the 64-bit Intel processor are compared with existing 32-bit resource maximums. Architectural component 64-bit Windows 32-bit Windows
MORE INFORMATION
Virtual Memory
This is a method of extending the available physical memory on a computer. In a virtual memory system, the operating system creates a pagefile, or swapfile, and divides memory into units called pages. Recently referenced pages are located in physical memory, or RAM. If a page of memory is not referenced for a while, it is written to the pagefile. This is called &quot;swapping&quot; or &quot;paging out&quot; memory. If that piece of memory is then later referenced by a program, the operating system reads the memory page back from the pagefile into physical memory, also called &quot;swapping&quot; or &quot;paging in&quot; memory. The total amount of memory that is available to programs is the amount of physical memory in the computer in addition to the size of the pagefile.
Paging File
This is a disk file that the computer uses to increase the amount of physical storage for virtual memory.
Hyperspace
This is a special region that is used to map the process working set list and to temporarily map other physical pages for such operations as zeroing a page on the free list (when the zero list is empty and the zero page is needed), invalidating page table entries in other page tables (such as when a page is removed from the standby list), and in regards to process creation, setting up the address space of a new process.
Paged Pool
This is a region of virtual memory in system space that can be paged in and out of the working set of the system process. Paged pool is created during system initialization and is used by Kernel-mode components to allocate system memory. Uniproccessor systems have two paged pools, and multiprocessor systems have four. Having more than one paged pool reduces the frequency of system code blocking on simultaneous calls to pool routines.
Non-paged Pool
This is a memory pool that consists of ranges of system virtual addresses that are guaranteed to be resident in physical memory at all times and thus can be accessed from any address space without incurring paging input/output (I/O). Non-paged pool is created during system initialization and is used by Kernel-mode components to allocate system memory.
System Cache
These are pages that are used to map open files in the system cache.
System PTEs
A pool of system Page Table Entries (PTEs) that is used to map system pages such as I/O space, Kernel stacks, and memory descriptor lists.
The 2-GB User-Mode Virtual Memory Limitation
64-bit programs use a 16-terabyte tuning model (8 terabytes User and 8 terabytes Kernel). 32-bit programs still use the 4-GB tuning model (2 GB User and 2 GB Kernel). This means that 32-bit processes that run on 64-bit versions of Windows run in a 4-GB tuning model (2 GB User and 2GB Kernel). 64-bit versions of Windows do not support the use of the /3GB switch in the boot options.
sys.dm_exec_cached_plans (includes memory address)
Memory address of the cached entry. This value can be used with sys.dm_os_memory_objects to get the memory breakdown of the cached plan and with sys.dm_os_memory_cache_entries_entries to obtain the cost of caching the entry.
select ce.* from sys.dm_os_memory_objects mo
inner join sys.dm_exec_cached_plans cp
on cp.memory_object_address = mo.memory_object_address
inner join sys.dm_os_memory_cache_entries ce
on mo.memory_object_address = ce.memory_object_address
where cp.bucketid = 4444