2. Lord of the Rings
• x86 processor has 4 layers of protection
called Ring 0 – 3.
• Privilege code (Kernel ) runs in Ring 0.
Processor ensure that privilege
instructions (like enable/disable interrupt, )
execute in kernel mode only.
• User application runs in Ring 3.
• Ring 1 is where the Hyperviser lives..
4. How system call works
• Cannot directly enter kernel space using jmp or a call instruction.
• When make a system call (like CreateFile, ReadFile) OS enter
kernel mode (Ring 0) using instruction int 2E (it is called interrupt
gate).
• Code segment descriptor contain information about the ‘Ring’ at
which the code can run. For kernel mode modules it will be always
Ring 0. If a user mode program try to do ‘jmp <kernel mode
address>’ it will cause access violation, because of the segment
descriptor flag says processor should be in Ring 0.
• The frequency of entering kernel mode is high (most of the Windows
API call cause to enter kernel mode) sysenter is the new optimized
instruction to enter kernel mode.
5. System Call continued..
• Windows maintains a system service dispatch table
which is similar to the IDT. Each entry in system service
table point to kernel mode system call routine.
• The int 2E probe and copy parameters from user mode
stack to thread’s kernel mode stack and fetch and
execute the correct system call procedure from the
system service table.
• There are multiple system service tables. One table for
NT Native APIs, one table for IIS and GDI etc.
8. Lets try it in WinDBG..
• NtWriteFile:
mov eax, 0x0E ; build 2195 system service number for NtWriteFile
mov ebx, esp ; point to parameters
int 0x2E ; execute system service trap
ret 0x2C ; pop parameters off stack and return to caller
9. Software Interrupt Request
Levels (IRQLs)
• Windows has its own interrupt priority schemes know as
IRQL.
• IRQL levels from 0 to 31, the higher the number means
higher priority interrupt level.
• HAL map hardware interrupts to IRQL 3 (Device 1) -
IRQL 31 (High)
• When higher priority interrupt occur, it mask the all lower
interrupts and execute the ISR for the higher interrupt.
• After executing the ISR, kernel lower the interrupt levels
and execute the lower interrupt ISR.
• ISR routine should do minimal work and it should defer
the major chunk of work to Deferred Procedure Call
(DPC) which run at lower IRQL 2.
11. IRQL and DPC
• DPC concept is similar to other OS, in
Linux it is called bottom half.
• DPC is per processor, means a duel
processor SMP box contains two DPC Qs.
• The ISR routine generally fetch data from
hardware and queue a DPC for further
processing.
• IRQL priority is different from thread
scheduling priority.
12. IRQL and DPC
• The scheduler (dispatcher) also runs at IRQL 2.
• So a code that execute on or above IRQL
2(dispatch level) cannot preempt.
• From the Diagram, see only hardware interrupts
and some higher priority interrupts like clock,
power fail are above IRQL 2.
• Most of the time OS will be in IRQL 0(Passive
level)
• All user programs and most of the kernel code
execute on Passive level only.
13. IRQL continued..
• Scheduler runs at IRQL 2, so what happen if my driver try to wait on
or above dispatch level ?.
• Simple system will crash with ‘Blue Screen’, usually with the bug
check ID IRQL_NOT_LESSTHAN_EQUAL.
• Because if wait above dispatch level, no one there to come and
switch the thread.
• What happen if try to access a PagedPool in above dispatch level ?.
• If the pages are on disk, then a page fault exception will happen, the
current thread need to wait and page fault handler will read the
pages from page file to page frames in memory.
• If page fault happen above the dispatch level, no one there to stop
the current thread and schedule the page fault handler. Thus cannot
access PagedPool on or above dispatch level.
14. IRQL 1 - APCs
• Asynchronous Procedure Call (APC) run at IRQL 1.
• The main duty of APC is to send the data to user thread
context.
• APC Q is thread specific, each thread has its own APC
Q.
• User space thread initiate the read operation from a
device and either it wait to finish it or continue with
another job.
• The IO may finish sometime later, now the buffer need
to send to the calling thread’s process context. It is the
duty of APC.
16. App issue ReadFile
NtReadFile
User Land
Kernel Land
IO Manager
IO Mgr create IRP Packet,
IRP send to driver stack
File System
Volume Manager
Disk Class Driver
Hardware Driver
17. What is IO Request Packet (IRP)
• IO Operation passes thru,
– Different stages.
– Different threads.
– Different drivers.
• IRP Encapsulate the IO request.
• IRP is thread independent.
18. IO Request Packet (IRP)
• When a thread initiate an IO operation, IO
Manager create a data structure call IO Request
Packet (IRP).
• The IRP contains all information about the
request.
• IO Manager send the IRP to the top device in
the driver stack.
• Demo : !irpfind to see all current IRPs.
Demo : !irp <irp address> to see information
about one IRP.
19. IRP Continued..
• Compare IRP with Windows Messages
-MSG structure.
• Each driver in the stack do its own task,
finally forward the IRP to the lower driver
in the stack.
• IRP can be processed synchronously or
asynchronously.
20. IRP Continued..
• Usually lower level hardware driver takes more
time. H/W driver can mark the IRP for pending
and return.
• When H/W finish IO, H/W driver complete the
IRP by calling IoCompleteRequest().
• IoCompleteRequest() call IO completion routine
set by drivers in stack and complete the IO.
21. Structure of IRP
• Fixed IRP Header IRP Header
• Variable Stack locations –
– One sub stack per driver Stack Location 1
Stack Location 2
Stack Location 3
Stack Location N
22. Flow of IRP
IRP for Storage
Stack
Storage Stack
IRP Header
File System Stack Location 1
Volume Manager Stack Location 2
Disk Class Driver Stack Location 3
Hardware Driver Stack Location 4
Forward IRP to lower
driver in the stack
23. Flow of IRP Completion
IRP for Storage
Stack
Storage Stack
IRP Header
File System –
Stack Location 1
Completion Routine
Volume Manager –
Stack Location 2
Completion Routine
Disk Class Driver –
Stack Location 3
Completion Routine
Hardware Driver –
Stack Location 4
Complete the IRP
Call the completion routine while
completing the IRP
24. IRP Header
• IO buffer Information.
• Flags
– Page IO Flag
– No Caching IO flag
• IO Status – On Completion set this to IO
Completed.
• IRP cancel routine
25. IRP Stack Location
• IO Manager get the driver count in the
stack from the top device in the stack.
• While creating IRP, IO manager allocate
the IO stack locations equal to the device
count from the top device object.
26. Contents of IO Stack Location
• IO Completion routine specific to the
driver.
• File object specific to the request.
27. Asynchronous IO
• CreateFile(…, FILE_FLAG_OVERLAPPED ,..),
ReadFile(.., LPOVERLAPPED)
• When complete the IO operation, IO Mgr
signal the EVENT in LPOVERLAPPED.
28. How Async IO work in Kernel
• Lower layer driver complete IRP in arbitrary
thread context.
• IO Mgr call IO Completion routine in reverse
order.
• If operation is Async, IO Mgr queue an APC
specific to the initiator thread.
• This APC has complete info of buffer, size info.
• This APC get executed later in the context of
initiator thread, which copy buffer to user space,
trigger the event set by App.
29. Common issues related IRP
• After forward the IRP down, don’t touch it (except from
IO completion routine).
• If lower driver mark the IRP for pending, all top layer
driver should do the same.
• If a middle level driver need to keep the IRP for further
processing after completing it by lower driver, it can
return STATUS_MORE_PROCESSING REQUIRED
from completion routine.
• Middle layer driver should complete it later.
• See ReactOS source code (instead of reading 20 page
doc)
• FastIO - Concepts
31. Locality Theory
• If access page/cluster n, high possibility to
access blocks near to n.
• All memory based computing system
working on this principle.
• Windows has registry keys to configure
pre-fetch how many blocks/pages.
• Application specific memory manager like
Databases, multimedia workload, have
application aware pre-fetching.
32. Virtual Memory Manager (VMM)
• Apps feels memory is unlimited – magic
done by VMM.
• Multiple apps run concurrently with out
interfering other apps data.
• Apps feel the entire resource is mine.
• Protect OS memory from apps.
• Advanced app may need to share
memory. Provide solution to memory
sharing easily.
33. VMM Continued..
• VMM reserve certain amount of memory
to Kernel.
• 32 bit box , 2GB for Kernel and 2GB for
User apps.
• Specific area in Kernel memory reserved
to store process specific data like PDE,
PTE etc called Hyper Space
34. Segmentation and Paging
• X86 processor has segmentation and
paging support.
• Can disable or enable paging, but
segmentation is enabled by default.
• Windows uses paging.
• Since not able to disable segmentation, it
consider the entire memory for segments
(also called ‘flat segments’).
35. Paging
• Divide entire physical memory in to equal
size pages (4K size for x86 platforms).
This is called ‘page frames’ and list called
‘page frame database’ (PF DB).
• PF DB also contains flags stating,
read/write underway , shared page , etc.
36. VMM Continued..
• Upper 2GB Kernel space is common for
all process.
• What is it mean – Half of PDE is common
to all process !.
• Experiment – See the PDE of two process
and make sure half of the PDE is same
37. Physical to Virtual address
translation
• Address translation in both direction – When
write PF to pagefile, VMM need to update proper
PDE/PTE stating page is in disk.
• Done by
– Memory Management Unit (MMU) of the processor.
– The VMM help MMU.
• VMM keep the PDE/PTE info and pass to MMU
during process context switch.
• MMU translate virtual address to physical
address.
• CR3 register
38. Translation Lookaside Buffer (TLB)
• Address translation is costly operation
• It happen frequently – when even touches virtual
memory.
• TLB keeps a list containing most frequent
address translations.
• The list is tagged by process ID.
• TLB is a generic OS concept - implementation is
architecture dependent.
• Before doing the address translation MMU
search TLB for the PF.
39. Address Translation
• In x86 32 bit address – 10 bits of MSB
points to the PTE offset in PDE. Thus PDE
size of process is 1024 bytes.
• Next 10 bits point to the PF starting
address in PTE. Thus each PTE contains
1024 bytes.
• Remaining 12 bits to address the location
in the PF. Thus page size is 4K.
40. What is a Zero Page
• Page frames not specific to apps.
• If App1 write sensitive data to PF1, and later VMM push
the page to page file, attach PF 1 to App2. App2 can see
these sensitive info.
• It’s a big security flaw, VMM keep a Zero Page list.
• Cannot clean the page while freeing memory – it’s a
performance problem.
• VMM has dedicated thread who activate when system
under low memory situation and pick page frames from
free PF list, clean it and push to zero page list.
• VMM allocate memory from zero page list.
41. Arbitrary Thread Context
• Top layer of the driver stack get the
request (IRP) in the same process
context.
• Middle or lower layer driver MAY get the
request in any thread context (Ex: IO
completion), the current running thread
context.
• The address in the IRP is specific to the
PDE/PTE in the original process context.
42. Arbitrary Thread Context
continued..
• How to solve the issue ?.
• Note the half of the PDE (Kernel area) is
common in all process.
• If some how map to the kernel memory
(Upper half of PDE), the buffer is
accessible from all process.
43. Mapping buffer to Kernel space
• Allocate kernel pool from the calling
process context, copy user buffer to this
Kernel space.
• Memory Descriptor List (MDL) – Most
commonly used mechanism to keep data
in Kernel space.
44. Standby list
• To reclaim pages from a process, VMM first move pages
to Standby list.
• VMM keep it there for a pre-defined ticks.
• If process refer the same page, VMM remove from
standby list and assign to process.
• VMM free the pages from Standby list after the timeout
expire.
• Pages in standby list is not free, not belong to a process
also.
• VMM keep a min and max value for free and standby
page count. If its out of the limits, appropriate events will
signaled and adjust the appropriate lists.
47. Cache Manager concepts
• If disk heads run in the speed of super
sonic jets, Cache Manager not required.
• Disk access is the main bottleneck that
reduce the system performance. Faster
CPU and Memory, but disk is still in stone
age.
• Common concept in Operating Systems,
Unix flavor called ‘buffer cache’.
48. What Cache Manager does
• Keep the system wide cache data of
frequently used secondary storage blocks.
• Facilitate read ahead , write back to
improve the overall system performance.
• With write-back, cache manager combine
multiple write requests and issue single
write request to improve performance.
There is a risk associated with write-back.
49. How Cache Manager works
• Cache Manager implement caching using
Memory Mapping.
• The concept is similar to an App uses
memory mapped file.
• CreateFile(…dwFlagsAndAttributes ,..)
• dwFlagsAndAttributes ==
FILE_FLAG_NO_BUFFERING means I don’t want
cache manager.
50. How Cache Manager works..
• Cache Manager reserve area in higher 2GB (x86
platform) system area.
• The Cache Manager reserved page count adjust
according to the system memory requirement.
• If system has lots of IO intensive tasks, system
dynamically increase the cache size.
• If system under low memory situation, reduce
the buffer cache size.
51. How cached read operation works
User Space
Kernel Space
Cached Read (1)
Page Fault (4)
VMM
Do Memory Mapping (3)
Get the
Pages From Cache Manager
File System
CM (2)
Get the blocks from disk (5)
Disk stack
(SCSI/Fibre Channel)
52. How cached write operation works
User Space
Kernel Space
Cached Write (1)
Modified Page
Writer Thread VMM
of VMM
Write to disk Do Memory Mapping (3),
later(4) Copy data to VMM pages.
Copy Pages
to CM (2) Cache Manager
File System
Write the blocks to disk (5)
Disk stack
(SCSI/Fibre Channel)
53. Storage Stack Comparison –
Windows vs. Linux
File System(NTFS) Cache Mgr VFS
Volume Manager File System(ext2, ext3,.)
Class Driver (disk.sys)
Cache Mgr
Port Driver(ex: storport)
Block Layer(LVM, RAID)
MiniPort (emulex HBA ) Upper SCSI (Disk, CD)
IO Scheduler
SCSI Mid layer
SCSI lower layer(HW)