SlideShare uma empresa Scribd logo
1 de 54
Windows Kernel

  Sisimon Soman
Lord of the Rings
• x86 processor has 4 layers of protection
  called Ring 0 – 3.
• Privilege code (Kernel ) runs in Ring 0.
  Processor ensure that privilege
  instructions (like enable/disable interrupt, )
  execute in kernel mode only.
• User application runs in Ring 3.
• Ring 1 is where the Hyperviser lives..
Rings continued..
How system call works
•   Cannot directly enter kernel space using jmp or a call instruction.
•   When make a system call (like CreateFile, ReadFile) OS enter
    kernel mode (Ring 0) using instruction int 2E (it is called interrupt
    gate).
•   Code segment descriptor contain information about the ‘Ring’ at
    which the code can run. For kernel mode modules it will be always
    Ring 0. If a user mode program try to do ‘jmp <kernel mode
    address>’ it will cause access violation, because of the segment
    descriptor flag says processor should be in Ring 0.
•   The frequency of entering kernel mode is high (most of the Windows
    API call cause to enter kernel mode) sysenter is the new optimized
    instruction to enter kernel mode.
System Call continued..
• Windows maintains a system service dispatch table
  which is similar to the IDT. Each entry in system service
  table point to kernel mode system call routine.
• The int 2E probe and copy parameters from user mode
  stack to thread’s kernel mode stack and fetch and
  execute the correct system call procedure from the
  system service table.

• There are multiple system service tables. One table for
  NT Native APIs, one table for IIS and GDI etc.
System call mechanism..
Lets try it in WinDBG..
• NtWriteFile:
  mov eax, 0x0E ; build 2195 system service number for NtWriteFile
  mov ebx, esp ; point to parameters
  int 0x2E     ; execute system service trap
  ret 0x2C      ; pop parameters off stack and return to caller
Software Interrupt Request
          Levels (IRQLs)
• Windows has its own interrupt priority schemes know as
  IRQL.
• IRQL levels from 0 to 31, the higher the number means
  higher priority interrupt level.
• HAL map hardware interrupts to IRQL 3 (Device 1) -
  IRQL 31 (High)
• When higher priority interrupt occur, it mask the all lower
  interrupts and execute the ISR for the higher interrupt.
• After executing the ISR, kernel lower the interrupt levels
  and execute the lower interrupt ISR.
• ISR routine should do minimal work and it should defer
  the major chunk of work to Deferred Procedure Call
  (DPC) which run at lower IRQL 2.
Software Interrupt Request
     Levels (IRQLs)
IRQL and DPC
• DPC concept is similar to other OS, in
  Linux it is called bottom half.
• DPC is per processor, means a duel
  processor SMP box contains two DPC Qs.
• The ISR routine generally fetch data from
  hardware and queue a DPC for further
  processing.
• IRQL priority is different from thread
  scheduling priority.
IRQL and DPC
• The scheduler (dispatcher) also runs at IRQL 2.
• So a code that execute on or above IRQL
  2(dispatch level) cannot preempt.
• From the Diagram, see only hardware interrupts
  and some higher priority interrupts like clock,
  power fail are above IRQL 2.
• Most of the time OS will be in IRQL 0(Passive
  level)
• All user programs and most of the kernel code
  execute on Passive level only.
IRQL continued..
•   Scheduler runs at IRQL 2, so what happen if my driver try to wait on
    or above dispatch level ?.
•   Simple system will crash with ‘Blue Screen’, usually with the bug
    check ID IRQL_NOT_LESSTHAN_EQUAL.
•   Because if wait above dispatch level, no one there to come and
    switch the thread.
•   What happen if try to access a PagedPool in above dispatch level ?.
•   If the pages are on disk, then a page fault exception will happen, the
    current thread need to wait and page fault handler will read the
    pages from page file to page frames in memory.
•   If page fault happen above the dispatch level, no one there to stop
    the current thread and schedule the page fault handler. Thus cannot
    access PagedPool on or above dispatch level.
IRQL 1 - APCs
• Asynchronous Procedure Call (APC) run at IRQL 1.
• The main duty of APC is to send the data to user thread
  context.
• APC Q is thread specific, each thread has its own APC
  Q.
• User space thread initiate the read operation from a
  device and either it wait to finish it or continue with
  another job.
• The IO may finish sometime later, now the buffer need
  to send to the calling thread’s process context. It is the
  duty of APC.
IO Manager
App issue ReadFile



         NtReadFile
                                                  User Land

                                                  Kernel Land
IO Manager

                      IO Mgr create IRP Packet,
              IRP     send to driver stack

       File System

       Volume Manager

       Disk Class Driver

       Hardware Driver
What is IO Request Packet (IRP)
• IO Operation passes thru,
  – Different stages.
  – Different threads.
  – Different drivers.
• IRP Encapsulate the IO request.
• IRP is thread independent.
IO Request Packet (IRP)
• When a thread initiate an IO operation, IO
  Manager create a data structure call IO Request
  Packet (IRP).
• The IRP contains all information about the
  request.
• IO Manager send the IRP to the top device in
  the driver stack.
• Demo : !irpfind to see all current IRPs.
  Demo : !irp <irp address> to see information
  about one IRP.
IRP Continued..
• Compare IRP with Windows Messages
  -MSG structure.
• Each driver in the stack do its own task,
  finally forward the IRP to the lower driver
  in the stack.
• IRP can be processed synchronously or
  asynchronously.
IRP Continued..

• Usually lower level hardware driver takes more
  time. H/W driver can mark the IRP for pending
  and return.
• When H/W finish IO, H/W driver complete the
  IRP by calling IoCompleteRequest().
• IoCompleteRequest() call IO completion routine
  set by drivers in stack and complete the IO.
Structure of IRP
• Fixed IRP Header                IRP Header
• Variable Stack locations –
  – One sub stack per driver   Stack Location 1

                               Stack Location 2

                               Stack Location 3

                               Stack Location N
Flow of IRP
                                         IRP for Storage
                                         Stack


              Storage Stack
                                            IRP Header


             File System                 Stack Location 1

             Volume Manager              Stack Location 2

             Disk Class Driver           Stack Location 3

             Hardware Driver             Stack Location 4


Forward IRP to lower
driver in the stack
Flow of IRP Completion
                                     IRP for Storage
                                     Stack


                 Storage Stack
                                        IRP Header


               File System –
                                     Stack Location 1
              Completion Routine
               Volume Manager –
                                     Stack Location 2
              Completion Routine
               Disk Class Driver –
                                     Stack Location 3
              Completion Routine
               Hardware Driver –
                                     Stack Location 4
               Complete the IRP


Call the completion routine while
completing the IRP
IRP Header
• IO buffer Information.
• Flags
  – Page IO Flag
  – No Caching IO flag


• IO Status – On Completion set this to IO
  Completed.
• IRP cancel routine
IRP Stack Location
• IO Manager get the driver count in the
  stack from the top device in the stack.
• While creating IRP, IO manager allocate
  the IO stack locations equal to the device
  count from the top device object.
Contents of IO Stack Location
• IO Completion routine specific to the
  driver.
• File object specific to the request.
Asynchronous IO
• CreateFile(…, FILE_FLAG_OVERLAPPED ,..),
  ReadFile(.., LPOVERLAPPED)
• When complete the IO operation, IO Mgr
  signal the EVENT in LPOVERLAPPED.
How Async IO work in Kernel
• Lower layer driver complete IRP in arbitrary
  thread context.
• IO Mgr call IO Completion routine in reverse
  order.
• If operation is Async, IO Mgr queue an APC
  specific to the initiator thread.
• This APC has complete info of buffer, size info.
• This APC get executed later in the context of
  initiator thread, which copy buffer to user space,
  trigger the event set by App.
Common issues related IRP
• After forward the IRP down, don’t touch it (except from
  IO completion routine).
• If lower driver mark the IRP for pending, all top layer
  driver should do the same.
• If a middle level driver need to keep the IRP for further
  processing after completing it by lower driver, it can
  return STATUS_MORE_PROCESSING REQUIRED
  from completion routine.
• Middle layer driver should complete it later.
• See ReactOS source code (instead of reading 20 page
  doc)
• FastIO - Concepts
Memory and Cache Manager
Locality Theory
• If access page/cluster n, high possibility to
  access blocks near to n.
• All memory based computing system
  working on this principle.
• Windows has registry keys to configure
  pre-fetch how many blocks/pages.
• Application specific memory manager like
  Databases, multimedia workload, have
  application aware pre-fetching.
Virtual Memory Manager (VMM)
• Apps feels memory is unlimited – magic
  done by VMM.
• Multiple apps run concurrently with out
  interfering other apps data.
• Apps feel the entire resource is mine.
• Protect OS memory from apps.
• Advanced app may need to share
  memory. Provide solution to memory
  sharing easily.
VMM Continued..
• VMM reserve certain amount of memory
  to Kernel.
• 32 bit box , 2GB for Kernel and 2GB for
  User apps.
• Specific area in Kernel memory reserved
  to store process specific data like PDE,
  PTE etc called Hyper Space
Segmentation and Paging
• X86 processor has segmentation and
  paging support.
• Can disable or enable paging, but
  segmentation is enabled by default.
• Windows uses paging.
• Since not able to disable segmentation, it
  consider the entire memory for segments
  (also called ‘flat segments’).
Paging
• Divide entire physical memory in to equal
  size pages (4K size for x86 platforms).
  This is called ‘page frames’ and list called
  ‘page frame database’ (PF DB).

• PF DB also contains flags stating,
  read/write underway , shared page , etc.
VMM Continued..
• Upper 2GB Kernel space is common for
  all process.
• What is it mean – Half of PDE is common
  to all process !.
• Experiment – See the PDE of two process
  and make sure half of the PDE is same
Physical to Virtual address
             translation
• Address translation in both direction – When
  write PF to pagefile, VMM need to update proper
  PDE/PTE stating page is in disk.
• Done by
  – Memory Management Unit (MMU) of the processor.
  – The VMM help MMU.
• VMM keep the PDE/PTE info and pass to MMU
  during process context switch.
• MMU translate virtual address to physical
  address.
• CR3 register
Translation Lookaside Buffer (TLB)
• Address translation is costly operation
• It happen frequently – when even touches virtual
  memory.
• TLB keeps a list containing most frequent
  address translations.
• The list is tagged by process ID.
• TLB is a generic OS concept - implementation is
  architecture dependent.
• Before doing the address translation MMU
  search TLB for the PF.
Address Translation
• In x86 32 bit address – 10 bits of MSB
  points to the PTE offset in PDE. Thus PDE
  size of process is 1024 bytes.
• Next 10 bits point to the PF starting
  address in PTE. Thus each PTE contains
  1024 bytes.
• Remaining 12 bits to address the location
  in the PF. Thus page size is 4K.
What is a Zero Page
• Page frames not specific to apps.
• If App1 write sensitive data to PF1, and later VMM push
  the page to page file, attach PF 1 to App2. App2 can see
  these sensitive info.
• It’s a big security flaw, VMM keep a Zero Page list.
• Cannot clean the page while freeing memory – it’s a
  performance problem.
• VMM has dedicated thread who activate when system
  under low memory situation and pick page frames from
  free PF list, clean it and push to zero page list.
• VMM allocate memory from zero page list.
Arbitrary Thread Context
• Top layer of the driver stack get the
  request (IRP) in the same process
  context.
• Middle or lower layer driver MAY get the
  request in any thread context (Ex: IO
  completion), the current running thread
  context.
• The address in the IRP is specific to the
  PDE/PTE in the original process context.
Arbitrary Thread Context
              continued..
• How to solve the issue ?.
• Note the half of the PDE (Kernel area) is
  common in all process.
• If some how map to the kernel memory
  (Upper half of PDE), the buffer is
  accessible from all process.
Mapping buffer to Kernel space
• Allocate kernel pool from the calling
  process context, copy user buffer to this
  Kernel space.
• Memory Descriptor List (MDL) – Most
  commonly used mechanism to keep data
  in Kernel space.
Standby list
• To reclaim pages from a process, VMM first move pages
  to Standby list.
• VMM keep it there for a pre-defined ticks.
• If process refer the same page, VMM remove from
  standby list and assign to process.
• VMM free the pages from Standby list after the timeout
  expire.
• Pages in standby list is not free, not belong to a process
  also.
• VMM keep a min and max value for free and standby
  page count. If its out of the limits, appropriate events will
  signaled and adjust the appropriate lists.
Miscellaneous VMM Terms
• Paged Pool

• Non Paged Pool

• Copy on write (COW)
Cache Manager
Cache Manager concepts
• If disk heads run in the speed of super
  sonic jets, Cache Manager not required.
• Disk access is the main bottleneck that
  reduce the system performance. Faster
  CPU and Memory, but disk is still in stone
  age.
• Common concept in Operating Systems,
  Unix flavor called ‘buffer cache’.
What Cache Manager does
• Keep the system wide cache data of
  frequently used secondary storage blocks.
• Facilitate read ahead , write back to
  improve the overall system performance.
• With write-back, cache manager combine
  multiple write requests and issue single
  write request to improve performance.
  There is a risk associated with write-back.
How Cache Manager works
• Cache Manager implement caching using
  Memory Mapping.
• The concept is similar to an App uses
  memory mapped file.
• CreateFile(…dwFlagsAndAttributes ,..)
• dwFlagsAndAttributes ==
 FILE_FLAG_NO_BUFFERING means I don’t want
 cache manager.
How Cache Manager works..
• Cache Manager reserve area in higher 2GB (x86
  platform) system area.
• The Cache Manager reserved page count adjust
  according to the system memory requirement.
• If system has lots of IO intensive tasks, system
  dynamically increase the cache size.
• If system under low memory situation, reduce
  the buffer cache size.
How cached read operation works
                                                             User Space

                                                            Kernel Space
Cached Read (1)

                      Page Fault (4)
                                                 VMM


                                           Do Memory Mapping (3)

                              Get the
                              Pages From    Cache Manager
           File System
                              CM (2)

       Get the blocks from disk (5)

          Disk stack
      (SCSI/Fibre Channel)
How cached write operation works
                                                                   User Space

                                                                  Kernel Space
Cached Write (1)
                               Modified Page
                               Writer Thread          VMM
                                  of VMM
                   Write to disk                Do Memory Mapping (3),
                   later(4)                     Copy data to VMM pages.
                                   Copy Pages
                                   to CM (2)     Cache Manager
            File System

       Write the blocks to disk (5)

           Disk stack
       (SCSI/Fibre Channel)
Storage Stack Comparison –
       Windows vs. Linux
 File System(NTFS)          Cache Mgr        VFS

 Volume Manager                         File System(ext2, ext3,.)

 Class Driver (disk.sys)
                                           Cache Mgr

Port Driver(ex: storport)
                                         Block Layer(LVM, RAID)

MiniPort (emulex HBA )                   Upper SCSI (Disk, CD)


                                         IO Scheduler

                                         SCSI Mid layer

                                         SCSI lower layer(HW)
Questions ?

Mais conteúdo relacionado

Mais procurados

Processes and operating systems
Processes and operating systemsProcesses and operating systems
Processes and operating systemsRAMPRAKASHT1
 
OS Internals and Portable Executable File Format
OS Internals and Portable Executable File FormatOS Internals and Portable Executable File Format
OS Internals and Portable Executable File FormatAitezaz Mohsin
 
12 processor structure and function
12 processor structure and function12 processor structure and function
12 processor structure and functionSher Shah Merkhel
 
Real time Operating System
Real time Operating SystemReal time Operating System
Real time Operating SystemTech_MX
 
Chap 4 lesson02emsysnewinterruptbasedi_os
Chap 4 lesson02emsysnewinterruptbasedi_osChap 4 lesson02emsysnewinterruptbasedi_os
Chap 4 lesson02emsysnewinterruptbasedi_osMontassar BEN ABDALLAH
 
introduction to embedded systems part 1
introduction to embedded systems part 1introduction to embedded systems part 1
introduction to embedded systems part 1Hatem Abd El-Salam
 
Process scheduling &amp; time
Process scheduling &amp; timeProcess scheduling &amp; time
Process scheduling &amp; timeYojana Nanaware
 
The survey on real time operating systems (1)
The survey on real time operating systems (1)The survey on real time operating systems (1)
The survey on real time operating systems (1)manojkumarsmks
 
Real Time Operating System
Real Time Operating SystemReal Time Operating System
Real Time Operating SystemSharad Pandey
 
RTOS implementation
RTOS implementationRTOS implementation
RTOS implementationRajan Kumar
 
INTERRUPT ROUTINES IN RTOS EN VIRONMENT HANDELING OF INTERRUPT SOURCE CALLS
INTERRUPT ROUTINES IN RTOS EN VIRONMENT HANDELING OF INTERRUPT SOURCE CALLSINTERRUPT ROUTINES IN RTOS EN VIRONMENT HANDELING OF INTERRUPT SOURCE CALLS
INTERRUPT ROUTINES IN RTOS EN VIRONMENT HANDELING OF INTERRUPT SOURCE CALLSJOLLUSUDARSHANREDDY
 
Real time operating system
Real time operating systemReal time operating system
Real time operating systemPratik Hiremath
 

Mais procurados (20)

Processes and operating systems
Processes and operating systemsProcesses and operating systems
Processes and operating systems
 
OS Internals and Portable Executable File Format
OS Internals and Portable Executable File FormatOS Internals and Portable Executable File Format
OS Internals and Portable Executable File Format
 
12 processor structure and function
12 processor structure and function12 processor structure and function
12 processor structure and function
 
Real time Operating System
Real time Operating SystemReal time Operating System
Real time Operating System
 
Vxworks
VxworksVxworks
Vxworks
 
Chap 4 lesson02emsysnewinterruptbasedi_os
Chap 4 lesson02emsysnewinterruptbasedi_osChap 4 lesson02emsysnewinterruptbasedi_os
Chap 4 lesson02emsysnewinterruptbasedi_os
 
Linux Internals - Interview essentials 3.0
Linux Internals - Interview essentials 3.0Linux Internals - Interview essentials 3.0
Linux Internals - Interview essentials 3.0
 
14 superscalar
14 superscalar14 superscalar
14 superscalar
 
Rtos part2
Rtos part2Rtos part2
Rtos part2
 
introduction to embedded systems part 1
introduction to embedded systems part 1introduction to embedded systems part 1
introduction to embedded systems part 1
 
Process scheduling &amp; time
Process scheduling &amp; timeProcess scheduling &amp; time
Process scheduling &amp; time
 
The survey on real time operating systems (1)
The survey on real time operating systems (1)The survey on real time operating systems (1)
The survey on real time operating systems (1)
 
Real Time Operating System
Real Time Operating SystemReal Time Operating System
Real Time Operating System
 
RTOS implementation
RTOS implementationRTOS implementation
RTOS implementation
 
Real Time Operating System
Real Time Operating SystemReal Time Operating System
Real Time Operating System
 
INTERRUPT ROUTINES IN RTOS EN VIRONMENT HANDELING OF INTERRUPT SOURCE CALLS
INTERRUPT ROUTINES IN RTOS EN VIRONMENT HANDELING OF INTERRUPT SOURCE CALLSINTERRUPT ROUTINES IN RTOS EN VIRONMENT HANDELING OF INTERRUPT SOURCE CALLS
INTERRUPT ROUTINES IN RTOS EN VIRONMENT HANDELING OF INTERRUPT SOURCE CALLS
 
Processors selection
Processors selectionProcessors selection
Processors selection
 
Rtos Concepts
Rtos ConceptsRtos Concepts
Rtos Concepts
 
Real time operating system
Real time operating systemReal time operating system
Real time operating system
 
Os
OsOs
Os
 

Destaque

Linux io introduction-fudcon-2015-with-demo-slides
Linux io introduction-fudcon-2015-with-demo-slidesLinux io introduction-fudcon-2015-with-demo-slides
Linux io introduction-fudcon-2015-with-demo-slidesKASHISH BHATIA
 
0513 深入Windows Server 2008 系統核心
0513 深入Windows Server 2008  系統核心0513 深入Windows Server 2008  系統核心
0513 深入Windows Server 2008 系統核心Timothy Chen
 
Metasploit & Windows Kernel Exploitation
Metasploit & Windows Kernel ExploitationMetasploit & Windows Kernel Exploitation
Metasploit & Windows Kernel ExploitationzeroSteiner
 
Windows kernel debugging session 2
Windows kernel debugging session 2Windows kernel debugging session 2
Windows kernel debugging session 2Sisimon Soman
 
10 point sukses kls 8
10 point sukses kls 810 point sukses kls 8
10 point sukses kls 8rumiraia
 
Arm architecture
Arm architectureArm architecture
Arm architectureMinYeop Na
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecturehugo lu
 
Process synchronization in Operating Systems
Process synchronization in Operating SystemsProcess synchronization in Operating Systems
Process synchronization in Operating SystemsRitu Ranjan Shrivastwa
 
Operating system.ppt (1)
Operating system.ppt (1)Operating system.ppt (1)
Operating system.ppt (1)Vaibhav Bajaj
 
INTERRUPTS OF 8086 MICROPROCESSOR
INTERRUPTS OF 8086 MICROPROCESSORINTERRUPTS OF 8086 MICROPROCESSOR
INTERRUPTS OF 8086 MICROPROCESSORGurudev joshi
 

Destaque (15)

Linux io introduction-fudcon-2015-with-demo-slides
Linux io introduction-fudcon-2015-with-demo-slidesLinux io introduction-fudcon-2015-with-demo-slides
Linux io introduction-fudcon-2015-with-demo-slides
 
0513 深入Windows Server 2008 系統核心
0513 深入Windows Server 2008  系統核心0513 深入Windows Server 2008  系統核心
0513 深入Windows Server 2008 系統核心
 
Metasploit & Windows Kernel Exploitation
Metasploit & Windows Kernel ExploitationMetasploit & Windows Kernel Exploitation
Metasploit & Windows Kernel Exploitation
 
Windows kernel debugging session 2
Windows kernel debugging session 2Windows kernel debugging session 2
Windows kernel debugging session 2
 
Kernal
KernalKernal
Kernal
 
10 point sukses kls 8
10 point sukses kls 810 point sukses kls 8
10 point sukses kls 8
 
Arm architecture
Arm architectureArm architecture
Arm architecture
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecture
 
Process synchronization in Operating Systems
Process synchronization in Operating SystemsProcess synchronization in Operating Systems
Process synchronization in Operating Systems
 
Interrupts
InterruptsInterrupts
Interrupts
 
Types of operating system
Types of operating systemTypes of operating system
Types of operating system
 
Operating system.ppt (1)
Operating system.ppt (1)Operating system.ppt (1)
Operating system.ppt (1)
 
FreeRTOS Course - Semaphore/Mutex Management
FreeRTOS Course - Semaphore/Mutex ManagementFreeRTOS Course - Semaphore/Mutex Management
FreeRTOS Course - Semaphore/Mutex Management
 
Presentation on operating system
 Presentation on operating system Presentation on operating system
Presentation on operating system
 
INTERRUPTS OF 8086 MICROPROCESSOR
INTERRUPTS OF 8086 MICROPROCESSORINTERRUPTS OF 8086 MICROPROCESSOR
INTERRUPTS OF 8086 MICROPROCESSOR
 

Semelhante a Windows Kernel Rings and System Calls

Introduction to windows kernel
Introduction to windows kernelIntroduction to windows kernel
Introduction to windows kernelSisimon Soman
 
CNIT 127 Ch Ch 1: Before you Begin
CNIT 127 Ch Ch 1: Before you BeginCNIT 127 Ch Ch 1: Before you Begin
CNIT 127 Ch Ch 1: Before you BeginSam Bowne
 
CNIT 127 Ch 1: Before you Begin
CNIT 127 Ch 1: Before you BeginCNIT 127 Ch 1: Before you Begin
CNIT 127 Ch 1: Before you BeginSam Bowne
 
CNIT 126 4: A Crash Course in x86 Disassembly
CNIT 126 4: A Crash Course in x86 DisassemblyCNIT 126 4: A Crash Course in x86 Disassembly
CNIT 126 4: A Crash Course in x86 DisassemblySam Bowne
 
Ir da in_linux_presentation
Ir da in_linux_presentationIr da in_linux_presentation
Ir da in_linux_presentationAnshuman Biswal
 
Beneath the Linux Interrupt handling
Beneath the Linux Interrupt handlingBeneath the Linux Interrupt handling
Beneath the Linux Interrupt handlingBhoomil Chavda
 
Smash the Stack: Writing a Buffer Overflow Exploit (Win32)
Smash the Stack: Writing a Buffer Overflow Exploit (Win32)Smash the Stack: Writing a Buffer Overflow Exploit (Win32)
Smash the Stack: Writing a Buffer Overflow Exploit (Win32)Elvin Gentiles
 
UNIT 3 - General Purpose Processors
UNIT 3 - General Purpose ProcessorsUNIT 3 - General Purpose Processors
UNIT 3 - General Purpose ProcessorsButtaRajasekhar2
 
Practical Malware Analysis: Ch 4 A Crash Course in x86 Disassembly
Practical Malware Analysis: Ch 4 A Crash Course in x86 Disassembly Practical Malware Analysis: Ch 4 A Crash Course in x86 Disassembly
Practical Malware Analysis: Ch 4 A Crash Course in x86 Disassembly Sam Bowne
 
Windows内核技术介绍
Windows内核技术介绍Windows内核技术介绍
Windows内核技术介绍jeffz
 
Synchronization linux
Synchronization linuxSynchronization linux
Synchronization linuxSusant Sahani
 
Computer_Organization and architecture _unit 1.pptx
Computer_Organization and architecture _unit 1.pptxComputer_Organization and architecture _unit 1.pptx
Computer_Organization and architecture _unit 1.pptxManimegalaM3
 
Combining the strength of erlang and Ruby
Combining the strength of erlang and RubyCombining the strength of erlang and Ruby
Combining the strength of erlang and RubyMartin Rehfeld
 

Semelhante a Windows Kernel Rings and System Calls (20)

Introduction to windows kernel
Introduction to windows kernelIntroduction to windows kernel
Introduction to windows kernel
 
Windows io manager
Windows io managerWindows io manager
Windows io manager
 
CNIT 127 Ch Ch 1: Before you Begin
CNIT 127 Ch Ch 1: Before you BeginCNIT 127 Ch Ch 1: Before you Begin
CNIT 127 Ch Ch 1: Before you Begin
 
Earhart
EarhartEarhart
Earhart
 
the windows opereting system
the windows opereting systemthe windows opereting system
the windows opereting system
 
CNIT 127 Ch 1: Before you Begin
CNIT 127 Ch 1: Before you BeginCNIT 127 Ch 1: Before you Begin
CNIT 127 Ch 1: Before you Begin
 
CNIT 126 4: A Crash Course in x86 Disassembly
CNIT 126 4: A Crash Course in x86 DisassemblyCNIT 126 4: A Crash Course in x86 Disassembly
CNIT 126 4: A Crash Course in x86 Disassembly
 
Ir da in_linux_presentation
Ir da in_linux_presentationIr da in_linux_presentation
Ir da in_linux_presentation
 
Beneath the Linux Interrupt handling
Beneath the Linux Interrupt handlingBeneath the Linux Interrupt handling
Beneath the Linux Interrupt handling
 
13 superscalar
13 superscalar13 superscalar
13 superscalar
 
Smash the Stack: Writing a Buffer Overflow Exploit (Win32)
Smash the Stack: Writing a Buffer Overflow Exploit (Win32)Smash the Stack: Writing a Buffer Overflow Exploit (Win32)
Smash the Stack: Writing a Buffer Overflow Exploit (Win32)
 
13_Superscalar.ppt
13_Superscalar.ppt13_Superscalar.ppt
13_Superscalar.ppt
 
UNIT 3 - General Purpose Processors
UNIT 3 - General Purpose ProcessorsUNIT 3 - General Purpose Processors
UNIT 3 - General Purpose Processors
 
Practical Malware Analysis: Ch 4 A Crash Course in x86 Disassembly
Practical Malware Analysis: Ch 4 A Crash Course in x86 Disassembly Practical Malware Analysis: Ch 4 A Crash Course in x86 Disassembly
Practical Malware Analysis: Ch 4 A Crash Course in x86 Disassembly
 
Os lectures
Os lecturesOs lectures
Os lectures
 
Processor types
Processor typesProcessor types
Processor types
 
Windows内核技术介绍
Windows内核技术介绍Windows内核技术介绍
Windows内核技术介绍
 
Synchronization linux
Synchronization linuxSynchronization linux
Synchronization linux
 
Computer_Organization and architecture _unit 1.pptx
Computer_Organization and architecture _unit 1.pptxComputer_Organization and architecture _unit 1.pptx
Computer_Organization and architecture _unit 1.pptx
 
Combining the strength of erlang and Ruby
Combining the strength of erlang and RubyCombining the strength of erlang and Ruby
Combining the strength of erlang and Ruby
 

Mais de Sisimon Soman

Windows kernel debugging workshop in florida
Windows kernel debugging   workshop in floridaWindows kernel debugging   workshop in florida
Windows kernel debugging workshop in floridaSisimon Soman
 
Windows memory manager internals
Windows memory manager internalsWindows memory manager internals
Windows memory manager internalsSisimon Soman
 
Windows debugging sisimon
Windows debugging   sisimonWindows debugging   sisimon
Windows debugging sisimonSisimon Soman
 
Storage virtualization citrix blr wide tech talk
Storage virtualization citrix blr wide tech talkStorage virtualization citrix blr wide tech talk
Storage virtualization citrix blr wide tech talkSisimon Soman
 
VDI storage and storage virtualization
VDI storage and storage virtualizationVDI storage and storage virtualization
VDI storage and storage virtualizationSisimon Soman
 
Design Patterns By Sisimon Soman
Design Patterns By Sisimon SomanDesign Patterns By Sisimon Soman
Design Patterns By Sisimon SomanSisimon Soman
 

Mais de Sisimon Soman (7)

Windows kernel debugging workshop in florida
Windows kernel debugging   workshop in floridaWindows kernel debugging   workshop in florida
Windows kernel debugging workshop in florida
 
Windows memory manager internals
Windows memory manager internalsWindows memory manager internals
Windows memory manager internals
 
Windows debugging sisimon
Windows debugging   sisimonWindows debugging   sisimon
Windows debugging sisimon
 
Storage virtualization citrix blr wide tech talk
Storage virtualization citrix blr wide tech talkStorage virtualization citrix blr wide tech talk
Storage virtualization citrix blr wide tech talk
 
VDI storage and storage virtualization
VDI storage and storage virtualizationVDI storage and storage virtualization
VDI storage and storage virtualization
 
COM and DCOM
COM and DCOMCOM and DCOM
COM and DCOM
 
Design Patterns By Sisimon Soman
Design Patterns By Sisimon SomanDesign Patterns By Sisimon Soman
Design Patterns By Sisimon Soman
 

Windows Kernel Rings and System Calls

  • 1. Windows Kernel Sisimon Soman
  • 2. Lord of the Rings • x86 processor has 4 layers of protection called Ring 0 – 3. • Privilege code (Kernel ) runs in Ring 0. Processor ensure that privilege instructions (like enable/disable interrupt, ) execute in kernel mode only. • User application runs in Ring 3. • Ring 1 is where the Hyperviser lives..
  • 4. How system call works • Cannot directly enter kernel space using jmp or a call instruction. • When make a system call (like CreateFile, ReadFile) OS enter kernel mode (Ring 0) using instruction int 2E (it is called interrupt gate). • Code segment descriptor contain information about the ‘Ring’ at which the code can run. For kernel mode modules it will be always Ring 0. If a user mode program try to do ‘jmp <kernel mode address>’ it will cause access violation, because of the segment descriptor flag says processor should be in Ring 0. • The frequency of entering kernel mode is high (most of the Windows API call cause to enter kernel mode) sysenter is the new optimized instruction to enter kernel mode.
  • 5. System Call continued.. • Windows maintains a system service dispatch table which is similar to the IDT. Each entry in system service table point to kernel mode system call routine. • The int 2E probe and copy parameters from user mode stack to thread’s kernel mode stack and fetch and execute the correct system call procedure from the system service table. • There are multiple system service tables. One table for NT Native APIs, one table for IIS and GDI etc.
  • 7.
  • 8. Lets try it in WinDBG.. • NtWriteFile: mov eax, 0x0E ; build 2195 system service number for NtWriteFile mov ebx, esp ; point to parameters int 0x2E ; execute system service trap ret 0x2C ; pop parameters off stack and return to caller
  • 9. Software Interrupt Request Levels (IRQLs) • Windows has its own interrupt priority schemes know as IRQL. • IRQL levels from 0 to 31, the higher the number means higher priority interrupt level. • HAL map hardware interrupts to IRQL 3 (Device 1) - IRQL 31 (High) • When higher priority interrupt occur, it mask the all lower interrupts and execute the ISR for the higher interrupt. • After executing the ISR, kernel lower the interrupt levels and execute the lower interrupt ISR. • ISR routine should do minimal work and it should defer the major chunk of work to Deferred Procedure Call (DPC) which run at lower IRQL 2.
  • 10. Software Interrupt Request Levels (IRQLs)
  • 11. IRQL and DPC • DPC concept is similar to other OS, in Linux it is called bottom half. • DPC is per processor, means a duel processor SMP box contains two DPC Qs. • The ISR routine generally fetch data from hardware and queue a DPC for further processing. • IRQL priority is different from thread scheduling priority.
  • 12. IRQL and DPC • The scheduler (dispatcher) also runs at IRQL 2. • So a code that execute on or above IRQL 2(dispatch level) cannot preempt. • From the Diagram, see only hardware interrupts and some higher priority interrupts like clock, power fail are above IRQL 2. • Most of the time OS will be in IRQL 0(Passive level) • All user programs and most of the kernel code execute on Passive level only.
  • 13. IRQL continued.. • Scheduler runs at IRQL 2, so what happen if my driver try to wait on or above dispatch level ?. • Simple system will crash with ‘Blue Screen’, usually with the bug check ID IRQL_NOT_LESSTHAN_EQUAL. • Because if wait above dispatch level, no one there to come and switch the thread. • What happen if try to access a PagedPool in above dispatch level ?. • If the pages are on disk, then a page fault exception will happen, the current thread need to wait and page fault handler will read the pages from page file to page frames in memory. • If page fault happen above the dispatch level, no one there to stop the current thread and schedule the page fault handler. Thus cannot access PagedPool on or above dispatch level.
  • 14. IRQL 1 - APCs • Asynchronous Procedure Call (APC) run at IRQL 1. • The main duty of APC is to send the data to user thread context. • APC Q is thread specific, each thread has its own APC Q. • User space thread initiate the read operation from a device and either it wait to finish it or continue with another job. • The IO may finish sometime later, now the buffer need to send to the calling thread’s process context. It is the duty of APC.
  • 16. App issue ReadFile NtReadFile User Land Kernel Land IO Manager IO Mgr create IRP Packet, IRP send to driver stack File System Volume Manager Disk Class Driver Hardware Driver
  • 17. What is IO Request Packet (IRP) • IO Operation passes thru, – Different stages. – Different threads. – Different drivers. • IRP Encapsulate the IO request. • IRP is thread independent.
  • 18. IO Request Packet (IRP) • When a thread initiate an IO operation, IO Manager create a data structure call IO Request Packet (IRP). • The IRP contains all information about the request. • IO Manager send the IRP to the top device in the driver stack. • Demo : !irpfind to see all current IRPs. Demo : !irp <irp address> to see information about one IRP.
  • 19. IRP Continued.. • Compare IRP with Windows Messages -MSG structure. • Each driver in the stack do its own task, finally forward the IRP to the lower driver in the stack. • IRP can be processed synchronously or asynchronously.
  • 20. IRP Continued.. • Usually lower level hardware driver takes more time. H/W driver can mark the IRP for pending and return. • When H/W finish IO, H/W driver complete the IRP by calling IoCompleteRequest(). • IoCompleteRequest() call IO completion routine set by drivers in stack and complete the IO.
  • 21. Structure of IRP • Fixed IRP Header IRP Header • Variable Stack locations – – One sub stack per driver Stack Location 1 Stack Location 2 Stack Location 3 Stack Location N
  • 22. Flow of IRP IRP for Storage Stack Storage Stack IRP Header File System Stack Location 1 Volume Manager Stack Location 2 Disk Class Driver Stack Location 3 Hardware Driver Stack Location 4 Forward IRP to lower driver in the stack
  • 23. Flow of IRP Completion IRP for Storage Stack Storage Stack IRP Header File System – Stack Location 1 Completion Routine Volume Manager – Stack Location 2 Completion Routine Disk Class Driver – Stack Location 3 Completion Routine Hardware Driver – Stack Location 4 Complete the IRP Call the completion routine while completing the IRP
  • 24. IRP Header • IO buffer Information. • Flags – Page IO Flag – No Caching IO flag • IO Status – On Completion set this to IO Completed. • IRP cancel routine
  • 25. IRP Stack Location • IO Manager get the driver count in the stack from the top device in the stack. • While creating IRP, IO manager allocate the IO stack locations equal to the device count from the top device object.
  • 26. Contents of IO Stack Location • IO Completion routine specific to the driver. • File object specific to the request.
  • 27. Asynchronous IO • CreateFile(…, FILE_FLAG_OVERLAPPED ,..), ReadFile(.., LPOVERLAPPED) • When complete the IO operation, IO Mgr signal the EVENT in LPOVERLAPPED.
  • 28. How Async IO work in Kernel • Lower layer driver complete IRP in arbitrary thread context. • IO Mgr call IO Completion routine in reverse order. • If operation is Async, IO Mgr queue an APC specific to the initiator thread. • This APC has complete info of buffer, size info. • This APC get executed later in the context of initiator thread, which copy buffer to user space, trigger the event set by App.
  • 29. Common issues related IRP • After forward the IRP down, don’t touch it (except from IO completion routine). • If lower driver mark the IRP for pending, all top layer driver should do the same. • If a middle level driver need to keep the IRP for further processing after completing it by lower driver, it can return STATUS_MORE_PROCESSING REQUIRED from completion routine. • Middle layer driver should complete it later. • See ReactOS source code (instead of reading 20 page doc) • FastIO - Concepts
  • 30. Memory and Cache Manager
  • 31. Locality Theory • If access page/cluster n, high possibility to access blocks near to n. • All memory based computing system working on this principle. • Windows has registry keys to configure pre-fetch how many blocks/pages. • Application specific memory manager like Databases, multimedia workload, have application aware pre-fetching.
  • 32. Virtual Memory Manager (VMM) • Apps feels memory is unlimited – magic done by VMM. • Multiple apps run concurrently with out interfering other apps data. • Apps feel the entire resource is mine. • Protect OS memory from apps. • Advanced app may need to share memory. Provide solution to memory sharing easily.
  • 33. VMM Continued.. • VMM reserve certain amount of memory to Kernel. • 32 bit box , 2GB for Kernel and 2GB for User apps. • Specific area in Kernel memory reserved to store process specific data like PDE, PTE etc called Hyper Space
  • 34. Segmentation and Paging • X86 processor has segmentation and paging support. • Can disable or enable paging, but segmentation is enabled by default. • Windows uses paging. • Since not able to disable segmentation, it consider the entire memory for segments (also called ‘flat segments’).
  • 35. Paging • Divide entire physical memory in to equal size pages (4K size for x86 platforms). This is called ‘page frames’ and list called ‘page frame database’ (PF DB). • PF DB also contains flags stating, read/write underway , shared page , etc.
  • 36. VMM Continued.. • Upper 2GB Kernel space is common for all process. • What is it mean – Half of PDE is common to all process !. • Experiment – See the PDE of two process and make sure half of the PDE is same
  • 37. Physical to Virtual address translation • Address translation in both direction – When write PF to pagefile, VMM need to update proper PDE/PTE stating page is in disk. • Done by – Memory Management Unit (MMU) of the processor. – The VMM help MMU. • VMM keep the PDE/PTE info and pass to MMU during process context switch. • MMU translate virtual address to physical address. • CR3 register
  • 38. Translation Lookaside Buffer (TLB) • Address translation is costly operation • It happen frequently – when even touches virtual memory. • TLB keeps a list containing most frequent address translations. • The list is tagged by process ID. • TLB is a generic OS concept - implementation is architecture dependent. • Before doing the address translation MMU search TLB for the PF.
  • 39. Address Translation • In x86 32 bit address – 10 bits of MSB points to the PTE offset in PDE. Thus PDE size of process is 1024 bytes. • Next 10 bits point to the PF starting address in PTE. Thus each PTE contains 1024 bytes. • Remaining 12 bits to address the location in the PF. Thus page size is 4K.
  • 40. What is a Zero Page • Page frames not specific to apps. • If App1 write sensitive data to PF1, and later VMM push the page to page file, attach PF 1 to App2. App2 can see these sensitive info. • It’s a big security flaw, VMM keep a Zero Page list. • Cannot clean the page while freeing memory – it’s a performance problem. • VMM has dedicated thread who activate when system under low memory situation and pick page frames from free PF list, clean it and push to zero page list. • VMM allocate memory from zero page list.
  • 41. Arbitrary Thread Context • Top layer of the driver stack get the request (IRP) in the same process context. • Middle or lower layer driver MAY get the request in any thread context (Ex: IO completion), the current running thread context. • The address in the IRP is specific to the PDE/PTE in the original process context.
  • 42. Arbitrary Thread Context continued.. • How to solve the issue ?. • Note the half of the PDE (Kernel area) is common in all process. • If some how map to the kernel memory (Upper half of PDE), the buffer is accessible from all process.
  • 43. Mapping buffer to Kernel space • Allocate kernel pool from the calling process context, copy user buffer to this Kernel space. • Memory Descriptor List (MDL) – Most commonly used mechanism to keep data in Kernel space.
  • 44. Standby list • To reclaim pages from a process, VMM first move pages to Standby list. • VMM keep it there for a pre-defined ticks. • If process refer the same page, VMM remove from standby list and assign to process. • VMM free the pages from Standby list after the timeout expire. • Pages in standby list is not free, not belong to a process also. • VMM keep a min and max value for free and standby page count. If its out of the limits, appropriate events will signaled and adjust the appropriate lists.
  • 45. Miscellaneous VMM Terms • Paged Pool • Non Paged Pool • Copy on write (COW)
  • 47. Cache Manager concepts • If disk heads run in the speed of super sonic jets, Cache Manager not required. • Disk access is the main bottleneck that reduce the system performance. Faster CPU and Memory, but disk is still in stone age. • Common concept in Operating Systems, Unix flavor called ‘buffer cache’.
  • 48. What Cache Manager does • Keep the system wide cache data of frequently used secondary storage blocks. • Facilitate read ahead , write back to improve the overall system performance. • With write-back, cache manager combine multiple write requests and issue single write request to improve performance. There is a risk associated with write-back.
  • 49. How Cache Manager works • Cache Manager implement caching using Memory Mapping. • The concept is similar to an App uses memory mapped file. • CreateFile(…dwFlagsAndAttributes ,..) • dwFlagsAndAttributes == FILE_FLAG_NO_BUFFERING means I don’t want cache manager.
  • 50. How Cache Manager works.. • Cache Manager reserve area in higher 2GB (x86 platform) system area. • The Cache Manager reserved page count adjust according to the system memory requirement. • If system has lots of IO intensive tasks, system dynamically increase the cache size. • If system under low memory situation, reduce the buffer cache size.
  • 51. How cached read operation works User Space Kernel Space Cached Read (1) Page Fault (4) VMM Do Memory Mapping (3) Get the Pages From Cache Manager File System CM (2) Get the blocks from disk (5) Disk stack (SCSI/Fibre Channel)
  • 52. How cached write operation works User Space Kernel Space Cached Write (1) Modified Page Writer Thread VMM of VMM Write to disk Do Memory Mapping (3), later(4) Copy data to VMM pages. Copy Pages to CM (2) Cache Manager File System Write the blocks to disk (5) Disk stack (SCSI/Fibre Channel)
  • 53. Storage Stack Comparison – Windows vs. Linux File System(NTFS) Cache Mgr VFS Volume Manager File System(ext2, ext3,.) Class Driver (disk.sys) Cache Mgr Port Driver(ex: storport) Block Layer(LVM, RAID) MiniPort (emulex HBA ) Upper SCSI (Disk, CD) IO Scheduler SCSI Mid layer SCSI lower layer(HW)