SlideShare uma empresa Scribd logo
1 de 30
Memory Management in Linux

      Anand Sivasubramaniam
Two Parts

• Architecture Independent Memory
   Should be flexible and portable enough across
     platforms

• Implementation for a specific architecture
Architecture Independent
         Memory Model
• Process virtual address space divided into pages
• Page size given in PAGE_SIZE macro in asm/page.h
   (4K for x86 and 8K for Alpha)
• The pages are divided between 4 segments
• User Code, User Data, Kernel Code, Kernel Data
• In User mode, access only User Code and User Data
• But in Kernel mode, access also needed for User Data
• put_user(), get_user(), memcpy_tofs(),
  memcpy_fromfs() allow kernel to access user data
  (defined in asm/segment.h)
• Registers cs and ds point to the code and data
  segments of the current mode
• fs points to the data segment of the calling process in
  kernel mode.
• Get_ds(), get_fs(), and set_fs() are defined in
  asm/segment.h
• Segment + Offset = 4 GB Linear address (32 bits)
• Of this, user space = 3 GB (defined by TASK_SIZE
  macro) and kernel space = 1GB
• Linear Address converted to physical address using 3
  levels




    Index into   Index into  Index into      Page
    Page Dir.    Page Middle Page Table      Offset
                 Dir.
Page Dir. And Middle Dir. Access Functions
         (in asm/page.h and asm/pgtable.h)

•   Structures pgd_t and pmd_t define an entry of these tables.
•   pgd_alloc_alloc()/pgd_free() to allocate and free a page for the page
    directory
•   pmd_alloc(),pmd_alloc_kernel()/pmd_free(),pmd_free_kernel()
    allocate and free a page middle directory in user and kernel segments.
•   pgd_set(),pgd_clear()/pmd_set(),pmd_clear() set and clear a
    entry of their tables.
•   pgd_present()/pmd_present() checks for presence of what the
    entries are pointing to.
•   pgd_page()/pmd_page() returns the base address of the page to
    which the entry is pointing
•   …..
Page Table Entry (pte_t)
Attributes
• Presence (is page present in VAS?)
• Read, Write and Execute
• Accessed ? (age)
• Dirty
•   Macros of Pgprot_type
    – PAGE_NONE (invalid)
    – PAGE_SHARED (read-write)
    – PAGE_COPY/READ_ONLY (read only, used by copy-on-
      write)
    – PAGE_KERNEL (accessibe only by kernel)
Page Table Functions
•   mk_pte(), Pte_clear(), set_pte()
•   pte_mkclean(), pte_mkdirty(), pt_mkread(), ….
•   pte_none() (check whether entry is set)
•   pte_page() (returns address of page)
•   pte_dirty(), pte_present(), pte_young(),
    pte_read(), pte_write()
Process Address Space (not to scale!)
                        Kernel

0xC0000000
               File name, Environment

                      Arguments

                        Stack



      _end
                         bss
  _bss_start
     _edata
                        Data
     _etext
                        Code
                        Header
 0x84000000

                   Shared Libs
Address Space Descriptor
•   mm_struct defined in the process descriptor. (in linux/sched.h)
•   This is duplicated if CLONE_VM is specified on forking.
•   struct mm_struct {
     int count; // no. of processes sharing this descriptor
     pgd_t *pgd; //page directory ptr
     unsigned long start_code, end_code;
     unsigned long start_data, end_data;
     unsigned long start_brk, brk;
     unsigned long start_stack;
     unsigned long arg_start, arg_end, env_start, env_end;
     unsigned long rss; // no. of pages resident in memory
     unsigned long total_vm; // total # of bytes in this address space
     unsigned long locked_vm; // # of bytes locked in memory
     unsigned long def_flags; // status to use when mem regions are created
     struct vm_area_struct *mmap; // ptr to first region desc.
     struct vm_area_struct *mmap_avl; // faster search of region desc.
     }
Region Descriptors
•   Why even allocate all of the VAS? Allocate only on demand.
•   Use region descriptors for each allocated region of VAS
•   Map allocated but unused regions to same physical page to save space.
•   struct vm_area_struct {
     struct mm_struct *vm_mm; // descriptor of VAS
     unsigned long vm_start, vm_end; // of this region
     pgprot_t vm_page_prot; // protection attributes for this region
     short vm_avl_height;
     struct vm_avl_left;
     vm_area_struct *vm_avl_permission; // right hand child
     vm_area_struct * vm_next_share, *vm_prev_share; // doubly linked
     vm_operations_struct *vm_ops;
     struct inode *vm_inode; // of file mapped, or NULL = “anonymous
        mapping”
     unsigned long vm_offset; // offset in file/device
     }
• If vm_inode is NULL (anonymous mapping), all PTEs
  for this region point to the same page.
• If the process does a write to any of these pages, the
  faulting mechanism creates a new physical page (copy-
  on-write).
• This is used by the brk() system call.
• Operations specific to this region (including fault
  handling) are specified in vm_operations_struct.
• Hence, different regions can have different functions.
Struct vm_operations_struct {
    void (*open)(struct vm_area_struct *);
    void (*close)(struct vm_area_struct *);
    void (*unmap)(…);
    void (*protect)(…)
    void (*sync)(…);
    unsigned long (*nopage)(struct vm_area_struct *, unsigned long
       address, unsigned long page, int write_access);
    void (*swapout)(struct vm_area_struct *, unsigned long, pte_t *);
    pte_t (*swapin)(struct vm_area_struct *, unsigned long, unsigned long);
}
Traditional mmap()
• int do_mmap(struct file *, unsigned long addr,
  unsigned long len, unsigned long prot, unsigned
  long flags, unsigned long off);

•   Creates a new memory region
•   Creates the required PTEs
•   Sets the PTEs to fault later
•   The handler (nopage) will either copy-on-write if
    anonymous mapping, or will bring in the required page
    of file.
How is brk() implemented?

• Check whether to allocate (deny if not enough physical
  memory, exceeds its VA limits, or crosses stack).
• Then call do_mmap() for anonymous mapping
  between the old and new values of brk (in process
  table).
• Return the new brk value.
Kernel Segment
•   On a sys call, CS points to kernel segment. DS and ES are set to
    kernel segment as well.
•   Next, FS is set to user data segment.
•   Put_user() and get_user() can then access user space if needed.
•   The address parameters to these functions cannot exceed
    0xc0000000.
•   Violation of this should result in a trap, together with any writes
    to a read-only page (creates a problem on 386, while the problem
    does not exist in 486/Pentium)
•   Hence, verify_area() is typically called before performing such
    operations.
•   Physical and Virtual addresses are same except for those allocated
    using vmalloc().
•   Kernel segment shared across processes (not switched!)
Memory Allocn for Kernel Segment

• Static
     Memory_start = console_init(memory_start, memory_end);
     Typically done for drivers to reserve areas, and for some other kernel
       components.

•   Dynamic
     Void *kmalloc(size, priority), Void kfree (void *) // in mm/kmalloc.c
     Void *vmalloc(size), void *vmfree(void *) // in mm/vmalloc.c

     Kmalloc is used for physically contiguous pages while vmalloc does not
       necessarily allocate physically contiguous pages
     Memory allocated is not initialized (and is not paged out).
kmalloc() data structures
sizes[]
    32
     64    size_descriptor
    128                           page_descriptor

   252
   508
  1020
   2040                      bh         bh
  4080
  8176                       bh         bh
 16368
  32752                      bh         bh
 65520
                  Null                        Null
131056
vmalloc()
•   Allocated virtually contiguous pages, but they do not need to be physically
    contiguous.
•   Uses __get_free_page() to allocate physical frames.
•   Once all the required physical frames are found, the virtual addresses are
    created (and mappings set) at an unused part.
•   The virtual address search (for unused parts) on x86 begins at the next
    address after physical memory on an 8 MB boundary.
•   One (virtual) page is left free after each allocation for cushioning.

                     next
         vmstruct     size
                     addr
                                                            VAS
                     next
                      size
                      addr
                 vmlist
vmalloc vs kmalloc
• Contiguous vs non-contiguous physical memory
• kmalloc is faster but less flexible
• vmalloc involves __get_free_page() and may need to
  block to find a free physical page
• DMA requires contiguous physical memory
Paging
•   All kernel segment pages are locked in memory (no swapping)
•   User pages can be paged out:
     – Complete block device
     – Fixed length files in a file system
•   First 4096 bytes are a bitmap indicating that space for that page is
    available for paging.
•   At byte 4086, string “SWAP_SPACE” is stored.
•   Hence, max swap of 4086*8-1 = 32687 pages = 130784KB per
    device or file
•   MAX_SWAPFILES specifies number of swap files or devices
•   Swap device is more efficient than swap file.
•   Inform swap space to kernel using
•   int sys_swapon(char * swapfile, int swapflags);
•   Ceates an entry in swap_info table.
•   struct swap_info_struct {
          unsigned int flags;
          kdev_t swap_device;
          struct indoe *swap_file;
          unsigned char *swap_map; // ptr to table, with 1 byte for
            each page to indicate how many processes are referring
            to this page
          unsigned char *swap_lockmap; // ptr to bitmap, bit
            indicating lock
          int lowest_bit, highest_bit; // to calculate maximum page
            number
          unsigned long max; // highest_bit + 1
          int prio; // priority for this swap space
          int cluster_nr, cluster_next; // to cluster pages on
            storage device
          int next;
     }
•   For each physical frame (mm.h):
     typedef struct page {
         struct page *prev, *next; // doubly linked
         struct inode *inode; unsigned long offset; // where to
            swap
         struct page *prev_hash, next_hash; // in hash list of
            pages in page cache
         atomic_t count; // number of users of this page
         unsigned dirty:16, age:8;
         struct buffer_head * buffers; // if it is part of a block
            buffer
         unsigned long map_nr; // frame #
         struct wait_queue *wait; // Tasks waiting for page to be
            unlocked
         unsigned flags;
     } mem_map_t;
Finding a Physical Page
•   unsigned long __get_free_pages(int priority, unsigned long
                              order, int dma) in mm/page_alloc.c
•   Priority =
     – GFP_BUFFER (free page returned only if available in physical
        memory)
     – GFP_ATOMIC (return page if possible, do not interrupt current
        process)
     – GFP_USER (current process can be interrupted)
     – GFP_KERNEL (kernel can be interrupted)
     – GFP_NOBUFFER (do not attempt to reduce buffer cache)
•   order says give me 2^^order pages (max is 128KB)
•   dma specifies that it is for DMA purposes
•    First tries to find a free frame using Buddy system.
•    Table free_area[] keeps appropriate data structures.




    free_area_list[0 1 2]                     free_area_map
                                          0         1       2
•   If you cannot find a free page,
int try_to_free_page(int priority, int dma, int wait) {
      static int state = 6; int I = 6; int stop;
      stop = 3; if (wait) stop = 0;
      switch (state) {
           do {
                Case 0:
                       if (shrink_mmap(i,dma)) return 1;
                      state = 1;
                Case 1:
                       if (shm_swap(i,dma)) return 1;
                      state = 2;
                Default:
                       if (swap_out(i,dma,wait)) return 1;
                      state = 0;
                i--;
           } while ((i-stop) >= 0);
      }
      return 0;
}
• shrink_mmap() tries to discard pages in page cache or buffer
  cache that have only one user currently, and have not been
  references since the last cycle. The number of examined pages
  depends on priority.
• shm_swap() tries pages allocated for shared memory.
• swap_out()
   – Uses swap_cnt to determine how many pages to swap out for
      current process before moving on to next.
   – Always start where you left off last time (Clock algorithm)
   – Uses swap_out_process() function, which then calls
      try_to_swap_out() for each possible page present in memory
      (and is not locked).
   – try_to_swap_out() checks the age attribute in mem_map
      data structure, and the page is selected if this is 0.
   – VM area’s swapout() operation is called.
   – Write back if the page is dirty
   – Invalidate page table entry.
• kswapd kernel thread running in background is
  activated each time the number of free pages falls
  below a critical level.
• This thread calls the try_to_free_page() function.
• A block of memory is released using free_pages().
  When the number of users reaches 0, the frames are
  entered in free_area[].
Page Fault
• Error code written onto stack, and the VA is stored in register
  CR2
• do_page_fault(struct pt_regs *regs, unsigned long
  error_code) is now called.
• If faulting address is in kernel segment, alarm messages are
  printed out and the process is terminated.
• If faulting address is not in a virtual memory area, check if
  VM_GROWSDOWN for the nexy virtual memory area is set (I.e.
  Stack). If so, expand VM. If error in expanding send SIGSEGV.
• If faulting address is in a virtual memory area, check if protection
  bits are OK. If not legal, send SIGSEGV. Else, call do_no_page()
  or do_wp_page().
•   void do_wp_page(struct task_struct *task, struct
    vm_area_struct *vma, unsigned long address, int write_access);
     – Check if page is mapped in
     – If it is referenced only once, change permissions
     – Else, create a copy and entry in page table to this copy with write
        permissions on

•   void do_no_page(struct task_struct *task, struct vm_area_struct *vma,
    unsigned long address, int write_access);
     – If there is no nopage() handler, an empty page is mapped to this area
     – Else, do_swap_page(struct task_struct, struct
       vm_area_struct, unsigned long address, pte_t, int
       write_access) is called.
     – If no swap_in() handler is defined, swap_in(struct task_struct,
       struct vm_area_struct, pte_t *, unsigned long entry, int
       write_access) is called.
     – entry contains swap space and page number.
     – This calls swap_free() to release te page in swap space and the
       swap_map counter is decremented.

Mais conteúdo relacionado

Mais procurados

Decompressed vmlinux: linux kernel initialization from page table configurati...
Decompressed vmlinux: linux kernel initialization from page table configurati...Decompressed vmlinux: linux kernel initialization from page table configurati...
Decompressed vmlinux: linux kernel initialization from page table configurati...Adrian Huang
 
Linux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionLinux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionGene Chang
 
Arm device tree and linux device drivers
Arm device tree and linux device driversArm device tree and linux device drivers
Arm device tree and linux device driversHoucheng Lin
 
Linux Initialization Process (2)
Linux Initialization Process (2)Linux Initialization Process (2)
Linux Initialization Process (2)shimosawa
 
Linux memory-management-kamal
Linux memory-management-kamalLinux memory-management-kamal
Linux memory-management-kamalKamal Maiti
 
linux device driver
linux device driverlinux device driver
linux device driverRahul Batra
 
malloc & vmalloc in Linux
malloc & vmalloc in Linuxmalloc & vmalloc in Linux
malloc & vmalloc in LinuxAdrian Huang
 
Linux SD/MMC Driver Stack
Linux SD/MMC Driver Stack Linux SD/MMC Driver Stack
Linux SD/MMC Driver Stack Champ Yen
 
Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory ManagementNi Zo-Ma
 
Linux kernel architecture
Linux kernel architectureLinux kernel architecture
Linux kernel architectureSHAJANA BASHEER
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugginglibfetion
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File SystemAdrian Huang
 
Understanding a kernel oops and a kernel panic
Understanding a kernel oops and a kernel panicUnderstanding a kernel oops and a kernel panic
Understanding a kernel oops and a kernel panicJoseph Lu
 
Memory management in Linux
Memory management in LinuxMemory management in Linux
Memory management in LinuxRaghu Udiyar
 
Linux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBLinux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBshimosawa
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernelAdrian Huang
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageKernel TLV
 
Physical Memory Management.pdf
Physical Memory Management.pdfPhysical Memory Management.pdf
Physical Memory Management.pdfAdrian Huang
 
Understanding of linux kernel memory model
Understanding of linux kernel memory modelUnderstanding of linux kernel memory model
Understanding of linux kernel memory modelSeongJae Park
 

Mais procurados (20)

Decompressed vmlinux: linux kernel initialization from page table configurati...
Decompressed vmlinux: linux kernel initialization from page table configurati...Decompressed vmlinux: linux kernel initialization from page table configurati...
Decompressed vmlinux: linux kernel initialization from page table configurati...
 
Linux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionLinux MMAP & Ioremap introduction
Linux MMAP & Ioremap introduction
 
Arm device tree and linux device drivers
Arm device tree and linux device driversArm device tree and linux device drivers
Arm device tree and linux device drivers
 
Linux Initialization Process (2)
Linux Initialization Process (2)Linux Initialization Process (2)
Linux Initialization Process (2)
 
Linux memory-management-kamal
Linux memory-management-kamalLinux memory-management-kamal
Linux memory-management-kamal
 
linux device driver
linux device driverlinux device driver
linux device driver
 
spinlock.pdf
spinlock.pdfspinlock.pdf
spinlock.pdf
 
malloc & vmalloc in Linux
malloc & vmalloc in Linuxmalloc & vmalloc in Linux
malloc & vmalloc in Linux
 
Linux SD/MMC Driver Stack
Linux SD/MMC Driver Stack Linux SD/MMC Driver Stack
Linux SD/MMC Driver Stack
 
Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory Management
 
Linux kernel architecture
Linux kernel architectureLinux kernel architecture
Linux kernel architecture
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File System
 
Understanding a kernel oops and a kernel panic
Understanding a kernel oops and a kernel panicUnderstanding a kernel oops and a kernel panic
Understanding a kernel oops and a kernel panic
 
Memory management in Linux
Memory management in LinuxMemory management in Linux
Memory management in Linux
 
Linux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBLinux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKB
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernel
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast Storage
 
Physical Memory Management.pdf
Physical Memory Management.pdfPhysical Memory Management.pdf
Physical Memory Management.pdf
 
Understanding of linux kernel memory model
Understanding of linux kernel memory modelUnderstanding of linux kernel memory model
Understanding of linux kernel memory model
 

Semelhante a Linux memory

Windows memory manager internals
Windows memory manager internalsWindows memory manager internals
Windows memory manager internalsSisimon Soman
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocatorsHao-Ran Liu
 
Driver development – memory management
Driver development – memory managementDriver development – memory management
Driver development – memory managementVandana Salve
 
Process' Virtual Address Space in GNU/Linux
Process' Virtual Address Space in GNU/LinuxProcess' Virtual Address Space in GNU/Linux
Process' Virtual Address Space in GNU/LinuxVarun Mahajan
 
Linux memorymanagement
Linux memorymanagementLinux memorymanagement
Linux memorymanagementpradeepelinux
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)shimosawa
 
Spectrum Scale Memory Usage
Spectrum Scale Memory UsageSpectrum Scale Memory Usage
Spectrum Scale Memory UsageTomer Perry
 
Chapter 8 : Memory
Chapter 8 : MemoryChapter 8 : Memory
Chapter 8 : MemoryAmin Omi
 
Bab 4
Bab 4Bab 4
Bab 4n k
 
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxtidwellveronique
 
ECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxtidwellveronique
 
ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata EvonCanales257
 

Semelhante a Linux memory (20)

Memory
MemoryMemory
Memory
 
memory_mapping.ppt
memory_mapping.pptmemory_mapping.ppt
memory_mapping.ppt
 
Windows memory manager internals
Windows memory manager internalsWindows memory manager internals
Windows memory manager internals
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocators
 
kerch04.ppt
kerch04.pptkerch04.ppt
kerch04.ppt
 
Driver development – memory management
Driver development – memory managementDriver development – memory management
Driver development – memory management
 
Process' Virtual Address Space in GNU/Linux
Process' Virtual Address Space in GNU/LinuxProcess' Virtual Address Space in GNU/Linux
Process' Virtual Address Space in GNU/Linux
 
Linux memorymanagement
Linux memorymanagementLinux memorymanagement
Linux memorymanagement
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)
 
It322 intro 2
It322 intro 2It322 intro 2
It322 intro 2
 
Spectrum Scale Memory Usage
Spectrum Scale Memory UsageSpectrum Scale Memory Usage
Spectrum Scale Memory Usage
 
Chapter 8 : Memory
Chapter 8 : MemoryChapter 8 : Memory
Chapter 8 : Memory
 
Updates
UpdatesUpdates
Updates
 
Updates
UpdatesUpdates
Updates
 
memory.ppt
memory.pptmemory.ppt
memory.ppt
 
Vmfs
VmfsVmfs
Vmfs
 
Bab 4
Bab 4Bab 4
Bab 4
 
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
 
ECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docx
 
ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata
 

Último

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Último (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Linux memory

  • 1. Memory Management in Linux Anand Sivasubramaniam
  • 2. Two Parts • Architecture Independent Memory Should be flexible and portable enough across platforms • Implementation for a specific architecture
  • 3. Architecture Independent Memory Model • Process virtual address space divided into pages • Page size given in PAGE_SIZE macro in asm/page.h (4K for x86 and 8K for Alpha) • The pages are divided between 4 segments • User Code, User Data, Kernel Code, Kernel Data • In User mode, access only User Code and User Data • But in Kernel mode, access also needed for User Data
  • 4. • put_user(), get_user(), memcpy_tofs(), memcpy_fromfs() allow kernel to access user data (defined in asm/segment.h) • Registers cs and ds point to the code and data segments of the current mode • fs points to the data segment of the calling process in kernel mode. • Get_ds(), get_fs(), and set_fs() are defined in asm/segment.h
  • 5. • Segment + Offset = 4 GB Linear address (32 bits) • Of this, user space = 3 GB (defined by TASK_SIZE macro) and kernel space = 1GB • Linear Address converted to physical address using 3 levels Index into Index into Index into Page Page Dir. Page Middle Page Table Offset Dir.
  • 6. Page Dir. And Middle Dir. Access Functions (in asm/page.h and asm/pgtable.h) • Structures pgd_t and pmd_t define an entry of these tables. • pgd_alloc_alloc()/pgd_free() to allocate and free a page for the page directory • pmd_alloc(),pmd_alloc_kernel()/pmd_free(),pmd_free_kernel() allocate and free a page middle directory in user and kernel segments. • pgd_set(),pgd_clear()/pmd_set(),pmd_clear() set and clear a entry of their tables. • pgd_present()/pmd_present() checks for presence of what the entries are pointing to. • pgd_page()/pmd_page() returns the base address of the page to which the entry is pointing • …..
  • 7. Page Table Entry (pte_t) Attributes • Presence (is page present in VAS?) • Read, Write and Execute • Accessed ? (age) • Dirty • Macros of Pgprot_type – PAGE_NONE (invalid) – PAGE_SHARED (read-write) – PAGE_COPY/READ_ONLY (read only, used by copy-on- write) – PAGE_KERNEL (accessibe only by kernel)
  • 8. Page Table Functions • mk_pte(), Pte_clear(), set_pte() • pte_mkclean(), pte_mkdirty(), pt_mkread(), …. • pte_none() (check whether entry is set) • pte_page() (returns address of page) • pte_dirty(), pte_present(), pte_young(), pte_read(), pte_write()
  • 9. Process Address Space (not to scale!) Kernel 0xC0000000 File name, Environment Arguments Stack _end bss _bss_start _edata Data _etext Code Header 0x84000000 Shared Libs
  • 10. Address Space Descriptor • mm_struct defined in the process descriptor. (in linux/sched.h) • This is duplicated if CLONE_VM is specified on forking. • struct mm_struct { int count; // no. of processes sharing this descriptor pgd_t *pgd; //page directory ptr unsigned long start_code, end_code; unsigned long start_data, end_data; unsigned long start_brk, brk; unsigned long start_stack; unsigned long arg_start, arg_end, env_start, env_end; unsigned long rss; // no. of pages resident in memory unsigned long total_vm; // total # of bytes in this address space unsigned long locked_vm; // # of bytes locked in memory unsigned long def_flags; // status to use when mem regions are created struct vm_area_struct *mmap; // ptr to first region desc. struct vm_area_struct *mmap_avl; // faster search of region desc. }
  • 11. Region Descriptors • Why even allocate all of the VAS? Allocate only on demand. • Use region descriptors for each allocated region of VAS • Map allocated but unused regions to same physical page to save space. • struct vm_area_struct { struct mm_struct *vm_mm; // descriptor of VAS unsigned long vm_start, vm_end; // of this region pgprot_t vm_page_prot; // protection attributes for this region short vm_avl_height; struct vm_avl_left; vm_area_struct *vm_avl_permission; // right hand child vm_area_struct * vm_next_share, *vm_prev_share; // doubly linked vm_operations_struct *vm_ops; struct inode *vm_inode; // of file mapped, or NULL = “anonymous mapping” unsigned long vm_offset; // offset in file/device }
  • 12. • If vm_inode is NULL (anonymous mapping), all PTEs for this region point to the same page. • If the process does a write to any of these pages, the faulting mechanism creates a new physical page (copy- on-write). • This is used by the brk() system call. • Operations specific to this region (including fault handling) are specified in vm_operations_struct. • Hence, different regions can have different functions.
  • 13. Struct vm_operations_struct { void (*open)(struct vm_area_struct *); void (*close)(struct vm_area_struct *); void (*unmap)(…); void (*protect)(…) void (*sync)(…); unsigned long (*nopage)(struct vm_area_struct *, unsigned long address, unsigned long page, int write_access); void (*swapout)(struct vm_area_struct *, unsigned long, pte_t *); pte_t (*swapin)(struct vm_area_struct *, unsigned long, unsigned long); }
  • 14. Traditional mmap() • int do_mmap(struct file *, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, unsigned long off); • Creates a new memory region • Creates the required PTEs • Sets the PTEs to fault later • The handler (nopage) will either copy-on-write if anonymous mapping, or will bring in the required page of file.
  • 15. How is brk() implemented? • Check whether to allocate (deny if not enough physical memory, exceeds its VA limits, or crosses stack). • Then call do_mmap() for anonymous mapping between the old and new values of brk (in process table). • Return the new brk value.
  • 16. Kernel Segment • On a sys call, CS points to kernel segment. DS and ES are set to kernel segment as well. • Next, FS is set to user data segment. • Put_user() and get_user() can then access user space if needed. • The address parameters to these functions cannot exceed 0xc0000000. • Violation of this should result in a trap, together with any writes to a read-only page (creates a problem on 386, while the problem does not exist in 486/Pentium) • Hence, verify_area() is typically called before performing such operations. • Physical and Virtual addresses are same except for those allocated using vmalloc(). • Kernel segment shared across processes (not switched!)
  • 17. Memory Allocn for Kernel Segment • Static Memory_start = console_init(memory_start, memory_end); Typically done for drivers to reserve areas, and for some other kernel components. • Dynamic Void *kmalloc(size, priority), Void kfree (void *) // in mm/kmalloc.c Void *vmalloc(size), void *vmfree(void *) // in mm/vmalloc.c Kmalloc is used for physically contiguous pages while vmalloc does not necessarily allocate physically contiguous pages Memory allocated is not initialized (and is not paged out).
  • 18. kmalloc() data structures sizes[] 32 64 size_descriptor 128 page_descriptor 252 508 1020 2040 bh bh 4080 8176 bh bh 16368 32752 bh bh 65520 Null Null 131056
  • 19. vmalloc() • Allocated virtually contiguous pages, but they do not need to be physically contiguous. • Uses __get_free_page() to allocate physical frames. • Once all the required physical frames are found, the virtual addresses are created (and mappings set) at an unused part. • The virtual address search (for unused parts) on x86 begins at the next address after physical memory on an 8 MB boundary. • One (virtual) page is left free after each allocation for cushioning. next vmstruct size addr VAS next size addr vmlist
  • 20. vmalloc vs kmalloc • Contiguous vs non-contiguous physical memory • kmalloc is faster but less flexible • vmalloc involves __get_free_page() and may need to block to find a free physical page • DMA requires contiguous physical memory
  • 21. Paging • All kernel segment pages are locked in memory (no swapping) • User pages can be paged out: – Complete block device – Fixed length files in a file system • First 4096 bytes are a bitmap indicating that space for that page is available for paging. • At byte 4086, string “SWAP_SPACE” is stored. • Hence, max swap of 4086*8-1 = 32687 pages = 130784KB per device or file • MAX_SWAPFILES specifies number of swap files or devices • Swap device is more efficient than swap file.
  • 22. Inform swap space to kernel using • int sys_swapon(char * swapfile, int swapflags); • Ceates an entry in swap_info table. • struct swap_info_struct { unsigned int flags; kdev_t swap_device; struct indoe *swap_file; unsigned char *swap_map; // ptr to table, with 1 byte for each page to indicate how many processes are referring to this page unsigned char *swap_lockmap; // ptr to bitmap, bit indicating lock int lowest_bit, highest_bit; // to calculate maximum page number unsigned long max; // highest_bit + 1 int prio; // priority for this swap space int cluster_nr, cluster_next; // to cluster pages on storage device int next; }
  • 23. For each physical frame (mm.h): typedef struct page { struct page *prev, *next; // doubly linked struct inode *inode; unsigned long offset; // where to swap struct page *prev_hash, next_hash; // in hash list of pages in page cache atomic_t count; // number of users of this page unsigned dirty:16, age:8; struct buffer_head * buffers; // if it is part of a block buffer unsigned long map_nr; // frame # struct wait_queue *wait; // Tasks waiting for page to be unlocked unsigned flags; } mem_map_t;
  • 24. Finding a Physical Page • unsigned long __get_free_pages(int priority, unsigned long order, int dma) in mm/page_alloc.c • Priority = – GFP_BUFFER (free page returned only if available in physical memory) – GFP_ATOMIC (return page if possible, do not interrupt current process) – GFP_USER (current process can be interrupted) – GFP_KERNEL (kernel can be interrupted) – GFP_NOBUFFER (do not attempt to reduce buffer cache) • order says give me 2^^order pages (max is 128KB) • dma specifies that it is for DMA purposes
  • 25. First tries to find a free frame using Buddy system. • Table free_area[] keeps appropriate data structures. free_area_list[0 1 2] free_area_map 0 1 2
  • 26. If you cannot find a free page, int try_to_free_page(int priority, int dma, int wait) { static int state = 6; int I = 6; int stop; stop = 3; if (wait) stop = 0; switch (state) { do { Case 0: if (shrink_mmap(i,dma)) return 1; state = 1; Case 1: if (shm_swap(i,dma)) return 1; state = 2; Default: if (swap_out(i,dma,wait)) return 1; state = 0; i--; } while ((i-stop) >= 0); } return 0; }
  • 27. • shrink_mmap() tries to discard pages in page cache or buffer cache that have only one user currently, and have not been references since the last cycle. The number of examined pages depends on priority. • shm_swap() tries pages allocated for shared memory. • swap_out() – Uses swap_cnt to determine how many pages to swap out for current process before moving on to next. – Always start where you left off last time (Clock algorithm) – Uses swap_out_process() function, which then calls try_to_swap_out() for each possible page present in memory (and is not locked). – try_to_swap_out() checks the age attribute in mem_map data structure, and the page is selected if this is 0. – VM area’s swapout() operation is called. – Write back if the page is dirty – Invalidate page table entry.
  • 28. • kswapd kernel thread running in background is activated each time the number of free pages falls below a critical level. • This thread calls the try_to_free_page() function. • A block of memory is released using free_pages(). When the number of users reaches 0, the frames are entered in free_area[].
  • 29. Page Fault • Error code written onto stack, and the VA is stored in register CR2 • do_page_fault(struct pt_regs *regs, unsigned long error_code) is now called. • If faulting address is in kernel segment, alarm messages are printed out and the process is terminated. • If faulting address is not in a virtual memory area, check if VM_GROWSDOWN for the nexy virtual memory area is set (I.e. Stack). If so, expand VM. If error in expanding send SIGSEGV. • If faulting address is in a virtual memory area, check if protection bits are OK. If not legal, send SIGSEGV. Else, call do_no_page() or do_wp_page().
  • 30. void do_wp_page(struct task_struct *task, struct vm_area_struct *vma, unsigned long address, int write_access); – Check if page is mapped in – If it is referenced only once, change permissions – Else, create a copy and entry in page table to this copy with write permissions on • void do_no_page(struct task_struct *task, struct vm_area_struct *vma, unsigned long address, int write_access); – If there is no nopage() handler, an empty page is mapped to this area – Else, do_swap_page(struct task_struct, struct vm_area_struct, unsigned long address, pte_t, int write_access) is called. – If no swap_in() handler is defined, swap_in(struct task_struct, struct vm_area_struct, pte_t *, unsigned long entry, int write_access) is called. – entry contains swap space and page number. – This calls swap_free() to release te page in swap space and the swap_map counter is decremented.