SlideShare a Scribd company logo
1 of 27
Download to read offline
Operating System Management
      of Shared Caches
  on Multicore Processors
     -- Ph.D. Thesis Presentation --
               Apr. 20, 2010

            David Tam

          Supervisor: Michael Stumm
                                       1
Multicores Today
                        ...
                Cache
                        ...    Cache

                    Shared Cache




Multicores are Ubiquitous
 ●   Unexpected by most software developers
 ●   Software support is lacking (e.g., OS)
General Role of OS
 ●   Manage shared hardware resources
New Candidate
 ●   Shared cache: performance critical
 ●   Focus of thesis
                                              2
Thesis
OS should manage on-chip shared caches
of multicore processors

Demonstrate:
 ●   Properly managing shared caches at OS level
     can increase performance

Management Principles
 1. Promote sharing
      ●   For threads that share data
      ●   Maximize major advantage of shared caches
 2. Provide isolation
      ●   For threads that do not share data
      ●   Minimize major disadvantage of shared caches

Supporting Role
 ●   Provision the shared cache online
                                                         3
#1 - Promote Sharing
  Problem: Cross-chip accesses are slow
  Solution: Exploit major advantage of shared caches:
              Fast access to shared data
  OS Actions: Identify & localize data sharing
  View: Match software sharing to hardware sharing


             Chip A                            Chip B
Thread A                                                Thread B




                 L2                              L2
           Shared Data                    Shared Data


                         Shared Data Traffic

                                                                   4
#1 - Promote Sharing
  Problem: Cross-chip accesses are slow
  Solution: Exploit major advantage of shared caches:
              Fast access to shared data
  OS Actions: Identify & localize data sharing
  View: Match software sharing to hardware sharing


             Chip A      Thread B     Chip B
Thread A




                 L2                     L2
           Shared Data




                                                        5
Thread Clustering [EuroSys'07]
  Identify Data Sharing
  ●   Detect sharing online with hardware performance counters
      ● Monitor remote cache accesses (data addresses)

      ● Track on a per-thread basis

      ● Data addresses are memory regions shared with other threads




  Localize Data Sharing
  ● Identify clusters of threads that access same memory regions
  ● Migrate threads of a cluster onto same chip




                Chip A                            Chip B
Thread A                                                   Thread B




                    L2                              L2
              Shared Data                    Shared Data


                            Shared Data Traffic
                                                                      6
Thread Clustering [EuroSys'07]
  Identify Data Sharing
  ●   Detect sharing online with hardware performance counters
      ● Monitor remote cache accesses (data addresses)

      ● Track on a per-thread basis

      ● Data addresses are memory regions shared with other threads




  Localize Data Sharing
  ● Identify clusters of threads that access same memory regions
  ● Migrate threads of a cluster onto same chip




                Chip A      Thread B     Chip B
Thread A




                    L2                      L2
              Shared Data




                                                                      7
Visualization of Clusters
           ●   SPECjbb 2000                                            Sharing Intensity
               ●   4 warehouses, 16 threads per warehouse                     High
                                                                              Medium
           ●   Threads have been sorted by cluster for visualization          Low
                                                                              None

   16
      {
threads




 Threads




           0                            Memory Regions                          264
                                         (Virtual Address)
                                                                                           8
Visualization of Clusters
           ●   SPECjbb 2000                                            Sharing Intensity
               ●   4 warehouses, 16 threads per warehouse                     High
                                                                              Medium
           ●   Threads have been sorted by cluster for visualization          Low
                                                                              None

   16
      {
threads




 Threads




           0                            Memory Regions                          264
                                         (Virtual Address)
                                                                                           9
Performance Results

                     1.9MB L2           1.9MB L2
           36 MB                                   36 MB




                        4 GB            4 GB

●   Multithreaded commercial workloads
     ●   RUBiS, VolanoMark, SPECjbb2k
●   8-way IBM POWER5 Linux system
     ●   22%, 32%, 70% reduction in stalls caused by
                         cross-chip accesses
     ●    7%, 5%, 6% performance improvement

●   32-way IBM POWER5+ Linux system
     ●   14% SPECjbb2k potential improvement               10
#2 – Provide Isolation
Problem: Major disadvantage of shared caches
             Cache space interference
Solution: Provide cache space isolation between applications
OS Actions: Enforce isolation during physical page allocation
View: Partition into smaller private caches


                                              Apache

                                              MySQL




                                                                11
#2 – Provide Isolation
Problem: Major disadvantage of shared caches
             Cache space interference
Solution: Provide cache space isolation between applications
OS Actions: Enforce isolation during physical page allocation
View: Partition into smaller private caches


                                              Apache

                                              MySQL




                                                                12
#2 – Provide Isolation
Problem: Major disadvantage of shared caches
             Cache space interference
Solution: Provide cache space isolation between applications
OS Actions: Enforce isolation during physical page allocation
View: Partition into smaller private caches


                                              Apache

                                              MySQL




                                                                13
#2 – Provide Isolation
Problem: Major disadvantage of shared caches
             Cache space interference
Solution: Provide cache space isolation between applications
OS Actions: Enforce isolation during physical page allocation
View: Partition into smaller private caches


                                              Apache

                                              MySQL




       Boundary


                                                                14
#2 – Provide Isolation
Problem: Major disadvantage of shared caches
             Cache space interference
Solution: Provide cache space isolation between applications
OS Actions: Enforce isolation during physical page allocation
View: Partition into smaller private caches


                                              Apache

                                              MySQL




       Boundary


                                                                15
Cache Partitioning                                   [WIOSCA'07]

 ●   Apply page-coloring technique
 ●   Guide physical page allocation to control cache line usage
 ●   Works on existing processors


                Virtual Pages          Physical Pages
                                        Color A


                                                              L2 Cache
Application
                                                          {          } Color A
                                                                         (N sets)
                                        Color A




                                        Color A


                          OS Managed              Fixed Mapping
                                                    (Hardware)



                                                                                    16
Cache Partitioning                                    [WIOSCA'07]

   ●   Apply page-coloring technique
   ●   Guide physical page allocation to control cache line usage
   ●   Works on existing processors


                  Virtual Pages       Physical Pages
                                          Color A
                                          Color B
                                                                 L2 Cache
Application A
                                                             {          } Color A
                                                                            (N sets)
                                          Color A
                                          Color B

                 Virtual Pages
                                                             {          } Color B
                                                                            (N sets)

                                          Color A
Application B                             Color B




                             OS Managed             Fixed Mapping
                                                      (Hardware)                       17
Impact of Partitioning

                                       mcf
                                                        Performance
                                                        Without
                                                 art
                                                        Isolation




                                                         mcf
           16 14 12 10          8     6      4     2   0 art
                   L2 Cache Sizes (# of Colors)

Performance of Other Combos
●   10 pairs of applications: SPECcpu2k, SPECjbb2k
      ●
          4% to 17% improvement (36MB L3 cache)
      ●
          28%, 50% improvement (no L3 cache)                          18
Provisioning the Cache
Problem: How to determine cache partition size
Solution: Use L2 cache miss rate curve (MRC) of application
Criteria: Obtain MRC rapidly, accurately, online, with low overhead,
        on existing hardware
OS Actions: Monitor L2 cache accesses
               using hardware performance counters
                                                  Application X
                              100
                               90
                               80
              Miss Rate (%)




                               70
                               60
                               50
                               40
                               30
                               20
                               10
                                0
                                    0   10   20   30   40   50   60   70   80   90 100
                                         Allocated Cache Size (%)                        19
RapidMRC [ASPLOS'09]
Design
 ●   Upon every L2 access:
     ● Update sampling register with data address

     ● Trigger interrupt to copy register to trace log in main memory



 ●   Feed trace log into Mattson's stack algorithm [1970]
     to obtain L2 MRC

Results
 ●
     Workloads
     ● 30 apps from SPECcpu2k, SPECcpu2k6, SPECjbb2k

 ●
     Latency
     ● 227 ms to generate online L2 MRC

 ●   Accuracy
     ● Good, e.g. up to 27% performance improvement when

       applied to cache partitioning


                                                                        20
Accuracy of RapidMRC
                   ●   Execution slice at 10 billion instructions
                               jbb                         gzip     mgrid
Miss Rate (MPKI)




                        Cache Size (# colors)
                               mcf 2k                   xalancbmk   ammp




                                                                            21
Effectiveness on Provisioning



                                             Performance
                                             Without
                           RapidMRC Real MRC Isolation


          0   2 4      6    8   10   12 14   16 twolf
          16 14 12    10    8    6    4 2     0 equake
              L2 Cache Sizes (# of colors)


 Performance of Other Combos Using RapidMRC
  ●   12% improvement for vpr+applu
  ●   14% improvement for ammp+3applu

                                                           22
Contributions
On commodity multicores, first to demonstrate
● Mechanism: To detect data sharing online & automatically cluster threads
● Benefits: Promoting sharing [EuroSys'07]




● Mechanism: To partition shared cache by applying page-coloring
● Benefits: Providing isolation [WIOSCA'07]




● Mechanism: To approximate L2 MRCs online in software
● Benefits: Provisioning the cache [ASPLOS'09]



...all performed by the OS.




                                                                             23
Concluding Remarks
Demonstrated Performance Improvements
●
    Promoting Sharing
     ●   5% – 7%        SPECjbb2k, RUBiS, VolanoMark (2 chips)
     ●   14% potential: SPECjbb2k (8 chips)

●
    Providing Isolation
     ●   4% – 17% 8 combos: SPECcpu2k, SPECjbb2k (36MB L3 cache)
     ●   28%, 50% 2 combos: SPECcpu2k (no L3 cache)

●
    Provisioning the Cache Online
     ●   12% – 27% 3 combos: SPECcpu2k




    OS should manage on-chip shared caches


                                                                   24
Thank You


            25
24-9=15 slides




                 26
Future Research Opportunities
 Shared cache management principles
 can be applied to other layers:
  ●   Application, managed runtime, virtual machine monitor

 Promoting sharing
  ●   Improve locality on NUMA multiprocessor systems

 Providing isolation
  ●   Finer granularity, within one application [MICRO'08]
      ●   Regions
      ●   Objects

 RapidMRC
  ●   Online L2 MRCs
      ●   Reducing energy
      ●   Guiding co-scheduling

  ●   Underlying Tracing Mechanism
      ●   Trace other hardware events

                                                              27

More Related Content

What's hot

Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
Membase Meetup Chicago - january 2011
Membase Meetup Chicago - january 2011Membase Meetup Chicago - january 2011
Membase Meetup Chicago - january 2011Membase
 
Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09Steve Staso
 
Gluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFSGluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFSGlusterFS
 
Scalability
ScalabilityScalability
Scalabilityfelho
 
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksHadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksCloudera, Inc.
 
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageWebinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageGlusterFS
 
Enea Linux and LWRT FTF China 2012
Enea Linux and LWRT FTF China 2012Enea Linux and LWRT FTF China 2012
Enea Linux and LWRT FTF China 2012EneaSoftware
 
Embedded Database Technology | Interbase From Embarcadero Technologies
Embedded Database Technology | Interbase From Embarcadero TechnologiesEmbedded Database Technology | Interbase From Embarcadero Technologies
Embedded Database Technology | Interbase From Embarcadero TechnologiesMichael Findling
 
SLES 11 SP2 PerformanceEvaluation for Linux on System z
SLES 11 SP2 PerformanceEvaluation for Linux on System zSLES 11 SP2 PerformanceEvaluation for Linux on System z
SLES 11 SP2 PerformanceEvaluation for Linux on System zIBM India Smarter Computing
 
VNSISPL_DBMS_Concepts_ch17
VNSISPL_DBMS_Concepts_ch17VNSISPL_DBMS_Concepts_ch17
VNSISPL_DBMS_Concepts_ch17sriprasoon
 
Using Novell Sentinel Log Manager to Monitor Novell Applications
Using Novell Sentinel Log Manager to Monitor Novell ApplicationsUsing Novell Sentinel Log Manager to Monitor Novell Applications
Using Novell Sentinel Log Manager to Monitor Novell ApplicationsNovell
 
Sql Server 2005 Memory Internals
Sql Server 2005 Memory InternalsSql Server 2005 Memory Internals
Sql Server 2005 Memory InternalsDmitry Geyzersky
 
bfarm-v2
bfarm-v2bfarm-v2
bfarm-v2Zeus G
 
Apache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesApache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesDataWorks Summit
 

What's hot (20)

Openstorage Openstack
Openstorage OpenstackOpenstorage Openstack
Openstorage Openstack
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
Membase Meetup Chicago - january 2011
Membase Meetup Chicago - january 2011Membase Meetup Chicago - january 2011
Membase Meetup Chicago - january 2011
 
Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09
 
Gluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFSGluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFS
 
Scalability
ScalabilityScalability
Scalability
 
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksHadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
 
Cosbench apac
Cosbench apacCosbench apac
Cosbench apac
 
cosbench-openstack.pdf
cosbench-openstack.pdfcosbench-openstack.pdf
cosbench-openstack.pdf
 
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageWebinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
 
Barrelfish OS
Barrelfish OS Barrelfish OS
Barrelfish OS
 
Enea Linux and LWRT FTF China 2012
Enea Linux and LWRT FTF China 2012Enea Linux and LWRT FTF China 2012
Enea Linux and LWRT FTF China 2012
 
Embedded Database Technology | Interbase From Embarcadero Technologies
Embedded Database Technology | Interbase From Embarcadero TechnologiesEmbedded Database Technology | Interbase From Embarcadero Technologies
Embedded Database Technology | Interbase From Embarcadero Technologies
 
SLES 11 SP2 PerformanceEvaluation for Linux on System z
SLES 11 SP2 PerformanceEvaluation for Linux on System zSLES 11 SP2 PerformanceEvaluation for Linux on System z
SLES 11 SP2 PerformanceEvaluation for Linux on System z
 
VNSISPL_DBMS_Concepts_ch17
VNSISPL_DBMS_Concepts_ch17VNSISPL_DBMS_Concepts_ch17
VNSISPL_DBMS_Concepts_ch17
 
Using Novell Sentinel Log Manager to Monitor Novell Applications
Using Novell Sentinel Log Manager to Monitor Novell ApplicationsUsing Novell Sentinel Log Manager to Monitor Novell Applications
Using Novell Sentinel Log Manager to Monitor Novell Applications
 
Windows 8 Hyper-V: Scalability
Windows 8 Hyper-V: ScalabilityWindows 8 Hyper-V: Scalability
Windows 8 Hyper-V: Scalability
 
Sql Server 2005 Memory Internals
Sql Server 2005 Memory InternalsSql Server 2005 Memory Internals
Sql Server 2005 Memory Internals
 
bfarm-v2
bfarm-v2bfarm-v2
bfarm-v2
 
Apache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesApache Hadoop on Virtual Machines
Apache Hadoop on Virtual Machines
 

Similar to Ph.D. thesis presentation

Key-value databases in practice Redis @ DotNetToscana
Key-value databases in practice Redis @ DotNetToscanaKey-value databases in practice Redis @ DotNetToscana
Key-value databases in practice Redis @ DotNetToscanaMatteo Baglini
 
"The Open Source effect in the Storage world" by George Mitropoulos @ eLibera...
"The Open Source effect in the Storage world" by George Mitropoulos @ eLibera..."The Open Source effect in the Storage world" by George Mitropoulos @ eLibera...
"The Open Source effect in the Storage world" by George Mitropoulos @ eLibera...eLiberatica
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
 
Managing Exadata in the Real World
Managing Exadata in the Real WorldManaging Exadata in the Real World
Managing Exadata in the Real WorldEnkitec
 
CloudStack Best Practice in PPTV
CloudStack Best Practice in PPTVCloudStack Best Practice in PPTV
CloudStack Best Practice in PPTVgavin_lee
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsTony Nguyen
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsYoung Alista
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsFraboni Ec
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsHoang Nguyen
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsLuis Goldster
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsHarry Potter
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsJames Wong
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009Randall Hand
 
2 architectural at CloudStack Developer Day
2  architectural at CloudStack Developer Day2  architectural at CloudStack Developer Day
2 architectural at CloudStack Developer DayKimihiko Kitase
 
Pm 01 bradley stone_openstorage_openstack
Pm 01 bradley stone_openstorage_openstackPm 01 bradley stone_openstorage_openstack
Pm 01 bradley stone_openstorage_openstackOpenCity Community
 
Openstorage with OpenStack, by Bradley
Openstorage with OpenStack, by BradleyOpenstorage with OpenStack, by Bradley
Openstorage with OpenStack, by BradleyHui Cheng
 
Private cloud virtual reality to reality a partner story daniel mar_technicom
Private cloud virtual reality to reality a partner story daniel mar_technicomPrivate cloud virtual reality to reality a partner story daniel mar_technicom
Private cloud virtual reality to reality a partner story daniel mar_technicomMicrosoft Singapore
 
Xldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocksXldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocksliqiang xu
 

Similar to Ph.D. thesis presentation (20)

Key-value databases in practice Redis @ DotNetToscana
Key-value databases in practice Redis @ DotNetToscanaKey-value databases in practice Redis @ DotNetToscana
Key-value databases in practice Redis @ DotNetToscana
 
"The Open Source effect in the Storage world" by George Mitropoulos @ eLibera...
"The Open Source effect in the Storage world" by George Mitropoulos @ eLibera..."The Open Source effect in the Storage world" by George Mitropoulos @ eLibera...
"The Open Source effect in the Storage world" by George Mitropoulos @ eLibera...
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
 
My sql tutorial-oscon-2012
My sql tutorial-oscon-2012My sql tutorial-oscon-2012
My sql tutorial-oscon-2012
 
Managing Exadata in the Real World
Managing Exadata in the Real WorldManaging Exadata in the Real World
Managing Exadata in the Real World
 
CloudStack Best Practice in PPTV
CloudStack Best Practice in PPTVCloudStack Best Practice in PPTV
CloudStack Best Practice in PPTV
 
Extlect03
Extlect03Extlect03
Extlect03
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
 
2 architectural at CloudStack Developer Day
2  architectural at CloudStack Developer Day2  architectural at CloudStack Developer Day
2 architectural at CloudStack Developer Day
 
Pm 01 bradley stone_openstorage_openstack
Pm 01 bradley stone_openstorage_openstackPm 01 bradley stone_openstorage_openstack
Pm 01 bradley stone_openstorage_openstack
 
Openstorage with OpenStack, by Bradley
Openstorage with OpenStack, by BradleyOpenstorage with OpenStack, by Bradley
Openstorage with OpenStack, by Bradley
 
Private cloud virtual reality to reality a partner story daniel mar_technicom
Private cloud virtual reality to reality a partner story daniel mar_technicomPrivate cloud virtual reality to reality a partner story daniel mar_technicom
Private cloud virtual reality to reality a partner story daniel mar_technicom
 
Xldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocksXldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocks
 

Ph.D. thesis presentation

  • 1. Operating System Management of Shared Caches on Multicore Processors -- Ph.D. Thesis Presentation -- Apr. 20, 2010 David Tam Supervisor: Michael Stumm 1
  • 2. Multicores Today ... Cache ... Cache Shared Cache Multicores are Ubiquitous ● Unexpected by most software developers ● Software support is lacking (e.g., OS) General Role of OS ● Manage shared hardware resources New Candidate ● Shared cache: performance critical ● Focus of thesis 2
  • 3. Thesis OS should manage on-chip shared caches of multicore processors Demonstrate: ● Properly managing shared caches at OS level can increase performance Management Principles 1. Promote sharing ● For threads that share data ● Maximize major advantage of shared caches 2. Provide isolation ● For threads that do not share data ● Minimize major disadvantage of shared caches Supporting Role ● Provision the shared cache online 3
  • 4. #1 - Promote Sharing Problem: Cross-chip accesses are slow Solution: Exploit major advantage of shared caches: Fast access to shared data OS Actions: Identify & localize data sharing View: Match software sharing to hardware sharing Chip A Chip B Thread A Thread B L2 L2 Shared Data Shared Data Shared Data Traffic 4
  • 5. #1 - Promote Sharing Problem: Cross-chip accesses are slow Solution: Exploit major advantage of shared caches: Fast access to shared data OS Actions: Identify & localize data sharing View: Match software sharing to hardware sharing Chip A Thread B Chip B Thread A L2 L2 Shared Data 5
  • 6. Thread Clustering [EuroSys'07] Identify Data Sharing ● Detect sharing online with hardware performance counters ● Monitor remote cache accesses (data addresses) ● Track on a per-thread basis ● Data addresses are memory regions shared with other threads Localize Data Sharing ● Identify clusters of threads that access same memory regions ● Migrate threads of a cluster onto same chip Chip A Chip B Thread A Thread B L2 L2 Shared Data Shared Data Shared Data Traffic 6
  • 7. Thread Clustering [EuroSys'07] Identify Data Sharing ● Detect sharing online with hardware performance counters ● Monitor remote cache accesses (data addresses) ● Track on a per-thread basis ● Data addresses are memory regions shared with other threads Localize Data Sharing ● Identify clusters of threads that access same memory regions ● Migrate threads of a cluster onto same chip Chip A Thread B Chip B Thread A L2 L2 Shared Data 7
  • 8. Visualization of Clusters ● SPECjbb 2000 Sharing Intensity ● 4 warehouses, 16 threads per warehouse High Medium ● Threads have been sorted by cluster for visualization Low None 16 { threads Threads 0 Memory Regions 264 (Virtual Address) 8
  • 9. Visualization of Clusters ● SPECjbb 2000 Sharing Intensity ● 4 warehouses, 16 threads per warehouse High Medium ● Threads have been sorted by cluster for visualization Low None 16 { threads Threads 0 Memory Regions 264 (Virtual Address) 9
  • 10. Performance Results 1.9MB L2 1.9MB L2 36 MB 36 MB 4 GB 4 GB ● Multithreaded commercial workloads ● RUBiS, VolanoMark, SPECjbb2k ● 8-way IBM POWER5 Linux system ● 22%, 32%, 70% reduction in stalls caused by cross-chip accesses ● 7%, 5%, 6% performance improvement ● 32-way IBM POWER5+ Linux system ● 14% SPECjbb2k potential improvement 10
  • 11. #2 – Provide Isolation Problem: Major disadvantage of shared caches Cache space interference Solution: Provide cache space isolation between applications OS Actions: Enforce isolation during physical page allocation View: Partition into smaller private caches Apache MySQL 11
  • 12. #2 – Provide Isolation Problem: Major disadvantage of shared caches Cache space interference Solution: Provide cache space isolation between applications OS Actions: Enforce isolation during physical page allocation View: Partition into smaller private caches Apache MySQL 12
  • 13. #2 – Provide Isolation Problem: Major disadvantage of shared caches Cache space interference Solution: Provide cache space isolation between applications OS Actions: Enforce isolation during physical page allocation View: Partition into smaller private caches Apache MySQL 13
  • 14. #2 – Provide Isolation Problem: Major disadvantage of shared caches Cache space interference Solution: Provide cache space isolation between applications OS Actions: Enforce isolation during physical page allocation View: Partition into smaller private caches Apache MySQL Boundary 14
  • 15. #2 – Provide Isolation Problem: Major disadvantage of shared caches Cache space interference Solution: Provide cache space isolation between applications OS Actions: Enforce isolation during physical page allocation View: Partition into smaller private caches Apache MySQL Boundary 15
  • 16. Cache Partitioning [WIOSCA'07] ● Apply page-coloring technique ● Guide physical page allocation to control cache line usage ● Works on existing processors Virtual Pages Physical Pages Color A L2 Cache Application { } Color A (N sets) Color A Color A OS Managed Fixed Mapping (Hardware) 16
  • 17. Cache Partitioning [WIOSCA'07] ● Apply page-coloring technique ● Guide physical page allocation to control cache line usage ● Works on existing processors Virtual Pages Physical Pages Color A Color B L2 Cache Application A { } Color A (N sets) Color A Color B Virtual Pages { } Color B (N sets) Color A Application B Color B OS Managed Fixed Mapping (Hardware) 17
  • 18. Impact of Partitioning mcf Performance Without art Isolation mcf 16 14 12 10 8 6 4 2 0 art L2 Cache Sizes (# of Colors) Performance of Other Combos ● 10 pairs of applications: SPECcpu2k, SPECjbb2k ● 4% to 17% improvement (36MB L3 cache) ● 28%, 50% improvement (no L3 cache) 18
  • 19. Provisioning the Cache Problem: How to determine cache partition size Solution: Use L2 cache miss rate curve (MRC) of application Criteria: Obtain MRC rapidly, accurately, online, with low overhead, on existing hardware OS Actions: Monitor L2 cache accesses using hardware performance counters Application X 100 90 80 Miss Rate (%) 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 Allocated Cache Size (%) 19
  • 20. RapidMRC [ASPLOS'09] Design ● Upon every L2 access: ● Update sampling register with data address ● Trigger interrupt to copy register to trace log in main memory ● Feed trace log into Mattson's stack algorithm [1970] to obtain L2 MRC Results ● Workloads ● 30 apps from SPECcpu2k, SPECcpu2k6, SPECjbb2k ● Latency ● 227 ms to generate online L2 MRC ● Accuracy ● Good, e.g. up to 27% performance improvement when applied to cache partitioning 20
  • 21. Accuracy of RapidMRC ● Execution slice at 10 billion instructions jbb gzip mgrid Miss Rate (MPKI) Cache Size (# colors) mcf 2k xalancbmk ammp 21
  • 22. Effectiveness on Provisioning Performance Without RapidMRC Real MRC Isolation 0 2 4 6 8 10 12 14 16 twolf 16 14 12 10 8 6 4 2 0 equake L2 Cache Sizes (# of colors) Performance of Other Combos Using RapidMRC ● 12% improvement for vpr+applu ● 14% improvement for ammp+3applu 22
  • 23. Contributions On commodity multicores, first to demonstrate ● Mechanism: To detect data sharing online & automatically cluster threads ● Benefits: Promoting sharing [EuroSys'07] ● Mechanism: To partition shared cache by applying page-coloring ● Benefits: Providing isolation [WIOSCA'07] ● Mechanism: To approximate L2 MRCs online in software ● Benefits: Provisioning the cache [ASPLOS'09] ...all performed by the OS. 23
  • 24. Concluding Remarks Demonstrated Performance Improvements ● Promoting Sharing ● 5% – 7% SPECjbb2k, RUBiS, VolanoMark (2 chips) ● 14% potential: SPECjbb2k (8 chips) ● Providing Isolation ● 4% – 17% 8 combos: SPECcpu2k, SPECjbb2k (36MB L3 cache) ● 28%, 50% 2 combos: SPECcpu2k (no L3 cache) ● Provisioning the Cache Online ● 12% – 27% 3 combos: SPECcpu2k OS should manage on-chip shared caches 24
  • 25. Thank You 25
  • 27. Future Research Opportunities Shared cache management principles can be applied to other layers: ● Application, managed runtime, virtual machine monitor Promoting sharing ● Improve locality on NUMA multiprocessor systems Providing isolation ● Finer granularity, within one application [MICRO'08] ● Regions ● Objects RapidMRC ● Online L2 MRCs ● Reducing energy ● Guiding co-scheduling ● Underlying Tracing Mechanism ● Trace other hardware events 27