SlideShare a Scribd company logo
1 of 50
Rethinking Applications in
the NVM Era
Amitabha Roy
ex- Intel Research
InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
applications-nvm
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
NVM = Non Volatile Memory
● Like DRAM, but retains its contents across reboots
● Past: Non Volatile DIMMs
○ Memory DIMM + Ultra-capacitor + Flash
○ Contents dumped on power fail, restored on startup
○ DRAM style access and performance but non-volatile
● Future: New types of non volatile memory media
○ Memristor, Phase Change Memory, Crossbar resistive memory, 3DXPoint
○ 3DXPoint DIMMS (Intel and Micron), demoed at Sapphire NOW 2017
○ Non Volatile without extra machinery - practical
Software Design
● New level in the storage hierarchy
Disk/SSD DRAMNVM
Persistent Persistent Volatile
Block oriented Byte oriented Byte oriented
Slow Fast Fast
⇒ Fundamental breakthroughs in how we design systems
Use Case: Rocksdb
● Rocksdb - open source persistent key value store
● Optimized for Flash SSDs
● persistent map<key:string, value:string>
● Two levels (LSM tree) - sorted by key
L0: DRAM
PUT(<K, V>)
OK
Absorb updates quickly
L1: SSD
Flush large batches to SSD
Use Case: Rocksdb
● Problem: Lose all data in DRAM on power fail
● Durability guarantee requires a write ahead log
● Solution: synchronously append to a write ahead log
L0: DRAM
PUT(<K, V>)
Absorb updates quickly
L1: SSD
Flush large batches to SSD
Write Ahead Log: SSD
WAL.Append(<K, V>) OK
Rocksdb
+ WAL
Have to choose between safety and performance
Synchronous == 10X GAP
Rocksdb WAL Flow
PUT(<K, V>)
SSD
WAL.Append(<K, V>) OK 20 us round trip to SSD
Small KV pairs ~ 100 bytes
Synchronous writes => 5 MB/s
SSD is not the problem.
Sequential SSD BW => 1 GB/s
Problem: Persistence is block oriented
Most efficient path to SSD is 4KB units not 100 bytes
Have to pay fixed latency cost for only 100 byte IO
Rocksdb WAL Flow
Solution: Use byte oriented persistent memory
PUT(<K, V>)
SSD
WAL.Append(<K, V>) OK
NVM
Drain 4KB
OK
~100 ns round trip to NVDIMM
Small KV pairs ~ 100 bytes
Synchronous writes => 1GB/s
Sequential SSD BW => 1 GB/s
Rocksdb
+ WAL
+ WAL + NVM
NVM removes the need for a safety vs performance choice
NVM = No more synchronous logging pain for KV stores, FS, Databases...
Software Engineering for NVM
● Building software for NVM has high payoffs
○ Make everything go much faster
● Not as simple as writing code for data in DRAM
○ Even though NVM looks exactly like DRAM for access
● Writing correct code to maintain persistent data structures is difficult
○ Part 2 of this talk
● Getting it wrong has high cost
○ Persistence = errors do not go away with reboot
○ No more ctrl+alt+del to fix problems
● Software engineering aids to deal with persistent memory
○ Part 3 of this talk
Example: Building an NVM log
● Like the one we need for RocksDB
● Start from DRAM version
int * entries;
int tail;
void append(int value) {
tail++;
entries[tail] = value;
}
Making it Persistent
…
entries = mmap(fd, ...);
...
int * entries;
int tail;
void append(int value) {
tail++;
entries[tail] = value;
}
DRAM
(pagecache) IO Device
page_out( )
page_in( )
OS VMM
Persistent devices are block oriented
Hide block interface behind mmap abstraction
entries
Persistent Data Structures Tomorrow
Does not work for NVM
Wasteful copying - NVM is byte oriented and directly addressable
NVM
page_out( )
page_in( )
OS VMMDRAM
(pagecache)
Direct Access (DAX)
NVM
# Most Linux filesystems support for NVM
# mount -t ramfs -o dax,size=128m ext2 /nvm
fd = open(“/nvm/log”, …);
int *entries = mmap(fd, ...);
int tail;
void append(int value) {
tail++;
entries[tail] = value;
}
entries
Tolerating Reboots
fd = open(“/nvm/log”, …);
int entries = mmap(fd, ...);
int tail;
void append(int value) {
tail++;
entries[tail] = value;
}
Persistent data structures live across reboots
Thinking about Persistence
void * area = mmap(.., fd, ..);
int *entries=0xabc
Virtual Physical
0xabc 0xdef
... ...
Page Table
0xdef
In NVM NVM
Thinking about Persistence
After reboot
Virtual Physical
0xbbb 0xdef
0xabc NULL
Page Table
0xdef
Persistent data structures live across reboots. Address mappings do not.
int *entries=0xabc
In NVM
CRASH
NVM
Persistent Pointers
Solution: Make pointers base relative. Base comes from mmap.
fd = open(“/nvm/log”, …);
nvm_base = mmap(fd, ...);
#define VA(off) ((off) + nvm_base)
offset_t entries;
int tail;
void append(int value) {
tail++;
VA(entries)[tail] = value;
}
entries
nvm_base:
Power Failure
fd = open(“/nvm/log”, …);
nvm_base = mmap(fd, ...);
#define VA(off) ((off) + nvm_base)
offset_t entries;
int tail;
void append(int value) {
tail++;
VA(entries)[tail] = value;
}
value
garbage
Entries
tail
value
garbage
Entries
tail
value
value
Entries
tail
Before
tail++
.. = value
Power Failure
fd = open(“/nvm/log”, …);
nvm_base = mmap(fd, ...);
#define VA(off) ((off) + nvm_base)
offset_t entries;
int tail;
void append(int value) {
tail++;
VA(entries)[tail] = value;
}
value
garbage
Entries
tail
value
garbage
Entries
tail
value
value
Entries
tail
Before
tail++
Reboot after Power Failure
fd = open(“/nvm/log”, …);
nvm_base = mmap(fd, ...);
#define VA(off) ((off) + nvm_base)
offset_t entries;
int tail;
void append(int value) {
tail++;
VA(entries)[tail] = value;
}
value
garbage
Entries
tailreboot
Ordering Matters
fd = open(“/nvm/log”, …);
nvm_base = mmap(fd, ...);
#define VA(off) ((off) + nvm_base)
offset_t entries;
int tail;
void append(int value) {
VA(entries)[tail + 1] = value;
tail++;
}
value
garbage
Entries
tail
value
garbage
Entries
tail
value
value
Entries
tail
Before
tail++
.. = value
OK to fail !
The last piece: CPU caches
Transparent processor caches reorder your updates to NVM
CPU
core
1
Cache
2
NVM
fd = open(“/nvm/log”, …);
nvm_base = mmap(fd, ...);
#define VA(off) ((off) + nvm_base)
offset_t entries;
int tail;
void append(int value) {
VA(entries)[tail + 1] = value;
tail++;
}
Cache:
{tail, entries[tail]}
NVM:
{}
Cache:
{entries[tail]}
NVM:
{tail}
Explicit Cache Control
fd = open(“/nvm/log”, …);
nvm_base = mmap(fd, ...);
#define VA(off) ((off) + nvm_base)
offset_t entries;
int tail;
void append(int value) {
tail++;
sfence();
clflush(&tail);
VA(entries)[tail] = value;
sfence();
clflush(&VA(entries)[tail]);
}
CPU
core
1
Cache
2
NVM
Use explicit instructions to control cache behavior
Cache:
{tail}
NVM:
{entries[tail]}
Getting NVM Right
COMPLEXITY !!!
fd = open(“/nvm/log”, …);
nvm_base = mmap(fd, ...);
#define VA(off) ((off) + nvm_base)
offset_t entries;
int tail;
void append(int value) {
tail++;
sfence();
clflush(&tail);
VA(entries)[tail] = value;
sfence();
clflush(&VA(entries)[tail]);
}
Software Toolchains for NVM
● Correctly manipulating NVM can be difficult.
● Bugs and errors propagate past the lifetime of the program
○ Fixing errors with DRAM is easy - ctrl + alt + del
○ Your data structures will outlive your code
○ New reality for software engineering
● People will still do it (this talk encourages you to)
● Need automation to relieve software burden
○ Testing
○ Libraries
Software Testing for NVM
fd = open(“/nvm/log”, …);
nvm_base = mmap(fd, ...);
#define VA(off) ((off) + nvm_base)
offset_t entries;
int tail;
void append(int value) {
tail++;
sfence();
clflush(&tail);
VA(entries)[tail] = value;
sfence();
clflush(&VA(entries)[tail]);
}
TEST {
append(42);
ASSERT(entries[1] == 42);
}
Software Testing for NVM
fd = open(“/nvm/log”, …);
nvm_base = mmap(fd, ...);
#define VA(off) ((off) + nvm_base)
offset_t entries;
int tail;
void append(int value) {
tail++;
sfence();
clflush(&tail); // BUG!!
VA(entries)[tail] = value;
sfence();
clflush(&VA(entries)[tail]);
}
TEST {
append(42);
ASSERT(entries[1] == 42);
}
Thousands of executions…..
ASSERT nevers fires
Software Testing for NVM
fd = open(“/nvm/log”, …);
nvm_base = mmap(fd, ...);
#define VA(off) ((off) + nvm_base)
offset_t entries;
int tail;
void append(int value) {
tail++;
sfence();
clflush(&tail); // BUG!!
VA(entries)[tail] = value;
sfence();
clflush(&VA(entries)[tail]);
}
TEST {
append(42);
REBOOT;
ASSERT(entries[1] == 42);
}
Thousands of executions…..
ASSERT maybe fires
YAT
Automated testing tool for NVM software
Yat: A Validation Framework for Persistent Memory. Dulloor et al. USENIX 2014
Idea: Test power failure without really pulling the plug
1. Extract possible store orders to NVM
Use a hypervisor or instrumentation via binary instrumentation (eg. PIN, Valgrind)
Use understanding of x86 memory ordering model
fd = open(“/nvm/log”, …);
nvm_base = mmap(fd, ...);
#define VA(off) ((off) + nvm_base)
offset_t entries;
int tail;
void append(int value) {
tail++;
sfence();
clflush(&tail); // BUG!!
VA(entries)[tail] = value;
sfence();
clflush(&VA(entries)[tail]);
}
YAT {
append(42);
ASSERT(entries[1] == 42);
}
tail=1;
..=42;
..=42;
tail=1;
2. Consider All Possible Truncations
Each truncation is a simulated power failure!
fd = open(“/nvm/log”, …);
nvm_base = mmap(fd, ...);
#define VA(off) ((off) + nvm_base)
offset_t entries;
int tail;
void append(int value) {
tail++;
sfence();
clflush(&tail); // BUG!!
VA(entries)[tail] = value;
sfence();
clflush(&VA(entries)[tail]);
}
YAT {
append(42);
ASSERT(entries[1] == 42);
}
tail=1;
..=42;
..=42;
tail=1;
tail=1;
..=42;
tail=1; ..=42;
tail=1;
..=42;
2. Check Assertion for Each Truncation
fd = open(“/nvm/log”, …);
nvm_base = mmap(fd, ...);
#define VA(off) ((off) + nvm_base)
offset_t entries;
int tail;
void append(int value) {
tail++;
sfence();
clflush(&tail); // BUG!!
VA(entries)[tail] = value;
sfence();
clflush(&VA(entries)[tail]);
}
YAT {
append(42);
ASSERT(entries[1] == 42);
}
tail=1;
..=42;
..=42;
tail=1;
tail=1;
..=42;
tail=1; ..=42;
tail=1;
..=42;
Non Volatile Memory Library
● Testing does not stop bugs - it only catches them after the fact
● Need to stop bugs at the source
● Make NVM look exactly DRAM to the programmer
● Automate the extra bits
● Enable complex datastructures - such as trees
■ Not as easy to reason about consistency like our toy example
■ Impossible except for ninja programmers
● Non Volatile Memory Library (NVML)
http://nvml.io
PMEMoid entries;
int tail;
void append(int v) {
TX_BEGIN(...) {
pmemobj_tx_add_range_direct(&tail, sizeof(int));
tail++;
int* array = pmemobj_direct(entries);
pmemobj_tx_add_range_direct(&array[top], sizeof(T));
array[tail] = v;
} TX_END
NVML
Automation/Magic
1. Persistent pointers
2. No need for sfence, clflush
3. Order of updates irrelevant
Undo Log
TX_BEGIN(...) {
...
pmemobj_tx_add_range_direct(&tail, ..);
tail++;
...
pmemobj_tx_add_range_direct(&array[tail], ..);
array[tail] = v;
} TX_END
&tail 10
&array[11] GARBAGE
ADDRESS CONTENT
UNDO LOG
Undo Log
TX_BEGIN(...) {
...
pmemobj_tx_add_range_direct(&tail, ..);
tail++;
...
pmemobj_tx_add_range_direct(&array[tail], ..);
array[tail] = v;
} TX_END
&tail 10
&array[11] GARBAGE
ADDRESS CONTENT
UNDO LOG
If Hit TX_END - Success !
sfence;
foreach e in UNDO LOG:
clflush e.address
Delete UNDO LOG
Undo Log
TX_BEGIN(...) {
...
pmemobj_tx_add_range_direct(&tail, ..);
tail++;
...
pmemobj_tx_add_range_direct(&array[tail], ..);
array[tail] = v;
} TX_END
&tail 10
&array[11] GARBAGE
ADDRESS CONTENT
UNDO LOG (in NVM)
If Hit TX_END - Success !
sfence;
foreach e in UNDO LOG:
clflush e.address
Delete UNDO LOG
else Restart - Failed !
foreach e in UNDO LOG:
*e.address = e.content
sfence
clflush e.address
Delete UNDO LOG
Undo Log == Failure Atomicity
TX_BEGIN(...) {
...
pmemobj_tx_add_range_direct(&tail, ..);
tail++;
...
pmemobj_tx_add_range_direct(&array[tail], ..);
array[tail] = v;
} TX_END
&tail 10
&array[11] GARBAGE
ADDRESS CONTENT
UNDO LOG (in NVM)
All or nothing semantics in the face of failure
Like ACID from DBMS world
Profilers
● Tiered memory => Different performance flavors of main memory
● NVM slower and more plentiful than DRAM
CPUNVM
(most flavors)
DRAM
Long Latency
Low Bandwidth
Low Latency
High Bandwidth
NVM = Tiered performance of main memory
Performance and Placement
● Analytics - don’t care about persistence
● Use NVM as cheap, plentiful and slow memory
○ Surprising but projected use of NVM
● Choice of where to place data structures
CPU
NVM
DRAM
Long Latency
Low Bandwidth
Low Latency
High Bandwidth
Performance and Placement
CPU
NVM
DRAM
Long Latency
Low Bandwidth
Low Latency
High Bandwidth
Data Structure A: 20 accesses/sec
Data Structure B: 10 accesses/sec
Storage caching wisdom(aka 5 minute rule): More frequent accesses to faster memory
Performance and Placement
CPU
NVM
DRAM
Long Latency
Low Bandwidth
Low Latency
High Bandwidth
Data Structure A: 20 accesses/sec - sequential scans
Data Structure B: 10 accesses/sec - pointer chasing
Memory access performance strongly governed by access pattern and not just access frequency
Data Tiering in Heterogeneous Memory Systems. Eurosys 2016.Storage caching wisdom is wrong
More frequently accessed DS in slower memory!
Performance and Placement
● Little’s Law
InFlightRequests = Bandwidth * Latency
● Can have larger bandwidth even with longer latency
Bandwidth = InFlightRequests/Latency
● OOO CPU pipelines good at increasing InFlightRequests for scans
○ Prefetching
○ Non-Blocking Caches
● Scans should go to longer latency NVM even if more frequent
○ Pointer chasing needs high performance memory
X-Mem
● Guide data structure placement via profiling
● Beyond simple cache miss rate optimization
○ Eg. tools like vtune, gprof
● Need to determine access pattern (pointer or scan?)
Solution:
malloc(size, TAG) + map<Virtual Pages, TAG>
● TAG unique to datastructure: maps memory access to datastructure
● Access profiler determines best memory type for TAG
● malloc maps data structure blocks to pages from correct memory type
Example
● MemC3 hash table - An improved memcached
Name Frequency Type Location
512B buckets 2% Random DRAM
8K Values 8% Scans NVM
Conclusion
● NVM adds a whole new dimension to software engineering
● Opportunities for fundamental breakthroughs
○ Solve system design problems in new ways
○ Eg. fixing synchronous logging in Rocksdb
● Challenges
○ Data structures outlive the code - can’t restart on a bug!
○ Persistent pointers, ordering, processor caches
○ Tiered main memory architecture
● Software engineering solutions
○ New ideas in testing, libraries, profilers
● What will you do with NVM ?
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
applications-nvm

More Related Content

More from C4Media

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?C4Media
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseC4Media
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinC4Media
 

More from C4Media (20)

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
 

Recently uploaded

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 

Rethinking Applications for the NVM Era

  • 1. Rethinking Applications in the NVM Era Amitabha Roy ex- Intel Research
  • 2. InfoQ.com: News & Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ applications-nvm
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4. NVM = Non Volatile Memory ● Like DRAM, but retains its contents across reboots ● Past: Non Volatile DIMMs ○ Memory DIMM + Ultra-capacitor + Flash ○ Contents dumped on power fail, restored on startup ○ DRAM style access and performance but non-volatile ● Future: New types of non volatile memory media ○ Memristor, Phase Change Memory, Crossbar resistive memory, 3DXPoint ○ 3DXPoint DIMMS (Intel and Micron), demoed at Sapphire NOW 2017 ○ Non Volatile without extra machinery - practical
  • 5. Software Design ● New level in the storage hierarchy Disk/SSD DRAMNVM Persistent Persistent Volatile Block oriented Byte oriented Byte oriented Slow Fast Fast ⇒ Fundamental breakthroughs in how we design systems
  • 6. Use Case: Rocksdb ● Rocksdb - open source persistent key value store ● Optimized for Flash SSDs ● persistent map<key:string, value:string> ● Two levels (LSM tree) - sorted by key L0: DRAM PUT(<K, V>) OK Absorb updates quickly L1: SSD Flush large batches to SSD
  • 7. Use Case: Rocksdb ● Problem: Lose all data in DRAM on power fail ● Durability guarantee requires a write ahead log ● Solution: synchronously append to a write ahead log L0: DRAM PUT(<K, V>) Absorb updates quickly L1: SSD Flush large batches to SSD Write Ahead Log: SSD WAL.Append(<K, V>) OK
  • 8. Rocksdb + WAL Have to choose between safety and performance Synchronous == 10X GAP
  • 9. Rocksdb WAL Flow PUT(<K, V>) SSD WAL.Append(<K, V>) OK 20 us round trip to SSD Small KV pairs ~ 100 bytes Synchronous writes => 5 MB/s SSD is not the problem. Sequential SSD BW => 1 GB/s Problem: Persistence is block oriented Most efficient path to SSD is 4KB units not 100 bytes Have to pay fixed latency cost for only 100 byte IO
  • 10. Rocksdb WAL Flow Solution: Use byte oriented persistent memory PUT(<K, V>) SSD WAL.Append(<K, V>) OK NVM Drain 4KB OK ~100 ns round trip to NVDIMM Small KV pairs ~ 100 bytes Synchronous writes => 1GB/s Sequential SSD BW => 1 GB/s
  • 11. Rocksdb + WAL + WAL + NVM NVM removes the need for a safety vs performance choice NVM = No more synchronous logging pain for KV stores, FS, Databases...
  • 12. Software Engineering for NVM ● Building software for NVM has high payoffs ○ Make everything go much faster ● Not as simple as writing code for data in DRAM ○ Even though NVM looks exactly like DRAM for access ● Writing correct code to maintain persistent data structures is difficult ○ Part 2 of this talk ● Getting it wrong has high cost ○ Persistence = errors do not go away with reboot ○ No more ctrl+alt+del to fix problems ● Software engineering aids to deal with persistent memory ○ Part 3 of this talk
  • 13. Example: Building an NVM log ● Like the one we need for RocksDB ● Start from DRAM version int * entries; int tail; void append(int value) { tail++; entries[tail] = value; }
  • 14. Making it Persistent … entries = mmap(fd, ...); ... int * entries; int tail; void append(int value) { tail++; entries[tail] = value; } DRAM (pagecache) IO Device page_out( ) page_in( ) OS VMM Persistent devices are block oriented Hide block interface behind mmap abstraction entries
  • 15. Persistent Data Structures Tomorrow Does not work for NVM Wasteful copying - NVM is byte oriented and directly addressable NVM page_out( ) page_in( ) OS VMMDRAM (pagecache)
  • 16. Direct Access (DAX) NVM # Most Linux filesystems support for NVM # mount -t ramfs -o dax,size=128m ext2 /nvm fd = open(“/nvm/log”, …); int *entries = mmap(fd, ...); int tail; void append(int value) { tail++; entries[tail] = value; } entries
  • 17. Tolerating Reboots fd = open(“/nvm/log”, …); int entries = mmap(fd, ...); int tail; void append(int value) { tail++; entries[tail] = value; } Persistent data structures live across reboots
  • 18. Thinking about Persistence void * area = mmap(.., fd, ..); int *entries=0xabc Virtual Physical 0xabc 0xdef ... ... Page Table 0xdef In NVM NVM
  • 19. Thinking about Persistence After reboot Virtual Physical 0xbbb 0xdef 0xabc NULL Page Table 0xdef Persistent data structures live across reboots. Address mappings do not. int *entries=0xabc In NVM CRASH NVM
  • 20. Persistent Pointers Solution: Make pointers base relative. Base comes from mmap. fd = open(“/nvm/log”, …); nvm_base = mmap(fd, ...); #define VA(off) ((off) + nvm_base) offset_t entries; int tail; void append(int value) { tail++; VA(entries)[tail] = value; } entries nvm_base:
  • 21. Power Failure fd = open(“/nvm/log”, …); nvm_base = mmap(fd, ...); #define VA(off) ((off) + nvm_base) offset_t entries; int tail; void append(int value) { tail++; VA(entries)[tail] = value; } value garbage Entries tail value garbage Entries tail value value Entries tail Before tail++ .. = value
  • 22. Power Failure fd = open(“/nvm/log”, …); nvm_base = mmap(fd, ...); #define VA(off) ((off) + nvm_base) offset_t entries; int tail; void append(int value) { tail++; VA(entries)[tail] = value; } value garbage Entries tail value garbage Entries tail value value Entries tail Before tail++
  • 23. Reboot after Power Failure fd = open(“/nvm/log”, …); nvm_base = mmap(fd, ...); #define VA(off) ((off) + nvm_base) offset_t entries; int tail; void append(int value) { tail++; VA(entries)[tail] = value; } value garbage Entries tailreboot
  • 24. Ordering Matters fd = open(“/nvm/log”, …); nvm_base = mmap(fd, ...); #define VA(off) ((off) + nvm_base) offset_t entries; int tail; void append(int value) { VA(entries)[tail + 1] = value; tail++; } value garbage Entries tail value garbage Entries tail value value Entries tail Before tail++ .. = value OK to fail !
  • 25. The last piece: CPU caches Transparent processor caches reorder your updates to NVM CPU core 1 Cache 2 NVM fd = open(“/nvm/log”, …); nvm_base = mmap(fd, ...); #define VA(off) ((off) + nvm_base) offset_t entries; int tail; void append(int value) { VA(entries)[tail + 1] = value; tail++; } Cache: {tail, entries[tail]} NVM: {} Cache: {entries[tail]} NVM: {tail}
  • 26. Explicit Cache Control fd = open(“/nvm/log”, …); nvm_base = mmap(fd, ...); #define VA(off) ((off) + nvm_base) offset_t entries; int tail; void append(int value) { tail++; sfence(); clflush(&tail); VA(entries)[tail] = value; sfence(); clflush(&VA(entries)[tail]); } CPU core 1 Cache 2 NVM Use explicit instructions to control cache behavior Cache: {tail} NVM: {entries[tail]}
  • 27. Getting NVM Right COMPLEXITY !!! fd = open(“/nvm/log”, …); nvm_base = mmap(fd, ...); #define VA(off) ((off) + nvm_base) offset_t entries; int tail; void append(int value) { tail++; sfence(); clflush(&tail); VA(entries)[tail] = value; sfence(); clflush(&VA(entries)[tail]); }
  • 28. Software Toolchains for NVM ● Correctly manipulating NVM can be difficult. ● Bugs and errors propagate past the lifetime of the program ○ Fixing errors with DRAM is easy - ctrl + alt + del ○ Your data structures will outlive your code ○ New reality for software engineering ● People will still do it (this talk encourages you to) ● Need automation to relieve software burden ○ Testing ○ Libraries
  • 29. Software Testing for NVM fd = open(“/nvm/log”, …); nvm_base = mmap(fd, ...); #define VA(off) ((off) + nvm_base) offset_t entries; int tail; void append(int value) { tail++; sfence(); clflush(&tail); VA(entries)[tail] = value; sfence(); clflush(&VA(entries)[tail]); } TEST { append(42); ASSERT(entries[1] == 42); }
  • 30. Software Testing for NVM fd = open(“/nvm/log”, …); nvm_base = mmap(fd, ...); #define VA(off) ((off) + nvm_base) offset_t entries; int tail; void append(int value) { tail++; sfence(); clflush(&tail); // BUG!! VA(entries)[tail] = value; sfence(); clflush(&VA(entries)[tail]); } TEST { append(42); ASSERT(entries[1] == 42); } Thousands of executions….. ASSERT nevers fires
  • 31. Software Testing for NVM fd = open(“/nvm/log”, …); nvm_base = mmap(fd, ...); #define VA(off) ((off) + nvm_base) offset_t entries; int tail; void append(int value) { tail++; sfence(); clflush(&tail); // BUG!! VA(entries)[tail] = value; sfence(); clflush(&VA(entries)[tail]); } TEST { append(42); REBOOT; ASSERT(entries[1] == 42); } Thousands of executions….. ASSERT maybe fires
  • 32. YAT Automated testing tool for NVM software Yat: A Validation Framework for Persistent Memory. Dulloor et al. USENIX 2014 Idea: Test power failure without really pulling the plug
  • 33. 1. Extract possible store orders to NVM Use a hypervisor or instrumentation via binary instrumentation (eg. PIN, Valgrind) Use understanding of x86 memory ordering model fd = open(“/nvm/log”, …); nvm_base = mmap(fd, ...); #define VA(off) ((off) + nvm_base) offset_t entries; int tail; void append(int value) { tail++; sfence(); clflush(&tail); // BUG!! VA(entries)[tail] = value; sfence(); clflush(&VA(entries)[tail]); } YAT { append(42); ASSERT(entries[1] == 42); } tail=1; ..=42; ..=42; tail=1;
  • 34. 2. Consider All Possible Truncations Each truncation is a simulated power failure! fd = open(“/nvm/log”, …); nvm_base = mmap(fd, ...); #define VA(off) ((off) + nvm_base) offset_t entries; int tail; void append(int value) { tail++; sfence(); clflush(&tail); // BUG!! VA(entries)[tail] = value; sfence(); clflush(&VA(entries)[tail]); } YAT { append(42); ASSERT(entries[1] == 42); } tail=1; ..=42; ..=42; tail=1; tail=1; ..=42; tail=1; ..=42; tail=1; ..=42;
  • 35. 2. Check Assertion for Each Truncation fd = open(“/nvm/log”, …); nvm_base = mmap(fd, ...); #define VA(off) ((off) + nvm_base) offset_t entries; int tail; void append(int value) { tail++; sfence(); clflush(&tail); // BUG!! VA(entries)[tail] = value; sfence(); clflush(&VA(entries)[tail]); } YAT { append(42); ASSERT(entries[1] == 42); } tail=1; ..=42; ..=42; tail=1; tail=1; ..=42; tail=1; ..=42; tail=1; ..=42;
  • 36. Non Volatile Memory Library ● Testing does not stop bugs - it only catches them after the fact ● Need to stop bugs at the source ● Make NVM look exactly DRAM to the programmer ● Automate the extra bits ● Enable complex datastructures - such as trees ■ Not as easy to reason about consistency like our toy example ■ Impossible except for ninja programmers ● Non Volatile Memory Library (NVML) http://nvml.io
  • 37. PMEMoid entries; int tail; void append(int v) { TX_BEGIN(...) { pmemobj_tx_add_range_direct(&tail, sizeof(int)); tail++; int* array = pmemobj_direct(entries); pmemobj_tx_add_range_direct(&array[top], sizeof(T)); array[tail] = v; } TX_END NVML Automation/Magic 1. Persistent pointers 2. No need for sfence, clflush 3. Order of updates irrelevant
  • 38. Undo Log TX_BEGIN(...) { ... pmemobj_tx_add_range_direct(&tail, ..); tail++; ... pmemobj_tx_add_range_direct(&array[tail], ..); array[tail] = v; } TX_END &tail 10 &array[11] GARBAGE ADDRESS CONTENT UNDO LOG
  • 39. Undo Log TX_BEGIN(...) { ... pmemobj_tx_add_range_direct(&tail, ..); tail++; ... pmemobj_tx_add_range_direct(&array[tail], ..); array[tail] = v; } TX_END &tail 10 &array[11] GARBAGE ADDRESS CONTENT UNDO LOG If Hit TX_END - Success ! sfence; foreach e in UNDO LOG: clflush e.address Delete UNDO LOG
  • 40. Undo Log TX_BEGIN(...) { ... pmemobj_tx_add_range_direct(&tail, ..); tail++; ... pmemobj_tx_add_range_direct(&array[tail], ..); array[tail] = v; } TX_END &tail 10 &array[11] GARBAGE ADDRESS CONTENT UNDO LOG (in NVM) If Hit TX_END - Success ! sfence; foreach e in UNDO LOG: clflush e.address Delete UNDO LOG else Restart - Failed ! foreach e in UNDO LOG: *e.address = e.content sfence clflush e.address Delete UNDO LOG
  • 41. Undo Log == Failure Atomicity TX_BEGIN(...) { ... pmemobj_tx_add_range_direct(&tail, ..); tail++; ... pmemobj_tx_add_range_direct(&array[tail], ..); array[tail] = v; } TX_END &tail 10 &array[11] GARBAGE ADDRESS CONTENT UNDO LOG (in NVM) All or nothing semantics in the face of failure Like ACID from DBMS world
  • 42. Profilers ● Tiered memory => Different performance flavors of main memory ● NVM slower and more plentiful than DRAM CPUNVM (most flavors) DRAM Long Latency Low Bandwidth Low Latency High Bandwidth NVM = Tiered performance of main memory
  • 43. Performance and Placement ● Analytics - don’t care about persistence ● Use NVM as cheap, plentiful and slow memory ○ Surprising but projected use of NVM ● Choice of where to place data structures CPU NVM DRAM Long Latency Low Bandwidth Low Latency High Bandwidth
  • 44. Performance and Placement CPU NVM DRAM Long Latency Low Bandwidth Low Latency High Bandwidth Data Structure A: 20 accesses/sec Data Structure B: 10 accesses/sec Storage caching wisdom(aka 5 minute rule): More frequent accesses to faster memory
  • 45. Performance and Placement CPU NVM DRAM Long Latency Low Bandwidth Low Latency High Bandwidth Data Structure A: 20 accesses/sec - sequential scans Data Structure B: 10 accesses/sec - pointer chasing Memory access performance strongly governed by access pattern and not just access frequency Data Tiering in Heterogeneous Memory Systems. Eurosys 2016.Storage caching wisdom is wrong More frequently accessed DS in slower memory!
  • 46. Performance and Placement ● Little’s Law InFlightRequests = Bandwidth * Latency ● Can have larger bandwidth even with longer latency Bandwidth = InFlightRequests/Latency ● OOO CPU pipelines good at increasing InFlightRequests for scans ○ Prefetching ○ Non-Blocking Caches ● Scans should go to longer latency NVM even if more frequent ○ Pointer chasing needs high performance memory
  • 47. X-Mem ● Guide data structure placement via profiling ● Beyond simple cache miss rate optimization ○ Eg. tools like vtune, gprof ● Need to determine access pattern (pointer or scan?) Solution: malloc(size, TAG) + map<Virtual Pages, TAG> ● TAG unique to datastructure: maps memory access to datastructure ● Access profiler determines best memory type for TAG ● malloc maps data structure blocks to pages from correct memory type
  • 48. Example ● MemC3 hash table - An improved memcached Name Frequency Type Location 512B buckets 2% Random DRAM 8K Values 8% Scans NVM
  • 49. Conclusion ● NVM adds a whole new dimension to software engineering ● Opportunities for fundamental breakthroughs ○ Solve system design problems in new ways ○ Eg. fixing synchronous logging in Rocksdb ● Challenges ○ Data structures outlive the code - can’t restart on a bug! ○ Persistent pointers, ordering, processor caches ○ Tiered main memory architecture ● Software engineering solutions ○ New ideas in testing, libraries, profilers ● What will you do with NVM ?
  • 50. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ applications-nvm