SlideShare a Scribd company logo
1 of 78
EBtree
QCon New York, June 24-26, 2019
Design for a scheduler, and use (almost)
everywhere
Andjelko Iharos
aiharos@haproxy.com
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
ebtree-design/
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
EBtree
QCon New York, June 24-26, 2019
Design for a scheduler, and use (almost)
everywhere
Andjelko Iharos
aiharos@haproxy.com
● Fast tree descent & search
● Memory efficient
● Lookup by mask or prefix (i.e. IPv4 and IPv6)
● Optimized for inserts and deletes
● Great with bit-addressable data
EBtree features
● Scheduling requirements
● Candidate solutions
● EBtree design
● Implementation
● Production use
● Results
Outline
Scheduling
requirements
● Handle network connections
● Run active tasks
● Check suspended tasks, wake them up
HAProxy event loop
HAProxy event loop
HAProxy task
● Active & suspended tasks
● Insert
● Duplicates
● Read sorted
● Delete
● Priorities
Scheduler features
● Up to high frequency of events
● Up to very large number of entries
● Large variations in rate of entry change
● Frequent lookups
Scheduling environment
● Speed
● Predictability
● Simplicity
Desirable qualities
Candidate
solutions
● Array
● Linked list
● Stack, Queue
● Hash Map
● Tree
Basic data structures
Linked list
By Lasindi - Own work, Public Domain, https://commons.wikimedia.org/w/index.php?curid=2194027
By No machine-readable author provided. Dcoetzee assumed (based on copyright claims). - No machine-readable source provided. Own work assumed (based
on copyright claims)., Public Domain, https://commons.wikimedia.org/w/index.php?curid=488330
Binary search tree
By Bruno Schalch - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=64250599
AVL tree rotations
By Claudio Rocchini - Own work, CC BY 2.5, https://commons.wikimedia.org/w/index.php?curid=2118795
Prefix (Radix) trees
● O(log n) insert, O(1) delete
● Fast comparison even for long keys
● Prefix matching
● Nodes and leaves are different
● Not balanced
Prefix (Radix) trees
EBtree design
● Simplify memory management
● Reduce impact of imbalance and tree height
Can we improve?
Basic elements
Inserting elements
0
0
0
1
0
0
1
0
0
0
1
1
0
1
0
0
0
1
0
1
Deleting elements
Implementation
● 32 bits = 4 bytes word, 2 bits available
● 64 bits = 8 bytes word, 3 bits available
0x846010 = 100001000110000000010000
0x846030 = 100001000110000000110000
0x846050 = 100001000110000001010000
Pointer tagging
● regparm compiler directive (historical reason)
● forced inlining
● __builtin_expect
● ASM
C is portable Assembler
typedef void eb_troot_t;
struct eb_root {
eb_troot_t *b[2]; /* left and right branches */
};
struct eb_node {
struct eb_root branches; /* branches, must be at the beginning */
short int bit; /* link's bit position. */
eb_troot_t *node_p; /* link node's parent */
eb_troot_t *leaf_p; /* leaf node's parent */
};
Base structs
ebtree/ebtree.h
/* Return next leaf node after an existing leaf node, or NULL if none. */
static inline struct eb_node *eb_next(struct eb_node *node)
{
eb_troot_t *t = node->leaf_p;
while (eb_gettag(t) != EB_LEFT)
/* Walking up from right branch, so we cannot be below root */
t = (eb_root_to_node(eb_untag(t, EB_RGHT)))->node_p;
/* Note that <t> cannot be NULL at this stage */
t = (eb_untag(t, EB_LEFT))->b[EB_RGHT];
if (eb_clrtag(t) == NULL)
return NULL;
return eb_walk_down(t, EB_LEFT);
}
Base functions
eb_next() in ebtree/ebtree.h
● eb32 / eb64
● ebpt for pointers
● ebim and ebis for indirect memory and strings
● ebmb and ebst for memory block and strings
(allocated after just after the node)
● All support storage and ordered retrieval of duplicate
keys
EBtree data types
struct eb64_node {
struct eb_node node; /* the tree node, must be at the beginning */
u64 key;
};
eb64 node
ebtree/eb64tree.h
troot = root->b[EB_LEFT];
if (unlikely(troot == NULL))
return NULL;
while (1) {
if ((eb_gettag(troot) == EB_LEAF)) {
node = container_of(eb_untag(troot, EB_LEAF),
struct eb64_node, node.branches);
if (node->key == x)
return node;
else
return NULL;
}
eb64 specific functions
eb64_lookup() in ebtree/eb64tree.h
node = container_of(eb_untag(troot, EB_NODE),
struct eb64_node, node.branches);
y = node->key ^ x;
if (!y) {
/* Either we found the node which holds the key, or
* we have a dup tree. */
return node;
}
if ((y >> node->node.bit) >= EB_NODE_BRANCHES) /* 2 */
return NULL; /* no more common bits */
troot = node->node.branches.b[(x >> node->node.bit) &
EB_NODE_BRANCH_MASK];
}
}
eb64 specific functions
eb64_lookup() in ebtree/eb64tree.h
Production use
● computational load associated with a proxied
connection
● active
● suspended
● millisecond resolution
HAProxy tasks
● EBtree
● indexed on expiration date
Suspended HAProxy tasks
● EBtree
● indexed on expiration date, taking priority into
consideration
Active HAProxy tasks
while (1) {
/* Process a few tasks */
process_runnable_tasks();
/* Check if we can expire some tasks */
next = wake_expired_tasks();
/* expire immediately if events are pending */
if (fd_cache_num || run_queue) next = now_ms;
/* The poller will ensure it returns around <next> */
cur_poller.poll(&cur_poller, next);
fd_process_cached_events();
}
HAProxy event loop
run_poll_loop() in haproxy-1.7.x/src/haproxy.c
/* The base for all tasks */
struct task {
struct eb32_node rq; /* ebtree node used to hold the task in the run queue */
struct eb32_node wq; /* ebtree node used to hold the task in the wait queue */
unsigned short state; /* task state : bit field of TASK_* */
short nice; /* the task's current nice value from -1024 to +1024 */
unsigned int calls; /* number of times ->process() was called */
struct task * (*process)(struct task *t); /* the function which processes the task */
void *context; /* the task's context */
int expire; /* next expiration date for this task, in ticks */
};
Task struct
haproxy-1.7.x/include/types/task.h
if (likely(last_timer && last_timer->node.bit < 0 &&
last_timer->key == task->wq.key && last_timer->node.node_p)) {
eb_insert_dup(&last_timer->node, &task->wq.node);
if (task->wq.node.bit < last_timer->node.bit)
last_timer = &task->wq;
return;
}
eb32_insert(&timers, &task->wq);
/* Make sure we don't assign the last_timer to a node-less entry */
if (task->wq.node.node_p && (!last_timer || (task->wq.node.bit < last_timer->node.bit)))
last_timer = &task->wq;
return;
}
Scheduling tasks for later
__task_queue() in haproxy-1.7.x/src/task.c
if (likely(t->nice)) {
int offset;
niced_tasks++;
if (likely(t->nice > 0))
offset = (unsigned)((tasks_run_queue * (unsigned int)t->nice) / 32U);
else
offset = -(unsigned)((tasks_run_queue * (unsigned int)-t->nice) / 32U);
t->rq.key += offset;
}
eb32_insert(&rqueue, &t->rq);
rq_next = NULL;
return t;
}
Waking up tasks to run
__task_wakeup() in haproxy-1.7.x/src/task.c
while (max_processed--) {
if (unlikely(!rq_next)) {
rq_next = eb32_lookup_ge(&rqueue, rqueue_ticks - TIMER_LOOK_BACK);
if (!rq_next) {
/* we might have reached the end of the tree, typically because
* <rqueue_ticks> is in the first half and we're first scanning
* the last half. Let's loop back to the beginning of the tree now.
*/
rq_next = eb32_first(&rqueue);
if (!rq_next)
break;
}
}
Running tasks
process_runnable_tasks() in haproxy-1.7.x/src/task.c
t = eb32_entry(rq_next, struct task, rq);
rq_next = eb32_next(rq_next);
__task_unlink_rq(t);
t->state |= TASK_RUNNING;
t->calls++;
t = t->process(t);
if (likely(t != NULL)) {
t->state &= ~TASK_RUNNING;
if (t->expire)
task_queue(t);
}
}
Running tasks
process_runnable_tasks() in haproxy-1.7.x/src/task.c
● timers
● schedulers
● ACL
● stick-tables (stats, counters)
● LRU cache
EBtree in HAProxy
● Down to 100ns inserts
● > 200k TCP conn/s
● > 350k HTTP req/s
● scheduler using up only 3-5% CPU
● Halog utility - up to 4 million log lines per second
● 450000 BGP routes table: >2 million lookups per
second
EBtree performing in HAProxy
struct lru64_list {
struct lru64_list *n;
struct lru64_list *p;
};
struct lru64_head {
struct lru64_list list;
struct eb_root keys;
struct lru64 *spare;
int cache_size;
int cache_usage;
};
LRU cache structs
ebtree/examples/lru.h
struct lru64 {
struct eb64_node node; /* indexing key, typically a hash64 */
struct lru64_list lru; /* LRU list */
void *domain; /* who this data belongs to */
unsigned long long revision; /* data revision (to avoid use-after-free) */
void *data; /* returned value, user decides how to use this */
};
LRU cache structs
ebtree/examples/lru.h
struct lru64 *lru64_get(unsigned long long key, struct lru64_head *lru,
void *domain, unsigned long long revision)
{
struct eb64_node *node;
struct lru64 *elem;
if (!lru->spare) {
if (!lru->cache_size)
return NULL;
lru->spare = malloc(sizeof(*lru->spare));
if (!lru->spare)
return NULL;
lru->spare->domain = NULL;
}
LRU cache get/store
lru64_get() in ebtree/examples/lru.c
/* Lookup or insert */
lru->spare->node.key = key;
node = __eb64_insert(&lru->keys, &lru->spare->node);
elem = container_of(node, typeof(*elem), node);
if (elem != lru->spare) {
/* Existing entry found, check validity then move it at the head of the LRU list. */
return elem;
}
else {
/* New entry inserted, initialize and move to the head of the
* LRU list, and lock it until commit. */
lru->cache_usage++;
lru->spare = NULL; // used, need a new one next time
}
LRU cache get/store
lru64_get() in ebtree/examples/lru.c
if (lru->cache_usage > lru->cache_size) {
struct lru64 *old;
old = container_of(lru->list.p, typeof(*old), lru);
if (old->domain) {
/* not locked */
LIST_DEL(&old->lru);
__eb64_delete(&old->node);
if (!lru->spare)
lru->spare = old;
else
free(old);
lru->cache_usage--;
}
}
LRU cache get/store
lru64_get() in ebtree/examples/lru.c
Results
● Fast tree descent & search
● Memory efficient
● Lookup by mask or prefix (i.e. IPv4 and IPv6)
● Optimized for inserts and deletes
● Great with bit-addressable data
EBtree features
Check out EBtree at http://git.1wt.eu/web/ebtree.git/
Check out HAProxy at haproxy.org or haproxy.com
Join development at haproxy@formilux.org
aiharos@haproxy.com
Q&A
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
ebtree-design/

More Related Content

What's hot

Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)
David Evans
 
My talk about Tarantool and Lua at Percona Live 2016
My talk about Tarantool and Lua at Percona Live 2016My talk about Tarantool and Lua at Percona Live 2016
My talk about Tarantool and Lua at Percona Live 2016
Konstantin Osipov
 
Multi-Tasking Map (MapReduce, Tasks in Rust)
Multi-Tasking Map (MapReduce, Tasks in Rust)Multi-Tasking Map (MapReduce, Tasks in Rust)
Multi-Tasking Map (MapReduce, Tasks in Rust)
David Evans
 
20100712-OTcl Command -- Getting Started
20100712-OTcl Command -- Getting Started20100712-OTcl Command -- Getting Started
20100712-OTcl Command -- Getting Started
Teerawat Issariyakul
 
Pylons + Tokyo Cabinet
Pylons + Tokyo CabinetPylons + Tokyo Cabinet
Pylons + Tokyo Cabinet
Ben Cheng
 

What's hot (20)

Memory management
Memory managementMemory management
Memory management
 
Virtual Memory (Making a Process)
Virtual Memory (Making a Process)Virtual Memory (Making a Process)
Virtual Memory (Making a Process)
 
Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)
 
Nicety of Java 8 Multithreading
Nicety of Java 8 MultithreadingNicety of Java 8 Multithreading
Nicety of Java 8 Multithreading
 
RealmDB for Android
RealmDB for AndroidRealmDB for Android
RealmDB for Android
 
Making a Process
Making a ProcessMaking a Process
Making a Process
 
Redis: Lua scripts - a primer and use cases
Redis: Lua scripts - a primer and use casesRedis: Lua scripts - a primer and use cases
Redis: Lua scripts - a primer and use cases
 
My talk about Tarantool and Lua at Percona Live 2016
My talk about Tarantool and Lua at Percona Live 2016My talk about Tarantool and Lua at Percona Live 2016
My talk about Tarantool and Lua at Percona Live 2016
 
Multi-Tasking Map (MapReduce, Tasks in Rust)
Multi-Tasking Map (MapReduce, Tasks in Rust)Multi-Tasking Map (MapReduce, Tasks in Rust)
Multi-Tasking Map (MapReduce, Tasks in Rust)
 
NS2 Object Construction
NS2 Object ConstructionNS2 Object Construction
NS2 Object Construction
 
Intro to Rust from Applicative / NY Meetup
Intro to Rust from Applicative / NY MeetupIntro to Rust from Applicative / NY Meetup
Intro to Rust from Applicative / NY Meetup
 
20100712-OTcl Command -- Getting Started
20100712-OTcl Command -- Getting Started20100712-OTcl Command -- Getting Started
20100712-OTcl Command -- Getting Started
 
Ganga: an interface to the LHC computing grid
Ganga: an interface to the LHC computing gridGanga: an interface to the LHC computing grid
Ganga: an interface to the LHC computing grid
 
서버 개발자가 바라 본 Functional Reactive Programming with RxJava - SpringCamp2015
서버 개발자가 바라 본 Functional Reactive Programming with RxJava - SpringCamp2015서버 개발자가 바라 본 Functional Reactive Programming with RxJava - SpringCamp2015
서버 개발자가 바라 본 Functional Reactive Programming with RxJava - SpringCamp2015
 
packet destruction in NS2
packet destruction in NS2packet destruction in NS2
packet destruction in NS2
 
Scheduling
SchedulingScheduling
Scheduling
 
NS2 Shadow Object Construction
NS2 Shadow Object ConstructionNS2 Shadow Object Construction
NS2 Shadow Object Construction
 
Pylons + Tokyo Cabinet
Pylons + Tokyo CabinetPylons + Tokyo Cabinet
Pylons + Tokyo Cabinet
 
2013.02.02 지앤선 테크니컬 세미나 - Xcode를 활용한 디버깅 팁(OSXDEV)
2013.02.02 지앤선 테크니컬 세미나 - Xcode를 활용한 디버깅 팁(OSXDEV)2013.02.02 지앤선 테크니컬 세미나 - Xcode를 활용한 디버깅 팁(OSXDEV)
2013.02.02 지앤선 테크니컬 세미나 - Xcode를 활용한 디버깅 팁(OSXDEV)
 
Preparation for mit ose lab4
Preparation for mit ose lab4Preparation for mit ose lab4
Preparation for mit ose lab4
 

Similar to EBtree - Design for a Scheduler and Use (Almost) Everywhere

Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014
Holden Karau
 

Similar to EBtree - Design for a Scheduler and Use (Almost) Everywhere (20)

Dragoncraft Architectural Overview
Dragoncraft Architectural OverviewDragoncraft Architectural Overview
Dragoncraft Architectural Overview
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic Analysis
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Writing MySQL UDFs
Writing MySQL UDFsWriting MySQL UDFs
Writing MySQL UDFs
 
Return of c++
Return of c++Return of c++
Return of c++
 
RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)
 
Introduction to redis - version 2
Introduction to redis - version 2Introduction to redis - version 2
Introduction to redis - version 2
 
What's new in Redis v3.2
What's new in Redis v3.2What's new in Redis v3.2
What's new in Redis v3.2
 
2 architecture anddatastructures
2 architecture anddatastructures2 architecture anddatastructures
2 architecture anddatastructures
 
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
 
Ropython-windbg-python-extensions
Ropython-windbg-python-extensionsRopython-windbg-python-extensions
Ropython-windbg-python-extensions
 
freertos-proj.pdf
freertos-proj.pdffreertos-proj.pdf
freertos-proj.pdf
 
ROS Course Slides Course 2.pdf
ROS Course Slides Course 2.pdfROS Course Slides Course 2.pdf
ROS Course Slides Course 2.pdf
 
Deep Dive into Futures and the Parallel Programming Library
Deep Dive into Futures and the Parallel Programming LibraryDeep Dive into Futures and the Parallel Programming Library
Deep Dive into Futures and the Parallel Programming Library
 
Introduction to eBPF and XDP
Introduction to eBPF and XDPIntroduction to eBPF and XDP
Introduction to eBPF and XDP
 
stackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Three
stackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Threestackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Three
stackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Three
 
RxJava on Android
RxJava on AndroidRxJava on Android
RxJava on Android
 
.NET Multithreading/Multitasking
.NET Multithreading/Multitasking.NET Multithreading/Multitasking
.NET Multithreading/Multitasking
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014
 

More from C4Media

More from C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

EBtree - Design for a Scheduler and Use (Almost) Everywhere

  • 1. EBtree QCon New York, June 24-26, 2019 Design for a scheduler, and use (almost) everywhere Andjelko Iharos aiharos@haproxy.com
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ ebtree-design/
  • 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4.
  • 5.
  • 6.
  • 7. EBtree QCon New York, June 24-26, 2019 Design for a scheduler, and use (almost) everywhere Andjelko Iharos aiharos@haproxy.com
  • 8.
  • 9. ● Fast tree descent & search ● Memory efficient ● Lookup by mask or prefix (i.e. IPv4 and IPv6) ● Optimized for inserts and deletes ● Great with bit-addressable data EBtree features
  • 10. ● Scheduling requirements ● Candidate solutions ● EBtree design ● Implementation ● Production use ● Results Outline
  • 12. ● Handle network connections ● Run active tasks ● Check suspended tasks, wake them up HAProxy event loop
  • 15. ● Active & suspended tasks ● Insert ● Duplicates ● Read sorted ● Delete ● Priorities Scheduler features
  • 16.
  • 17. ● Up to high frequency of events ● Up to very large number of entries ● Large variations in rate of entry change ● Frequent lookups Scheduling environment
  • 18. ● Speed ● Predictability ● Simplicity Desirable qualities
  • 20. ● Array ● Linked list ● Stack, Queue ● Hash Map ● Tree Basic data structures
  • 21. Linked list By Lasindi - Own work, Public Domain, https://commons.wikimedia.org/w/index.php?curid=2194027
  • 22. By No machine-readable author provided. Dcoetzee assumed (based on copyright claims). - No machine-readable source provided. Own work assumed (based on copyright claims)., Public Domain, https://commons.wikimedia.org/w/index.php?curid=488330 Binary search tree
  • 23. By Bruno Schalch - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=64250599 AVL tree rotations
  • 24. By Claudio Rocchini - Own work, CC BY 2.5, https://commons.wikimedia.org/w/index.php?curid=2118795 Prefix (Radix) trees
  • 25. ● O(log n) insert, O(1) delete ● Fast comparison even for long keys ● Prefix matching ● Nodes and leaves are different ● Not balanced Prefix (Radix) trees
  • 27.
  • 28. ● Simplify memory management ● Reduce impact of imbalance and tree height Can we improve?
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 37.
  • 38.
  • 41.
  • 42.
  • 44.
  • 45. ● 32 bits = 4 bytes word, 2 bits available ● 64 bits = 8 bytes word, 3 bits available 0x846010 = 100001000110000000010000 0x846030 = 100001000110000000110000 0x846050 = 100001000110000001010000 Pointer tagging
  • 46. ● regparm compiler directive (historical reason) ● forced inlining ● __builtin_expect ● ASM C is portable Assembler
  • 47.
  • 48. typedef void eb_troot_t; struct eb_root { eb_troot_t *b[2]; /* left and right branches */ }; struct eb_node { struct eb_root branches; /* branches, must be at the beginning */ short int bit; /* link's bit position. */ eb_troot_t *node_p; /* link node's parent */ eb_troot_t *leaf_p; /* leaf node's parent */ }; Base structs ebtree/ebtree.h
  • 49. /* Return next leaf node after an existing leaf node, or NULL if none. */ static inline struct eb_node *eb_next(struct eb_node *node) { eb_troot_t *t = node->leaf_p; while (eb_gettag(t) != EB_LEFT) /* Walking up from right branch, so we cannot be below root */ t = (eb_root_to_node(eb_untag(t, EB_RGHT)))->node_p; /* Note that <t> cannot be NULL at this stage */ t = (eb_untag(t, EB_LEFT))->b[EB_RGHT]; if (eb_clrtag(t) == NULL) return NULL; return eb_walk_down(t, EB_LEFT); } Base functions eb_next() in ebtree/ebtree.h
  • 50. ● eb32 / eb64 ● ebpt for pointers ● ebim and ebis for indirect memory and strings ● ebmb and ebst for memory block and strings (allocated after just after the node) ● All support storage and ordered retrieval of duplicate keys EBtree data types
  • 51. struct eb64_node { struct eb_node node; /* the tree node, must be at the beginning */ u64 key; }; eb64 node ebtree/eb64tree.h
  • 52. troot = root->b[EB_LEFT]; if (unlikely(troot == NULL)) return NULL; while (1) { if ((eb_gettag(troot) == EB_LEAF)) { node = container_of(eb_untag(troot, EB_LEAF), struct eb64_node, node.branches); if (node->key == x) return node; else return NULL; } eb64 specific functions eb64_lookup() in ebtree/eb64tree.h
  • 53. node = container_of(eb_untag(troot, EB_NODE), struct eb64_node, node.branches); y = node->key ^ x; if (!y) { /* Either we found the node which holds the key, or * we have a dup tree. */ return node; } if ((y >> node->node.bit) >= EB_NODE_BRANCHES) /* 2 */ return NULL; /* no more common bits */ troot = node->node.branches.b[(x >> node->node.bit) & EB_NODE_BRANCH_MASK]; } } eb64 specific functions eb64_lookup() in ebtree/eb64tree.h
  • 55. ● computational load associated with a proxied connection ● active ● suspended ● millisecond resolution HAProxy tasks
  • 56. ● EBtree ● indexed on expiration date Suspended HAProxy tasks
  • 57. ● EBtree ● indexed on expiration date, taking priority into consideration Active HAProxy tasks
  • 58.
  • 59. while (1) { /* Process a few tasks */ process_runnable_tasks(); /* Check if we can expire some tasks */ next = wake_expired_tasks(); /* expire immediately if events are pending */ if (fd_cache_num || run_queue) next = now_ms; /* The poller will ensure it returns around <next> */ cur_poller.poll(&cur_poller, next); fd_process_cached_events(); } HAProxy event loop run_poll_loop() in haproxy-1.7.x/src/haproxy.c
  • 60.
  • 61. /* The base for all tasks */ struct task { struct eb32_node rq; /* ebtree node used to hold the task in the run queue */ struct eb32_node wq; /* ebtree node used to hold the task in the wait queue */ unsigned short state; /* task state : bit field of TASK_* */ short nice; /* the task's current nice value from -1024 to +1024 */ unsigned int calls; /* number of times ->process() was called */ struct task * (*process)(struct task *t); /* the function which processes the task */ void *context; /* the task's context */ int expire; /* next expiration date for this task, in ticks */ }; Task struct haproxy-1.7.x/include/types/task.h
  • 62. if (likely(last_timer && last_timer->node.bit < 0 && last_timer->key == task->wq.key && last_timer->node.node_p)) { eb_insert_dup(&last_timer->node, &task->wq.node); if (task->wq.node.bit < last_timer->node.bit) last_timer = &task->wq; return; } eb32_insert(&timers, &task->wq); /* Make sure we don't assign the last_timer to a node-less entry */ if (task->wq.node.node_p && (!last_timer || (task->wq.node.bit < last_timer->node.bit))) last_timer = &task->wq; return; } Scheduling tasks for later __task_queue() in haproxy-1.7.x/src/task.c
  • 63. if (likely(t->nice)) { int offset; niced_tasks++; if (likely(t->nice > 0)) offset = (unsigned)((tasks_run_queue * (unsigned int)t->nice) / 32U); else offset = -(unsigned)((tasks_run_queue * (unsigned int)-t->nice) / 32U); t->rq.key += offset; } eb32_insert(&rqueue, &t->rq); rq_next = NULL; return t; } Waking up tasks to run __task_wakeup() in haproxy-1.7.x/src/task.c
  • 64. while (max_processed--) { if (unlikely(!rq_next)) { rq_next = eb32_lookup_ge(&rqueue, rqueue_ticks - TIMER_LOOK_BACK); if (!rq_next) { /* we might have reached the end of the tree, typically because * <rqueue_ticks> is in the first half and we're first scanning * the last half. Let's loop back to the beginning of the tree now. */ rq_next = eb32_first(&rqueue); if (!rq_next) break; } } Running tasks process_runnable_tasks() in haproxy-1.7.x/src/task.c
  • 65. t = eb32_entry(rq_next, struct task, rq); rq_next = eb32_next(rq_next); __task_unlink_rq(t); t->state |= TASK_RUNNING; t->calls++; t = t->process(t); if (likely(t != NULL)) { t->state &= ~TASK_RUNNING; if (t->expire) task_queue(t); } } Running tasks process_runnable_tasks() in haproxy-1.7.x/src/task.c
  • 66. ● timers ● schedulers ● ACL ● stick-tables (stats, counters) ● LRU cache EBtree in HAProxy
  • 67. ● Down to 100ns inserts ● > 200k TCP conn/s ● > 350k HTTP req/s ● scheduler using up only 3-5% CPU ● Halog utility - up to 4 million log lines per second ● 450000 BGP routes table: >2 million lookups per second EBtree performing in HAProxy
  • 68.
  • 69. struct lru64_list { struct lru64_list *n; struct lru64_list *p; }; struct lru64_head { struct lru64_list list; struct eb_root keys; struct lru64 *spare; int cache_size; int cache_usage; }; LRU cache structs ebtree/examples/lru.h
  • 70. struct lru64 { struct eb64_node node; /* indexing key, typically a hash64 */ struct lru64_list lru; /* LRU list */ void *domain; /* who this data belongs to */ unsigned long long revision; /* data revision (to avoid use-after-free) */ void *data; /* returned value, user decides how to use this */ }; LRU cache structs ebtree/examples/lru.h
  • 71. struct lru64 *lru64_get(unsigned long long key, struct lru64_head *lru, void *domain, unsigned long long revision) { struct eb64_node *node; struct lru64 *elem; if (!lru->spare) { if (!lru->cache_size) return NULL; lru->spare = malloc(sizeof(*lru->spare)); if (!lru->spare) return NULL; lru->spare->domain = NULL; } LRU cache get/store lru64_get() in ebtree/examples/lru.c
  • 72. /* Lookup or insert */ lru->spare->node.key = key; node = __eb64_insert(&lru->keys, &lru->spare->node); elem = container_of(node, typeof(*elem), node); if (elem != lru->spare) { /* Existing entry found, check validity then move it at the head of the LRU list. */ return elem; } else { /* New entry inserted, initialize and move to the head of the * LRU list, and lock it until commit. */ lru->cache_usage++; lru->spare = NULL; // used, need a new one next time } LRU cache get/store lru64_get() in ebtree/examples/lru.c
  • 73. if (lru->cache_usage > lru->cache_size) { struct lru64 *old; old = container_of(lru->list.p, typeof(*old), lru); if (old->domain) { /* not locked */ LIST_DEL(&old->lru); __eb64_delete(&old->node); if (!lru->spare) lru->spare = old; else free(old); lru->cache_usage--; } } LRU cache get/store lru64_get() in ebtree/examples/lru.c
  • 75.
  • 76. ● Fast tree descent & search ● Memory efficient ● Lookup by mask or prefix (i.e. IPv4 and IPv6) ● Optimized for inserts and deletes ● Great with bit-addressable data EBtree features
  • 77. Check out EBtree at http://git.1wt.eu/web/ebtree.git/ Check out HAProxy at haproxy.org or haproxy.com Join development at haproxy@formilux.org aiharos@haproxy.com Q&A
  • 78. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ ebtree-design/