perf: add direct-mapped node cache to BTreeMap#416
Conversation
Add a 32-slot direct-mapped node cache to BTreeMap that avoids re-loading hot nodes from stable memory. Modeled after CPU caches: O(1) lookup via (address / page_size) % 32, collision = eviction. Read paths (get, contains_key, first/last_key_value) use a take+return pattern to borrow nodes from the cache without RefCell lifetime issues. Write paths (insert, remove, split, merge) invalidate affected cache slots. Key changes: - Switch get() from destructive extract_entry_at to node.value() - Remove unused extract_entry_at method - Change traverse() closure from Fn(&mut Node) to Fn(&Node) - Invalidate cache in save_node, deallocate_node, merge, clear_new Expected improvement: ~15-20% for random reads, ~65% for hot-key workloads, ~0% overhead for writes (cache.get_mut() bypasses RefCell).
|
|
|
|
|
|
…expose hit/miss stats
src/btreemap.rs
Outdated
| if !self.is_enabled() { | ||
| self.misses += 1; | ||
| return None; | ||
| } |
There was a problem hiding this comment.
Just an idea, feel free to ignore: rather than disable the cache (and incur the cost of a branch), might it be preferable to have a cache of size 1?
(As a side note, I would actually be curious to see the difference in benchmark scores from adding this and the stats. Maybe it's all a storm in a teacup. But maybe not.)
There was a problem hiding this comment.
If we have a single entry that will strictly be more computationally expensive (always and is unlikely to give any benefit since we'll keep overwriting it all the time.
size-0 execution path will be correctly predicted every time and costs ~0 cycles in practice.
A size-1 cache, on the other hand, would still allocate a Node on the heap and would collide on every operation (every node maps to slot 0), producing worse miss behaviour than even a small real cache.
So I'd recommend we go with size-0 by default, in the next version. And then we can turn it on to size-32 or larger by default once we get some feedback from production runs.
There was a problem hiding this comment.
btw I now see in the benchmarks with default 0-size cache size some increase in instruction cost, up to +8%.
would it be possible to have some kind of short circuit not to do extra work when cache size is zero?
There was a problem hiding this comment.
btw I now see in the benchmarks with default 0-size cache size some increase in instruction cost, up to +8%.
This is what I was referring to. Sure, a size 1 cache is going to be more expensive than a size 0 cache. But having the option of a size 0 cache makes all other cache sizes more expensive. And the size 0 cache more expensive than not having this feature at all.
So I guess it's a trade-off between optimizing for no caching vs. optimizing for caching. I was going for the latter, since caching appears to make a positive change in performance, but we could also stick with no caching as the default.
There was a problem hiding this comment.
By inlining one function, we're now roughly in the negative range (marginally improved performance) even with cache disabled.
There was a problem hiding this comment.
re: size 1 cache
In current implementation direct-mapped cache does not prioritise top level nodes over lower level nodes, meaning that the cache with size 1 will always be overwritten on each tree traversal and will always have 0 hits, which is a pure waste of cycles.
More over any cache with the size smaller than the tree height will have 0 hits. So the rough range for the cache sizes that make it practically usable:
- lower bound must be at least the tree size, maybe x2 to account for collisions
- upper bound any cache size that reaches 90-95% hit ratio, above that the ROI is too small
(optional exploration idea) Based on that, I suppose one interesting idea to explore (not now, in the future PRs) is to store in cache also the current node height (or level) and prioritize top nodes (closer to root) over lower level nodes. Or maybe when we decide to use set-assotiative cache store cache lines of different levels with prioritization by level.
src/btreemap.rs
Outdated
| pub fn cache_stats(&self) -> CacheStats { | ||
| self.cache.borrow().stats() | ||
| } |
There was a problem hiding this comment.
Might be useful to return the cache size in bytes. The end user might have a hard time figuring out the average node size or even just the page size.
There was a problem hiding this comment.
there were some requests from users to add to stable structures methods that report memory usage: heap usage, stable memory allocated and actually used. if this is implemented it also covers any cache, so maybe it should not report back the cache size. also if user configures the cache size, it's expected to be full.
I'm fine if this PR does not implement reporting cache size.
There was a problem hiding this comment.
Reporting the cache size in bytes is easy and cheap, so I just added it.
There was a problem hiding this comment.
it wasn't so easy after all, I'll look for another way to do it.
There was a problem hiding this comment.
I added a MemSize trait implementation to this branch.
It calculates heap usage with (in my opinion) an acceptable tradeoff between the cost and precision.
Please take a look and let me know what you think. It's ok to revert my changes if they don't work.
src/btreemap.rs
Outdated
| /// map.set_cache_size(stats.size_bytes * 2); | ||
| /// } | ||
| /// ``` | ||
| pub fn set_cache_size(&mut self, bytes: u64) { |
There was a problem hiding this comment.
After thinking about this a bit more I realized we cannot guarantee this cache size limit, because each node is 11 key-value pairs which can be unbounded and take any amount of space. So probably we should stick back to the actual numbers of nodes when defining the cache size.
Based on that, if the user wants to know the exact (or at least realistic) cache size and can only define it in the number of nodes, then we need to provide a way to report this data, maybe it should be returned as a field in stats. We can use something similar to DataSize trait implementation, or maybe some existing crate.
And with such a way to dynamically change the cache size I suppose here are the methods we need to cover cache-lifecycle:
let stats = map.node_cache_stats(); // provides: hits, misses, hit ratio, memory usage
...
map.node_cache_resize(number_of_nodes); // should also clear hit/miss counters.
map.node_cache_clear(); // same as resizing to the current cache sizeI don't think we should stick to power of 2 cache size, it's too restrictive, let's allow users to choose their favourite number, the same way one can resize vector to any number.
There was a problem hiding this comment.
Calculating the exact cache size is not trivial after all.
- You'd need to walk all slots and sum up the actual heap allocations inside each Node — the Vec<LazyEntry> for entries, Vec for children, and each LazyEntry's OnceCell contents (which may or may not be materialized)
- This is inherently imprecise - Rust Vec allocates with capacity, not len. You'd be measuring a lower bound unless you account for allocator overhead, which you can't portably
- Node doesn't own its values in the cache (that's the whole point of read_value_uncached) — so the "big" part of the data isn't even cached. The cache holds mostly keys + children addresses + metadata
- Adding a DataSize trait or pulling in a crate like deepsize/datasize is dependency bloat for a feature that not everyone will use
How about this instead?
/// Returns an estimate of the cache's heap usage in bytes.
///
/// This is a rough upper bound: `num_slots * (page_size + overhead)`.
/// Actual usage is typically lower because cached nodes don't
/// materialize values (only keys and child pointers).
pub fn node_cache_size_bytes_approx(&self) -> usize {
self.cache_num_slots * (self.version.page_size().get() as usize + size_of::<(Address, Option<Node<K>>)>())
}
There was a problem hiding this comment.
I have committed to the branch an implementation of MemSize trait which is used to make the calculating of node cache memory usage approximately accurate and fast without much cost.
- it does not not iterate over all the nodes each time you need to read memory usage, internally it keeps track of memory usage when adding/removing nodes into the cache
- imprecision of vec len vs vec capacity is good enough, I suppose with MemSize one can even measure capacity instead of len
- incorporating
MemSizeinto lazy objects (keys and values) will properly calculate actual heap usage (only keys, when values are not cached) MemSizeis currently an in-place implementation, no dependency bloat issue
The approach of approximately measuring via assuming that each node is a single 'virtual page' (1KiB) is too imprecise in my opinion.
Let's take a look at those options closely and discuss the tradeoffs.
| + self.version.mem_size() | ||
| + self.allocator.mem_size() | ||
| + self.length.mem_size() | ||
| + self.node_cache_memory_used() |
There was a problem hiding this comment.
FYI: this line is O(1), it does not traverse all the slots to calculate memory usage, because it tracks the usage on adding/removing the nodes.
src/btreemap.rs
Outdated
| } | ||
| let root = self.load_node(self.root_addr); | ||
| let root = self.take_or_load_node(self.root_addr); | ||
| let (k, encoded_v) = root.get_min(self.memory()); |
There was a problem hiding this comment.
Node::get_min and Node::get_max have their own "fast" tree traversal without searching the keys, but they always load without using the power of cache.
Maybe a future optimization would be to move this logic from node-level to btreemap-level, so that it can traverse without search but using the cache.
src/btreemap.rs
Outdated
| } | ||
| let root = self.load_node(self.root_addr); | ||
| let root = self.take_or_load_node(self.root_addr); | ||
| let (k, encoded_v) = root.get_max(self.memory()); |
|
|
||
| fn put(&mut self, addr: Address, node: Node<K>) { | ||
| debug_assert!(self.is_enabled()); | ||
| self.metrics.add_memory_used(node.heap_memory_used()); |
There was a problem hiding this comment.
(optional exploration) temporary for debugging reasons we can add a debug assert to check if all the elements in the node do not contain the value, and if some do, then to inspect how they find the way to end up in cache and maybe try to remove them if possible (like in the case of get_min/get_max traversal).
this is not critical for this PR and can be done later.
| fn take(&mut self, addr: Address) -> Option<Node<K>> { | ||
| debug_assert!(self.is_enabled()); | ||
| let idx = self.slot_index(addr); | ||
| if self.slots[idx].0 == addr { |
There was a problem hiding this comment.
This comparison looks a bit suspicious due to NULL which is Address(0), so if it's called to take with NULL it'll record a hit which I don't think is correct. But at the same time I don't know, maybe at this point it's always called for non NULL values? Maybe it should have an extra check for non-null.
Summary
(address / page_size) % 32, collision = eviction (no LRU tracking)get,contains_key,first/last_key_value) use a take+return pattern to avoid re-loading hot upper-tree nodes from stable memorysave_node,deallocate_node,merge, andclear_newget()from destructiveextract_entry_at(swap_remove) to non-destructivenode.value()(borrows via OnceCell)extract_entry_atmethodThis subsumes all four previous caching approaches (root-only, LRU+clone, LRU+Rc, page cache) into a single design that:
Node<K>directly (no Rc, no Clone, no heap allocation per cache entry)cache.get_mut()on write paths (zero RefCell overhead)Expected improvement: ~15-20% for random reads, ~65% for hot-key workloads, ~0% overhead for writes.