Skip to content

Flow eviction and LFT/UFT lifecycle rework#990

Open
FelixMcFelix wants to merge 6 commits into
masterfrom
eviction-policy
Open

Flow eviction and LFT/UFT lifecycle rework#990
FelixMcFelix wants to merge 6 commits into
masterfrom
eviction-policy

Conversation

@FelixMcFelix
Copy link
Copy Markdown
Collaborator

This PR reworks the LFT/UFT lifecycle so that inbound/outbound entries are correctly bound by a single lifetime, and so that a slowpath packet tracks all LFTs which were used in its handling. We then associate this list of LFTs with the flow UFT, and the TCP flow entry if created.

This allows us to correctly keep LFTs alive so long as a UFT/TCP entry refers to them. This has been a longstanding bug. The new and really useful feature is that this allows us to remove all related and dependent entries when we deliberately evict an LFT to make space for a new flow.

An expiry policy for any table can now also specifiy an "eviction priority" for a given flow. Generally speaking most flows are unevictable so long as they are well-behaved, but the main issue is that we need information from elsewhere to know when we can make this determination. How we tackle this is by looking at all descendent flows which rely on a given flow. Most will state no preference on the flow lifecycle, but the TCP flow entry can return a non-zero priority when inactive in a given state for too long.

Closes #986.
Closes #788.
Closes #789.

@FelixMcFelix FelixMcFelix self-assigned this May 18, 2026
This PR reworks the LFT/UFT lifecycle so that inbound/outbound entries
are correctly bound by a single lifetime, and so that a slowpath packet
tracks all LFTs which were used in its handling. We then associate this
list of LFTs with the flow UFT, and the TCP flow entry if created.

This allows us to correctly keep LFTs alive so long as a UFT/TCP entry
refers to them. This has been a longstanding bug. The new and really
useful feature is that this allows us to remove all related and
dependent entries when we deliberately evict an LFT to make space for a
new flow.

An expiry policy for any table can now also specifiy an "eviction
priority" for a given flow. Generally speaking most flows are
unevictable so long as they are well-behaved, but the main issue is that
we need information from elsewhere to *know* when we can make this
determination. How we tackle this is by looking at all descendent flows
which rely on a given flow. Most will state no preference on the flow
lifecycle, but the TCP flow entry can return a non-zero priority when
inactive in a given state for too long.

Closes #986.
Closes #788.
Closes #789.
@FelixMcFelix FelixMcFelix marked this pull request as ready for review May 20, 2026 15:59
@FelixMcFelix
Copy link
Copy Markdown
Collaborator Author

From what I can tell this appears to have the same throughput on glasgow as master. I was mildly concerned that changing the Arc<TcpFlowEntryState> held by a UFT into a Weak<_> in the name of consistent leak prevention would have had a mild impact, due to the calls to Weak::upgrade (similar enough cost to an Arc::clone). Otherwise the extra work we're doing affects only the slowpath and the periodic reaper callback.

Copy link
Copy Markdown
Contributor

@rcgoodfellow rcgoodfellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together @FelixMcFelix. Questions and comments follow.

On the #986 front, while this PR does not introduce the provisional table we discussed, it looks like it does satisfy properties (1) and (2) from that ticket and therefore would satisfy the intent of the ticket. For (1) we only mark connections as eligible for eviction if the last hit is older than the TTL for the connection state they are in. For (2) the priority based eviction seems to cover that, but I have some questions on how it's implemented below.


/// Operations that the main flow identifier type for a given [`NetworkImpl`]
/// must support.
// This should live in `opte::api`, but we don't want to introduce an
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's file a follow up issue for this.

// should be immune to timer-driven expiry. They are *not* immune to
// explicit eviction.
//
// We don't arrange the UFT as a parent of the TCP flow entry because
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just food for thought, I wonder if the three way relationship between UFT/LFT/TFT is more trilateral in nature than parent/child. E.g. we just want to have the three way associations in place through pointers to be able to have policy over all three that may evolve over time and make the parent/child relationship less of a thing.

Direction::Out => &mut inner.outbound_parents,
};

core::mem::swap(new_parent_slot, &mut parents);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the reaper runs when the parent/child lists are potentially inconsistent between the parent vec swap here and the child addition/removals below? (or other threads that may hit this inconsistent window, consecutive slowpath packets on the same flow?)

// Ordinarily we will expire flows in these states quickly. If
// we are under table pressure, we can allow them to be
// explicitly selected for removal faster.
a if a > self.incipient_ttl.as_milliseconds() / 2 => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making the 2 a constant with some rationale on why this divisor was selected.

// quickly, but leave them in place for the ~120s when we are
// not under pressure.
a if a > self.incipient_ttl.as_milliseconds() => {
Some(EvictionPriority::Evictable(NonZeroU16::MIN))
Copy link
Copy Markdown
Contributor

@rcgoodfellow rcgoodfellow May 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For flows that are beyond the TTL, should the priority reflect how far beyond the TTL they are, something like so the flows furthest beyond their TTL get evicted first? Same question for line 3182 below.

// Allow established flows to be evicted on the same time scale as
// UFT/LFT expiry, but if we are not under pressure then we aim to
// keep the LFTs in place even while the flow is inactive.
TcpState::Established
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the established case, we may want to have this be tunable per OPTE port. Maybe I'm not totally understanding the model, but if we get to a place where a flow exists for a TCP session with a sparse communication pattern that has intervals of silence beyond 60 seconds, the flow will stay in place until the system comes under pressure, then the flow may be evicted in favor of new flows due to high connection churn. In that situation, is the flow dead or does it just encounter a slowpath speed bump? If dead, that seems rather unfortunate and I think we'd want to provide a knob for users to tune FLOW_DEF_TTL for their applications.

}
Some((EvictionKey::Evictable(curr_prio, curr_time), ..))
if prio > curr_prio
|| (prio == curr_prio && last_hit < curr_time) =>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is last hit what we want here, or do we want amount of time beyond TTL, as the TTL can pretty different depending on what state the flow is in? This is related to the question earlier on how we set the eviction priority when looking at the delta relative to TTL in TcpExpiry::eviction_priority.

/// Return an iterator containing references to all flow entries from other
/// tables which underpin `self`.
fn parents(&self) -> impl Iterator<Item = Arc<dyn FlowEntryInfo>> {
[].into_iter()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is a bit of a bug magnet if a new impl forgets or does not realize they need to implement this. Consider removing the default and making impls fill in the method?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants