Skip to content

frr: resync grout state when this one restarts#522

Merged
rjarry merged 5 commits intoDPDK:mainfrom
maxime-leroy:resync_grout
Mar 3, 2026
Merged

frr: resync grout state when this one restarts#522
rjarry merged 5 commits intoDPDK:mainfrom
maxime-leroy:resync_grout

Conversation

@maxime-leroy
Copy link
Copy Markdown
Collaborator

@maxime-leroy maxime-leroy commented Feb 26, 2026

Summary by CodeRabbit

  • New Features

    • Added ability to atomically clear interface index mappings
    • Introduced per‑VRF synchronization channel for staged sync and graceful restart flows
  • Bug Fixes

    • Made interface mapping operations thread‑safe to prevent races and duplicates
    • Improved error handling, reconnection, and recovery to ensure reliable resync and notification processing

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 26, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b679d81 and 1533a24.

📒 Files selected for processing (3)
  • frr/if_map.c
  • frr/if_map.h
  • frr/zebra_dplane_grout.c

📝 Walkthrough

Walkthrough

The changes add mutex-based thread-safety to if_map (introducing a global ifindex mutex and a new public function clear_ifindex_mappings()), and refactor zebra_dplane_grout to introduce a dedicated sync_client, per-VRF synchronization state and events, new sync/reconnect workflows, updated connection/recovery logic, and notification handling that triggers full resync on errors. Public APIs remain unchanged.

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
frr/zebra_dplane_grout.c (1)

267-268: Manual bitfield reset - consider using a helper if available.

The manual memset and n = 0 reset works correctly, but FRR may have a bf_reset() or similar macro for cleaner bitfield clearing. If not available, this approach is acceptable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@frr/zebra_dplane_grout.c` around lines 267 - 268, The manual clearing of the
bitfield (memset on grout_ctx.sync_vrf.data and setting grout_ctx.sync_vrf.n =
0) should be replaced with the project's bitfield reset helper if one exists;
locate the bitfield API used elsewhere (e.g., bf_reset, BITFIELD_RESET,
bitfield_reset) and call that with grout_ctx.sync_vrf (or its address) instead
of the raw memset and n = 0, otherwise leave the current code as-is if no helper
is available.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@frr/zebra_dplane_grout.c`:
- Around line 267-268: The manual clearing of the bitfield (memset on
grout_ctx.sync_vrf.data and setting grout_ctx.sync_vrf.n = 0) should be replaced
with the project's bitfield reset helper if one exists; locate the bitfield API
used elsewhere (e.g., bf_reset, BITFIELD_RESET, bitfield_reset) and call that
with grout_ctx.sync_vrf (or its address) instead of the raw memset and n = 0,
otherwise leave the current code as-is if no helper is available.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3019bed and 5ba4b9e.

📒 Files selected for processing (3)
  • frr/if_map.c
  • frr/if_map.h
  • frr/zebra_dplane_grout.c

Comment thread frr/zebra_dplane_grout.c Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@frr/if_map.c`:
- Around line 118-125: The early return inside the frr_with_mutex block can
leave the mutex locked; instead, perform the find and set a local boolean (e.g.,
found_flag or result) or a pointer variable for found, do not return from inside
frr_with_mutex, and after leaving the mutex block return the appropriate value.
Specifically, call grout_to_frr_find(&grout_to_frr_mappings, &key) inside
frr_with_mutex, store the outcome in a local variable, only call
grout_to_frr_del and frr_to_grout_del if found, and move the final return
(return false/true) to after the frr_with_mutex block so the mutex is always
released by frr_with_mutex.
- Around line 66-86: The code inside the frr_with_mutex(...) block returns early
on error paths (after calls to grout_to_frr_add and frr_to_grout_add), which
prevents the frr_with_mutex loop from executing its unlock step; change the
logic to use a local bool result (e.g., success = true) and set it to false on
the error conditions, perform necessary cleanup calls (grout_to_frr_del, XFREE)
as currently done, then break out of the frr_with_mutex block and return the
local result after the block; reference the existing symbols grout_to_frr_add,
frr_to_grout_add, grout_to_frr_del, XFREE, mapping and the frr_with_mutex macro
when making the change.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5ba4b9e and b679d81.

📒 Files selected for processing (3)
  • frr/if_map.c
  • frr/if_map.h
  • frr/zebra_dplane_grout.c

Comment thread frr/if_map.c
Comment thread frr/if_map.c
Comment thread frr/zebra_dplane_grout.c Outdated
Comment thread frr/zebra_dplane_grout.c
grout_ctx.client is shared between the dplane thread
(grout_client_send_recv) and the main thread (sync functions) with no
synchronization. gr_api_client has no locking: its response queue and
the static message_id counter in gr_api_client_send can be corrupted
by concurrent access. Most of the time this is not an issue as the
initial sync completes before the dplane thread starts processing, but
as soon as interfaces are created in zebra the dplane thread may start
sending requests concurrently with the main thread's streaming reads.

Add a dedicated sync_client to grout_ctx for exclusive use by the main
thread sync functions (grout_sync_ifaces, grout_sync_ifaces_addresses,
grout_sync_nhs, grout_sync_routes). The existing client field remains
for the dplane thread.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Reviewed-by: Robin Jarry <rjarry@redhat.com>
The ifindex/VRF mapping hash tables are accessed from both the main
thread (grout_sync_ifaces via grout_link_change, grout_sync_* via
vrf_grout_to_frr lookups) and the dplane thread (zd_grout_process via
grout_add_del_route, grout_add_del_nexthop, grout_add_del_address).

FRR hash tables (DECLARE_HASH) are not thread-safe. Concurrent add,
del or clear from one thread while the other traverses the hash can
corrupt the internal structure or cause a use-after-free on the
ifindex_mapping entries.

This race exists from initial startup: grout_sync_ifaces populates the
mappings on the main thread while the dplane provider may already be
processing queued requests that perform lookups on the dplane thread.

Add a pthread_mutex_t to serialize all hash table operations.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Reviewed-by: Robin Jarry <rjarry@redhat.com>
grout_sync_ifaces and grout_sync_addrs (renamed from
grout_sync_ifaces_addresses) were running on the zebra main thread
instead of the dplane thread. Unlike the kernel datapath where the
initial netlink sync completes before the dplane thread starts, the
grout plugin syncs via the frr_config_post hook which fires after the
dplane thread is already running. At that point,
dplane_read_notifications is actively processing iface and address
events, so the sync must run on the same thread to avoid concurrent
access.

Schedule grout_sync_ifaces and grout_sync_addrs on
dplane_get_thread_master() so that API calls using the sync_client
happen on the thread that owns it.

Additionally, all VRFs were being synced concurrently by scheduling
one event per VRF at the end of grout_sync_ifaces. Replace this with
a sequential chain: grout_sync_ifaces starts the first VRF, and
grout_sync_routes chains to the next VRF when done. This avoids
interleaving API requests from multiple VRFs and ensures that the
single dg_t_dplane_sync event pointer is not overwritten.

Use a bitfield instead of a stack-allocated bool array to track which
VRFs need syncing, so the set persists across the chained callbacks.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Reviewed-by: Robin Jarry <rjarry@redhat.com>
When grout restarts, the zebra notification socket receives EOF.
This triggers grout_reconnect which flushes all FRR state (VRFs,
ifindex mappings), resets the namespace, and schedules grout_sync.

grout_sync is the single entry point for connecting the main thread
API client (sync_client) to grout. It retries every second until
grout becomes available, then schedules notification channels and
the interface sync chain. This same function is used at startup via
the frr_config_post hook, allowing FRR to start before grout is
ready. Connection failures are logged so operators can tell why FRR
has not synced yet.

Notification connect functions (dplane_grout_connect,
zebra_grout_connect) no longer self-retry but schedule
grout_reconnect on failure so the full sync is retried.
grout_notif_subscribe disconnects any stale client before
reconnecting since the ordering between EOF processing and
reconnect scheduling is not deterministic.

grout_client_send_recv no longer attempts inline reconnection.
If the dplane client connection is lost, it fails the request,
disconnects the client, and schedules grout_reconnect so recovery
does not depend solely on the notification EOF path.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Reviewed-by: Robin Jarry <rjarry@redhat.com>
When grout restarts, grout_ns_reset() tears down all VRFs and their
interfaces via vrf_terminate(). Routes and nexthops are properly
cleaned up through zebra_vrf_disable(), but interfaces are destroyed
via if_delete() which never calls if_down().

In normal FRR operation with the kernel backend, if_down() has
already been called in response to a netlink RTM_DELLINK notification
before the interface is destroyed. With grout, there is no kernel
notification, so global state referencing the interface (such as the
RA timer wheel) is not cleaned up before the interface is freed. In
particular, BGP unnumbered enables RA on peering interfaces which
adds them to zrouter.ra_wheel. After the reset, the wheel timer
thread calls process_rtadv() on a stale pointer, causing a SIGSEGV.

Call if_down() on all interfaces before tearing down namespaces and
VRFs.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Reviewed-by: Robin Jarry <rjarry@redhat.com>
@rjarry rjarry merged commit 2ecbd25 into DPDK:main Mar 3, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants