metrics: add handle ddl event duration metric by wk989898 · Pull Request #4320 · pingcap/ticdc

wk989898 · 2026-02-28T08:46:59Z

What problem does this PR solve?

Issue Number: close #4295

What is changed and how it works?

New DDL Handling Duration Metric: Introduced a new Prometheus histogram metric, ticdc_ddl_handle_duration_bucket, to precisely track the duration of DDL event handling within the system.
Metric Integration: Integrated the new DDL handling duration metric into the BasicDispatcher to record the time taken from event processing to the completion of the post-flush callback for DDL events.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Summary by CodeRabbit

New Features
- Added DDL handling duration metric and timing instrumentation to measure and expose DDL processing latency.
Chores
- Added Grafana heatmap panels to visualize DDL handle duration.
- Ensured metric label cleanup when instances are closed.
Bug Fixes
- Added debug logging for DDL timing to aid diagnostics.

Signed-off-by: wk989898 <nhsmwk@gmail.com>

coderabbitai · 2026-02-28T08:47:15Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7bc3877 and 3611d61.

📒 Files selected for processing (5)

downstreamadapter/dispatcher/basic_dispatcher.go
downstreamadapter/dispatcher/basic_dispatcher_info.go
metrics/grafana/ticdc_new_arch.json
metrics/nextgengrafana/ticdc_new_arch_next_gen.json
metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json

🚧 Files skipped from review as they are similar to previous changes (4)

downstreamadapter/dispatcher/basic_dispatcher.go
metrics/nextgengrafana/ticdc_new_arch_next_gen.json
downstreamadapter/dispatcher/basic_dispatcher_info.go
metrics/grafana/ticdc_new_arch.json

📝 Walkthrough

Walkthrough

Instrumentation and a new Prometheus histogram were added to measure DDL handling duration in the dispatcher; metric lifecycle cleanup was implemented; Grafana dashboard JSONs gained a heatmap panel for "Handle DDL Duration". No control-flow or return-value changes.

Changes

Cohort / File(s)	Summary
Metric Definition & Registration `pkg/metrics/ddl.go`	Added exported `HandleDDLHistogram` (labels: keyspace, changefeed) and registered it in initDDLMetrics.
Metric Initialization & Lifecycle `downstreamadapter/dispatcher/basic_dispatcher_info.go`	Added `metricHandleDDLHis prometheus.Observer` to `SharedInfo`; initialized with `HandleDDLHistogram.WithLabelValues(...)`; delete label values on `Close()`.
Timing Instrumentation `downstreamadapter/dispatcher/basic_dispatcher.go`	Captured timestamp before adding DDL post-flush callback; in callback observed elapsed time via `metricHandleDDLHis` and logged debug message with duration and DDL.
Grafana Dashboard Panels `metrics/grafana/ticdc_new_arch.json`, `metrics/nextgengrafana/ticdc_new_arch_next_gen.json`, `metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json`	Added heatmap panel "Handle DDL Duration" (id 62001) to DDL dashboard rows; bumped dashboard version from 38→39; panel uses tsbuckets and Prometheus rate on `ticdc_ddl_handle_duration_bucket`.

Sequence Diagram(s)

sequenceDiagram
    participant Dispatcher as BasicDispatcher
    participant Shared as SharedInfo
    participant Prom as Prometheus
    participant Graf as Grafana

    Dispatcher->>Shared: register DDL post-flush callback (record start time)
    note right of Shared: callback will run after flush completes
    Shared->>Dispatcher: (on DDL post-flush) invoke callback
    Dispatcher->>Prom: observe duration via HandleDDLHistogram.WithLabels(...)
    Prom->>Graf: histogram buckets exposed (scraped by Prometheus)
    Graf->>Graf: heatmap panel queries `ticdc_ddl_handle_duration_bucket` for visualization

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

lgtm, size/L

Suggested reviewers

asddongmen
hongyunyan

Poem

🐰 A hop, a tick, a tiny trail,

I timed the DDL without fail.
Buckets fill and colors hum,
Grafana shows how long they run.
Hooray — metrics done, time to nibble a kale!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description includes the required issue reference, explains what changed and how it works, but leaves critical sections incomplete: testing details, performance/compatibility questions, and release notes are unfilled.	Complete the unanswered questions about performance regression/compatibility impact and provide a proper release note following the style guide; specify which test types were actually performed.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and concisely describes the main change: adding a Prometheus histogram metric to measure DDL event handling duration.
Linked Issues check	✅ Passed	The PR successfully implements the objective from issue `#4295`: it adds a new Prometheus histogram metric (HandleDDLHistogram) to measure DDL event handling duration and integrates it into BasicDispatcher to capture pre-sink processing time.
Out of Scope Changes check	✅ Passed	All changes are scoped to the DDL handling duration metric objective: instrumentation in dispatcher logic, metric definition and registration, and Grafana dashboard updates to visualize the new metric.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-02-28T08:47:16Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the monitoring capabilities of TiCDC by introducing a dedicated metric to measure the duration of Data Definition Language (DDL) event handling. This provides crucial insights into the performance and latency characteristics of DDL operations, which are vital for maintaining database schema consistency and overall system health. The addition of corresponding Grafana panels ensures that this new observability data is readily available for analysis and troubleshooting.

Highlights

New DDL Handling Duration Metric: Introduced a new Prometheus histogram metric, ticdc_ddl_handle_duration_bucket, to precisely track the duration of DDL event handling within the system.
Metric Integration: Integrated the new DDL handling duration metric into the BasicDispatcher to record the time taken from event processing to the completion of the post-flush callback for DDL events.
Grafana Dashboard Updates: Updated multiple Grafana dashboards (ticdc_new_arch.json, ticdc_new_arch_next_gen.json, ticdc_new_arch_with_keyspace_name.json) to include a new 'Handle DDL Duration' heatmap panel, providing visual representation of the new metric.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

downstreamadapter/dispatcher/basic_dispatcher.go
- Added a timestamp capture (now) before DDL post-flush function execution.
- Recorded the DDL handling duration using metricHandleDDLHis.Observe within the DDL post-flush function.
- Included debug logging for DDL event finish, showing the cost.
downstreamadapter/dispatcher/basic_dispatcher_info.go
- Imported metrics and prometheus packages.
- Added metricHandleDDLHis (a prometheus.Observer) to the SharedInfo struct.
- Initialized metricHandleDDLHis with metrics.HandleDDLHistogram during NewSharedInfo creation.
- Implemented cleanup logic in the Close() method to delete the metric label values.
metrics/grafana/ticdc_new_arch.json
- Added a new Grafana heatmap panel titled 'Handle DDL Duration' to the DDL section.
- Configured the new panel to display the ticdc_ddl_handle_duration_bucket metric.
metrics/nextgengrafana/ticdc_new_arch_next_gen.json
- Added a new Grafana heatmap panel titled 'Handle DDL Duration' to the DDL section.
- Configured the new panel to display the ticdc_ddl_handle_duration_bucket metric, including the keyspace_name label.
metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json
- Added a new Grafana heatmap panel titled 'Handle DDL Duration' to the DDL section.
- Configured the new panel to display the ticdc_ddl_handle_duration_bucket metric, including the keyspace_name label.
pkg/metrics/ddl.go
- Defined HandleDDLHistogram as a new Prometheus histogram vector for DDL handling duration.
- Registered HandleDDLHistogram in the initDDLMetrics function.

Activity

The pull request was created by wk989898 to address Issue add more metrics for ddl event #4295, focusing on adding a DDL event duration metric.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new metric, ticdc_ddl_handle_duration, to monitor the duration of DDL event handling. The implementation involves adding the metric definition, observing its value within the dispatcher logic, and updating Grafana dashboards with a new visualization panel. The changes are generally well-implemented. I've provided a couple of suggestions to improve code consistency and performance.

downstreamadapter/dispatcher/basic_dispatcher.go

downstreamadapter/dispatcher/basic_dispatcher_info.go

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

downstreamadapter/dispatcher/basic_dispatcher_info.go (1)
73-75: Fix the metric field comment to match the actual identifier.

Line 73 says metricExecDDLHis, but the field at Line 75 is metricHandleDDLHis. Please align the comment to avoid confusion during maintenance.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@downstreamadapter/dispatcher/basic_dispatcher_info.go` around lines 73 - 75,
The comment above the metric field is referring to the wrong identifier
(mentions metricExecDDLHis while the field is named metricHandleDDLHis); update
the comment to match the actual field name (metricHandleDDLHis) or rename the
field to metricExecDDLHis so they align—ensure the comment describes that
metricHandleDDLHis records each DDL handling duration (execution + wait for
resolution) and use the exact identifier metricHandleDDLHis in the comment.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@downstreamadapter/dispatcher/basic_dispatcher.go`:
- Around line 672-681: The metric currently measures after sink flush and
callback overhead because now is set before AddPostFlushFunc but
d.sharedInfo.metricHandleDDLHis.Observe is called inside the post-flush callback
after wakeCallback; change to compute elapsed := time.Since(now) once inside the
post-flush func and call
d.sharedInfo.metricHandleDDLHis.Observe(elapsed.Seconds()) before invoking
wakeCallback (and before any heavy post-sink work like
d.tableSchemaStore.AddEvent or wakeCallback) so the metric reflects pre-sink
dispatcher handling only.

---

Nitpick comments:
In `@downstreamadapter/dispatcher/basic_dispatcher_info.go`:
- Around line 73-75: The comment above the metric field is referring to the
wrong identifier (mentions metricExecDDLHis while the field is named
metricHandleDDLHis); update the comment to match the actual field name
(metricHandleDDLHis) or rename the field to metricExecDDLHis so they
align—ensure the comment describes that metricHandleDDLHis records each DDL
handling duration (execution + wait for resolution) and use the exact identifier
metricHandleDDLHis in the comment.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e8e6236 and 7bc3877.

📒 Files selected for processing (6)

downstreamadapter/dispatcher/basic_dispatcher.go
downstreamadapter/dispatcher/basic_dispatcher_info.go
metrics/grafana/ticdc_new_arch.json
metrics/nextgengrafana/ticdc_new_arch_next_gen.json
metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json
pkg/metrics/ddl.go

coderabbitai · 2026-02-28T08:52:02Z

downstreamadapter/dispatcher/basic_dispatcher.go

+			now := time.Now()
 			ddl.AddPostFlushFunc(func() {
 				if d.tableSchemaStore != nil {
 					d.tableSchemaStore.AddEvent(ddl)
 				}
 				wakeCallback()
+				d.sharedInfo.metricHandleDDLHis.Observe(time.Since(now).Seconds())
+				log.Debug("dispatcher handle ddl event finish",
+					zap.Duration("cost", time.Since(now)),
+					zap.Any("ddl", ddl))


⚠️ Potential issue | 🟠 Major

Metric boundary currently measures beyond pre-sink handling.

Timer starts at Line 672 but is observed in post-flush callback at Line 678 (after sink write/flush), and after wakeCallback() at Line 677. This captures end-to-end + callback overhead, not just dispatcher pre-sink DDL handling from issue #4295.

If pre-sink latency is the target, record the metric right before the first sink write path. At minimum, move observe before wakeCallback and reuse one computed elapsed duration to avoid callback skew.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@downstreamadapter/dispatcher/basic_dispatcher.go` around lines 672 - 681, The metric currently measures after sink flush and callback overhead because now is set before AddPostFlushFunc but d.sharedInfo.metricHandleDDLHis.Observe is called inside the post-flush callback after wakeCallback; change to compute elapsed := time.Since(now) once inside the post-flush func and call d.sharedInfo.metricHandleDDLHis.Observe(elapsed.Seconds()) before invoking wakeCallback (and before any heavy post-sink work like d.tableSchemaStore.AddEvent or wakeCallback) so the metric reflects pre-sink dispatcher handling only.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Signed-off-by: wk989898 <nhsmwk@gmail.com>

ti-chi-bot · 2026-02-28T10:50:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: asddongmen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [asddongmen]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-02-28T10:50:16Z

[LGTM Timeline notifier]

Timeline:

2026-02-28 10:50:16.097813058 +0000 UTC m=+9660.675892242: ☑️ agreed by asddongmen.

init

7bc3877

Signed-off-by: wk989898 <nhsmwk@gmail.com>

ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 28, 2026

gemini-code-assist bot reviewed Feb 28, 2026

View reviewed changes

downstreamadapter/dispatcher/basic_dispatcher.go Outdated Show resolved Hide resolved

downstreamadapter/dispatcher/basic_dispatcher_info.go Outdated Show resolved Hide resolved

coderabbitai bot reviewed Feb 28, 2026

View reviewed changes

wk989898 and others added 4 commits February 28, 2026 16:56

Update downstreamadapter/dispatcher/basic_dispatcher_info.go

da5e694

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update downstreamadapter/dispatcher/basic_dispatcher.go

2cea503

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

fmt

2f74896

Signed-off-by: wk989898 <nhsmwk@gmail.com>

update version

3611d61

Signed-off-by: wk989898 <nhsmwk@gmail.com>

asddongmen approved these changes Feb 28, 2026

View reviewed changes

ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Feb 28, 2026

ti-chi-bot bot added the approved label Feb 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metrics: add handle ddl event duration metric#4320

metrics: add handle ddl event duration metric#4320
wk989898 wants to merge 5 commits intopingcap:masterfrom
wk989898:ddl-handle-metric

wk989898 commented Feb 28, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 28, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

gemini-code-assist bot commented Feb 28, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 28, 2026

Uh oh!

ti-chi-bot bot commented Feb 28, 2026

Uh oh!

ti-chi-bot bot commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wk989898 commented Feb 28, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

gemini-code-assist bot commented Feb 28, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot bot commented Feb 28, 2026

Uh oh!

ti-chi-bot bot commented Feb 28, 2026

[LGTM Timeline notifier]

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wk989898 commented Feb 28, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 28, 2026 •

edited

Loading