metrics: add handle ddl event duration metric#4320
metrics: add handle ddl event duration metric#4320wk989898 wants to merge 5 commits intopingcap:masterfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review infoConfiguration used: Organization UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (5)
🚧 Files skipped from review as they are similar to previous changes (4)
📝 WalkthroughWalkthroughInstrumentation and a new Prometheus histogram were added to measure DDL handling duration in the dispatcher; metric lifecycle cleanup was implemented; Grafana dashboard JSONs gained a heatmap panel for "Handle DDL Duration". No control-flow or return-value changes. Changes
Sequence Diagram(s)sequenceDiagram
participant Dispatcher as BasicDispatcher
participant Shared as SharedInfo
participant Prom as Prometheus
participant Graf as Grafana
Dispatcher->>Shared: register DDL post-flush callback (record start time)
note right of Shared: callback will run after flush completes
Shared->>Dispatcher: (on DDL post-flush) invoke callback
Dispatcher->>Prom: observe duration via HandleDDLHistogram.WithLabels(...)
Prom->>Graf: histogram buckets exposed (scraped by Prometheus)
Graf->>Graf: heatmap panel queries `ticdc_ddl_handle_duration_bucket` for visualization
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the monitoring capabilities of TiCDC by introducing a dedicated metric to measure the duration of Data Definition Language (DDL) event handling. This provides crucial insights into the performance and latency characteristics of DDL operations, which are vital for maintaining database schema consistency and overall system health. The addition of corresponding Grafana panels ensures that this new observability data is readily available for analysis and troubleshooting. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new metric, ticdc_ddl_handle_duration, to monitor the duration of DDL event handling. The implementation involves adding the metric definition, observing its value within the dispatcher logic, and updating Grafana dashboards with a new visualization panel. The changes are generally well-implemented. I've provided a couple of suggestions to improve code consistency and performance.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
downstreamadapter/dispatcher/basic_dispatcher_info.go (1)
73-75: Fix the metric field comment to match the actual identifier.Line 73 says
metricExecDDLHis, but the field at Line 75 ismetricHandleDDLHis. Please align the comment to avoid confusion during maintenance.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@downstreamadapter/dispatcher/basic_dispatcher_info.go` around lines 73 - 75, The comment above the metric field is referring to the wrong identifier (mentions metricExecDDLHis while the field is named metricHandleDDLHis); update the comment to match the actual field name (metricHandleDDLHis) or rename the field to metricExecDDLHis so they align—ensure the comment describes that metricHandleDDLHis records each DDL handling duration (execution + wait for resolution) and use the exact identifier metricHandleDDLHis in the comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@downstreamadapter/dispatcher/basic_dispatcher.go`:
- Around line 672-681: The metric currently measures after sink flush and
callback overhead because now is set before AddPostFlushFunc but
d.sharedInfo.metricHandleDDLHis.Observe is called inside the post-flush callback
after wakeCallback; change to compute elapsed := time.Since(now) once inside the
post-flush func and call
d.sharedInfo.metricHandleDDLHis.Observe(elapsed.Seconds()) before invoking
wakeCallback (and before any heavy post-sink work like
d.tableSchemaStore.AddEvent or wakeCallback) so the metric reflects pre-sink
dispatcher handling only.
---
Nitpick comments:
In `@downstreamadapter/dispatcher/basic_dispatcher_info.go`:
- Around line 73-75: The comment above the metric field is referring to the
wrong identifier (mentions metricExecDDLHis while the field is named
metricHandleDDLHis); update the comment to match the actual field name
(metricHandleDDLHis) or rename the field to metricExecDDLHis so they
align—ensure the comment describes that metricHandleDDLHis records each DDL
handling duration (execution + wait for resolution) and use the exact identifier
metricHandleDDLHis in the comment.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
downstreamadapter/dispatcher/basic_dispatcher.godownstreamadapter/dispatcher/basic_dispatcher_info.gometrics/grafana/ticdc_new_arch.jsonmetrics/nextgengrafana/ticdc_new_arch_next_gen.jsonmetrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.jsonpkg/metrics/ddl.go
| now := time.Now() | ||
| ddl.AddPostFlushFunc(func() { | ||
| if d.tableSchemaStore != nil { | ||
| d.tableSchemaStore.AddEvent(ddl) | ||
| } | ||
| wakeCallback() | ||
| d.sharedInfo.metricHandleDDLHis.Observe(time.Since(now).Seconds()) | ||
| log.Debug("dispatcher handle ddl event finish", | ||
| zap.Duration("cost", time.Since(now)), | ||
| zap.Any("ddl", ddl)) |
There was a problem hiding this comment.
Metric boundary currently measures beyond pre-sink handling.
Timer starts at Line 672 but is observed in post-flush callback at Line 678 (after sink write/flush), and after wakeCallback() at Line 677. This captures end-to-end + callback overhead, not just dispatcher pre-sink DDL handling from issue #4295.
If pre-sink latency is the target, record the metric right before the first sink write path. At minimum, move observe before wakeCallback and reuse one computed elapsed duration to avoid callback skew.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@downstreamadapter/dispatcher/basic_dispatcher.go` around lines 672 - 681, The
metric currently measures after sink flush and callback overhead because now is
set before AddPostFlushFunc but d.sharedInfo.metricHandleDDLHis.Observe is
called inside the post-flush callback after wakeCallback; change to compute
elapsed := time.Since(now) once inside the post-flush func and call
d.sharedInfo.metricHandleDDLHis.Observe(elapsed.Seconds()) before invoking
wakeCallback (and before any heavy post-sink work like
d.tableSchemaStore.AddEvent or wakeCallback) so the metric reflects pre-sink
dispatcher handling only.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: wk989898 <nhsmwk@gmail.com>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: asddongmen The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
What problem does this PR solve?
Issue Number: close #4295
What is changed and how it works?
Check List
Tests
Questions
Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?
Release note
Summary by CodeRabbit
New Features
Chores
Bug Fixes