You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: ticdc/ticdc-architecture.md
+43Lines changed: 43 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -76,6 +76,49 @@ In table split mode, pay attention to the following settings:
76
76
-[`scheduler.region-count-per-span`](/ticdc/ticdc-changefeed-config.md#region-count-per-span-new-in-v854): the default value is `100`. During changefeed initialization, tables that meet the split conditions are split according to this parameter. After splitting, each split sub-table contains at most `region-count-per-span` regions.
77
77
-[`scheduler.write-key-threshold`](/ticdc/ticdc-changefeed-config.md#write-key-threshold): the default value is `0` (disabled). When the sink write throughput of a table exceeds this threshold, TiCDC triggers table splitting. In most cases, keep this parameter to `0`.
78
78
79
+
## Storage Sink file name changes and consumption instructions
80
+
81
+
After switching to the new TiCDC architecture and enabling table-level task splitting, for [Storage Sink](/ticdc/ticdc-sink-to-cloud-storage.md), the file name format for recording data changes changes from `CDC_{num}.{extension}` to `CDC_{uuid}_{num}.{extension}`, and the Index file name format changes from `CDC.index` to `CDC_{uuid}.index`. Here, `uuid` identifies the sub replication task after table splitting, and `num` indicates the file sequence number.
After table-level task splitting is enabled, under the `{schema}/{table}/{table-version-separator}/` directory, the same table might have multiple data files with different `uuid` values but the same sequence number. For example:
96
+
97
+
```
98
+
├── metadata
99
+
└── test
100
+
├── tbl_1
101
+
│ ├── 437752935075545091
102
+
│ │ ├── CDC_11_000001.json
103
+
│ │ ├── CDC_11_000002.json
104
+
│ │ ├── CDC_22_000001.json
105
+
│ │ └── meta
106
+
│ │ ├── CDC_11.index
107
+
│ │ └── CDC_22.index
108
+
│ ├── 437752935075546092
109
+
│ │ ├── CDC_33_000001.json
110
+
│ │ ├── CDC_44_000001.json
111
+
│ │ └── meta
112
+
│ │ ├── CDC_33.index
113
+
│ │ └── CDC_44.index
114
+
```
115
+
116
+
Because multiple sub replication tasks write files in parallel, a data file might be read by the downstream before it is fully written, causing part of the data to fail to be read successfully. To avoid this situation, when writing a downstream consumer program, read the data in the following order:
117
+
118
+
1. Read the `meta/CDC_{uuid}.index` file (for example, `CDC_11.index`) to obtain the name of the file that has been completely written (for example, `CDC_11_000002.json`).
119
+
2. Read the files whose sequence numbers are less than or equal to the sequence number in that file name in order (for example, `CDC_11_000001.json` and `CDC_11_000002.json`).
120
+
3. After reading DML events from the files of all sub tasks, sort these files by the `commit-ts` of the DML events, and then process them downstream in a unified manner.
121
+
79
122
## Compatibility
80
123
81
124
Except as described in the following special cases, the TiCDC new architecture is fully compatible with the classic architecture.
Copy file name to clipboardExpand all lines: ticdc/ticdc-storage-consumer-dev-guide.md
+10Lines changed: 10 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -176,3 +176,13 @@ The consumption logic is consistent. Specifically, the consumer parses the table
176
176
After DDL events are properly processed, you can process DML events in the `{schema}/{table}/{table-version-separator}/` directory based on the specific file format (CSV or Canal-JSON) and file number.
177
177
178
178
TiCDC ensures that data is replicated at least once. Therefore, there might be duplicate data. You need to compare the commit ts of the change data with the consumer checkpoint. If the commit ts is less than the consumer checkpoint, you need to perform deduplication.
179
+
180
+
When processing files, a downstream consumer might read a data file before it is fully written, causing some data to fail to be read successfully. To avoid this issue, when writing a downstream consumer, read data in the following order:
181
+
182
+
1. Read the `meta/CDC.index` file in the `{schema}/{table}/{table-version-separator}/` directory to obtain the name of the file that has been completely written.
183
+
2. For the [new TiCDC architecture](/ticdc/ticdc-architecture.md), read files in sequence whose file numbers are less than or equal to the number in that file name. For the [classic TiCDC architecture](/ticdc/ticdc-classic-architecture.md), read files in sequence whose file numbers are less than the number in that file name.
184
+
185
+
> **Note:**
186
+
>
187
+
> When `scheduler.enable-table-across-nodes` is enabled in the [new TiCDC architecture](/ticdc/ticdc-architecture.md), the file name format for recording data changes changes from `CDC_{num}.{extension}` to `CDC_{uuid}_{num}.{extension}`, and the Index file name format changes from `CDC.index` to `CDC_{uuid}.index`. In this case, files with different UUIDs but the same sequence number exist in the `{schema}/{table}/{table-version-separator}/` directory. When writing a downstream consumer, refer to the order described in [Storage Sink file name changes and consumption instructions](/ticdc/ticdc-architecture.md#storage-sink-file-name-changes-and-consumption-instructions) for consumption.
0 commit comments