Skip to content

Commit f0c4c4a

Browse files
Auto-sync: Update English docs from Chinese PR
Synced from: pingcap/docs-cn#21221 Target PR: #22738 AI Provider: azure Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
1 parent 0813b50 commit f0c4c4a

2 files changed

Lines changed: 53 additions & 0 deletions

File tree

ticdc/ticdc-architecture.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,49 @@ In table split mode, pay attention to the following settings:
7676
- [`scheduler.region-count-per-span`](/ticdc/ticdc-changefeed-config.md#region-count-per-span-new-in-v854): the default value is `100`. During changefeed initialization, tables that meet the split conditions are split according to this parameter. After splitting, each split sub-table contains at most `region-count-per-span` regions.
7777
- [`scheduler.write-key-threshold`](/ticdc/ticdc-changefeed-config.md#write-key-threshold): the default value is `0` (disabled). When the sink write throughput of a table exceeds this threshold, TiCDC triggers table splitting. In most cases, keep this parameter to `0`.
7878

79+
## Storage Sink file name changes and consumption instructions
80+
81+
After switching to the new TiCDC architecture and enabling table-level task splitting, for [Storage Sink](/ticdc/ticdc-sink-to-cloud-storage.md), the file name format for recording data changes changes from `CDC_{num}.{extension}` to `CDC_{uuid}_{num}.{extension}`, and the Index file name format changes from `CDC.index` to `CDC_{uuid}.index`. Here, `uuid` identifies the sub replication task after table splitting, and `num` indicates the file sequence number.
82+
83+
- Data change record path
84+
85+
```
86+
{scheme}://{prefix}/{schema}/{table}/{table-version-separator}/{partition-separator}/{date-separator}/CDC_{uuid}_{num}.{extension}
87+
```
88+
89+
- Index file path
90+
91+
```
92+
{scheme}://{prefix}/{schema}/{table}/{table-version-separator}/{partition-separator}/{date-separator}/meta/CDC_{uuid}.index
93+
```
94+
95+
After table-level task splitting is enabled, under the `{schema}/{table}/{table-version-separator}/` directory, the same table might have multiple data files with different `uuid` values but the same sequence number. For example:
96+
97+
```
98+
├── metadata
99+
└── test
100+
├── tbl_1
101+
│ ├── 437752935075545091
102+
│ │ ├── CDC_11_000001.json
103+
│ │ ├── CDC_11_000002.json
104+
│ │ ├── CDC_22_000001.json
105+
│ │ └── meta
106+
│ │ ├── CDC_11.index
107+
│ │ └── CDC_22.index
108+
│ ├── 437752935075546092
109+
│ │ ├── CDC_33_000001.json
110+
│ │ ├── CDC_44_000001.json
111+
│ │ └── meta
112+
│ │ ├── CDC_33.index
113+
│ │ └── CDC_44.index
114+
```
115+
116+
Because multiple sub replication tasks write files in parallel, a data file might be read by the downstream before it is fully written, causing part of the data to fail to be read successfully. To avoid this situation, when writing a downstream consumer program, read the data in the following order:
117+
118+
1. Read the `meta/CDC_{uuid}.index` file (for example, `CDC_11.index`) to obtain the name of the file that has been completely written (for example, `CDC_11_000002.json`).
119+
2. Read the files whose sequence numbers are less than or equal to the sequence number in that file name in order (for example, `CDC_11_000001.json` and `CDC_11_000002.json`).
120+
3. After reading DML events from the files of all sub tasks, sort these files by the `commit-ts` of the DML events, and then process them downstream in a unified manner.
121+
79122
## Compatibility
80123

81124
Except as described in the following special cases, the TiCDC new architecture is fully compatible with the classic architecture.

ticdc/ticdc-storage-consumer-dev-guide.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -176,3 +176,13 @@ The consumption logic is consistent. Specifically, the consumer parses the table
176176
After DDL events are properly processed, you can process DML events in the `{schema}/{table}/{table-version-separator}/` directory based on the specific file format (CSV or Canal-JSON) and file number.
177177

178178
TiCDC ensures that data is replicated at least once. Therefore, there might be duplicate data. You need to compare the commit ts of the change data with the consumer checkpoint. If the commit ts is less than the consumer checkpoint, you need to perform deduplication.
179+
180+
When processing files, a downstream consumer might read a data file before it is fully written, causing some data to fail to be read successfully. To avoid this issue, when writing a downstream consumer, read data in the following order:
181+
182+
1. Read the `meta/CDC.index` file in the `{schema}/{table}/{table-version-separator}/` directory to obtain the name of the file that has been completely written.
183+
2. For the [new TiCDC architecture](/ticdc/ticdc-architecture.md), read files in sequence whose file numbers are less than or equal to the number in that file name. For the [classic TiCDC architecture](/ticdc/ticdc-classic-architecture.md), read files in sequence whose file numbers are less than the number in that file name.
184+
185+
> **Note:**
186+
>
187+
> When `scheduler.enable-table-across-nodes` is enabled in the [new TiCDC architecture](/ticdc/ticdc-architecture.md), the file name format for recording data changes changes from `CDC_{num}.{extension}` to `CDC_{uuid}_{num}.{extension}`, and the Index file name format changes from `CDC.index` to `CDC_{uuid}.index`. In this case, files with different UUIDs but the same sequence number exist in the `{schema}/{table}/{table-version-separator}/` directory. When writing a downstream consumer, refer to the order described in [Storage Sink file name changes and consumption instructions](/ticdc/ticdc-architecture.md#storage-sink-file-name-changes-and-consumption-instructions) for consumption.
188+

0 commit comments

Comments
 (0)