[FEATURE] Allow output \0 terminated frames (for WebSocket streaming support)#2105
[FEATURE] Allow output \0 terminated frames (for WebSocket streaming support)#2105cfsmp3 merged 15 commits intoCCExtractor:masterfrom
Conversation
bdf3aa1 to
ff9c160
Compare
cfsmp3
left a comment
There was a problem hiding this comment.
Good feature with a clear real-world use case. The implementation is clean and properly wired through both C and Rust. However, the --null-terminated flag currently only works for DVB bitmap subtitles, not for text-based captions (CEA-608/708). This needs to be fixed before merging.
The problem
In src/lib_ccx/ccx_encoders_transcript.c, you replaced encoded_crlf with encoded_end_frame in only one place — the bitmap subtitle path at line 92:
// write_cc_bitmap_as_transcript() — line 92 — ✅ changed
write_wrapped(context->out->fh, context->encoded_end_frame, context->encoded_end_frame_length);But the text subtitle path (write_cc_buffer_as_transcript) still uses encoded_crlf in three places that also need updating:
// Line 206 — ❌ not changed (end of each subtitle line)
ret = write(context->out->fh, context->encoded_crlf, context->encoded_crlf_length);
// Line 328 — ❌ not changed (end of each subtitle block)
ret = write(context->out->fh, context->encoded_crlf, context->encoded_crlf_length);There's also line 77 and 90 where encoded_crlf is used for parsing/splitting tokens — those should probably stay as-is since they're detecting line breaks within the input, not writing output.
How to verify
I tested with a CEA-608 stream:
./ccextractor input.ts --txt --stdout --null-terminated 2>/dev/null | xxd | head -30
The output contains only 0d 0a (CRLF) — zero null bytes. The flag has no effect for text-based captions.
What to fix
In src/lib_ccx/ccx_encoders_transcript.c, replace encoded_crlf with encoded_end_frame on lines 206 and 328 (the two write() calls in write_cc_buffer_as_transcript). Leave lines 77 and 90 alone — those are input parsing, not output.
Note: you'll also need to update the ret < context->encoded_crlf_length comparisons on lines 207 and 329 to use encoded_end_frame_length accordingly.
|
Thanks @cfsmp3 I've fixed missing code paths. to: |
cfsmp3
left a comment
There was a problem hiding this comment.
Thanks for addressing the previous feedback — the C paths all work now. However there's still one path that doesn't respect --null-terminated:
CEA-708 via the Rust decoder — src/rust/src/decoder/tv_screen.rs:353 hardcodes \r\n:
writer.write_to_file(b"\r\n")?;This means --null-terminated has no effect on CEA-708 transcript output. You can verify:
ccextractor input.ts --txt -o /tmp/test.txt --null-terminated -svc 1
xxd /tmp/test.p1.svc01.txt | head -20
# No null bytes — only 0d 0aThe frame_terminator_0 option needs to be plumbed into the Rust Writer struct so that write_transcript can use it instead of the hardcoded \r\n.
|
Hi @pszemus, I noticed the latest review feedback about plumbing frame_terminator_0 into the Rust Writer struct for CEA-708 support. I'd be happy to help with this if you'd like, just let me know! |
224d594 to
b5312f8
Compare
|
@cfsmp3 Thanks! I've made the necessary changes and the project builds well, but the Rust format check fails with a "to many arguments" error from clippy. Could you please review my changes? |
cfsmp3
left a comment
There was a problem hiding this comment.
Thanks for the update — the DVBSUB bitmap path works correctly, and the Rust formatting/clippy issues are resolved. However, --null-terminated only produces correct frame-level \0 delimiters on the DVBSUB (bitmap/OCR) path. On CEA-608 and CEA-708, the \0 is written per line, not per frame, which breaks the websocat -0 use case for those codecs and contradicts the PR description.
How to reproduce
CEA-608:
./ccextractor sample.ts -out=txt --null-terminated -o /tmp/test.txt
xxd /tmp/test.txt | head -20You'll see \0 after every individual line, not after each complete subtitle frame. A two-line pop-on caption like:
♪ So no one told you
life was gonna be this way ♪
produces line1\0line2\0 instead of the expected line1\nline2\0.
CEA-708:
./ccextractor sample.ts -out=txt --null-terminated -svc 1 -o /tmp/test708.txt
xxd /tmp/test708.p1.svc01.txt | head -20Same issue — \0 per row instead of per frame.
Root cause
Three code paths need fixing:
-
write_cc_line_as_transcript2(CEA-608,ccx_encoders_transcript.c~line 325): This function is called per line and writesencoded_end_frameat the end of each individual line. The callerwrite_cc_buffer_as_transcript2iterates over the 15 rows and calls this function for each used row. Fix: keep usingencoded_crlf(or\n) as the separator between lines within a frame, and only writeencoded_end_frameonce after the last line. Sincewrite_cc_line_as_transcript2doesn't know whether it's the last line, the frame terminator should be moved to the caller (write_cc_buffer_as_transcript2) after the row loop. -
write_cc_subtitle_as_transcript(ccx_encoders_transcript.c~line 203): Thedo...while (strtok_r)loop writesencoded_end_frameafter each token (line). Fix: use\n(orencoded_crlf) between tokens within the loop, and writeencoded_end_frameonce after the loop exits. -
CEA-708 Rust path (
tv_screen.rs~line 353-354):end_frameis written inside the row iteration loop. Fix: move thewrite_to_file(&end_frame)call outside thefor row_index in ...loop, writing it once after all rows are emitted. Use\norencoded_crlfbetween rows within the loop (the current line separator behavior).
Why the bitmap path works
In write_cc_bitmap_as_transcript, the entire subtitle text is processed first (internal CRLFs are replaced with spaces), then encoded_end_frame is written once at the end. This is the correct pattern — the other paths should follow a similar structure.
Testing checklist
After fixing, verify with xxd that:
- CEA-608 multi-line pop-on:
\nbetween lines, single\0at frame end - CEA-608 single-line: single
\0at frame end - CEA-708 multi-line:
\nbetween lines, single\0at frame end - Normal mode (without
--null-terminated): output identical to master (no regression) -
--lfmode:\nline terminators still work as before
|
Thanks @cfsmp3 I must have missed that paths. |
CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit 9f250b1...:
Your PR breaks these cases:
NOTE: The following tests have been failing on the master branch as well as the PR:
Congratulations: Merging this PR would fix the following tests:
It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you). Check the result page for more info. |
CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit 9f250b1...:
Your PR breaks these cases:
NOTE: The following tests have been failing on the master branch as well as the PR:
Congratulations: Merging this PR would fix the following tests:
It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you). Check the result page for more info. |
In raising this pull request, I confirm the following (please check boxes):
My familiarity with the project is as follows (check one):
When streaming subtitles (particularly DVBSUB) from ccextractor to WebSocket endpoints via tools like websocat, multi-line subtitles cause issues. Each line is sent as a separate message, resulting in only the last line being visible at the receiving end.
For example, using the following pipeline:
multi-line subtitle frames are sent line-by-line, losing all but the final line.
This PR introduces the
--null-terminatedoption, which appends a null character (\0) as a frame delimiter after each complete subtitle frame (whether single or multi-line). This enables proper frame boundaries for streaming scenarios.Then, it'll be possible to create the following pipeline:
With this change, websocat's
-0flag can properly parse complete subtitle frames using the null delimiter (see websocat documentation).Benefits:
Please compare the following two output files, where with
--null-terminatedenabled new lines in multi-line subtitles were preserved and all frames end with\0.--out=webvtt:ccextractor_webvtt.txt
--out=txt --null-terminated:ccextractor_txt_null-terminated.txt