Skip to content

[FEATURE] Allow output \0 terminated frames (for WebSocket streaming support)#2105

Merged
cfsmp3 merged 15 commits intoCCExtractor:masterfrom
pszemus:null-terminated-frames
Mar 19, 2026
Merged

[FEATURE] Allow output \0 terminated frames (for WebSocket streaming support)#2105
cfsmp3 merged 15 commits intoCCExtractor:masterfrom
pszemus:null-terminated-frames

Conversation

@pszemus
Copy link
Contributor

@pszemus pszemus commented Feb 10, 2026

In raising this pull request, I confirm the following (please check boxes):

  • I have read and understood the contributors guide.
  • I have checked that another pull request for this purpose does not exist.
  • I have considered, and confirmed that this submission will be valuable to others.
  • I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
  • I give this submission freely, and claim no ownership to its content.
  • I have mentioned this change in the changelog.

My familiarity with the project is as follows (check one):

  • I have never used CCExtractor.
  • I have used CCExtractor just a couple of times.
  • I absolutely love CCExtractor, but have not contributed previously.
  • I am an active contributor to CCExtractor.

When streaming subtitles (particularly DVBSUB) from ccextractor to WebSocket endpoints via tools like websocat, multi-line subtitles cause issues. Each line is sent as a separate message, resulting in only the last line being visible at the receiving end.

For example, using the following pipeline:

ccextractor --udp <src_stream_address> --codec dvbsub --out=txt --stdout --forceflush | websocat ws://<endpoint-uri>

multi-line subtitle frames are sent line-by-line, losing all but the final line.

This PR introduces the --null-terminated option, which appends a null character (\0) as a frame delimiter after each complete subtitle frame (whether single or multi-line). This enables proper frame boundaries for streaming scenarios.

Then, it'll be possible to create the following pipeline:

ccextractor --udp <src_stream_address> --codec dvbsub --out=txt --null-terminated --stdout --forceflush | websocat -0 ws://<endpoint-uri>

With this change, websocat's -0 flag can properly parse complete subtitle frames using the null delimiter (see websocat documentation).

Benefits:

  • Enables reliable WebSocket streaming of subtitles without data loss
  • Maintains backward compatibility (opt-in feature)
  • Follows established patterns for null-terminated stream processing
  • Simple, focused change that solves a real-world use case

Please compare the following two output files, where with --null-terminated enabled new lines in multi-line subtitles were preserved and all frames end with \0.

@pszemus pszemus force-pushed the null-terminated-frames branch from bdf3aa1 to ff9c160 Compare February 11, 2026 15:42
Copy link
Contributor

@cfsmp3 cfsmp3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good feature with a clear real-world use case. The implementation is clean and properly wired through both C and Rust. However, the --null-terminated flag currently only works for DVB bitmap subtitles, not for text-based captions (CEA-608/708). This needs to be fixed before merging.

The problem

In src/lib_ccx/ccx_encoders_transcript.c, you replaced encoded_crlf with encoded_end_frame in only one place — the bitmap subtitle path at line 92:

// write_cc_bitmap_as_transcript() — line 92 — ✅ changed
write_wrapped(context->out->fh, context->encoded_end_frame, context->encoded_end_frame_length);

But the text subtitle path (write_cc_buffer_as_transcript) still uses encoded_crlf in three places that also need updating:

// Line 206 — ❌ not changed (end of each subtitle line)
ret = write(context->out->fh, context->encoded_crlf, context->encoded_crlf_length);

// Line 328 — ❌ not changed (end of each subtitle block)
ret = write(context->out->fh, context->encoded_crlf, context->encoded_crlf_length);

There's also line 77 and 90 where encoded_crlf is used for parsing/splitting tokens — those should probably stay as-is since they're detecting line breaks within the input, not writing output.

How to verify

I tested with a CEA-608 stream:

./ccextractor input.ts --txt --stdout --null-terminated 2>/dev/null | xxd | head -30

The output contains only 0d 0a (CRLF) — zero null bytes. The flag has no effect for text-based captions.

What to fix

In src/lib_ccx/ccx_encoders_transcript.c, replace encoded_crlf with encoded_end_frame on lines 206 and 328 (the two write() calls in write_cc_buffer_as_transcript). Leave lines 77 and 90 alone — those are input parsing, not output.

Note: you'll also need to update the ret < context->encoded_crlf_length comparisons on lines 207 and 329 to use encoded_end_frame_length accordingly.

@pszemus
Copy link
Contributor Author

pszemus commented Feb 16, 2026

Thanks @cfsmp3 I've fixed missing code paths.
With my test file, now the output changes after setting --null-terminated from:

00000000: 5745 4c4c 2c20 4920 4755 4553 5320 594f  WELL, I GUESS YO
00000010: 5520 434f 554c 4420 5341 5920 5448 4154  U COULD SAY THAT
00000020: 0d0a 4920 4341 5245 2e2e 2e42 4543 4155  ..I CARE...BECAU
00000030: 5345 2049 2042 524f 5547 4854 2059 4f55  SE I BROUGHT YOU
00000040: 0d0a 494e 544f 2054 4849 5320 574f 524c  ..INTO THIS WORL
00000050: 442e 0d0a

to:

00000000: 5745 4c4c 2c20 4920 4755 4553 5320 594f  WELL, I GUESS YO
00000010: 5520 434f 554c 4420 5341 5920 5448 4154  U COULD SAY THAT
00000020: 0049 2043 4152 452e 2e2e 4245 4341 5553  .I CARE...BECAUS
00000030: 4520 4920 4252 4f55 4748 5420 594f 5500  E I BROUGHT YOU.
00000040: 494e 544f 2054 4849 5320 574f 524c 442e  INTO THIS WORLD.
00000050: 00

@pszemus pszemus requested a review from cfsmp3 February 25, 2026 16:08
Copy link
Contributor

@cfsmp3 cfsmp3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the previous feedback — the C paths all work now. However there's still one path that doesn't respect --null-terminated:

CEA-708 via the Rust decodersrc/rust/src/decoder/tv_screen.rs:353 hardcodes \r\n:

writer.write_to_file(b"\r\n")?;

This means --null-terminated has no effect on CEA-708 transcript output. You can verify:

ccextractor input.ts --txt -o /tmp/test.txt --null-terminated -svc 1
xxd /tmp/test.p1.svc01.txt | head -20
# No null bytes — only 0d 0a

The frame_terminator_0 option needs to be plumbed into the Rust Writer struct so that write_transcript can use it instead of the hardcoded \r\n.

@SuvidhJ
Copy link

SuvidhJ commented Mar 2, 2026

Hi @pszemus, I noticed the latest review feedback about plumbing frame_terminator_0 into the Rust Writer struct for CEA-708 support. I'd be happy to help with this if you'd like, just let me know!

@pszemus pszemus force-pushed the null-terminated-frames branch from 224d594 to b5312f8 Compare March 10, 2026 11:26
@pszemus
Copy link
Contributor Author

pszemus commented Mar 10, 2026

@cfsmp3 Thanks! I've made the necessary changes and the project builds well, but the Rust format check fails with a "to many arguments" error from clippy. Could you please review my changes?

@pszemus pszemus requested a review from cfsmp3 March 10, 2026 11:41
@pszemus
Copy link
Contributor Author

pszemus commented Mar 13, 2026

Hi @pszemus, I noticed the latest review feedback about plumbing frame_terminator_0 into the Rust Writer struct for CEA-708 support. I'd be happy to help with this if you'd like, just let me know!

Hi @SuvidhJ It would be much appreciated if you could review the changes I made in Rust decoder.

Copy link
Contributor

@cfsmp3 cfsmp3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update — the DVBSUB bitmap path works correctly, and the Rust formatting/clippy issues are resolved. However, --null-terminated only produces correct frame-level \0 delimiters on the DVBSUB (bitmap/OCR) path. On CEA-608 and CEA-708, the \0 is written per line, not per frame, which breaks the websocat -0 use case for those codecs and contradicts the PR description.

How to reproduce

CEA-608:

./ccextractor sample.ts -out=txt --null-terminated -o /tmp/test.txt
xxd /tmp/test.txt | head -20

You'll see \0 after every individual line, not after each complete subtitle frame. A two-line pop-on caption like:

♪ So no one told you
life was gonna be this way ♪

produces line1\0line2\0 instead of the expected line1\nline2\0.

CEA-708:

./ccextractor sample.ts -out=txt --null-terminated -svc 1 -o /tmp/test708.txt
xxd /tmp/test708.p1.svc01.txt | head -20

Same issue — \0 per row instead of per frame.

Root cause

Three code paths need fixing:

  1. write_cc_line_as_transcript2 (CEA-608, ccx_encoders_transcript.c ~line 325): This function is called per line and writes encoded_end_frame at the end of each individual line. The caller write_cc_buffer_as_transcript2 iterates over the 15 rows and calls this function for each used row. Fix: keep using encoded_crlf (or \n) as the separator between lines within a frame, and only write encoded_end_frame once after the last line. Since write_cc_line_as_transcript2 doesn't know whether it's the last line, the frame terminator should be moved to the caller (write_cc_buffer_as_transcript2) after the row loop.

  2. write_cc_subtitle_as_transcript (ccx_encoders_transcript.c ~line 203): The do...while (strtok_r) loop writes encoded_end_frame after each token (line). Fix: use \n (or encoded_crlf) between tokens within the loop, and write encoded_end_frame once after the loop exits.

  3. CEA-708 Rust path (tv_screen.rs ~line 353-354): end_frame is written inside the row iteration loop. Fix: move the write_to_file(&end_frame) call outside the for row_index in ... loop, writing it once after all rows are emitted. Use \n or encoded_crlf between rows within the loop (the current line separator behavior).

Why the bitmap path works

In write_cc_bitmap_as_transcript, the entire subtitle text is processed first (internal CRLFs are replaced with spaces), then encoded_end_frame is written once at the end. This is the correct pattern — the other paths should follow a similar structure.

Testing checklist

After fixing, verify with xxd that:

  • CEA-608 multi-line pop-on: \n between lines, single \0 at frame end
  • CEA-608 single-line: single \0 at frame end
  • CEA-708 multi-line: \n between lines, single \0 at frame end
  • Normal mode (without --null-terminated): output identical to master (no regression)
  • --lf mode: \n line terminators still work as before

@pszemus
Copy link
Contributor Author

pszemus commented Mar 18, 2026

Thanks @cfsmp3 I must have missed that paths.
I think I've fixed them now. I moved writing line ends to the beginning of the loop and ended the loop with encoded_end_frame to remove the double crlf when --null-terminated is not set and keep the original behaviour.

@ccextractor-bot
Copy link
Collaborator

CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit 9f250b1...:
Report Name Tests Passed
Broken 9/13
CEA-708 1/14
DVB 3/7
DVD 3/3
DVR-MS 2/2
General 20/27
Hardsubx 1/1
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 77/86
Teletext 20/21
WTV 13/13
XDS 31/34

Your PR breaks these cases:

  • ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 8e8229b88b...
  • ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2...
  • ccextractor --autoprogram --out=ttxt --latin1 132d7df7e9...
  • ccextractor --autoprogram --out=ttxt --latin1 99e5eaafdc...
  • ccextractor --autoprogram --out=srt --latin1 b22260d065...
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla 7aad20907e...
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65...
  • ccextractor --autoprogram --out=ttxt --latin1 01509e4d27...
  • ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b...
  • ccextractor --out=spupng c83f765c66...
  • ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --autoprogram --out=ttxt --xds --latin1 --ucla 85058ad37e...
  • ccextractor --autoprogram --out=srt --latin1 --ucla b22260d065...
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 7f41299cc7...

NOTE: The following tests have been failing on the master branch as well as the PR:

Congratulations: Merging this PR would fix the following tests:

  • ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

@pszemus pszemus requested a review from cfsmp3 March 18, 2026 12:11
@ccextractor-bot
Copy link
Collaborator

CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit 9f250b1...:
Report Name Tests Passed
Broken 9/13
CEA-708 1/14
DVB 4/7
DVD 3/3
DVR-MS 2/2
General 22/27
Hardsubx 1/1
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 81/86
Teletext 20/21
WTV 13/13
XDS 31/34

Your PR breaks these cases:

  • ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 8e8229b88b...
  • ccextractor --autoprogram --out=ttxt --latin1 132d7df7e9...
  • ccextractor --autoprogram --out=ttxt --latin1 99e5eaafdc...
  • ccextractor --autoprogram --out=srt --latin1 b22260d065...
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla 7aad20907e...
  • ccextractor --autoprogram --out=ttxt --latin1 01509e4d27...
  • ccextractor --autoprogram --out=ttxt --xds --latin1 --ucla 85058ad37e...
  • ccextractor --autoprogram --out=srt --latin1 --ucla b22260d065...
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 7f41299cc7...

NOTE: The following tests have been failing on the master branch as well as the PR:

Congratulations: Merging this PR would fix the following tests:

  • ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65..., Last passed: Never
  • ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b..., Last passed: Never
  • ccextractor --out=spupng c83f765c66..., Last passed: Never
  • ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

@cfsmp3 cfsmp3 merged commit 03ad9e8 into CCExtractor:master Mar 19, 2026
25 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants