Skip to content

[server] TabletServer fails to restart after unclean shutdown due to EOFException in log recovery #2941

@LiebingYu

Description

@LiebingYu

Search before asking

  • I searched in the issues and found nothing similar.

Fluss version

main (development)

Please describe the bug 🐞

After an unclean shutdown (e.g., SIGKILL, OOM, power failure), TabletServer fails to restart due to a FlussRuntimeException wrapping an EOFException during log recovery. The server becomes completely unrecoverable without manual intervention.

On unclean shutdown, recoverSegment() is only called when sanityCheck() throws (i.e., index file is corrupt). When the index file is intact but the .log file has a truncated tail — the most common unclean shutdown scenario — sanityCheck() passes, recoverSegment() is skipped, and a subsequent call to readNextOffset()maxTimestampSoFar()readLargestTimestamp() hits the partial batch and crashes.

Steps to Reproduce:

  1. Start a TabletServer and produce data to a table (e.g., t_geely_2x_freeze_frame_info, bucket 33).
  2. Force-kill the TabletServer process (kill -9) while writes are in progress, ensuring the active log segment has unflushed data.
  3. Restart the TabletServer.
  4. Observe that the server fails to start with the stack trace below.

Stack Trace:

org.apache.fluss.exception.FlussRuntimeException: Failed to recovery log
    at org.apache.fluss.server.log.LogManager.loadLogs(LogManager.java:207)
    at org.apache.fluss.server.log.LogManager.startup(LogManager.java:139)
    at org.apache.fluss.server.tablet.TabletServer.startServices(TabletServer.java:228)
    ...
Caused by: org.apache.fluss.exception.FlussRuntimeException: Failed to load record batch at position 495460312
    from FileRecords(size=495460352, file=.../00000000000000000000.log, start=0, end=2147483647)
    at org.apache.fluss.record.FileLogInputStream$FileChannelLogRecordBatch.loadByteBufferWithSize(FileLogInputStream.java:222)
    at org.apache.fluss.record.FileLogInputStream$FileChannelLogRecordBatch.loadBatchHeader(FileLogInputStream.java:211)
    at org.apache.fluss.record.FileLogInputStream$FileChannelLogRecordBatch.commitTimestamp(FileLogInputStream.java:134)
    at org.apache.fluss.record.FileLogRecords.largestTimestampAfter(FileLogRecords.java:386)
    at org.apache.fluss.server.log.LogSegment.readLargestTimestamp(LogSegment.java:644)
    at org.apache.fluss.server.log.LogSegment.readMaxTimestampAndStartOffsetSoFar(LogSegment.java:214)
    at org.apache.fluss.server.log.LogSegment.maxTimestampSoFar(LogSegment.java:200)
    at org.apache.fluss.server.log.LogSegment.recover(LogSegment.java:319)
    at org.apache.fluss.server.log.LogLoader.recoverSegment(LogLoader.java:269)
    at org.apache.fluss.server.log.LogLoader.recoverLog(LogLoader.java:168)
    ...
Caused by: java.io.EOFException: Failed to read `record batch header` from file channel.
    Expected to read 48 bytes, but reached end of file after reading 40 bytes.
    Started read from position 495460312.
    at org.apache.fluss.utils.FileUtils.readFullyOrFail(FileUtils.java:110)
    at org.apache.fluss.utils.FileUtils.loadByteBufferFromFile(FileUtils.java:138)
    ...

Solution

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions