-
Notifications
You must be signed in to change notification settings - Fork 518
Description
Search before asking
- I searched in the issues and found nothing similar.
Fluss version
main (development)
Please describe the bug 🐞
After an unclean shutdown (e.g., SIGKILL, OOM, power failure), TabletServer fails to restart due to a FlussRuntimeException wrapping an EOFException during log recovery. The server becomes completely unrecoverable without manual intervention.
On unclean shutdown, recoverSegment() is only called when sanityCheck() throws (i.e., index file is corrupt). When the index file is intact but the .log file has a truncated tail — the most common unclean shutdown scenario — sanityCheck() passes, recoverSegment() is skipped, and a subsequent call to readNextOffset() → maxTimestampSoFar() → readLargestTimestamp() hits the partial batch and crashes.
Steps to Reproduce:
- Start a
TabletServerand produce data to a table (e.g.,t_geely_2x_freeze_frame_info, bucket 33). - Force-kill the
TabletServerprocess (kill -9) while writes are in progress, ensuring the active log segment has unflushed data. - Restart the
TabletServer. - Observe that the server fails to start with the stack trace below.
Stack Trace:
org.apache.fluss.exception.FlussRuntimeException: Failed to recovery log
at org.apache.fluss.server.log.LogManager.loadLogs(LogManager.java:207)
at org.apache.fluss.server.log.LogManager.startup(LogManager.java:139)
at org.apache.fluss.server.tablet.TabletServer.startServices(TabletServer.java:228)
...
Caused by: org.apache.fluss.exception.FlussRuntimeException: Failed to load record batch at position 495460312
from FileRecords(size=495460352, file=.../00000000000000000000.log, start=0, end=2147483647)
at org.apache.fluss.record.FileLogInputStream$FileChannelLogRecordBatch.loadByteBufferWithSize(FileLogInputStream.java:222)
at org.apache.fluss.record.FileLogInputStream$FileChannelLogRecordBatch.loadBatchHeader(FileLogInputStream.java:211)
at org.apache.fluss.record.FileLogInputStream$FileChannelLogRecordBatch.commitTimestamp(FileLogInputStream.java:134)
at org.apache.fluss.record.FileLogRecords.largestTimestampAfter(FileLogRecords.java:386)
at org.apache.fluss.server.log.LogSegment.readLargestTimestamp(LogSegment.java:644)
at org.apache.fluss.server.log.LogSegment.readMaxTimestampAndStartOffsetSoFar(LogSegment.java:214)
at org.apache.fluss.server.log.LogSegment.maxTimestampSoFar(LogSegment.java:200)
at org.apache.fluss.server.log.LogSegment.recover(LogSegment.java:319)
at org.apache.fluss.server.log.LogLoader.recoverSegment(LogLoader.java:269)
at org.apache.fluss.server.log.LogLoader.recoverLog(LogLoader.java:168)
...
Caused by: java.io.EOFException: Failed to read `record batch header` from file channel.
Expected to read 48 bytes, but reached end of file after reading 40 bytes.
Started read from position 495460312.
at org.apache.fluss.utils.FileUtils.readFullyOrFail(FileUtils.java:110)
at org.apache.fluss.utils.FileUtils.loadByteBufferFromFile(FileUtils.java:138)
...
Solution
No response
Are you willing to submit a PR?
- I'm willing to submit a PR!
Reactions are currently unavailable