backup: Phase 0b M6 implementation - cmd/elastickv-snapshot-encode + library#904
backup: Phase 0b M6 implementation - cmd/elastickv-snapshot-encode + library#904bootjp wants to merge 17 commits into
Conversation
Implementation of the M6 design (this PR's earlier doc commits 31c0a0f through 1736435). Wires the merged M1-M5 encoder slices into a user-facing CLI plus a library entrypoint, with the round-trip self-test the parent doc mandates. ## What ships - internal/backup/manifest.go: Exclusions gains RenameS3Collisions bool with json tag rename_s3_collisions. Intentionally NOT added to exclusionsRequiredFields so legacy manifests decode safely with the zero value false. Pinned by TestExclusionsLegacyManifestOmitsRenameS3Collisions. - cmd/elastickv-snapshot-decode/main.go: emitManifest now populates the new field from cfg.renameCollisions, so M6's self-test can thread the same option back through DecodeSnapshot. - internal/backup/encode_info.go: ENCODE_INFO.json schema with format_version gate, NewEncodeInfo constructor, WriteEncodeInfo / ReadEncodeInfo helpers, and EncodeInfoSidecarPath that derives the sidecar from the .fsm path (<output>.encode_info.json — no static name collisions). - internal/backup/encode_snapshot.go: EncodeSnapshot(opts, out) library entrypoint mirroring DecodeSnapshot. Dispatches per-adapter encoders in canonical fan-out order (redis -> dynamodb -> s3 -> sqs), and implements the two-mode buffering model: - SelfTest=false: stream FSM straight to out with a sha256.Writer tee. - SelfTest=true: buffer FSM in *bytes.Buffer, fire the unexported corruptBufferForTest hook (if set), self-test against the buffer, copy to out ONLY on match. Corruption hooks reach the self-test decode but never reach out (write-then-rename atomicity). - cmd/elastickv-snapshot-encode/main.go: CLI with --input / --output / --adapter / --last-commit-ts / --self-test / --scratch-root flags; decoder-parity adapter CSV parser; fail-closed T >= manifest.last_commit_ts validation; atomic publish via <output>.tmp-<random> with deferred cleanup; sidecar emission with fsync+close discipline. ## Tests (12 new) internal/backup/encode_info_test.go: - TestEncodeInfoRoundTrip - TestEncodeInfoRejectsUnknownFormatVersion - TestExclusionsLegacyManifestOmitsRenameS3Collisions internal/backup/encode_snapshot_test.go: - TestEncodeSnapshotLibraryRoundTrip - TestEncodeSnapshotSelfTestMatchesInput (uses canonicalize-once pattern: encode -> decode -> re-encode self-tests cleanly) - TestEncodeSnapshotSelfTestDetectsCorruption (corruption reaches self-test decode, NEVER reaches the io.Writer) - TestEncodeSnapshotRequiresInputRoot - TestEncodeInfoSidecarPath cmd/elastickv-snapshot-encode/main_test.go: - TestCLIRejectsMissingManifest - TestCLIRejectsUnknownAdapter - TestCLIRejectsLowerLastCommitTSOverride - TestCLIAcceptsEqualAndHigherLastCommitTSOverride (sub-tests for equal + higher) - TestCLIEncodeInfoPathDerivedFromOutput - TestCLIEncodeInfoTwoFilesNoCollision - TestCLIRoundTripSelfTestAllAdapters - TestCLISelfTestFailureLeavesNoFsmAtOutputPath - TestParseLastCommitTS All green (race + no cache). golangci-lint clean. ## Caller audit per CLAUDE.md semantic-change rule - Exclusions struct gained a field. Existing callers either: (a) build the struct via field-tagged literals (decoder CLI's emitManifest — updated to populate the new field), or (b) read it (encoder's buildSelfTestDecodeOptions — new code). No silent semantic change for any pre-existing caller. - DecodeOptions.RenameS3Collisions was already a public field used by the decoder; the encoder now also reads it via the manifest. No caller-side change needed. ## Self-review 1. Data loss: write-then-rename atomic publish — self-test failure never publishes the .fsm. Mismatch txt + corrupt buffer never reach the io.Writer (pinned by TestEncodeSnapshotSelfTestDetectsCorruption asserting out.Len()==0). 2. Concurrency: pure offline. The CLI is a single-shot binary; library entrypoint takes an io.Writer the caller owns. 3. Performance: SelfTest=false streams with one sha256 tee, no extra allocations. SelfTest=true allocates one FSM-sized buffer plus the scratch decode tree (documented memory cost). 4. Data consistency: --last-commit-ts T < manifest is fail-closed with typed error; self-test threads MANIFEST DecodeOptions (Exclusions.* + DynamoDBLayout -> DynamoDBBundleJSONL) so trees produced with non-default decoder flags round-trip cleanly. 5. Test coverage: 12 new tests cover library entrypoint, CLI flag parsing, atomic publish discipline, sidecar path-derivation, corruption detection, forward-compat for legacy manifests.
|
Warning Review limit reached
More reviews will be available in 6 minutes and 13 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThis PR adds a snapshot encoding pipeline: a new elastickv-snapshot-encode CLI, a library EncodeSnapshot with optional on-disk self-test and deterministic structural diffing, ENCODE_INFO.json sidecar schema and helpers, atomic temp→final publish with randomized temp names, and a manifest Exclusions flag ChangesSnapshot Encoding Pipeline
Sequence Diagram(s)sequenceDiagram
participant Input as Input Root
participant Encode as EncodeSnapshot
participant Adapter as Adapter Encoders
participant Temp as Temp FSM File
participant SelfTest as Self-Test (optional)
participant Decode as Decode to Scratch
participant Diff as Structural Diff
participant Output as Output Writer
Input->>Encode: EncodeOptions{InputRoot, Adapters, SelfTest}
Encode->>Adapter: Run enabled adapters in order
Adapter->>Temp: Write FSM bytes to temp
alt Self-Test Enabled
Temp->>SelfTest: Fsync and optionally corrupt buffer
SelfTest->>Decode: Rewind and decode FSM
Decode->>Diff: Compare adapter subdirs (streamed)
Diff-->>SelfTest: Mismatch text or empty
alt Mismatch Found
SelfTest-->>Encode: SelfTestMismatchTxt (non-error)
Encode-->>Output: Write nothing, return mismatch
else No Mismatch
SelfTest-->>Encode: SelfTestMatched=true
Encode->>Output: Copy temp to output writer
end
else Self-Test Disabled
Temp->>Output: Stream FSM directly with SHA-256
end
Output-->>Encode: BytesWritten, SHA256, SelfTestRan, SelfTestMatched
sequenceDiagram
participant User
participant CLI as snapshot-encode CLI
participant Input as Input Dir (MANIFEST)
participant Encode as EncodeSnapshot
participant Temp as Temp FSM File
participant Output as Final FSM Path
participant Sidecar as ENCODE_INFO.json
User->>CLI: --input INPUT --output OUTPUT [--adapter NAMES] [--last-commit-ts TS] [--self-test]
CLI->>Input: Read and decode MANIFEST.json
Input-->>CLI: Manifest + Exclusions
CLI->>CLI: Validate --last-commit-ts (reject if lower than manifest)
CLI->>CLI: Parse --adapter CSV to AdapterSet
CLI->>Encode: EncodeOptions{InputRoot, Adapters, LastCommitTS, SelfTest}
Encode-->>CLI: EncodeResult{BytesWritten, SHA256, SelfTestRan, SelfTestMatched, SelfTestMismatchTxt}
alt Self-Test Mismatch
CLI->>Output: Write .mismatch.txt with mismatch text
CLI->>CLI: Return ErrSelfTestMismatch (data-error exit)
else Self-Test Pass or Disabled
CLI->>Temp: Write EncodeResult bytes to random temp file
CLI->>Output: Fsync, close, atomic rename temp→OUTPUT.fsm
CLI->>Sidecar: WriteEncodeInfo(OUTPUT.fsm.encode_info.json)
Sidecar-->>CLI: EncodeInfo{FormatVersion, EncoderVersion, Timestamps, SHA256, Adapters, SelfTest}
CLI->>User: Success with completion log
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@claude review |
|
Claude finished @bootjp's task in 2m 59s —— View job Code Review - PR #904
|
There was a problem hiding this comment.
Code Review
This pull request implements the elastickv-snapshot-encode command-line tool and library, which encodes vendor-independent directory trees into native .fsm files, generates an encode_info.json sidecar, and supports an optional round-trip self-test. The review feedback highlights several critical memory-efficiency improvements to prevent Out Of Memory (OOM) crashes on large production snapshots. Key recommendations include writing the .fsm to a temporary file on disk instead of buffering it in memory during the self-test, streaming the file comparison chunk-by-chunk rather than reading all file bytes into memory, and increasing the entropy of the temporary file suffix to avoid collisions.
| func encodeBuffered(b *snapshotBuilder, opts EncodeOptions, enabled []string, out io.Writer) (EncodeResult, error) { | ||
| var buf bytes.Buffer | ||
| bytesWritten, err := b.WriteTo(&buf) | ||
| if err != nil { | ||
| return EncodeResult{}, errors.WithStack(err) | ||
| } | ||
| bufBytes := buf.Bytes() | ||
| if opts.corruptBufferForTest != nil { | ||
| opts.corruptBufferForTest(bufBytes) | ||
| } | ||
|
|
||
| header, mismatchTxt, matched, stErr := runSelfTest(bufBytes, opts) | ||
| result := EncodeResult{ | ||
| Header: header, | ||
| BytesWritten: bytesWritten, | ||
| SHA256: sha256.Sum256(bufBytes), | ||
| SelfTestRan: true, | ||
| SelfTestMatched: matched, | ||
| SelfTestMismatchTxt: mismatchTxt, | ||
| AdaptersEnabled: enabled, | ||
| } | ||
| if stErr != nil { | ||
| return result, stErr | ||
| } | ||
| if !matched { | ||
| return result, nil | ||
| } | ||
| if _, err := io.Copy(out, bytes.NewReader(bufBytes)); err != nil { | ||
| return result, errors.Wrap(err, "copy buffered fsm to out") | ||
| } | ||
| return result, nil | ||
| } |
There was a problem hiding this comment.
Buffering the entire .fsm file in memory (bytes.Buffer) during the self-test can lead to Out Of Memory (OOM) crashes on large production snapshots. Instead, write the .fsm directly to a temporary file on disk, compute the SHA256 hash in a single pass using the existing sha256Writer, and stream-read from the temporary file for the self-test and copying.
func encodeBuffered(b *snapshotBuilder, opts EncodeOptions, enabled []string, out io.Writer) (EncodeResult, error) {
tempFile, err := os.CreateTemp(opts.SelfTestDecodeOptions.OutRoot, "encode-buffer-")
if err != nil {
return EncodeResult{}, errors.WithStack(err)
}
defer func() {
_ = tempFile.Close()
_ = os.Remove(tempFile.Name())
}()
hashWriter := newSHA256Writer(tempFile)
bytesWritten, err := b.WriteTo(hashWriter)
if err != nil {
return EncodeResult{}, errors.WithStack(err)
}
if opts.corruptBufferForTest != nil {
if _, err := tempFile.Seek(0, io.SeekStart); err != nil {
return EncodeResult{}, errors.WithStack(err)
}
bufBytes, err := io.ReadAll(tempFile)
if err != nil {
return EncodeResult{}, errors.WithStack(err)
}
opts.corruptBufferForTest(bufBytes)
if _, err := tempFile.WriteAt(bufBytes, 0); err != nil {
return EncodeResult{}, errors.WithStack(err)
}
}
header, mismatchTxt, matched, stErr := runSelfTest(tempFile, opts)
result := EncodeResult{
Header: header,
BytesWritten: bytesWritten,
SHA256: hashWriter.Sum(),
SelfTestRan: true,
SelfTestMatched: matched,
SelfTestMismatchTxt: mismatchTxt,
AdaptersEnabled: enabled,
}
if stErr != nil {
return result, stErr
}
if !matched {
return result, nil
}
if _, err := tempFile.Seek(0, io.SeekStart); err != nil {
return result, errors.Wrap(err, "seek temp file for copy")
}
if _, err := io.Copy(out, tempFile); err != nil {
return result, errors.Wrap(err, "copy buffered fsm to out")
}
return result, nil
}| func runSelfTest(fsmBytes []byte, opts EncodeOptions) (SnapshotHeader, []byte, bool, error) { | ||
| scratchBase := opts.SelfTestDecodeOptions.OutRoot | ||
| scratchDir, err := os.MkdirTemp(scratchBase, "encode-self-test-") | ||
| if err != nil { | ||
| return SnapshotHeader{}, nil, false, errors.Wrap(err, "mkdir scratch") | ||
| } | ||
| defer func() { | ||
| _ = os.RemoveAll(scratchDir) | ||
| }() | ||
|
|
||
| decOpts := opts.SelfTestDecodeOptions | ||
| decOpts.OutRoot = scratchDir | ||
|
|
||
| result, derr := DecodeSnapshot(bytes.NewReader(fsmBytes), decOpts) | ||
| if derr != nil { | ||
| // Decoder errored on our own output — that IS a self-test | ||
| // failure (the .fsm we produced isn't loadable). Surface as | ||
| // a mismatch with the decoder error embedded in the txt. | ||
| mismatchTxt := []byte("self-test failed: DecodeSnapshot rejected the produced .fsm: " + derr.Error()) | ||
| return SnapshotHeader{}, mismatchTxt, false, nil | ||
| } | ||
|
|
||
| if result.Header.LastCommitTS != opts.LastCommitTS { | ||
| mismatchTxt := []byte(formatHeaderMismatch(opts.LastCommitTS, result.Header.LastCommitTS)) | ||
| return result.Header, mismatchTxt, false, nil | ||
| } | ||
|
|
||
| diff, derr := diffAdapterTrees(opts.InputRoot, scratchDir, opts.Adapters) | ||
| if derr != nil { | ||
| return result.Header, nil, false, errors.Wrap(derr, "diff scratch tree") | ||
| } | ||
| if len(diff) > 0 { | ||
| return result.Header, []byte(strings.Join(diff, "\n") + "\n"), false, nil | ||
| } | ||
| return result.Header, nil, true, nil | ||
| } |
There was a problem hiding this comment.
Update runSelfTest to accept an *os.File instead of a byte slice to support streaming the .fsm from disk during the self-test, avoiding loading the entire snapshot into memory.
func runSelfTest(fsmFile *os.File, opts EncodeOptions) (SnapshotHeader, []byte, bool, error) {
scratchBase := opts.SelfTestDecodeOptions.OutRoot
scratchDir, err := os.MkdirTemp(scratchBase, "encode-self-test-")
if err != nil {
return SnapshotHeader{}, nil, false, errors.Wrap(err, "mkdir scratch")
}
defer func() {
_ = os.RemoveAll(scratchDir)
}()
decOpts := opts.SelfTestDecodeOptions
decOpts.OutRoot = scratchDir
if _, err := fsmFile.Seek(0, io.SeekStart); err != nil {
return SnapshotHeader{}, nil, false, errors.Wrap(err, "seek temp file")
}
result, derr := DecodeSnapshot(fsmFile, decOpts)
if derr != nil {
// Decoder errored on our own output — that IS a self-test
// failure (the .fsm we produced isn't loadable). Surface as
// a mismatch with the decoder error embedded in the txt.
mismatchTxt := []byte("self-test failed: DecodeSnapshot rejected the produced .fsm: " + derr.Error())
return SnapshotHeader{}, mismatchTxt, false, nil
}
if result.Header.LastCommitTS != opts.LastCommitTS {
mismatchTxt := []byte(formatHeaderMismatch(opts.LastCommitTS, result.Header.LastCommitTS))
return result.Header, mismatchTxt, false, nil
}
diff, derr := diffAdapterTrees(opts.InputRoot, scratchDir, opts.Adapters)
if derr != nil {
return result.Header, nil, false, errors.Wrap(derr, "diff scratch tree")
}
if len(diff) > 0 {
return result.Header, []byte(strings.Join(diff, "\n") + "\n"), false, nil
}
return result.Header, nil, true, nil
}| func diffOneSubdir(aDir, bDir, relPrefix string) ([]string, error) { | ||
| aFiles, aErr := walkRegularFiles(aDir) | ||
| if aErr != nil && !errors.Is(aErr, os.ErrNotExist) { | ||
| return nil, errors.Wrapf(aErr, "walk input %s", aDir) | ||
| } | ||
| bFiles, bErr := walkRegularFiles(bDir) | ||
| if bErr != nil && !errors.Is(bErr, os.ErrNotExist) { | ||
| return nil, errors.Wrapf(bErr, "walk scratch %s", bDir) | ||
| } | ||
|
|
||
| var diffs []string | ||
| aMap := map[string][]byte{} | ||
| for path, body := range aFiles { | ||
| aMap[path] = body | ||
| } | ||
| for path, bBody := range bFiles { | ||
| aBody, ok := aMap[path] | ||
| if !ok { | ||
| diffs = append(diffs, relPrefix+"/"+path+" (missing in input)") | ||
| continue | ||
| } | ||
| if !bytes.Equal(aBody, bBody) { | ||
| diffs = append(diffs, relPrefix+"/"+path+" (bytes differ)") | ||
| } | ||
| delete(aMap, path) | ||
| } | ||
| // Anything remaining in aMap is present in input but not in scratch. | ||
| remaining := make([]string, 0, len(aMap)) | ||
| for path := range aMap { | ||
| remaining = append(remaining, relPrefix+"/"+path+" (missing in scratch)") | ||
| } | ||
| sort.Strings(remaining) | ||
| diffs = append(diffs, remaining...) | ||
| return diffs, nil | ||
| } | ||
|
|
||
| // walkRegularFiles returns a map of relative path -> file bytes for every | ||
| // regular file under root. Missing root is the empty map + os.ErrNotExist. | ||
| // Bounded by the per-adapter test fixtures the encoder runs against; | ||
| // production-scale dumps may want streaming compare, deferred until a | ||
| // real bottleneck shows up. | ||
| func walkRegularFiles(root string) (map[string][]byte, error) { | ||
| out := map[string][]byte{} | ||
| rootInfo, err := os.Stat(root) | ||
| if err != nil { | ||
| return nil, errors.WithStack(err) | ||
| } | ||
| if !rootInfo.IsDir() { | ||
| return nil, errors.Errorf("not a directory: %s", root) | ||
| } | ||
| if err := filepath.WalkDir(root, func(path string, d os.DirEntry, err error) error { | ||
| if err != nil { | ||
| return err | ||
| } | ||
| if d.IsDir() { | ||
| return nil | ||
| } | ||
| if !d.Type().IsRegular() { | ||
| return nil | ||
| } | ||
| body, rerr := os.ReadFile(path) //nolint:gosec // walking a caller-provided root, regular files only | ||
| if rerr != nil { | ||
| return errors.Wrap(rerr, path) | ||
| } | ||
| rel, rerr := filepath.Rel(root, path) | ||
| if rerr != nil { | ||
| return errors.WithStack(rerr) | ||
| } | ||
| out[filepath.ToSlash(rel)] = body | ||
| return nil | ||
| }); err != nil { | ||
| return nil, errors.WithStack(err) | ||
| } | ||
| return out, nil | ||
| } |
There was a problem hiding this comment.
Loading the entire contents of all files in both the input and scratch directories into memory (walkRegularFiles reading all file bytes via os.ReadFile) will cause OOM crashes on large backups (e.g., with large S3 blobs). Instead, walk the directories to collect file sizes, and perform a streaming chunk-by-chunk comparison for files present in both directories.
func diffOneSubdir(aDir, bDir, relPrefix string) ([]string, error) {
aFiles, aErr := walkRegularFiles(aDir)
if aErr != nil && !errors.Is(aErr, os.ErrNotExist) {
return nil, errors.Wrapf(aErr, "walk input %s", aDir)
}
bFiles, bErr := walkRegularFiles(bDir)
if bErr != nil && !errors.Is(bErr, os.ErrNotExist) {
return nil, errors.Wrapf(bErr, "walk scratch %s", bDir)
}
var diffs []string
for path, aSize := range aFiles {
bSize, ok := bFiles[path]
if !ok {
diffs = append(diffs, relPrefix+"/"+path+" (missing in scratch)")
continue
}
if aSize != bSize {
diffs = append(diffs, relPrefix+"/"+path+" (bytes differ)")
continue
}
equal, err := compareFiles(filepath.Join(aDir, path), filepath.Join(bDir, path))
if err != nil {
return nil, err
}
if !equal {
diffs = append(diffs, relPrefix+"/"+path+" (bytes differ)")
}
}
for path := range bFiles {
if _, ok := aFiles[path]; !ok {
diffs = append(diffs, relPrefix+"/"+path+" (missing in input)")
}
}
sort.Strings(diffs)
return diffs, nil
}
func walkRegularFiles(root string) (map[string]int64, error) {
out := map[string]int64{}
rootInfo, err := os.Stat(root)
if err != nil {
return nil, errors.WithStack(err)
}
if !rootInfo.IsDir() {
return nil, errors.Errorf("not a directory: %s", root)
}
if err := filepath.WalkDir(root, func(path string, d os.DirEntry, err error) error {
if err != nil {
return err
}
if d.IsDir() {
return nil
}
if !d.Type().IsRegular() {
return nil
}
info, rerr := d.Info()
if rerr != nil {
return errors.Wrap(rerr, path)
}
rel, rerr := filepath.Rel(root, path)
if rerr != nil {
return errors.WithStack(rerr)
}
out[filepath.ToSlash(rel)] = info.Size()
return nil
}); err != nil {
return nil, errors.WithStack(err)
}
return out, nil
}
func compareFiles(pathA, pathB string) (bool, error) {
fA, err := os.Open(pathA)
if err != nil {
return false, errors.WithStack(err)
}
defer fA.Close()
fB, err := os.Open(pathB)
if err != nil {
return false, errors.WithStack(err)
}
defer fB.Close()
bufA := make([]byte, 64*1024)
bufB := make([]byte, 64*1024)
for {
nA, errA := io.ReadFull(fA, bufA)
nB, errB := io.ReadFull(fB, bufB)
if errA == io.EOF || errA == io.ErrUnexpectedEOF {
errA = nil
}
if errB == io.EOF || errB == io.ErrUnexpectedEOF {
errB = nil
}
if errA != nil {
return false, errors.WithStack(errA)
}
if errB != nil {
return false, errors.WithStack(errB)
}
if nA != nB {
return false, nil
}
if nA == 0 {
break
}
if !bytes.Equal(bufA[:nA], bufB[:nB]) {
return false, nil
}
}
return true, nil
}| // tempSuffixHexLen is the hex-character length of the random | ||
| // suffix appended to <output>.tmp-<hex>; 8 hex chars = 4 bytes of | ||
| // entropy, which gives ~4×10^9 collision space per --output path | ||
| // (more than enough for concurrent encodes against the same path). | ||
| tempSuffixHexLen = 8 | ||
| tempSuffixByteLen = tempSuffixHexLen / 2 |
There was a problem hiding this comment.
Using only 4 bytes of entropy (8 hex characters) for the temporary file suffix has a relatively high probability of collision (due to the Birthday Paradox) in highly concurrent environments or automated test pipelines. Increasing the suffix length to 16 hex characters (8 bytes of entropy) provides a much safer collision-free space.
| // tempSuffixHexLen is the hex-character length of the random | |
| // suffix appended to <output>.tmp-<hex>; 8 hex chars = 4 bytes of | |
| // entropy, which gives ~4×10^9 collision space per --output path | |
| // (more than enough for concurrent encodes against the same path). | |
| tempSuffixHexLen = 8 | |
| tempSuffixByteLen = tempSuffixHexLen / 2 | |
| // tempSuffixHexLen is the hex-character length of the random | |
| // suffix appended to <output>.tmp-<hex>; 16 hex chars = 8 bytes of | |
| // entropy, which gives a collision-free space per --output path. | |
| tempSuffixHexLen = 16 | |
| tempSuffixByteLen = tempSuffixHexLen / 2 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1fa691fb74
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return backup.AdapterSet{}, errors.Errorf("unknown adapter %q", name) | ||
| } | ||
| } | ||
| return set, nil |
There was a problem hiding this comment.
Reject empty adapter selections
When --adapter contains only separators or whitespace (for example --adapter ' ,', or a templated value that expands to spaces), every token hits the empty case and this returns a zero AdapterSet with no error. The rest of the encoder then invokes no adapter encoders, publishes a valid header-only .fsm, and exits successfully, which can turn a bad CLI argument into an apparently successful but empty restore artifact; the decoder-side parser already rejects this zero-adapter case.
Useful? React with 👍 / 👎.
| if _, err := fmt.Sscanf(s, "%d", &ts); err != nil { | ||
| return 0, errors.Wrap(err, "--last-commit-ts decimal parse") |
There was a problem hiding this comment.
Reject trailing junk in timestamps
For decimal --last-commit-ts values, fmt.Sscanf with %d accepts a numeric prefix and ignores trailing bytes, so an input like --last-commit-ts 100oops is silently treated as 100. Because this flag controls the restored HLC ceiling and sidecar metadata, malformed operator input should fail instead of publishing a snapshot with a different timestamp than the argument appears to specify.
Useful? React with 👍 / 👎.
Six substantive review findings on the M6 implementation PR. Headline fix: switched the self-test buffer from in-memory bytes.Buffer to a disk-backed temp file so large snapshots cannot OOM the encoder. ## Gemini high #1: SelfTest=true was buffering FSM in memory Out Of Memory risk on production-scale snapshots (gemini high #904). Fixed by writing the FSM through a tee (sha256 + os.File.WriteAt) into an os.CreateTemp file under opts.SelfTestDecodeOptions.OutRoot, then Seek(0) and stream through DecodeSnapshot for the self-test. The temp file is os.Remove'd via defer on every exit path. The contract is unchanged: corruption reaches the self-test decode but never reaches the io.Writer (TestEncodeSnapshotSelfTestDetectsCorruption still asserts out.Len()==0). Memory cost: O(1) for the SelfTest=true path now (previously was an FSM-sized bytes.Buffer on top of the sort working set). ## Gemini high #2: runSelfTest signature Changed from runSelfTest(fsmBytes []byte, ...) to runSelfTest(fsmFile io.Reader, ...) so the .fsm streams through DecodeSnapshot instead of being held as []byte. Same OOM fix at the runSelfTest layer. ## Gemini high #3: walkRegularFiles loaded every file's bytes into memory For large S3 blobs this OOMs the encoder. Replaced walkRegularFiles with walkRegularFilePaths (paths only) and streamFilesEqual (chunk compare at streamCmpBufSize = 64 KiB). Per-file compare cost is now O(streamCmpBufSize), not O(file size). ## Gemini medium: tempSuffixHexLen entropy was 4 bytes Birthday-paradox collision risk in highly concurrent CI environments. Bumped from 8 hex / 4 bytes (2^32 space) to 16 hex / 8 bytes (2^64 space). Same width crypto/rand uses for cryptographic nonces. ## Claude high + codex P2: fmt.Sscanf silently accepts partial parses --last-commit-ts 0xffZZ parsed as 0xff (fmt.Sscanf stops at first non-hex). --last-commit-ts 100oops parsed as 100. This means a typo could silently set the snapshot HLC ceiling to something the operator did not type. Replaced fmt.Sscanf with strconv.ParseUint (base 10 / 16) which rejects trailing garbage as a parse error. New TestParseLastCommitTS cases cover: 0xffZZ, 100oops, " 100 ext", "-1". ## Codex P2: empty adapter selection produced an empty AdapterSet --adapter ' ,' (e.g. a templated argument that expanded to spaces) would yield a zero AdapterSet and the encoder would publish a valid header-only .fsm with no adapter records — a silently-empty restore artifact. Now rejected with a clear error. Pinned by TestParseAdapterSetRejectsEmptySelection covering " ,", ",,,", " ", and "," — plus a positive case (single "s3") asserting the guard does not break legitimate single-adapter selections. ## Caller audit per CLAUDE.md semantic-change rule - runSelfTest signature changed from ([]byte, opts) to (io.Reader, opts). Sole caller is encodeBuffered (same package, just updated). No external callers since the function is unexported. - corruptBufferForTest hook type changed from func([]byte) to func(*os.File). Sole caller is the same-package test TestEncodeSnapshotSelfTestDetectsCorruption (updated to ReadAt/ WriteAt the temp file). External callers cannot set this field (lowercase). - streamFilesEqual replaces the bytes.Equal compare in diffOneSubdir. Same return contract (bool, error) so the caller (diffOneSubdir) is unchanged in behavior. - parseAdapterSet now returns an error for empty-selection inputs. Caller is parseFlags which already propagates the error to exit-1. Tests + lint green.
|
v2 (b131186): addresses all 6 review findings on the implementation PR. Gemini high #1 (OOM on large snapshots): Self-test buffer is now an on-disk Gemini high #2 (runSelfTest signature): Changed from Gemini high #3 (walkRegularFiles OOM): Replaced bytes-into-memory with Gemini medium (tempSuffixHexLen entropy): Bumped from 8 hex / 4 bytes (2^32) to 16 hex / 8 bytes (2^64), same width crypto/rand uses for cryptographic nonces. Claude high + codex P2 (fmt.Sscanf partial parses): Replaced with Codex P2 (empty adapter selection): @claude review |
|
Claude finished @bootjp's task in 3m 41s —— View job Code Review - PR #904 (v2)Tasks
v2 Verification — All 6 Required Fixes Confirmed
One New Issue Introduced by v2Stale doc comments still say
// SelfTest enables the round-trip self-test. EncodeSnapshot buffers
// the FSM in *bytes.Buffer, decodes from the buffer, and copies to
// the caller's io.Writer only if the buffer survives DecodeSnapshot
// When opts.SelfTest=true an internal *bytes.Buffer holds the
// FSM during the self-test; bytes are copied to out only after the
// self-test matches. Memory cost in self-test mode is one FSM-sized
// allocation on top of the existing sort working set.Both still describe the v1 in-memory approach. The actual v2 implementation writes to an Carry-over Observations (non-blocking, noted in v1)These were marked "no fix required" in the v1 review and remain unchanged:
Strengths in v2The disk-backed self-test path is cleanly implemented:
Verdict: Ready to merge once the two stale |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b13118638f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| b := newSnapshotBuilder(opts.LastCommitTS) | ||
| enabled, err := runAdapterEncoders(b, opts) |
There was a problem hiding this comment.
Enforce the manifest timestamp floor in EncodeSnapshot
When EncodeSnapshot is called directly (the new library entrypoint the design exposes for future callers), a caller can pass LastCommitTS lower than InputRoot/MANIFEST.json:last_commit_ts; this path immediately builds with that value and never reads the manifest or returns the advertised ErrSelfTestLowerLastCommitTS. The CLI pre-check catches its own invocations, but in-process callers can still emit an .fsm whose header and MVCC key timestamps lower the restored HLC ceiling, violating the fail-closed invariant this wrapper is meant to enforce.
Useful? React with 👍 / 👎.
…over cleanups)
Mandatory (claude v2):
- Stale *bytes.Buffer comments in SelfTest field doc and EncodeSnapshot
godoc updated to describe the v2 on-disk temp file model with O(1)
memory cost (encode-self-test-fsm-* under SelfTestDecodeOptions.OutRoot).
Non-blocking carry-over items from v1 + v2 (folded in since trivial):
- EncodeResult.SHA256 comment no longer references a non-existent
SHA256Hex function; says "raw 32-byte digest; CLI hex-encodes via
encoding/hex in writeSidecar".
- EncodeSnapshot godoc now spells out the dual-check contract
("Callers MUST check result.SelfTestMatched before treating a nil
error as success") since self-test mismatch returns (result, nil).
- diffOneSubdir output is now fully sorted (was: only the "remaining"
segment sorted; map-iteration order leaked into the rest). mismatch.txt
is deterministic across runs with identical inputs.
- Replaced hand-rolled uitoa/itoa with strconv.FormatUint and
strconv.Itoa. Dropped the sha256w / hashSHA256 alias chain; the
sha256Writer.h field now uses hash.Hash directly (already imported).
|
v3 (b8da884): addresses claude v2 review. Mandatory (the v2 blocker): Two stale Non-blocking carry-overs folded in since trivial:
@claude review |
|
Claude finished @bootjp's task in 3m 33s —— View job Code Review - PR #904 (v3)Tasks
v3 Verification — All v2 Carry-overs Confirmed
One New Doc Bug in v3
// ErrSelfTestLowerLastCommitTS is returned by EncodeSnapshot when the
// effective LastCommitTS in EncodeOptions is below the manifest's value.
// ...
// Surfaced at the EncodeSnapshot layer so the CLI's main exits with code 2
// Returns ErrSelfTestLowerLastCommitTS when opts.LastCommitTS is below
// the manifest's value — caller is responsible for reading the manifest
// and computing the effective T (this layer just validates the floor).Both are inaccurate. // cmd/elastickv-snapshot-encode/main.go:349-354
func resolveLastCommitTS(cfg *config, manifestTS uint64) (uint64, bool, error) {
if cfg.lastCommitTS < manifestTS {
return 0, false, errors.Wrapf(backup.ErrSelfTestLowerLastCommitTS, ...)
}
...
}The functional behaviour is correct — the HLC floor IS enforced before any bytes are written. But a future library caller reading the The fix is straightforward: correct both comments to say the error is a sentinel for the caller to return after its own manifest comparison, not something Observations (non-blocking)
Both are Strengths in v3
Verdict: One doc fix needed before merge. All functional behaviour is correct; all v2 findings are addressed. The doc bug ( |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b8da884f9d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // encodeToTempFile creates tempPath, runs EncodeSnapshot into it, | ||
| // fsync+close. Caller is responsible for the os.Remove cleanup on error. | ||
| func encodeToTempFile(tempPath string, encodeOpts backup.EncodeOptions) (backup.EncodeResult, error) { | ||
| tempFile, err := os.Create(tempPath) //nolint:gosec // operator-supplied path |
There was a problem hiding this comment.
Create snapshot artifacts owner-only
When the encoder runs with a typical umask like 022, os.Create creates the temporary FSM as mode 0666 masked to 0644, and that same file becomes the published <output> after the rename. Because the .fsm contains the full KV dataset, this exposes backup contents to other local users on multi-user backup hosts while the encode is running and after publish; the decoded adapter writers in internal/backup/open_nofollow_* use 0600 for this reason. Please create the temp output with an explicit owner-only mode before renaming it.
Useful? React with 👍 / 👎.
… unify writer types
Mandatory (claude v3):
ErrSelfTestLowerLastCommitTS var comment and EncodeSnapshot godoc both
claimed the HLC ceiling check is enforced in the library. It is NOT.
The check lives in the CLI's resolveLastCommitTS BEFORE EncodeSnapshot
is called. A future library caller (Phase 1 live extractor, integration
test) reading the godoc would assume EncodeSnapshot guards against a
low timestamp and be surprised when it silently stamps the low value
into the .fsm.
Both comments updated to say:
- The error is the sentinel callers should return after their OWN
manifest comparison.
- EncodeSnapshot does NOT read MANIFEST.json or validate
opts.LastCommitTS against any external floor.
Non-blocking observation (claude v3):
sha256Writer and teeWriter were structurally identical (both
{w io.Writer; h hash.Hash} with the same Write body). teeWriter
removed; encodeBuffered now uses newSHA256Writer for the temp-file
tee. One less type to audit.
|
v4 (a357279): addresses claude v3. Mandatory: ErrSelfTestLowerLastCommitTS var comment + EncodeSnapshot godoc claimed the HLC ceiling check is enforced in the library — it isn't (the check lives in the CLI's Non-blocking observation: Unified the duplicated @claude review |
|
Claude finished @bootjp's task in 2m 35s —— View job Code Review - PR #904 (v4)Tasks
v4 Verification — All v3 Items Confirmed
One Remaining Issue —
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a3572798b7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
|
|
||
| b := newSnapshotBuilder(opts.LastCommitTS) | ||
| enabled, err := runAdapterEncoders(b, opts) |
There was a problem hiding this comment.
Reject zero-adapter library encodes
When an in-process caller passes the zero AdapterSet (for example, a future Phase 1 caller forgets to thread adapter selection), runAdapterEncoders returns an empty list with no error and the subsequent write emits a valid header-only .fsm, silently discarding any adapter subtrees under InputRoot. The CLI parser now rejects empty selections, but this public library entrypoint still has the same fail-open behavior for direct callers, so it should fail before writing when no adapters are enabled.
Useful? React with 👍 / 👎.
Claude v4 mandatory finding: encodeInfoFilePerm = 0o600 was declared but dead code. Both file-creation sites (encodeToTempFile and writeSidecar) used os.Create, which honors umask and typically gives 0644. On a multi-user backup host, the full KV dataset (.fsm) and the sidecar (which carries source path, cluster_id, and SHA-256) would have been world-readable. Replaced both os.Create calls with os.OpenFile + explicit encodeInfoFilePerm: - encodeToTempFile: temp .fsm is now 0o600 from creation, so the on-disk dataset is never world-readable during the encode-then- rename window. - writeSidecar: ENCODE_INFO.json gets the same 0o600. New TestCLIPublishesFsmAndSidecarMode0600 asserts both files have no group/other access bits (`perm & 0o077 == 0`) after a successful run. Skips on Windows where Unix-style mode bits aren't meaningful. Codex v3 had also flagged the temp FSM side at P2 in passing; that's now addressed alongside the sidecar. Caller audit: both call sites updated in this same file; no other os.Create calls remain in the encoder CLI. The `0o600` constant is now live code.
|
v5 (0c223d2): addresses claude v4 mandatory. Mandatory (claude v4): Both replaced with Caller audit: both sites updated; no other @claude review |
|
Claude finished @bootjp's task in 2m 29s —— View job Code Review - PR #904 (v5)Tasks
v5 Verification — Mandatory v4 Fix Confirmed
Permission chain is correct end-to-end: the temp file is created The library-level self-test buffer ( No remaining No New Issues Introduced in v5The change is minimal and scoped to exactly what v4 flagged: two Carry-over Observation (pre-existing, non-blocking)
Strengths in v5
Verdict: Ready to merge. The mandatory v4 fix is correctly implemented and pinned by the new test. No new issues in v5. |
… v5 + codex v5)
Carry-over flagged in both v5 reviews: EncodeSnapshot at the library
layer accepted opts.Adapters == AdapterSet{} and produced a valid
header-only .fsm with no adapter records. The CLI's parseAdapterSet
already rejects this for flag-driven entry, but a future in-process
caller (Phase 1 live extractor, integration test) that forgets to
thread the adapter set would get a silently empty restore artifact.
Fail-closed guard added at the top of EncodeSnapshot alongside the
existing InputRoot / out-nil checks. Pinned by
TestEncodeSnapshotRejectsZeroAdapterSet which also asserts no bytes
are written to out on rejection.
Claude v5 explicitly said the existing fix was "Ready to merge" and
this was non-blocking; folding in for completeness since the guard
is a 4-line addition and the test is independent.
Caller audit per CLAUDE.md semantic-change rule: EncodeSnapshot's
new pre-condition narrows the accepted input domain. All existing
callers either (a) pass through the CLI's parseAdapterSet (which
already rejects empty selections) or (b) are same-package tests
that pass an explicit AdapterSet{...} literal. No legitimate caller
is impacted.
|
v6 (b56c0e6): folded in the carry-over non-blocking guard. Claude v5 already declared "Ready to merge" but flagged the library-level zero-AdapterSet gap (also raised by codex v5). Added the guard at the top of Caller audit: the new pre-condition narrows accepted input. All existing callers pass through the CLI's @claude review |
|
Claude finished @bootjp's task in 1m 30s —— View job Code Review - PR #904 (v6)Tasks
v6 Verification — Zero-AdapterSet Guard ConfirmedThe change is exactly scoped to what was described: 10 lines to
No New Issues Introduced in v6The guard correctly fires before any One observation carries forward as genuinely non-blocking: if Verdict: Ready to merge. v6 closes the only remaining carry-over from v5 cleanly, with no regressions and no new issues. | Branch | |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b56c0e65ae
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| result, err := writeAndPublish(cfg, encodeOpts, mismatchPath, logger) | ||
| if err != nil { | ||
| return err |
There was a problem hiding this comment.
Write encode info before returning self-test mismatches
When --self-test produces a real mismatch, writeAndPublish returns errSelfTestMismatch here; this early return skips the writeSidecar call below, so <output>.encode_info.json is never written with self_test.matched=false. The design says mismatch exits should leave both mismatch.txt and provenance, and without the sidecar operators lose the generated SHA/T/adapters needed to diagnose a failed self-test.
Useful? React with 👍 / 👎.
User flagged that I had missed codex reviews on v2 and v6. Re-fetched all bot reviews; two P2 items remained unaddressed. ## Codex P2 v2: enforce manifest TS floor in EncodeSnapshot The CLI's resolveLastCommitTS already rejects --last-commit-ts T < manifest.last_commit_ts, but the library-level EncodeSnapshot accepted ANY opts.LastCommitTS, including one below the manifest's floor. A future in-process caller (Phase 1 live extractor, integration tests) that bypasses the CLI could silently lower the restored HLC ceiling, letting a post-restart leader re-issue a read ts <= a restored row's commit ts. Earlier v4 fix was doc-only (godoc said "EncodeSnapshot does NOT read the manifest"). That's accurate but pushes the responsibility to callers. Codex's suggestion to enforce in the library is the fail-closed answer. Fix: EncodeOptions gains ManifestLastCommitTS uint64. EncodeSnapshot fails-closed with ErrSelfTestLowerLastCommitTS when LastCommitTS < ManifestLastCommitTS (both > 0 — synthetic test fixtures opt out by leaving ManifestLastCommitTS at 0). CLI's buildEncodeOptions now threads manifest.LastCommitTS into the field. Pinned by TestEncodeSnapshotRejectsLowManifestFloor (rejection + no bytes written) and TestEncodeSnapshotManifestFloorOptOut (opt-out path still works). ## Codex P2 v6: write encode_info.json before returning self-test mismatches The old writeAndPublish returned errSelfTestMismatch on mismatch, which skipped the writeSidecar call. The design says mismatch should leave BOTH <output>.mismatch.txt AND <output>.encode_info.json (with self_test.matched=false) for diagnostics. Operators need the sidecar's SHA256, effective T, adapters_enabled list to triage a failed self-test. Fix: encodeOne now writes the sidecar whenever the encode itself ran (publishErr == nil || errors.Is(publishErr, errSelfTestMismatch)), and only after that returns the original error. Stale sidecar cleanup at the start of every run (was: only mismatch.txt was cleaned). ## Caller audit per CLAUDE.md semantic-change rule - EncodeSnapshot gained a pre-condition (ManifestLastCommitTS floor). All in-tree callers either thread manifest.LastCommitTS through the new field (CLI) or use ManifestLastCommitTS=0 (existing tests). No legitimate caller is impacted. - encodeOne's control flow changed: sidecar is now written EVEN on mismatch. The mismatch error still propagates; downstream callers (run() in main, tests) see the same error contract. Sole consumer is run() which maps errSelfTestMismatch to exitDataErr. Tests + lint green.
|
v7 (9f69295): addresses two missed codex P2 reviews. ユーザー指摘の通り、v2 と v6 の codex review を見落としていました。 Codex P2 v2 (manifest TS floor in EncodeSnapshot): Library-level enforcement was missing. v4 only fixed the godoc to say "EncodeSnapshot does NOT enforce" — accurate but pushed responsibility to callers. Now Codex P2 v6 (sidecar on self-test mismatch): Caller audit: @claude review |
|
Claude finished @bootjp's task in 3m 59s —— View job Code Review - PR #904 (v7)Tasks
v7 Verification — Both Codex P2 Fixes Confirmed
The dual-enforcement analysis: One New Issue in v7
// EncodeSnapshot itself does NOT read MANIFEST.json or enforce this
// floor — the comparison happens in the CLI's resolveLastCommitTS
// BEFORE EncodeSnapshot is called, and a future library caller (Phase 1
// live extractor, integration tests) must perform its own comparison
// and return this sentinel on regression so callers can errors.Is on
// it to map to the right exit code (claude v3 doc bug #904).This was accurate after v4's doc fix, but v7 changed the contract:
The fix is to update lines 23-28 to say the error is now returned by both the CLI ( // EncodeSnapshot itself does NOT read MANIFEST.json, but when the caller
// sets opts.ManifestLastCommitTS > 0, validateEncodeOptions enforces this
// floor and returns ErrSelfTestLowerLastCommitTS before any bytes are
// written. The CLI's resolveLastCommitTS provides the same check for
// flag-driven entry; library callers SHOULD set ManifestLastCommitTS to
// the manifest value for defense-in-depth (codex P2 v2 + claude v3 #904).Observation (non-blocking)
The test exercises the manifest-floor rejection path ( The codex P2 v6 fix (the new Strengths in v7
Verdict: One doc fix before merge. All functional behavior is correct; both v6 Codex P2 items are correctly implemented. The | Branch | |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9f692950fe
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| {"dynamodb", func(s AdapterSet) bool { return s.DynamoDB }, func(b *snapshotBuilder, root string) error { | ||
| return errors.Wrap(NewDynamoDBEncoder(root).Encode(b), "dynamodb encoder") |
There was a problem hiding this comment.
Reject DynamoDB JSONL inputs before encoding
When an input manifest has dynamodb_layout: "jsonl" and --self-test is not enabled (the default), this path still runs the per-item-only DynamoDB reverse encoder. That encoder only walks items/*.json / items/*/*.json, so a decoded tree with items/data-*.jsonl is accepted but all DynamoDB items are skipped while the CLI exits successfully and publishes an .fsm containing only table metadata. Please fail closed on JSONL manifests until the reverse encoder supports that layout, or pass a JSONL-aware mode through this library path.
Useful? React with 👍 / 👎.
Two findings on v8. ## Codex P2 v8: silent empty-restore on missing/stale InputRoot validateEncodeOptions only rejected the empty string. A library caller that passed a typo'd or deleted InputRoot would fan-out into every enabled adapter, each treating its missing top-level subdirectory as a no-op. Result: a "successful" header-only .fsm — exactly the silent empty-restore artifact the encoder is supposed to fail closed against. CLI callers don't hit this path (they open MANIFEST.json under InputRoot first), but the library entrypoint is now a real surface, so the guard belongs in EncodeSnapshot itself. Fix: validateEncodeOptions now os.Stats InputRoot after the empty- string check, rejecting non-existent paths (wraps the stat error) and regular files (returns a typed error). Both rejection paths run before any byte is written and before any adapter is invoked. Pinned by TestEncodeSnapshotRejectsMissingInputRoot (subtests: non-existent path; regular file). Existing TestEncodeSnapshot LibraryRoundTrip + the manifest-floor tests use t.TempDir(), which is a directory, so they continue to pass. ## Claude v8 doc bug: EncodeSnapshot godoc attributed to wrong symbol The EncodeSnapshot doc block sat directly above the validateEncodeOptions doc block with no blank line between them, so godoc tooling (and gopls) attributed the entire merged comment to validateEncodeOptions. EncodeSnapshot — the only exported function in this file — had no godoc at all. Fix: relocate the EncodeSnapshot doc block to immediately precede `func EncodeSnapshot`, separated from validateEncodeOptionsData by a blank line. Content unchanged; only the location moved. ## Caller audit per CLAUDE.md semantic-change rule - validateEncodeOptions gained a fail-closed pre-condition (stat + IsDir). All in-tree callers pass either t.TempDir() (directory) or a real backup root (directory). No legitimate caller is impacted; bad-input callers now surface a typed error instead of silently producing a header-only .fsm — strictly safer. - No exported signature or error semantics changed beyond the new rejection path. Tests + lint green.
|
@claude review v9 addresses two v8 findings: codex P2 v8 — claude v8 — Caller audit (CLAUDE.md semantic-change rule):
|
|
Claude finished @bootjp's task in 1m 51s —— View job Code Review - PR #904 (v9)Tasks
v9 Verification — Both v8 Items Confirmed
The guard ordering in No New Issues in v9The change is minimal and scoped precisely to the two v8 findings:
All prior fixes (v1–v8) remain intact. One non-blocking observation: the stat call happens before the Verdict: Ready to merge. Both v8 findings are correctly implemented and pinned by tests. No regressions and no new issues. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 145a9128d2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| if errors.Is(err, backup.ErrEncodeUnsupportedDynamoDBLayout) { | ||
| return exitDataErr, err | ||
| } |
There was a problem hiding this comment.
Classify encoder data failures as exit 2
When an adapter encoder rejects a malformed decoded tree (for example an invalid DynamoDB _schema.json or unsupported S3 collision artifact), encodeOne returns that EncodeSnapshot error but it does not match any of the three sentinels above, so this fallback exits 1. The CLI contract distinguishes user/flag errors from data-correctness failures with exit 2; treating corrupt backup contents as a user error can break runbooks that branch on exit status to quarantine or repair bad dump data.
Useful? React with 👍 / 👎.
## Codex P2 v9: adapter encoder rejections fall through to exit 1
When an adapter encoder rejects the input tree's contents (a malformed
DynamoDB _schema.json, an S3 collision artifact the encoder cannot
reverse, an SQS side-record with an unknown kind, etc.), the error
propagates out of EncodeSnapshot but does not match any of the three
exit-2 sentinels (ErrSelfTestLowerLastCommitTS,
ErrEncodeUnsupportedDynamoDBLayout, errSelfTestMismatch), so the
CLI's run() falls through to exit 1. Per the CLI contract, exit 1
is operator/flag error and exit 2 is data-correctness — runbooks
branch on exit status to quarantine bad dump data, so misclassifying
adapter rejections as user error is a real ergonomic regression.
The encoder is offline-only — it operates entirely on local files
under InputRoot. Every error from an adapter encoder originates from
the contents of that tree (a sentinel rejection, an unmarshalling
failure, etc.). Treating all adapter errors as data-correctness
matches the CLI contract.
Fix:
- New ErrEncodeAdapterData sentinel in internal/backup.
- runAdapterEncoders wraps each adapter error with errors.Mark so
errors.Is(err, ErrEncodeAdapterData) is true. Mark preserves the
inner sentinel chain — callers that errors.Is on the per-adapter
sentinel (ErrDDBEncodeInvalidSchema, ErrS3EncodeKeyConflict, etc.)
are unaffected.
- CLI run() gains a fourth exit-2 branch for ErrEncodeAdapterData.
Pinned by:
- TestEncodeSnapshotMarksAdapterDataErrors (library): malformed
_schema.json triggers ErrDDBEncodeInvalidSchema inside the DDB
encoder; assertion verifies BOTH errors.Is(ErrEncodeAdapterData)
AND errors.Is(ErrDDBEncodeInvalidSchema) hold - the mark is
additive, not lossy.
- TestCLIAdapterDataErrorExitsTwo (CLI): same fixture driven
end-to-end through run(); asserts exit code = 2 and no .fsm
published.
## Caller audit per CLAUDE.md semantic-change rule
- runAdapterEncoders return semantics: every non-nil err is now
marked. Sole caller is EncodeSnapshot; it propagates the marked
error unchanged.
- EncodeSnapshot callers (library tests, CLI's encodeToTempFile):
- Library tests that errors.Is on the validation sentinels
fire BEFORE runAdapterEncoders, so they are NOT marked.
- CLI's encodeToTempFile wraps with "EncodeSnapshot"; the run()
layer above it now matches ErrEncodeAdapterData and routes to
exit-2 (new branch).
- In-tree errors.Is on per-adapter sentinels (encode_dynamodb_test.go
et al.) call the adapter encoders directly, not through
EncodeSnapshot, so the mark wrapper is never on their path.
Tests + lint green. wrapcheck required errors.WithStack around the
Mark.
|
@claude review v10 addresses codex P2 v9: adapter-data rejections now classify as exit 2. codex P2 v9 — When an adapter encoder rejects the input tree's contents (e.g. malformed DynamoDB Fix:
Pinned by:
Caller audit (CLAUDE.md semantic-change rule):
|
|
Claude finished @bootjp's task in 2m 52s —— View job Code Review - PR #904 (v10)Tasks
v10 Verification — Codex P2 v9 Fix Confirmed
The One New Doc Bug in v10
// that errors.Is on ErrDDBEncodeInvalidSchema, ErrS3EncodeKeyConflict,
// etc. still see those (codex P2 v9 #904).
The fix is to replace the phantom name with an actual one (e.g., // that errors.Is on ErrDDBEncodeInvalidSchema, ErrS3EncodeUnsupportedCollision,
// etc. still see those (codex P2 v9 #904).Strengths in v10
Verdict: One doc fix before merge (phantom | Branch | | |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 18858341bb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if cfg.selfTest && !result.SelfTestMatched { | ||
| if werr := os.WriteFile(mismatchPath, result.SelfTestMismatchTxt, mismatchTxtPerm); werr != nil { | ||
| logger.Warn("write mismatch.txt", "err", werr) | ||
| } | ||
| return result, errors.Wrap(errSelfTestMismatch, "self-test diff (see "+mismatchPath+")") |
There was a problem hiding this comment.
Avoid leaving stale FSM beside failed self-test metadata
When --output already exists from a previous run and the new --self-test detects a mismatch, this branch returns before renaming the temp file but never removes or otherwise invalidates the old <output>. encodeOne then writes a fresh <output>.encode_info.json with self_test.matched=false and the SHA for the unpublished temp snapshot, so the visible .fsm path still contains stale bytes that do not match the sidecar and contradict the CLI contract that a self-test failure leaves no restore-visible FSM.
Useful? React with 👍 / 👎.
Claude v10 doc fix: the runAdapterEncoders godoc referenced ErrS3EncodeKeyConflict, which does not exist. The actual S3 encoder sentinels are ErrS3EncodeInvalidBucket, ErrS3EncodeNotRegular, ErrS3EncodeUnsupportedCollision, ErrS3EncodeReservedPrefixCollision, and ErrS3EncodeInvalidManifest. Replaced with ErrS3EncodeUnsupportedCollision per claude v10's suggestion. Doc-only change; no behavior or test changes.
|
@claude review v11 doc-only fix per claude v10: replaced phantom
|
|
Claude finished @bootjp's task in 48s —— View job Code Review - PR #904 (v11)Tasks
v11 Verification — Mandatory v10 Fix Confirmed
The diff is minimal and precisely scoped: the phantom sentinel name is replaced, and a parenthetical notes both the original fix reference and the doc correction. No New Issues in v11The change is a pure doc comment replacement — one identifier name corrected, no logic or test modifications. All prior fixes (v1–v10) remain intact. Verdict: Ready to merge. The single mandatory v10 item is correctly addressed. All 10 previous rounds of review findings are resolved. The PR is clean. | Branch | | |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cmd/elastickv-snapshot-encode/main_test.go (1)
277-284:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winHandle the
os.Openerror in this helper.If
tmpOutis missing, this passes a nil*os.FileintoDecodeSnapshotand hides the real failure behind a follow-on test crash or misleading decode error.Proposed fix
- f, _ := os.Open(tmpOut) + f, err := os.Open(tmpOut) + if err != nil { + t.Fatalf("open canonical output: %v", err) + } + defer func() { _ = f.Close() }() if _, err := backup.DecodeSnapshot(f, backup.DecodeOptions{ OutRoot: canonicalIn, Adapters: backup.AdapterSet{SQS: true}, }); err != nil { t.Fatalf("canonical decode: %v", err) } - _ = f.Close() emitMinimalManifest(t, canonicalIn, lastCommitTS) return canonicalIn🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/elastickv-snapshot-encode/main_test.go` around lines 277 - 284, The helper opens tmpOut with f, _ := os.Open(tmpOut) but ignores the open error which can pass a nil *os.File into backup.DecodeSnapshot; update the helper to check the error returned by os.Open(tmpOut) and call t.Fatalf (or t.Fatalf with the error) if opening fails before calling backup.DecodeSnapshot (references: tmpOut, f, os.Open, and backup.DecodeSnapshot); ensure you still close f after a successful open.
🧹 Nitpick comments (1)
cmd/elastickv-snapshot-encode/main_test.go (1)
335-372: ⚡ Quick winRename this test to match what it actually exercises.
This never reaches a self-test mismatch path. It verifies the preflight manifest-floor failure leaves no stale sidecar, so the current name/docstring point readers at the wrong branch.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/elastickv-snapshot-encode/main_test.go` around lines 335 - 372, Rename the test function TestCLISelfTestMismatchWritesSidecarWithMatchedFalse and update its docstring/comments to reflect that it verifies the manifest-floor preflight failure (triggered via resolveLastCommitTS / --last-commit-ts) leaves no stale sidecar; e.g., rename to TestCLIManifestFloorLeavesNoStaleSidecar (or similar), update the top comment to describe the preflight/manifest-floor path, and keep the existing assertions (emitMinimalManifest, run with "--last-commit-ts", exitDataErr, and the os.Stat check) unchanged so the test semantics remain the same.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@cmd/elastickv-snapshot-encode/main_test.go`:
- Around line 277-284: The helper opens tmpOut with f, _ := os.Open(tmpOut) but
ignores the open error which can pass a nil *os.File into backup.DecodeSnapshot;
update the helper to check the error returned by os.Open(tmpOut) and call
t.Fatalf (or t.Fatalf with the error) if opening fails before calling
backup.DecodeSnapshot (references: tmpOut, f, os.Open, and
backup.DecodeSnapshot); ensure you still close f after a successful open.
---
Nitpick comments:
In `@cmd/elastickv-snapshot-encode/main_test.go`:
- Around line 335-372: Rename the test function
TestCLISelfTestMismatchWritesSidecarWithMatchedFalse and update its
docstring/comments to reflect that it verifies the manifest-floor preflight
failure (triggered via resolveLastCommitTS / --last-commit-ts) leaves no stale
sidecar; e.g., rename to TestCLIManifestFloorLeavesNoStaleSidecar (or similar),
update the top comment to describe the preflight/manifest-floor path, and keep
the existing assertions (emitMinimalManifest, run with "--last-commit-ts",
exitDataErr, and the os.Stat check) unchanged so the test semantics remain the
same.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 06cbcf79-3732-42cb-a690-d182c2c96f28
📒 Files selected for processing (4)
cmd/elastickv-snapshot-encode/main.gocmd/elastickv-snapshot-encode/main_test.gointernal/backup/encode_snapshot.gointernal/backup/encode_snapshot_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- cmd/elastickv-snapshot-encode/main.go
## Codex P2 v10: self-test mismatch leaves orphaned stale .fsm When --output already exists from a prior successful run and the new --self-test invocation detects a mismatch, writeAndPublish returned errSelfTestMismatch without removing the stale <output>.fsm. encodeOne then wrote a fresh <output>.encode_info.json with self_test.matched=false and the NEW SHA pointing to the unpublished temp snapshot, leaving: - <output>.fsm: STALE bytes from the prior successful run - <output>.encode_info.json: NEW SHA + matched=false - <output>.mismatch.txt: from the new run The sidecar describes an FSM that does not exist on disk. This violates the CLI contract that "a self-test failure leaves no restore-visible FSM" and breaks runbooks that consume the sidecar's SHA to verify FSM integrity (they would find the stale FSM's SHA does not match the sidecar's NEW SHA). Fix: writeAndPublish, on the self-test-mismatch branch, now os.Remove(cfg.outputPath) before returning. After cleanup the artifacts are internally consistent: the sidecar describes a FAILED encode attempt, mismatch.txt has the diff, and the FSM path is absent (matching the sidecar's intent). errors.Is(rerr, os.ErrNotExist) is treated as success — a self-test mismatch from a first-ever encode (no prior FSM) leaves the path absent, which is the same end state. ## Surgical scope — only self-test mismatch wipes stale .fsm Other exit-2 paths (manifest-floor regression, adapter-data error, unsupported DDB layout) preserve any prior <output>.fsm because those failures occur BEFORE writeAndPublish runs. This preserves the runbook ergonomic of "operator typo'd --last-commit-ts; last good FSM still on disk to retry against." ## Test infrastructure To drive a deterministic self-test mismatch end-to-end through the CLI's writeAndPublish, a minimal test seam was added to package backup: EncodeOptions.SetSelfTestCorruptHookForTest(func(*os.File)) exposes the previously-package-private corruptBufferForTest field to callers outside package backup. Production code MUST NOT call this; the godoc explicitly names cmd/elastickv-snapshot-encode tests as the sole intended caller. Pinned by: - TestCLIWriteAndPublishRemovesStaleFSMOnSelfTestMismatch: pre-places a stale .fsm, injects corruption via the new seam, asserts publishErr == errSelfTestMismatch, the stale .fsm is GONE, and mismatch.txt is present. - TestCLINonSelfTestExitTwoPreservesPriorFSM: pre-places a stale .fsm, drives a manifest-floor regression (exit-2 BEFORE writeAndPublish), asserts the stale .fsm is byte-for-byte preserved. Pins the surgical scope of the cleanup. ## Caller audit per CLAUDE.md semantic-change rule - writeAndPublish: sole production caller is encodeOne. On self-test mismatch, encodeOne still writes the sidecar (publishErr matches errSelfTestMismatch) and returns publishErr. The new os.Remove happens before that, so the caller sees IDENTICAL error semantics and the same sidecar-write path. The only difference is on-disk state for <output>.fsm: was stale, is now absent. - EncodeOptions.SetSelfTestCorruptHookForTest: new exported method. No production callers anywhere; one test caller in cmd/elastickv-snapshot-encode/main_test.go. In-package backup tests continue to set corruptBufferForTest directly. Tests + lint green.
|
@claude review v12 addresses codex P2 v10 (which I had missed on the prior review cycle — fetched all bot reviews this time per the discipline lock-in). codex P2 v10 — When Fix:
Surgical scope: only self-test mismatch wipes the stale .fsm. Other exit-2 paths (manifest-floor, adapter-data, JSONL guard) fail BEFORE Pinned by:
Caller audit (CLAUDE.md semantic-change rule):
|
|
Claude finished @bootjp's task in 3m 10s —— View job Code Review - PR #904 (v12)Tasks
v12 Verification — Codex P2 v10 Fix Confirmed
One Doc Gap in v12
// writeAndPublish writes the .fsm to a temp path, runs the optional
// self-test via EncodeSnapshot, and renames temp → output on success.
// On self-test failure: writes mismatch.txt, removes the temp file via
// the deferred cleanup, returns errSelfTestMismatch.The comment says "removes the temp file via the deferred cleanup" but omits the new stale-output removal. A future reader auditing the on-disk state after a self-test mismatch would see only the deferred temp cleanup documented and need to grep for Carry-over Issues (non-blocking, flagged across multiple prior rounds)
f, _ := os.Open(tmpOut)CodeRabbit flagged this in its v12 review (outside-diff comment). If f, err := os.Open(tmpOut)
if err != nil {
t.Fatalf("open canonical output: %v", err)
}
defer func() { _ = f.Close() }()
Renaming to Strengths in v12
Verdict: One doc fix before merge ( |
## Claude v12 mandatory: writeAndPublish godoc missed v12 cleanup The godoc described only the temp-file deferred cleanup but omitted the new stale-<output>.fsm removal added in v12. A future reader auditing the on-disk state after self-test mismatch would see only the temp-file cleanup documented and need to grep for cfg.outputPath to discover the second removal. Expanded the godoc to call out both. ## Claude v12 carry-over: canonicalizeInput swallowed os.Open error The test helper used "f, _ := os.Open(tmpOut)" and passed nil to DecodeSnapshot if the open failed, producing a panic or misleading decode error instead of a clean t.Fatalf. Fixed: check the error explicitly and defer the Close. CodeRabbit flagged this on v11/v12. ## Claude v12 carry-over: test name contradicted its assertion TestCLISelfTestMismatchWritesSidecarWithMatchedFalse asserted os.IsNotExist on the sidecar — the encode never ran on the manifest-floor path it exercises. Name renamed to TestCLIManifestFloorLeavesNoStaleSidecar to match the actual assertion. The real "sidecar IS written with matched=false on self-test mismatch" behavior is now pinned end-to-end by TestCLIWriteAndPublishRemovesStaleFSMOnSelfTestMismatch (v12) which drives a real self-test mismatch through the new corruption seam. Docstring updated to explain the rename history. ## Caller audit per CLAUDE.md semantic-change rule All v13 changes are doc/identifier/error-handling only. No public or package-private function signature, error semantics, or return contract changed. Renamed test has no callers (Go testing framework discovers by prefix). canonicalizeInput's error handling is strictly safer — any open failure now surfaces at the failure site. Tests + lint green.
|
@claude review v13 addresses claude v12 verdict + two carry-overs flagged across multiple rounds: Claude v12 mandatory — Claude/CodeRabbit carry-over — Claude/CodeRabbit carry-over — Caller audit (CLAUDE.md semantic-change rule): all v13 changes are doc/identifier/error-handling only. No public/private signature, error semantics, or return contract changed. Renamed test has no callers (testing framework discovers by prefix).
|
|
Claude finished @bootjp's task in 1m 41s —— View job Code Review - PR #904 (v13)Tasks
v13 Verification — All Three Items Confirmed
The encodeOne flow confirm the test comment is accurate: No New Issues in v13The diff is exactly two files, 28 additions / 21 deletions, scoped to:
All prior fixes (v1–v12) remain intact. Verdict: Ready to merge. All three v13 items are correctly implemented. No regressions, no new issues. All 12 prior review rounds' findings are resolved. | Branch | | |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 48b0b0e582
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| func adapterRunners() []adapterRunner { | ||
| return []adapterRunner{ | ||
| {"redis", func(s AdapterSet) bool { return s.Redis }, func(b *snapshotBuilder, root string) error { | ||
| return errors.Wrap(NewRedisEncoder(root, 0).Encode(b), "redis encoder") |
There was a problem hiding this comment.
Encode every Redis database, not just db_0
When the decoded input contains Redis keys in any non-zero DB (for example redis/db_3, which the decoder can emit), this fan-out only constructs NewRedisEncoder(root, 0). RedisEncoder is scoped to a single redis/db_<n> directory, so the default CLI path with --self-test off publishes an .fsm that silently drops all non-DB0 Redis data; with self-test on the same valid multi-DB dump is rejected instead of encoded. Please enumerate the redis/db_* directories and invoke the per-DB encoder for each selected DB.
Useful? React with 👍 / 👎.
## Codex P1 v13: hardcoded db_0 silently dropped non-default Redis DBs
The Redis adapter runner in adapterRunners was hardcoded to
NewRedisEncoder(root, 0). RedisEncoder is scoped to a single
redis/db_<n>/ subdirectory; any Phase 1 multi-DB dump that included
redis/db_3/, redis/db_5/, etc. would silently drop those DBs from
the produced .fsm — a header-only encode for those keys, exactly
the silent-data-loss pattern the encoder is supposed to fail closed
against. With --self-test on, the same valid multi-DB dump would be
rejected at the diff stage (the decoded scratch tree would not match
the input's db_3 content).
Phase 0a/0b inputs only emit redis/db_0/, so this is a latent bug
that fires the moment Phase 1's multi-DB dumper lands; codex flagged
it as P1.
## Fix: enumerate redis/db_<N> directories and fan out per DB
Two new helpers in encode_snapshot.go:
- enumerateRedisDBs(inRoot) returns the sorted dbIndex values for
which <inRoot>/redis/db_<N>/ exists as a directory. A missing
redis/ returns nil (no-op, matching the per-DB encoder's
missing-subdir convention). Non-canonical entries (non-numeric
suffix, negative, leading-zero like "db_01", wrong prefix,
regular files at the redis/ level) are silently skipped — they
cannot be produced by the canonical decoder. A symlinked or
regular-file redis/ path is rejected with ErrRedisEncodeNotDir
(mirrors the per-DB encoder's symlink refusal).
- encodeAllRedisDBs(b, inRoot) invokes NewRedisEncoder per
enumerated index in ascending order, wrapping per-DB errors with
"redis encoder db_%d" for traceability.
The redis adapter runner now uses encodeAllRedisDBs as its function
value directly (gocritic unlambda).
## Pinned by
- TestEnumerateRedisDBsMissingDir: missing redis/ returns nil indices.
- TestEnumerateRedisDBsMixedEntries: only canonical db_<N> dirs
kept; non-numeric, negative, leading-zero, empty-suffix, wrong-
prefix, and regular-file entries are skipped; result sorted
ascending (asserted via direct index comparison).
- TestEnumerateRedisDBsRedisIsRegularFile: regular file at redis/
path triggers ErrRedisEncodeNotDir.
- TestEncodeSnapshotRedisMultiDB: a fixture under redis/db_3/ ONLY
(no db_0) produces strictly more bytes than an empty-redis baseline
through EncodeSnapshot. Pre-fix, the hardcoded db_0 path would
have produced an identical header-only .fsm for both inputs.
## Caller audit per CLAUDE.md semantic-change rule
- adapterRunners.redis: closure replaced with encodeAllRedisDBs
function value (gocritic unlambda). Error contract:
- errors.Is(err, ErrEncodeAdapterData) — still true (errors.Mark
in runAdapterEncoders is unchanged).
- errors.Is(err, ErrRedisEncodeNotDir) etc. — still true
(errors.Mark + errors.Wrap preserve the inner chain).
- Error prefix changed from "redis encoder: ..." to
"redis encoder db_%d: ..." (or "redis encoder enumerate: ..."
for the enumeration step). More specific, not less.
- NewRedisEncoder direct callers (encode_redis_test.go,
encode_redis_coll_test.go, encode_redis_hardlink_unix_test.go):
unaffected — they call NewRedisEncoder directly, never through
the adapter runner.
- enumerateRedisDBs / parseRedisDBDir / checkRedisRoot / encodeAllRedisDBs
are new package-private functions; no external callers exist yet.
## Lint compliance
Two cyclop refactor splits required:
- enumerateRedisDBs (initially 12 branches) split into
checkRedisRoot + parseRedisDBDir helpers. Each is ≤6 branches.
- TestEnumerateRedisDBs (initially 14 branches via t.Run) split
into three top-level tests. Each is ≤4 branches.
Tests + lint green.
|
@claude review v14 addresses codex P1 v13 (silent data loss for non-default Redis DBs). codex P1 v13 — Fix: two new helpers in
The redis adapter runner now uses Pinned by four tests:
Caller audit (CLAUDE.md semantic-change rule): Lint compliance required two cyclop splits:
|
|
Claude finished @bootjp's task in 2m 48s —— View job Code Review - PR #904 (v14)Tasks
v14 Verification — Codex P1 v13 Fix Confirmed
One Minor Doc Nit (non-blocking)
The split is correct and intentional — case errors.Is(err, os.ErrNotExist):
// Missing redis/ is valid (no DB dirs → no-op); the caller
// handles it via the os.ReadDir NotExist guard below.
return nilNot a blocker — the current code is correct. One Non-blocking ObservationNo end-to-end self-test with a multi-DB Redis fixture
The path is: Strengths in v14
Verdict: Ready to merge. The codex P1 v13 fix is correctly implemented and pinned by four tests. No new issues introduced. The one doc nit ( | Branch | |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 71ecc12699
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // the old bytes on disk would make the sidecar describe an | ||
| // FSM that does not exist and violate the "self-test failure | ||
| // leaves no restore-visible FSM" contract (codex P2 v10 #904). | ||
| if rerr := os.Remove(cfg.outputPath); rerr != nil && !errors.Is(rerr, os.ErrNotExist) { |
There was a problem hiding this comment.
Avoid deleting directories during mismatch cleanup
When --self-test fails and --output accidentally names a directory rather than an existing .fsm file, this cleanup calls os.Remove on that path; for an empty directory it deletes the directory instead of just clearing a stale snapshot. The normal publish path would only fail the later rename, so the destructive behavior is specific to this mismatch cleanup and should be guarded by a stat/type check that only removes regular stale FSM files.
Useful? React with 👍 / 👎.
| if errors.Is(err, backup.ErrSelfTestLowerLastCommitTS) { | ||
| return exitDataErr, err | ||
| } | ||
| if errors.Is(err, backup.ErrEncodeUnsupportedDynamoDBLayout) { | ||
| return exitDataErr, err | ||
| } | ||
| if errors.Is(err, backup.ErrEncodeAdapterData) { | ||
| return exitDataErr, err | ||
| } | ||
| if errors.Is(err, errSelfTestMismatch) { |
There was a problem hiding this comment.
Classify corrupt manifests as data failures
When MANIFEST.json is present but invalid or from an unsupported format version, readInputManifest returns backup.ErrInvalidManifest/backup.ErrUnsupportedFormatVersion, but neither sentinel is mapped here so the CLI exits 1. That treats a broken dump tree as an operator flag/path error, despite the encode CLI contract separating bad backup data from user-input mistakes with exit 2, which can make runbooks skip their quarantine/repair path for corrupt manifests.
Useful? React with 👍 / 👎.
## Codex P2 v14 (1/2): mismatch cleanup must not delete directories v12 added os.Remove(cfg.outputPath) on the self-test-mismatch branch of writeAndPublish to keep the sidecar+FSM internally consistent. That removal was unconditional. If --output names a directory (an operator typo — e.g. forgot the .fsm filename and passed a directory instead), the normal publish path would have failed at os.Rename, but the mismatch cleanup branch would silently call os.Remove on the directory and, for an empty one, delete it. Destructive behavior specific to the mismatch path. Fix: new helper removeStaleOutputFSM(outputPath, logger) performs an os.Lstat first and only proceeds with os.Remove when info.Mode() reports IsRegular. Non-regular files (directory, symlink, named pipe, etc.) are logged at warn level and left alone. Stat errors other than ErrNotExist are also warn-and-continue. Pinned by TestCLISelfTestMismatchSkipsDirectoryAtOutputPath: pre- places an empty directory at --output, drives a self-test mismatch via the corruption seam, asserts publishErr == errSelfTestMismatch AND the directory is still present + still a directory. ## Codex P2 v14 (2/2): classify corrupt manifests as data failures readInputManifest returns backup.ErrInvalidManifest (invalid JSON or schema-invariant violation) and backup.ErrUnsupportedFormatVersion (format_version unknown) when the input MANIFEST.json is broken. Neither sentinel was mapped in run(), so the CLI exited 1 — treating a broken dump tree as an operator-flag error and breaking runbook recovery paths that branch on exit status to triage corrupt-input vs operator-typo. Fix: extracted the error→exit-code mapping into classifyEncodeError and added ErrInvalidManifest + ErrUnsupportedFormatVersion to the exit-2 set. Switch-with-multiple-cases form keeps the function under nestif/cyclop. Pinned by TestCLIInvalidManifestExitsTwo (subtests: invalid JSON body; unsupported format_version). Both subtests assert run() exits exitDataErr. ## Caller audit per CLAUDE.md semantic-change rule - writeAndPublish self-test mismatch branch: os.Remove call replaced with removeStaleOutputFSM. Sole caller is encodeOne; the publishErr contract is unchanged (still errSelfTestMismatch); on-disk state changes ONLY when --output was a non-regular file, in which case the prior behavior was destructive AND wrong. No legitimate caller is impacted. - run() error classification: extracted to classifyEncodeError. Sole caller is run(). Two additional sentinels now classify as exit-2 (ErrInvalidManifest, ErrUnsupportedFormatVersion); all prior exit-2 sentinels still classify as exit-2; exit-1 paths are strictly a subset of the prior set. No runbook that branches on exit-2 would regress; runbooks that branch on exit-1 will see corrupt-manifest cases move to exit-2 where they belong. ## Lint compliance run() had nestif complexity 5 (six sequential errors.Is branches). classifyEncodeError uses a switch with multiple errors.Is cases per arm, which drops the complexity score below the linter bound. Tests + lint green.
|
@claude review v15 addresses both codex P2 v14 findings. codex P2 v14 (1/2) — mismatch cleanup must not delete directories: my v12 codex P2 v14 (2/2) — corrupt manifests should be exit 2: Caller audit (CLAUDE.md semantic-change rule):
Lint:
|
|
Claude finished @bootjp's task in 2m 36s —— View job Code Review - PR #904 (v15)Tasks
v15 Verification — Both codex P2 v14 Items Confirmed
The One Non-Blocking Observation
If The v14 codex finding was specifically about directories; the symlink case is a pre-existing edge that v14 didn't surface, and the conservative behavior (leave non-regular alone) is defensible. Not a blocker. Strengths in v15
Verdict: Ready to merge. Both codex P2 v14 items are correctly implemented and pinned by tests. No new issues in v15. All 14 prior review rounds' findings are resolved. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: af8c279c4c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for _, idx := range indices { | ||
| if err := NewRedisEncoder(inRoot, idx).Encode(b); err != nil { |
There was a problem hiding this comment.
Reject Redis DB fan-out until DBs are encoded
When the input contains redis/db_3 or multiple redis/db_N directories, this loop feeds each directory into the same snapshotBuilder, but NewRedisEncoder uses dbIndex only to choose the input directory and emits the same !redis|... MVCC keys with no database component. The restored/decode path also constructs a single RedisDB from DecodeOptions.RedisDBIndex (default db_0), so distinct databases are merged, same-named keys across DBs fail as duplicates, and a db_3-only self-test decodes under db_0 instead of round-tripping; until the native keys preserve DB identity, non-zero/multiple Redis DB inputs should fail closed rather than publish a mis-scoped snapshot.
Useful? React with 👍 / 👎.
| func parseRedisDBDir(ent os.DirEntry) (int, bool) { | ||
| if !ent.IsDir() { | ||
| return 0, false |
There was a problem hiding this comment.
Fail closed on malformed Redis db_N entries
When a canonical Redis database entry such as redis/db_0 is a regular file or symlink, ent.IsDir() is false and this parser silently skips it, so encodeAllRedisDBs can succeed with no Redis records and publish a header-only/partial FSM. The per-DB encoder's existing Lstat guard would reject these shapes with ErrRedisEncodeNotDir; enumeration should preserve that fail-closed behavior for names matching db_<N> instead of treating malformed dump contents as an absent DB.
Useful? React with 👍 / 👎.
v14 added an enumerate-and-fan-out path for redis/db_<N>/ to address
codex P1 v13's silent-data-loss concern. Codex's v14 follow-up
clarified that the fan-out is itself incorrect, and two P2 findings
landed (L452 + L427).
## Codex P2 v14 L452: Reject Redis DB fan-out until DBs are encoded
The Redis MVCC key prefixes (!redis|str|, !redis|hll|, !redis|ttl|,
plus the collection helpers' analogues) carry NO database component.
Feeding two distinct DBs into the same snapshotBuilder would:
- collide on same-named keys across DBs (b.Add returns ErrDuplicate),
or
- merge keys under db_0 at decode time (DecodeOptions.RedisDBIndex
defaults to 0, so a db_3-only self-test would decode under db_0
and the structural diff would fail).
In either case the produced .fsm is mis-scoped: the v14 fan-out
replaces silent-data-loss with silent-merge-or-collide.
Fix: encodeAllRedisDBs now enumerates and decides:
- 0 indices (no redis/, or empty redis/) → no-op.
- [0] only → proceed with NewRedisEncoder(_, 0) (the legacy path).
- anything else → fail closed with new sentinel
ErrRedisEncodeMultiDBUnsupported.
The sentinel is marked at the runAdapterEncoders boundary with
ErrEncodeAdapterData so the CLI routes it to exit-2. Until Phase 1
makes native keys DB-aware, multi-DB inputs are quarantined as
data-correctness failures rather than silently mis-scoped output.
## Codex P2 v14 L427: Fail closed on malformed Redis db_N entries
The v14 parseRedisDBDir(ent) returned (_, false) whenever
!ent.IsDir(), so a regular file or symlink at redis/db_0 would be
silently skipped and the encoder would publish a header-only .fsm.
Fix: split into parseRedisDBName (pure name parser) and shift the
ent.IsDir() check into enumerateRedisDBs. When the name matches the
canonical db_<N> pattern AND the entry is not a directory,
enumerateRedisDBs returns ErrRedisEncodeNotDir (mirroring the per-DB
encoder's existing Lstat guard).
## Pinned by
- TestEncodeSnapshotRedisRejectsNonZeroDB: redis/db_3-only fixture
through EncodeSnapshot → errors.Is ErrRedisEncodeMultiDBUnsupported
AND errors.Is ErrEncodeAdapterData (both fire).
- TestEncodeSnapshotRedisRejectsMultipleDBs: redis/db_0 + redis/db_3
→ errors.Is ErrRedisEncodeMultiDBUnsupported.
- TestEnumerateRedisDBsRejectsNonDirDBEntry: redis/db_2 as a regular
file → errors.Is ErrRedisEncodeNotDir.
- The replaced TestEncodeSnapshotRedisMultiDB (v14) was reformulated
as TestEncodeSnapshotRedisRejectsNonZeroDB; the prior "more bytes
than baseline" assertion no longer holds because the encoder
rejects the fixture upfront.
## Caller audit per CLAUDE.md semantic-change rule
- encodeAllRedisDBs: sole caller is adapterRunners.redis. Semantic
change: was "fan out across all db_<N>"; now "encode db_0 only,
fail closed on anything else." Error contract:
- errors.Is(err, ErrEncodeAdapterData) — still true (mark applied
at runAdapterEncoders, unchanged).
- errors.Is(err, ErrRedisEncodeMultiDBUnsupported) — new path.
- errors.Is(err, ErrRedisEncodeNotDir) — newly reachable via
enumerateRedisDBs for malformed db_<N> entries.
All in-tree NewRedisEncoder direct callers (encode_redis_test.go,
encode_redis_coll_test.go, encode_redis_hardlink_unix_test.go) are
unaffected — they bypass the adapter runner.
- enumerateRedisDBs: was "silently skip non-dir db_<N>"; now "fail
closed with ErrRedisEncodeNotDir." No production callers exist
outside encodeAllRedisDBs; tests in encode_snapshot_test.go cover
the new fail-closed path.
- parseRedisDBDir → parseRedisDBName: signature change (DirEntry →
string). Sole caller is enumerateRedisDBs (updated in lock-step).
- v14's TestEncodeSnapshotRedisMultiDB is replaced — see above.
TestEnumerateRedisDBsMixedEntries (v14) continues to pass: its
positive entries (db_0, db_1, db_5) are all directories, so the
IsDir check it never exercises remains exercised by the new
TestEnumerateRedisDBsRejectsNonDirDBEntry.
Tests + lint green.
|
@claude review v16 corrects v14's Redis multi-DB fan-out per codex P2 v14 (L452 + L427). codex P2 v14 L452 (semantic correctness) — v14's fan-out across Fix:
Marked with codex P2 v14 L427 (malformed db_) — Pinned by three tests:
Caller audit (CLAUDE.md semantic-change rule):
|
|
Claude finished @bootjp's task in 2m 50s —— View job Code Review - PR #904 (v16)Tasks
v16 Verification — Both codex P2 v14 Items Confirmed
The decision logic in One Minor Observation (non-blocking)Two new rejection tests don't assert
All five prior rejection tests consistently include this assertion:
The underlying behavior is correct — Strengths in v16
Verdict: Ready to merge. Both codex P2 v14 items are correctly implemented. The minor observation ( |

Summary
Phase 0b M6 implementation per the merged design doc
docs/design/2026_06_01_proposed_snapshot_encode_cli.md(#896, merged at fe9e941).Wires the merged M1–M5 encoder slices into a user-facing CLI plus a library entrypoint mirroring the decoder's
DecodeSnapshot. Implements the round-trip self-test the parent doc mandates, with write-then-rename atomic publish so a self-test failure never reaches the restore path.What lands
Library (
internal/backup/):encode_snapshot.go:EncodeSnapshot(opts, out io.Writer) (EncodeResult, error)— high-level wrapper that dispatches per-adapter encoders in canonical fan-out order (redis → dynamodb → s3 → sqs), implements two-mode buffering (stream whenSelfTest=false, buffer whenSelfTest=true), runs the structural self-test against the in-memory buffer, and copies tooutonly on match. UnexportedcorruptBufferForTesthook lets same-package tests inject buffer corruption that reaches the self-test decode but neverout(codex P2 v6 docs(backup): propose Phase 0b M6 - cmd/elastickv-snapshot-encode CLI #896 — write-then-rename atomicity).encode_info.go:EncodeInfoschema +WriteEncodeInfo/ReadEncodeInfohelpers +EncodeInfoSidecarPath(path-derived<output>.encode_info.json, no static-name collisions per gemini medium v2 docs(backup): propose Phase 0b M6 - cmd/elastickv-snapshot-encode CLI #896). Format-version gate so a future bump surfaces asErrUnsupportedEncodeInfoFormatVersion.manifest.go:Exclusions.RenameS3Collisions booladded with JSON tagrename_s3_collisions. Intentionally NOT inexclusionsRequiredFieldsso legacy manifests decode safely with the zero valuefalse(claude v5 medium docs(backup): propose Phase 0b M6 - cmd/elastickv-snapshot-encode CLI #896).Decoder CLI update (
cmd/elastickv-snapshot-decode/main.go):emitManifestnow populates the newRenameS3Collisionsfield fromcfg.renameCollisions, completing the decoder→encoder round-trip for--rename-collisionsdumps.Encoder CLI (
cmd/elastickv-snapshot-encode/main.go):--input,--output,--adapter(decoder-parity CSV parser),--last-commit-ts,--self-test,--scratch-root.<output>.tmp-<random>, fsync+close, then rename. Self-test failure →mismatch.txtnext to where.fsmwould have been, temp file removed, exit 2.--last-commit-ts T < manifest→ exit 2 with typedErrSelfTestLowerLastCommitTS.encoder_versionstamped at build time via-ldflags "-X main.version=..."(mirrors decoder pattern atmain.go:45).Test plan
go test -race -count=1— all green:internal/backup/: 3 new tests (encode_info schema, format gate, legacy manifest forward-compat)internal/backup/encode_snapshot_test.go: 5 tests (library round-trip, self-test match against canonicalized input, corruption never reachesout, missing-input guard, sidecar path derivation)cmd/elastickv-snapshot-encode/main_test.go: 9 tests (missing manifest, unknown adapter, lower-TS fail-closed, equal+higher TS accept, path-derived sidecar, two-files-no-collision, full round-trip with--self-test, atomic-publish never leaves bad.fsm,--last-commit-tsparser)make lint: clean.Caller audit per CLAUDE.md semantic-change rule
Exclusionsstruct gained a field. Existing callers either build via field-tagged literals (decoder CLI'semitManifest— updated to populate the new field) or read it (encoder'sbuildSelfTestDecodeOptions— new code). No silent semantic change.DecodeOptions.RenameS3Collisionswas already a public field used by the decoder; the encoder now also reads it via the manifest. No caller-side change needed.Self-review (5 passes)
.fsm. Corrupted buffer never reaches theio.Writer(pinned byTestEncodeSnapshotSelfTestDetectsCorruptionassertingout.Len()==0). Allb.Adderrors propagate throughEncodeSnapshot.io.Writer. Temp-file suffix usescrypto/randso concurrent encodes against the same--outputcannot collide.SelfTest=falsestreams with onesha256.Writertee, no extra allocations.SelfTest=trueallocates one FSM-sized*bytes.Bufferplus the scratch decode tree (documented memory cost).--last-commit-ts T < manifestfail-closed with typed error; self-test threads MANIFEST DecodeOptions (Exclusions.*+DynamoDBLayout → DynamoDBBundleJSONL) so trees produced with non-default decoder flags round-trip cleanly.Risk
Low. The encoder is offline; restoration is non-destructive (new keyspace on a fresh cluster via stop-replace-restart). The only public-API change is the
Exclusions.RenameS3Collisionsfield, which is forward-compat for older manifests. All existing M1–M5 tests continue to pass.Summary by CodeRabbit
New Features
Bug Fixes
Tests