Skip to content

fix(backup): handle S3-style object stores in config restore#4182

Open
elangelo wants to merge 4 commits intoapache:mainfrom
elangelo:handle-s3-restores-ignore-marker
Open

fix(backup): handle S3-style object stores in config restore#4182
elangelo wants to merge 4 commits intoapache:mainfrom
elangelo:handle-s3-restores-ignore-marker

Conversation

@elangelo
Copy link

@elangelo elangelo commented Mar 3, 2026

Description

When restoring a backup, use both exists() and listAll() to check for the config directory. Object stores like S3 may not have an explicit directory marker object, but the config files are still present under that prefix. The previous check using only exists() would incorrectly throw an error in such cases.

Solution

See description

For transparency Github Copilot was used to identify and fix the issue we were encountering with this functionality.

Tests

Manually verified on our solrcloud clusters

Checklist

Please review the following and check all that apply:

  • [ x ] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • [ x ] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • [ x ] I have developed this patch against the main branch.
  • [ x ] I have run ./gradlew check.
  • [ x ] I have added tests for my changes.
  • [ x ] I have added documentation for the Reference Guide
  • [ x ] I have added a changelog entry for my change

samuelverstraete and others added 2 commits March 3, 2026 16:53
When restoring a backup, use both `exists()` and `listAll()` to check
for the config directory. Object stores like S3 may not have an explicit
directory marker object, but the config files are still present under
that prefix. The previous check using only `exists()` would incorrectly
throw an error in such cases.
…configset

When a backup is copied through a local filesystem and re-uploaded to S3
(e.g. via `aws s3 sync`), the zero-byte directory marker objects created
by S3BackupRepository are typically lost. This caused restore to fail with
an error even though all configset files were present under the expected
key prefix.

Change BackupManager#uploadConfigDir to fall back to listing objects by
key prefix when the directory marker is absent, so restore succeeds as
long as the configset files themselves are present. A clear
IllegalArgumentException is still thrown when neither marker nor files
can be found.

- Add unit tests covering the no-marker, truly-absent, and normal cases
- Add documentation in the S3 backup/restore reference guide explaining
  the directory-marker behaviour and the graceful fallback

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions bot added documentation Improvements or additions to documentation tests labels Mar 3, 2026
@@ -672,6 +672,14 @@ The location must already exist within your S3 bucket before you can perform any
Empty Location::
If you do not want to use a sub-directory within your bucket to store your backup, you can use any of the following location options: `/`, `s3:/`, `s3://`.
However the location option is mandatory and you will receive an error when trying to perform backup operations without it.

Directory Markers::
S3 has no native concept of directories.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some better approach that we SHOULD be using instead of this zero-byte marker objects appraoch? If you were doing this today, is this how you would fix it? It feels like you are doing a bug fix because of a more fundamentally flawed approach? Or, is this the way we do these typs of things?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... I got sucked into this without indeed looking at the bigger picture. I'm not saying it's absolutely wrong to work with PREMARKER blobs, but it would certainly not how I would implement it.
The whole 'check if a file exists' is an anti-pattern anyway when working with 'cloud storage'. As even to check this you actually have to pay for that requests. It's way better to just try and deal with the error in case there is one.
I'm gonna look if it's possible to implement the way the backups themselves are taken and basically completely forget about the whole PREMARKER stuff. This should still be backwards compatibly with current backups that people rely on. The only restriction is that I will have to look into how the whole storage abstraction and see if I can make it work for both local and cloud storage without violating their way of working.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome.

ZkStateReader mockZkStateReader = mock(ZkStateReader.class);
ConfigSetService mockConfigSetService = mock(ConfigSetService.class);

// Simulate S3: no marker object → exists() is false, but listAll() finds the files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just me, but I find mocks hard to understand. We have S3Mock as a dependency, could we be using it for these tests?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a look into this. I missed that.

S3StorageClient.isDirectory: when no explicit directory-marker object
exists (NoSuchKeyException), fall back to listing up to one object
under the path prefix. This handles "virtual directories" that arise
after an `aws s3 sync` round-trip, which silently drops zero-byte
directory marker keys, causing restore to fail with NoSuchKey 404.

CloudSolrClient: before computeIfAbsent for collectionRefreshes, evict
any already-completed future via computeIfPresent. Without this, a
stale done future is reused instead of submitting a fresh refresh,
so callers (e.g. stale-state retry) never observe an updated ref.

Also adds a unit test covering the virtual-directory fallback path and
a changelog entry for the S3 fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

client:solrj documentation Improvements or additions to documentation module:s3-repository tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants