Skip to content

feat: add StorageAdapter plugin system for third-party storage protocols#1432

Open
kushalbakshi wants to merge 5 commits intodatajoint:masterfrom
kushalbakshi:feature/storage-adapter-plugin
Open

feat: add StorageAdapter plugin system for third-party storage protocols#1432
kushalbakshi wants to merge 5 commits intodatajoint:masterfrom
kushalbakshi:feature/storage-adapter-plugin

Conversation

@kushalbakshi
Copy link
Copy Markdown
Contributor

Summary

Adds a StorageAdapter entry-point plugin system that allows third-party packages to register new storage protocols without modifying DataJoint internals.

  • New StorageAdapter ABC in src/datajoint/storage_adapter.py with 4 extension points: create_filesystem, validate_spec, full_path, get_url
  • Lazy entry-point discovery via datajoint.storage group (mirrors the existing codec plugin pattern in codecs.py)
  • Plugin delegation wired into settings.py (get_store_spec) and storage.py (_create_filesystem, _full_path, get_url)
  • Zero changes to built-in protocol code paths (file, s3, gcs, azure)
  • Clear error message when an unknown protocol has no adapter installed

Motivation

DataJoint's external storage layer currently supports four hardcoded protocols. Extending it with custom fsspec-backed storage requires monkey-patching private methods — fragile, non-composable, and breaks on upstream refactors. This PR adds a clean plugin system so third-party packages can register storage protocols via standard Python entry points.

How third-party packages use it

from datajoint.storage_adapter import StorageAdapter

class MyStorageAdapter(StorageAdapter):
    protocol = "myprotocol"
    required_keys = ("protocol", "endpoint", "bucket")
    allowed_keys = required_keys + ("token",)

    def create_filesystem(self, spec):
        return MyFsspecFilesystem(**spec)
# In their pyproject.toml:
[project.entry-points."datajoint.storage"]
myprotocol = "my_package:MyStorageAdapter"

No import needed — DataJoint auto-discovers the adapter when it encounters the protocol in dj.config.stores.

Test plan

  • 19 new unit tests covering registry, validation defaults, delegation, and error handling
  • All existing unit tests pass unchanged (zero regressions)

🤖 Generated with Claude Code

kushalbakshi and others added 5 commits April 15, 2026 16:35
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a store config uses an unrecognised protocol, settings.py now
queries the adapter registry before raising an error. Registered
plugin adapters are validated and default keys are applied; unknown
protocols surface a clear message directing users to install a plugin
package. Includes new test class TestGetStoreSpecPluginDelegation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The three else-branches in StorageBackend (_create_filesystem,
_full_path, get_url) now query the adapter registry before falling
back to the built-in behaviour. Registered plugin adapters are called
for filesystem construction, path composition, and URL generation;
unknown protocols still raise DataJointError. Includes new test class
TestStorageBackendPluginDelegation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move imports to top of file and apply ruff-format to satisfy CI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kushalbakshi kushalbakshi force-pushed the feature/storage-adapter-plugin branch from 59c47de to cd93c19 Compare April 15, 2026 21:39
@dimitri-yatsenko
Copy link
Copy Markdown
Member

Review — two simpler alternatives to consider

Good work, Kushal. The need is real: third-party packages should be able to register new storage backends without monkey-patching. Before we commit to the ABC approach, I want to explore whether we can achieve the same result with less machinery — or with machinery that also improves the existing code.

Background: what StorageBackend actually does on top of fsspec

Looking at `storage.py`, there are five protocol-specific concerns:

  1. Config validation — different required keys per protocol
  2. Constructor arg mapping — translating DJ config → fsspec constructor params (e.g., S3 endpoint → `client_kwargs`)
  3. Path construction — cloud protocols prepend bucket; file:// uses pathlib
  4. URL generation — file:// needs absolute path resolution; clouds are `protocol://path`
  5. Atomic writes — file:// uses `safe_copy`/`safe_write`; clouds delegate to fsspec

~80% of I/O already delegates to fsspec directly. The protocol-specific code is concentrated in items 1-4, spread across ~26 `if self.protocol == "..."` branches.

fsspec already has its own protocol discovery — `fsspec.filesystem("s3")` auto-discovers S3FileSystem. So the question: what does DataJoint need to add on top of fsspec's own plugin system?


Alternative A — Generic fsspec passthrough

The lightest option: make `StorageBackend` accept any fsspec-supported protocol by default. No new ABC, no entry points, no registry.

For unknown protocols:

  • Pass non-DataJoint config keys directly to `fsspec.filesystem(protocol, **kwargs)`
  • Use generic defaults for path (`location/relpath`) and URL (`protocol://path`)

~30 lines of changes. No new module. Works for any fsspec filesystem immediately. Downside: no config validation or custom path/URL logic for unknown protocols.


Alternative B — Unified adapter model for ALL protocols (preferred)

Route all backends through adapters — including the existing four. The 26 hardcoded `if self.protocol == "..."` branches become adapter implementations:

```python
class S3Adapter(StorageAdapter):
protocol = "s3"

def create_filesystem(self, spec):
    endpoint = spec["endpoint"]
    if not endpoint.startswith(("http://", "https://")):
        endpoint_url = f"{'https' if spec.get('secure') else 'http'}://{endpoint}"
    else:
        endpoint_url = endpoint
    return fsspec.filesystem("s3", client_kwargs={"endpoint_url": endpoint_url}, ...)

def validate_spec(self, spec):
    missing = [k for k in ("endpoint", "bucket", "access_key", "secret_key") if k not in spec]
    if missing:
        raise DataJointError(f"S3 store missing: {', '.join(missing)}")

```

Built-in adapters (File, S3, GCS, Azure) registered at import time. Third-party adapters discovered via entry points. `StorageBackend` becomes a thin dispatcher — one code path for all protocols.

This eliminates the 26 hardcoded protocol branches, makes third-party backends first-class citizens alongside built-ins, and gives us the right abstraction. The user-facing API (`dj.config.stores`) doesn't change at all — this is purely internal refactoring. We test against the existing integration test suite and release within the 2.2.x series when confident.


Assessment

Your current PR has the right instincts — ABC, entry points, lazy discovery. But it creates a two-tier system: hardcoded paths for built-in protocols, adapter path for everything else. That means maintaining both code paths going forward.

I'd prefer Alternative B: if we're introducing the adapter abstraction, route everything through it. The existing integration tests validate that the built-in adapters behave correctly. Since the user API doesn't change, this is safe to land in 2.2.x with thorough testing.

@kushalbakshi — would you be up for reworking this PR to route the four existing protocols through adapters as well? The `StorageAdapter` ABC and entry-point discovery from your PR are the right foundation; the additional work is extracting the existing protocol-specific logic from `StorageBackend` into adapter subclasses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants