Refactor and type the direct APIs, and add complementary high-level Dataverse, filesystem, MCP, and docs layers by JR-1991 · Pull Request #227 · gdcc/pyDataverse

JR-1991 · 2026-03-27T12:40:11Z

This PR is a broad refactor of pyDataverse that moves the project from a mostly flat, monolithic layout to a typed, package-oriented structure with clearer public APIs.

The main theme is a cleanup and expansion of the direct low-level API surface. The new high-level Dataverse objects are an addition on top of that layer, not a replacement for it.

For reviewers who want to inspect the new documentation experience directly, a preview is available at:

https://pydataverse.vercel.app/getting-started

At a high level, it:

refactors the low-level direct API clients into dedicated modules with shared response handling and stronger typing
adds a complementary high-level Dataverse workflow layer for collections, datasets, files, metadata blocks, and search
introduces filesystem-style access and MCP server/tooling integrations
replaces the old documentation/tooling stack with a modern Astro/Starlight docs site and updated packaging/CI
reorganizes and expands the test suite around the new architecture

Detailed Changes

1. Refactor the low-level API surface into typed modules

The old flat pyDataverse/api.py module has been split into a dedicated pyDataverse/api/ package.

This includes:

a shared Api base with typed request handling, httpx support, async support, auth handling, and response parsing
a reusable APIResponse model and common response/status abstractions
refactored NativeApi, DataAccessApi, SearchApi, MetricsApi, and SwordApi
a new SemanticApi for JSON-LD/RDF-oriented access
a new DataverseHub client for Hub API interactions
shared utility helpers for crawling, dataset fetching, file input handling, and logging

Notable behavior and capability changes in this layer include:

stronger use of Pydantic models for request/response validation
PID-aware and streaming-oriented data access helpers
typed search options via QueryOptions
clearer API version abstractions and response handling

2. Add a new high-level Dataverse object model

This PR introduces a new pyDataverse/dataverse/ package that provides higher-level workflow objects on top of the low-level APIs.

New public objects include:

Dataverse
Collection
Dataset
File
Metrics

This layer adds:

object-oriented access to collections, datasets, files, and search results
metadata-aware dataset creation/update workflows
higher-level view classes for datasets, collections, content, and files
DOI-based dataset loading and richer file operations
version checks for supported Dataverse instances

The root package export surface is also updated so users can import the high-level workflow objects directly from pyDataverse, while the direct low-level API clients remain available through pyDataverse.api.

3. Add dynamic metadata block support and generated model packages

The previous monolithic pyDataverse/models.py module is replaced by a structured pyDataverse/models/ package.

This work adds:

generated Pydantic model modules grouped by API area
JSON schema sources under the top-level models/ directory
a models/generate.sh script for regenerating typed model code from the schemas
a new metadata-connect layer in pyDataverse/dataverse/connect/ for building typed metadata models from Dataverse metadata block specifications

This makes the codebase easier to extend and gives the API layer a clearer typed contract.

4. Add filesystem-style dataset access

This PR adds a new pyDataverse/filesystem/ package with a filesystem-oriented interface for working with dataset files.

Key additions include:

DataverseFS for browsing and interacting with a dataset like a filesystem
file reader/writer implementations for streaming file access
tabular file helpers and schema utilities
URL-based construction for mounting a dataset from a Dataverse dataset URL

This gives users a more natural way to read and write files without manually managing raw API calls.

5. Add MCP support and Dataverse MCP tools

This PR adds a new pyDataverse/mcp/ package and a configurable DataverseMCP server integration built around FastMCP.

Included MCP capabilities cover:

Dataverse-level metrics
collection metadata and content listing
collection graph inspection and SPARQL querying
dataset metadata retrieval and file listing
file reading, tabular previewing, and schema inspection
search helpers
DataCite-oriented helpers and middleware utilities

This creates a native path for agent/tool interoperability on top of pyDataverse.

6. Replace the documentation stack

The old in-package Sphinx/ReadTheDocs documentation has been removed and replaced with a dedicated docs/ site based on Astro + Starlight.

The new docs include dedicated sections for:

installation and getting started
low-level APIs
high-level Dataverse usage
filesystem access
MCP usage and server setup

The README is also refreshed to better reflect the current API shape, installation options, and test workflow.

7. Modernize packaging, dependency management, and CI

The project metadata and development workflow have been updated substantially.

Notable changes include:

migration of packaging metadata to PEP 621 in pyproject.toml
new optional extras for tests and mcp
introduction of uv.lock
removal of legacy requirements.txt, tox.ini, .travis.yml, and .readthedocs.yml
addition of a repository Dockerfile for running tests
refactoring of run-tests.sh to build and run the test container locally
GitHub Actions updates to use uv, newer checkout/setup actions, and an expanded Python matrix

8. Reorganize and expand the test suite

The test layout now follows the refactored package structure more closely.

Additions include dedicated tests for:

api/data_access
api/metrics
api/native
api/search
api/semantic
api/sword
api/hub
dataverse
mcp

Legacy tests tied to the old flat API/model structure and removed helper assets have been cleaned up accordingly.

Breaking Changes / Migration Notes

This is a large architectural change and likely requires downstream updates.

Important migration points:

imports move from flat modules to package-based modules, especially for API and model code
pyDataverse/api.py and pyDataverse/models.py are removed in favor of pyDataverse/api/ and pyDataverse/models/
the package now offers both direct low-level API clients and additional high-level Dataverse, Collection, Dataset, and File workflow objects
legacy docs, schema assets, templates, and some older helper modules/utilities have been removed
runtime support is now defined in pyproject.toml as Python >=3.10,<4.0

TLDR

Together, these changes expand pyDataverse from its earlier flat layout into a broader typed toolkit for Dataverse:

low-level APIs remain a first-class path for direct endpoint access
high-level objects support common authoring and browsing workflows
filesystem and MCP integrations open new interaction patterns
docs, tests, and packaging are aligned with the new architecture

The test suite has been re-written to cover most aspects and behaviours of pyDataverse.

Bump GitHub Actions test matrix to Python 3.10–3.13 (remove 3.8 and 3.9) and change poetry install from `--with tests` to `--all-extras`. This ensures CI runs on newer supported Python versions and installs all optional extras needed for the test environment.

Introduce a Dockerfile to build a containerized environment for running the project's test suite. Uses a configurable PYTHON_VERSION arg (default 3.11), sets common Python env vars, copies the project source and tests into /app, upgrades pip and installs the project with test-related extras, and uses `pytest -v` as the default command.

Remove legacy Sphinx/docs ignore entries (Sphinx header, docs/_build/ and /docs) and add ignores for editor/runtime files: .vscode (including nested .vscode), .cursor, and .virtual_documents. This avoids committing editor settings and generated virtual document files while keeping project-specific docs under src/pyDataverse intact.

Introduce a new pyDataverse/mcp package that integrates Dataverse with FastMCP. Adds core modules: - __init__.py: exports DataverseMCP and FastMCP. - server.py: DataverseMCP and MCPConfiguration to register tools and expose MCP endpoints. - collection.py: collection metadata, listing, RDF graph fetching, SPARQL querying and graph summaries with LRU caching (rdflib + TOON encoding). - datacite.py: async DataCite client to search DOIs and filter results to Dataverse instances (httpx, pydantic models). - dataset.py: fetch dataset metadata and list files, with full/summary modes (pandas support). - file.py: read files (images, PDFs, notebooks, binary blobs, tabular) with size limits, image compression, PDF text extraction, tabular reading and schema export. - search.py: compact search wrapper converting Dataverse search results into lightweight models and TOON encoding. - middleware.py: DataverseMiddleware to inject Dataverse instances into FastMCP context with a TTL cache. Common behaviors: results are TOON-encoded, sensible defaults for limits/caching, and careful handling of binary/text types. This commit scaffolds MCP-facing utilities for Dataverse tooling and data access.

Introduce a PyFilesystem2-backed Dataverse filesystem and helpers. Adds DataverseFS implementing file browsing, read/write and metadata operations (getinfo, listdir, openbin, remove, setinfo) with caching and URL parsing (from_url). Includes DataverseFileReader to stream downloads (supporting ranged reads and slice access), DataverseFileWriter to stream uploads via a background thread, and TabSpecs/TABULAR_MIME_TYPES to parse tabular files with pandas (open_tabular/stream_tabular). Also exports reader/writer in filesystem package __init__.py. Suppresses certain deprecation warnings during FS imports.

Introduce core dataverse content layer: add ContentBase, Dataset, Collection and package exports. ContentBase provides API accessors and common serialization helpers. Dataset implements metadata handling, export/graph generation, file I/O (fs/open/upload), publishing, review workflow, and conversion to Dataverse API payloads. Collection adds content listing, overview (DataFrame), creation/update of sub-dataverses and datasets, async graph aggregation with concurrency control and blank-node deduplication, search scoped to a collection, and metrics access. __init__.py exposes main types and triggers model_rebuild for Pydantic models.

Introduce pyDataverse/dataverse/connect package to dynamically build and handle Dataverse metadata models. Adds create_model_from_block (builder) to generate Pydantic models from metadata block definitions, split/convert field types (primitive, compound, controlledVocabulary), and performant CV validation using Annotated + AfterValidator with caching. Implements BlockConfig and DataverseMetaclass for attaching block metadata, JSONSchemaExtra and FieldInfo helpers, MetadataBlockBase and CompoundField bases with custom serialization and conversion to Dataverse API DTOs, utilities (clean_name, map_field_type), and a ValidatingList type that validates/coerces items and provides a convenient .add(...) API for compound entries. These additions enable schema-driven model creation, serialization, and round-trip mapping to Dataverse create/edit payloads.

Introduce a set of view abstractions for pyDataverse to provide cached, iterable and indexable access to collections, datasets and files. Files added: - baseview.py: Generic BaseView providing caching, iteration and getitem semantics. - contentview.py: ContentView extending BaseView with async prefetching, identifier resolution and lazy fallback iteration. - collectionview.py: CollectionView for iterating/lookup of child collections with async prefetch support. - datasetview.py: DatasetView for dataset iteration/lookup, supports index access, DOI/version suffix parsing and caching. - filesview.py: FilesView to list/index dataset files, optional tabular-only filtering, metadata prefetching, tree-style repr and a pandas DataFrame view. Key details: caching of fetched items, concurrency-limited prefetch using asyncio/asyncer, robust identifier resolution (including DOI/version handling), and best-effort background metadata prefetching for file objects to reduce blocking I/O.

Introduce a new pyDataverse.api.utilities package providing helper utilities for API workflows: an async crawl_collection to recursively enumerate collections and datasets, conc_get_datasets for batched concurrent dataset fetching with semaphore-based rate control, file_input to normalize various file inputs for uploads, and ApiLogger using Rich for formatted console logging. The package exposes crawl_collection and file_input via __init__.py. These utilities centralize common functionality for crawling, bulk fetching, file handling, and logging to simplify higher-level API code.

Remove unused ApiAuthorizationError import and replace custom exception with ValueError for type checks in ApiTokenAuth and BearerTokenAuth. This simplifies error handling by raising a built-in exception when non-string tokens are provided.

Introduce a new API response model for the Dataverse API. Adds a Status enum (OK/ERROR) and an APIResponse Pydantic BaseModel with fields: status, data (dict|list with default_factory), and optional message. Includes a convenience classmethod from_out_of_format to construct responses from raw dict/list payloads and map HTTP 200 to Status.OK (otherwise Status.ERROR). Adds type hints and documentation for clarity.

Replace the old ad-hoc search parameter handling with a typed QueryOptions pydantic model and modernize SearchApi. - Add QueryOptions (pydantic) to validate and document search parameters (type, subtree, sort, per_page, start, order, fq alias, show_* flags, geo_radius). - Introduce computed_field api_base_url to build the search endpoint. - Change SearchApi.search signature to search(query: str, options: Optional[QueryOptions]) -> SearchResponse and use options.model_dump(by_alias=True, exclude_none=True) to build query params. - Use get_request with url, params and response_model=SearchResponse. - Expand docstrings and examples. Note: This is a breaking API change for callers — previous keyword parameters (data_type, subtree, sort, etc.) were removed and should be passed via QueryOptions.

Introduce a new SemanticApi class to pyDataverse that adds support for Dataverse semantic/linked-data endpoints. Implements synchronous and batched async retrieval of dataset JSON-LD (get_dataset, get_datasets/_get_datasets) with type-hinted overloads, conversion utilities to rdflib.Graph (response_to_graph, responses_to_graph), and URI normalization (_normalize_schema_org_uris) to canonicalize schema.org HTTP→HTTPS. Uses pydantic computed_field properties and integrates with the existing Api base class.

Migrate SwordApi to a Pydantic-style implementation: add a SwordApiVersion enum and a sword_api_version Field with a default, enable use_enum_values via model_config, and introduce computed fields for api_base_url and base_url_api_sword. Replace the old __init__/httpx.BasicAuth logic and ApiUrlError checks with a validator for sword_api_version and simplify URL construction (get_service_document now uses _assemble_url). Removes direct httpx import and streamlines version handling and URL assembly.

Large refactor of the Api implementation to add async/sync parity, stronger typing, and robust response handling. Highlights: - Introduced DataverseBase and AbstractApi; Api now inherits improved base and implements api_base_url. - Added detailed typing, TypeVar/overloads and return-type unions for get/post/put/delete methods to support both raw httpx.Response and parsed Pydantic models (single & lists). - Implemented async support with _async_request and ensured synchronous behavior via _sync_request; added from_api helper. - Added stream_file_context for streaming downloads. - Centralized response processing in _handle_response using APIResponse, improved error handling (HTTPStatusError), and support for Union types via TypeAdapter. - Improved header handling, default User-Agent injection, configurable timeouts/max connections, and connection check in model_validator. - Replaced warnings usage, switched APIVersion.LATEST value to "v1", and added structured logging via ApiLogger. This change improves type-safety, developer ergonomics, and reliability when interacting with Dataverse endpoints (sync and async).

Major refactor of pyDataverse/api/data_access.py: add typing hints, pydantic computed field for API base URL, and httpx-based request handling. Consolidates datafile download logic into an overloaded _get_datafile_core, exposes get_datafile_download_url for redirect URLs, and adds contextmanager-based streaming helpers (stream_datafile, stream_datafiles, stream_datafiles_bundle) for efficient large-file/zip downloads. Improves PID handling and parameter assembly across endpoints, and updates request/permission APIs (request_access, allow_access_request, grant_file_access, list_file_access_requests) to return typed message/AccessRequest models and accept params consistently.

Refactors pyDataverse MetricsApi: adds typing, pydantic computed field for base URL, and returns typed MetricsResponse objects. Methods now accept typed parameters (including date objects), support parentAlias query scoping, and use the internal _assemble_url and use_async flags. CSV endpoints are parsed into pandas.DataFrame, dataverse-by-* methods are renamed to get_collections_by_* with deprecated wrappers for the old names, and dataset/location endpoints return structured responses. Misc: improved imports, deprecation annotations, and small HTTPX handling for CSV payloads.

Introduce pyDataverse.api.hub with DataverseHub client and HUB_BASE_URL. Adds a cached_property installations that calls the Dataverse Hub API (/api/installations/status) via httpx, validates the JSON with InstallationStatusResponse.model_validate, and returns a List[InstallationStatus]. Includes docstrings describing behavior and possible exceptions (httpx errors, validation errors).

Major rewrite of pyDataverse.api.native.NativeApi: reorganized imports, added extensive type hints and Pydantic response/request models, and introduced async support and concurrent dataset fetching. Replaced many legacy endpoints with typed methods (get_collection, create_collection, publish_collection, delete_collection, get_dataset, get_dataset_versions, get_dataset_export, get_datasets_export, create_dataset, edit_dataset_metadata, private URL helpers, download/stream dataset bundles, etc.), added context manager for streaming, and improved URL/param assembly. Deprecated several old dataverse helpers in favor of new collection-based methods and added logging for key operations. Overall cleans up error handling paths and moves toward a model-driven, async-capable API surface.

Clean up pyDataverse.api exports: explicitly import and expose DataAccessApi, MetricsApi, NativeApi, SearchApi, SwordApi, SemanticApi and DataverseHub via an __all__ list. Removes the old Api import and noqa comments to make the package's public API explicit and add new SemanticApi and DataverseHub exports.

Import warnings and nest_asyncio in package init; suppress pkg_resources DeprecationWarning (from a third-party dependency) and call nest_asyncio.apply() to enable nested asyncio event loops (useful in REPLs/notebooks). Also import QueryOptions and core types (Collection, Dataset, Dataverse, File) and expose them via __all__ so they are available from the package namespace.

Convert pyproject.toml from Poetry-specific [tool.poetry] layout to PEP 621 [project] metadata, bump package version to 0.4.0 and add readme, license, and requires-python. Replace Poetry dependency groups with PEP-style dependencies and optional-dependencies (including updated dependency pins), add project.urls.repository, tighten build-system poetry-core requirement, and reintroduce packages include for pyDataverse. Also adjust pytest asyncio_mode and coverage source to point at the library and remove legacy Poetry group/radon settings.

Rework README.md to modernize badges and project summary, state Python 3.10+ support, and provide clear installation instructions for pip, uv, and poetry. Add a concise Quickstart with high-level API and DOI examples (dataset creation, metadata updates, file I/O, publish), and document running local tests against a Dataverse container with run-tests.sh usage. Clean up links to documentation and community chat (ReadTheDocs, Zulip) and streamline setup/test dependency install commands.

Replace docker-compose based test orchestration with a simpler build-and-run flow. Switch to /usr/bin/env bash and enable strict flags (set -euo pipefail). Read PYTHON_VERSION from environment (default 3.11), require BASE_URL and API_TOKEN (with API_TOKEN_SUPERUSER fallback), normalize container host URLs, build a pydataverse-tests image with PYTHON_VERSION as a build-arg, and run it passing necessary env vars and host.docker.internal mapping. Remove argument parsing, compose files, container polling/logging, and make the script executable.

Add a generated uv.lock file that pins Python package versions, sources, hashes and wheels for reproducible installs (requires-python >=3.10,<4.0). This lockfile locks transitive dependencies and resolution markers to ensure consistent environments across installs.

Replace the ad-hoc globals/sys.modules lookup of CompoundField with a direct call to `field_info.dtype.from_dataverse_dict` when `field.value` is a dict. This simplifies the code, makes the conversion use the field's declared dtype (more robust and accurate), and removes the fallback import logic.

Replace actions/setup-python + Poetry steps with astral-sh/setup-uv and uv commands across workflows; update dependency/install/test commands to use uv (uv sync, uvx ruff, uv run pytest, uv build/publish). Bump actions/checkout to v5 in all workflows and extend build Python matrix to 3.10–3.13. Overall simplifies CI by centralizing tooling on UV and modernizing checkout/versioned actions.

Add setuptools>=82.0.0 to the 'tests' extra in pyproject.toml and update the lockfile accordingly. This is necessary, since the "fs" package requires `pk_resources` which ships with setuptools

Remove setuptools from the tests extra in pyproject.toml because it is no longer needed; update the lockfile to reflect the dependency removal.

Replace `uv sync --all-extras` with `uv pip install --system ".[tests,mcp]"` to install the project's test extras, and run tests with `pytest -vv` instead of `uv run pytest` to produce more verbose test output. These changes ensure CI installs the required extras and yields clearer test logs.

Introduce actions/setup-python@v5 using matrix.python-version to ensure the correct Python interpreter is available for subsequent steps. Remove the python-version input from the astral-sh/setup-uv step since Python is now configured separately. This clarifies environment setup and avoids duplicating interpreter configuration.

Add step to install setuptools and wheel in the tests workflow's Install Python Dependencies job. This ensures build tools are available before installing the package extras (`uv pip install --system " .[tests,mcp]"`), reducing potential build/install failures during CI.

Update GitHub Actions workflow (.github/workflows/tests.yml): remove Python 3.13 from the test matrix so CI runs on 3.10–3.12, and remove a redundant pip install line for setuptools/wheel. Tests now rely on a single pip install for the package plus test extras.

Stop subclassing fs.base.FS and remove the fs import/requirement; DataverseFS is now a plain class. Reorder and clean up imports (add pandas, cachetools.TTLCache, typing_extensions.Self), remove deprecation-warning suppression and unnecessary noqa comments, and adjust placement of pyDataverse.models.file.update import. These changes simplify dependencies and tidy module imports.

Delete the "fs" package from the dependencies list in pyproject.toml. Verify that no code relies on the removed package or add an alternative dependency if required.

Update .github/workflows/tests.yml to include Python 3.13 in the GitHub Actions matrix so the test suite runs against the latest Python release and verifies compatibility with 3.13.

Update: clarify the Annotated base_url parameter docstrings across mcp modules (collection, dataset, dataverse, file, search) to state that the MCP server's connected dataverse is used by default. Add DataverseMCP.base_url and base_url_instructions properties and include the base_url_instructions text in multiple tool descriptions so tools explicitly inform users which dataverse the MCP is connected to and how to override it. Also tighten dependency version bounds in pyproject.toml for several optional mcp-related packages.

Correct malformed version specifiers in pyproject.toml by adding missing commas (e.g., sniffio) and normalizing constraints for mcp-related extras. Tighten upper bounds for several packages (fastmcp, mcp, nbconvert, nbformat, pillow, pymupdf, toon-format) to avoid accidentally allowing next major releases and ensure constraints parse as intended.

Update test dependency range for pytest-asyncio from >=0.23.7,<0.24.0 to >=1.3.0,<2.0.0 in pyproject.toml and refresh the lockfile. The lockfile adds backports-asyncio-runner (for Python <3.11) and typing-extensions marker (for Python <3.13), and updates pytest-asyncio sdist/wheel entries and hashes to 1.3.0.

JR-1991 and others added 30 commits December 4, 2024 14:42

split and move api definitions

b4c5cd1

update deps

5bfea39

Remove unused files and modules

665594d

Add astro starlight docs

f22fdf8

Add lock

3a1c4b9

Add JSON Schema models for type gen

4b90779

Update tests for both high- and low-level APIs

0d76818

The test suite has been re-written to cover most aspects and behaviours of pyDataverse.

PyDantic types generated from ./models schemas

85a591b

Use ValueError for invalid token types

22a3008

Remove unused ApiAuthorizationError import and replace custom exception with ValueError for type checks in ApiTokenAuth and BearerTokenAuth. This simplifies error handling by raising a built-in exception when non-string tokens are provided.

JR-1991 added 17 commits February 13, 2026 10:39

Add setuptools to tests extras

b6fd204

Add setuptools>=82.0.0 to the 'tests' extra in pyproject.toml and update the lockfile accordingly. This is necessary, since the "fs" package requires `pk_resources` which ships with setuptools

Remove setuptools from test dependencies

7ba344a

Remove setuptools from the tests extra in pyproject.toml because it is no longer needed; update the lockfile to reflect the dependency removal.

Remove fs dependency from pyproject.toml

f2768cd

Delete the "fs" package from the dependencies list in pyproject.toml. Verify that no code relies on the removed package or add an alternative dependency if required.

Add Python 3.13 to CI test matrix

5bb4ef1

Update .github/workflows/tests.yml to include Python 3.13 in the GitHub Actions matrix so the test suite runs against the latest Python release and verifies compatibility with 3.13.

JR-1991 self-assigned this Mar 27, 2026

JR-1991 added type:feature New feature pkg:docs Documentation related activities pkg:testing test related activities pkg:api api related activities pkg:pkg Package related activities labels Mar 27, 2026

JR-1991 added this to PyDataverse Working Group Mar 27, 2026

JR-1991 added this to the 0.4.0 milestone Mar 27, 2026

JR-1991 mentioned this pull request Mar 27, 2026

Dataverse PyFilesystem implementation #178

Open

JR-1991 requested a review from Parthsuii March 27, 2026 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor and type the direct APIs, and add complementary high-level Dataverse, filesystem, MCP, and docs layers#227

Refactor and type the direct APIs, and add complementary high-level Dataverse, filesystem, MCP, and docs layers#227
JR-1991 wants to merge 47 commits intomainfrom
refactor-api

JR-1991 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JR-1991 commented Mar 27, 2026

Detailed Changes

1. Refactor the low-level API surface into typed modules

2. Add a new high-level Dataverse object model

3. Add dynamic metadata block support and generated model packages

4. Add filesystem-style dataset access

5. Add MCP support and Dataverse MCP tools

6. Replace the documentation stack

7. Modernize packaging, dependency management, and CI

8. Reorganize and expand the test suite

Breaking Changes / Migration Notes

TLDR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant