Refactor and type the direct APIs, and add complementary high-level Dataverse, filesystem, MCP, and docs layers#227
Open
Refactor and type the direct APIs, and add complementary high-level Dataverse, filesystem, MCP, and docs layers#227
Conversation
The test suite has been re-written to cover most aspects and behaviours of pyDataverse.
Bump GitHub Actions test matrix to Python 3.10–3.13 (remove 3.8 and 3.9) and change poetry install from `--with tests` to `--all-extras`. This ensures CI runs on newer supported Python versions and installs all optional extras needed for the test environment.
Introduce a Dockerfile to build a containerized environment for running the project's test suite. Uses a configurable PYTHON_VERSION arg (default 3.11), sets common Python env vars, copies the project source and tests into /app, upgrades pip and installs the project with test-related extras, and uses `pytest -v` as the default command.
Remove legacy Sphinx/docs ignore entries (Sphinx header, docs/_build/ and /docs) and add ignores for editor/runtime files: .vscode (including nested .vscode), .cursor, and .virtual_documents. This avoids committing editor settings and generated virtual document files while keeping project-specific docs under src/pyDataverse intact.
Introduce a new pyDataverse/mcp package that integrates Dataverse with FastMCP. Adds core modules: - __init__.py: exports DataverseMCP and FastMCP. - server.py: DataverseMCP and MCPConfiguration to register tools and expose MCP endpoints. - collection.py: collection metadata, listing, RDF graph fetching, SPARQL querying and graph summaries with LRU caching (rdflib + TOON encoding). - datacite.py: async DataCite client to search DOIs and filter results to Dataverse instances (httpx, pydantic models). - dataset.py: fetch dataset metadata and list files, with full/summary modes (pandas support). - file.py: read files (images, PDFs, notebooks, binary blobs, tabular) with size limits, image compression, PDF text extraction, tabular reading and schema export. - search.py: compact search wrapper converting Dataverse search results into lightweight models and TOON encoding. - middleware.py: DataverseMiddleware to inject Dataverse instances into FastMCP context with a TTL cache. Common behaviors: results are TOON-encoded, sensible defaults for limits/caching, and careful handling of binary/text types. This commit scaffolds MCP-facing utilities for Dataverse tooling and data access.
Introduce a PyFilesystem2-backed Dataverse filesystem and helpers. Adds DataverseFS implementing file browsing, read/write and metadata operations (getinfo, listdir, openbin, remove, setinfo) with caching and URL parsing (from_url). Includes DataverseFileReader to stream downloads (supporting ranged reads and slice access), DataverseFileWriter to stream uploads via a background thread, and TabSpecs/TABULAR_MIME_TYPES to parse tabular files with pandas (open_tabular/stream_tabular). Also exports reader/writer in filesystem package __init__.py. Suppresses certain deprecation warnings during FS imports.
Introduce core dataverse content layer: add ContentBase, Dataset, Collection and package exports. ContentBase provides API accessors and common serialization helpers. Dataset implements metadata handling, export/graph generation, file I/O (fs/open/upload), publishing, review workflow, and conversion to Dataverse API payloads. Collection adds content listing, overview (DataFrame), creation/update of sub-dataverses and datasets, async graph aggregation with concurrency control and blank-node deduplication, search scoped to a collection, and metrics access. __init__.py exposes main types and triggers model_rebuild for Pydantic models.
Introduce pyDataverse/dataverse/connect package to dynamically build and handle Dataverse metadata models. Adds create_model_from_block (builder) to generate Pydantic models from metadata block definitions, split/convert field types (primitive, compound, controlledVocabulary), and performant CV validation using Annotated + AfterValidator with caching. Implements BlockConfig and DataverseMetaclass for attaching block metadata, JSONSchemaExtra and FieldInfo helpers, MetadataBlockBase and CompoundField bases with custom serialization and conversion to Dataverse API DTOs, utilities (clean_name, map_field_type), and a ValidatingList type that validates/coerces items and provides a convenient .add(...) API for compound entries. These additions enable schema-driven model creation, serialization, and round-trip mapping to Dataverse create/edit payloads.
Introduce a set of view abstractions for pyDataverse to provide cached, iterable and indexable access to collections, datasets and files. Files added: - baseview.py: Generic BaseView providing caching, iteration and getitem semantics. - contentview.py: ContentView extending BaseView with async prefetching, identifier resolution and lazy fallback iteration. - collectionview.py: CollectionView for iterating/lookup of child collections with async prefetch support. - datasetview.py: DatasetView for dataset iteration/lookup, supports index access, DOI/version suffix parsing and caching. - filesview.py: FilesView to list/index dataset files, optional tabular-only filtering, metadata prefetching, tree-style repr and a pandas DataFrame view. Key details: caching of fetched items, concurrency-limited prefetch using asyncio/asyncer, robust identifier resolution (including DOI/version handling), and best-effort background metadata prefetching for file objects to reduce blocking I/O.
Introduce a new pyDataverse.api.utilities package providing helper utilities for API workflows: an async crawl_collection to recursively enumerate collections and datasets, conc_get_datasets for batched concurrent dataset fetching with semaphore-based rate control, file_input to normalize various file inputs for uploads, and ApiLogger using Rich for formatted console logging. The package exposes crawl_collection and file_input via __init__.py. These utilities centralize common functionality for crawling, bulk fetching, file handling, and logging to simplify higher-level API code.
Remove unused ApiAuthorizationError import and replace custom exception with ValueError for type checks in ApiTokenAuth and BearerTokenAuth. This simplifies error handling by raising a built-in exception when non-string tokens are provided.
Introduce a new API response model for the Dataverse API. Adds a Status enum (OK/ERROR) and an APIResponse Pydantic BaseModel with fields: status, data (dict|list with default_factory), and optional message. Includes a convenience classmethod from_out_of_format to construct responses from raw dict/list payloads and map HTTP 200 to Status.OK (otherwise Status.ERROR). Adds type hints and documentation for clarity.
Replace the old ad-hoc search parameter handling with a typed QueryOptions pydantic model and modernize SearchApi. - Add QueryOptions (pydantic) to validate and document search parameters (type, subtree, sort, per_page, start, order, fq alias, show_* flags, geo_radius). - Introduce computed_field api_base_url to build the search endpoint. - Change SearchApi.search signature to search(query: str, options: Optional[QueryOptions]) -> SearchResponse and use options.model_dump(by_alias=True, exclude_none=True) to build query params. - Use get_request with url, params and response_model=SearchResponse. - Expand docstrings and examples. Note: This is a breaking API change for callers — previous keyword parameters (data_type, subtree, sort, etc.) were removed and should be passed via QueryOptions.
Introduce a new SemanticApi class to pyDataverse that adds support for Dataverse semantic/linked-data endpoints. Implements synchronous and batched async retrieval of dataset JSON-LD (get_dataset, get_datasets/_get_datasets) with type-hinted overloads, conversion utilities to rdflib.Graph (response_to_graph, responses_to_graph), and URI normalization (_normalize_schema_org_uris) to canonicalize schema.org HTTP→HTTPS. Uses pydantic computed_field properties and integrates with the existing Api base class.
Migrate SwordApi to a Pydantic-style implementation: add a SwordApiVersion enum and a sword_api_version Field with a default, enable use_enum_values via model_config, and introduce computed fields for api_base_url and base_url_api_sword. Replace the old __init__/httpx.BasicAuth logic and ApiUrlError checks with a validator for sword_api_version and simplify URL construction (get_service_document now uses _assemble_url). Removes direct httpx import and streamlines version handling and URL assembly.
Large refactor of the Api implementation to add async/sync parity, stronger typing, and robust response handling. Highlights: - Introduced DataverseBase and AbstractApi; Api now inherits improved base and implements api_base_url. - Added detailed typing, TypeVar/overloads and return-type unions for get/post/put/delete methods to support both raw httpx.Response and parsed Pydantic models (single & lists). - Implemented async support with _async_request and ensured synchronous behavior via _sync_request; added from_api helper. - Added stream_file_context for streaming downloads. - Centralized response processing in _handle_response using APIResponse, improved error handling (HTTPStatusError), and support for Union types via TypeAdapter. - Improved header handling, default User-Agent injection, configurable timeouts/max connections, and connection check in model_validator. - Replaced warnings usage, switched APIVersion.LATEST value to "v1", and added structured logging via ApiLogger. This change improves type-safety, developer ergonomics, and reliability when interacting with Dataverse endpoints (sync and async).
Major refactor of pyDataverse/api/data_access.py: add typing hints, pydantic computed field for API base URL, and httpx-based request handling. Consolidates datafile download logic into an overloaded _get_datafile_core, exposes get_datafile_download_url for redirect URLs, and adds contextmanager-based streaming helpers (stream_datafile, stream_datafiles, stream_datafiles_bundle) for efficient large-file/zip downloads. Improves PID handling and parameter assembly across endpoints, and updates request/permission APIs (request_access, allow_access_request, grant_file_access, list_file_access_requests) to return typed message/AccessRequest models and accept params consistently.
Refactors pyDataverse MetricsApi: adds typing, pydantic computed field for base URL, and returns typed MetricsResponse objects. Methods now accept typed parameters (including date objects), support parentAlias query scoping, and use the internal _assemble_url and use_async flags. CSV endpoints are parsed into pandas.DataFrame, dataverse-by-* methods are renamed to get_collections_by_* with deprecated wrappers for the old names, and dataset/location endpoints return structured responses. Misc: improved imports, deprecation annotations, and small HTTPX handling for CSV payloads.
Introduce pyDataverse.api.hub with DataverseHub client and HUB_BASE_URL. Adds a cached_property installations that calls the Dataverse Hub API (/api/installations/status) via httpx, validates the JSON with InstallationStatusResponse.model_validate, and returns a List[InstallationStatus]. Includes docstrings describing behavior and possible exceptions (httpx errors, validation errors).
Major rewrite of pyDataverse.api.native.NativeApi: reorganized imports, added extensive type hints and Pydantic response/request models, and introduced async support and concurrent dataset fetching. Replaced many legacy endpoints with typed methods (get_collection, create_collection, publish_collection, delete_collection, get_dataset, get_dataset_versions, get_dataset_export, get_datasets_export, create_dataset, edit_dataset_metadata, private URL helpers, download/stream dataset bundles, etc.), added context manager for streaming, and improved URL/param assembly. Deprecated several old dataverse helpers in favor of new collection-based methods and added logging for key operations. Overall cleans up error handling paths and moves toward a model-driven, async-capable API surface.
Clean up pyDataverse.api exports: explicitly import and expose DataAccessApi, MetricsApi, NativeApi, SearchApi, SwordApi, SemanticApi and DataverseHub via an __all__ list. Removes the old Api import and noqa comments to make the package's public API explicit and add new SemanticApi and DataverseHub exports.
Import warnings and nest_asyncio in package init; suppress pkg_resources DeprecationWarning (from a third-party dependency) and call nest_asyncio.apply() to enable nested asyncio event loops (useful in REPLs/notebooks). Also import QueryOptions and core types (Collection, Dataset, Dataverse, File) and expose them via __all__ so they are available from the package namespace.
Convert pyproject.toml from Poetry-specific [tool.poetry] layout to PEP 621 [project] metadata, bump package version to 0.4.0 and add readme, license, and requires-python. Replace Poetry dependency groups with PEP-style dependencies and optional-dependencies (including updated dependency pins), add project.urls.repository, tighten build-system poetry-core requirement, and reintroduce packages include for pyDataverse. Also adjust pytest asyncio_mode and coverage source to point at the library and remove legacy Poetry group/radon settings.
Rework README.md to modernize badges and project summary, state Python 3.10+ support, and provide clear installation instructions for pip, uv, and poetry. Add a concise Quickstart with high-level API and DOI examples (dataset creation, metadata updates, file I/O, publish), and document running local tests against a Dataverse container with run-tests.sh usage. Clean up links to documentation and community chat (ReadTheDocs, Zulip) and streamline setup/test dependency install commands.
Replace docker-compose based test orchestration with a simpler build-and-run flow. Switch to /usr/bin/env bash and enable strict flags (set -euo pipefail). Read PYTHON_VERSION from environment (default 3.11), require BASE_URL and API_TOKEN (with API_TOKEN_SUPERUSER fallback), normalize container host URLs, build a pydataverse-tests image with PYTHON_VERSION as a build-arg, and run it passing necessary env vars and host.docker.internal mapping. Remove argument parsing, compose files, container polling/logging, and make the script executable.
Add a generated uv.lock file that pins Python package versions, sources, hashes and wheels for reproducible installs (requires-python >=3.10,<4.0). This lockfile locks transitive dependencies and resolution markers to ensure consistent environments across installs.
Replace the ad-hoc globals/sys.modules lookup of CompoundField with a direct call to `field_info.dtype.from_dataverse_dict` when `field.value` is a dict. This simplifies the code, makes the conversion use the field's declared dtype (more robust and accurate), and removes the fallback import logic.
Replace actions/setup-python + Poetry steps with astral-sh/setup-uv and uv commands across workflows; update dependency/install/test commands to use uv (uv sync, uvx ruff, uv run pytest, uv build/publish). Bump actions/checkout to v5 in all workflows and extend build Python matrix to 3.10–3.13. Overall simplifies CI by centralizing tooling on UV and modernizing checkout/versioned actions.
Add setuptools>=82.0.0 to the 'tests' extra in pyproject.toml and update the lockfile accordingly. This is necessary, since the "fs" package requires `pk_resources` which ships with setuptools
Remove setuptools from the tests extra in pyproject.toml because it is no longer needed; update the lockfile to reflect the dependency removal.
Replace `uv sync --all-extras` with `uv pip install --system ".[tests,mcp]"` to install the project's test extras, and run tests with `pytest -vv` instead of `uv run pytest` to produce more verbose test output. These changes ensure CI installs the required extras and yields clearer test logs.
Introduce actions/setup-python@v5 using matrix.python-version to ensure the correct Python interpreter is available for subsequent steps. Remove the python-version input from the astral-sh/setup-uv step since Python is now configured separately. This clarifies environment setup and avoids duplicating interpreter configuration.
Add step to install setuptools and wheel in the tests workflow's Install Python Dependencies job. This ensures build tools are available before installing the package extras (`uv pip install --system " .[tests,mcp]"`), reducing potential build/install failures during CI.
Update GitHub Actions workflow (.github/workflows/tests.yml): remove Python 3.13 from the test matrix so CI runs on 3.10–3.12, and remove a redundant pip install line for setuptools/wheel. Tests now rely on a single pip install for the package plus test extras.
Stop subclassing fs.base.FS and remove the fs import/requirement; DataverseFS is now a plain class. Reorder and clean up imports (add pandas, cachetools.TTLCache, typing_extensions.Self), remove deprecation-warning suppression and unnecessary noqa comments, and adjust placement of pyDataverse.models.file.update import. These changes simplify dependencies and tidy module imports.
Delete the "fs" package from the dependencies list in pyproject.toml. Verify that no code relies on the removed package or add an alternative dependency if required.
Update .github/workflows/tests.yml to include Python 3.13 in the GitHub Actions matrix so the test suite runs against the latest Python release and verifies compatibility with 3.13.
Update: clarify the Annotated base_url parameter docstrings across mcp modules (collection, dataset, dataverse, file, search) to state that the MCP server's connected dataverse is used by default. Add DataverseMCP.base_url and base_url_instructions properties and include the base_url_instructions text in multiple tool descriptions so tools explicitly inform users which dataverse the MCP is connected to and how to override it. Also tighten dependency version bounds in pyproject.toml for several optional mcp-related packages.
Correct malformed version specifiers in pyproject.toml by adding missing commas (e.g., sniffio) and normalizing constraints for mcp-related extras. Tighten upper bounds for several packages (fastmcp, mcp, nbconvert, nbformat, pillow, pymupdf, toon-format) to avoid accidentally allowing next major releases and ensure constraints parse as intended.
Update test dependency range for pytest-asyncio from >=0.23.7,<0.24.0 to >=1.3.0,<2.0.0 in pyproject.toml and refresh the lockfile. The lockfile adds backports-asyncio-runner (for Python <3.11) and typing-extensions marker (for Python <3.13), and updates pytest-asyncio sdist/wheel entries and hashes to 1.3.0.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR is a broad refactor of
pyDataversethat moves the project from a mostly flat, monolithic layout to a typed, package-oriented structure with clearer public APIs.The main theme is a cleanup and expansion of the direct low-level API surface. The new high-level Dataverse objects are an addition on top of that layer, not a replacement for it.
For reviewers who want to inspect the new documentation experience directly, a preview is available at:
At a high level, it:
Detailed Changes
1. Refactor the low-level API surface into typed modules
The old flat
pyDataverse/api.pymodule has been split into a dedicatedpyDataverse/api/package.This includes:
Apibase with typed request handling,httpxsupport, async support, auth handling, and response parsingAPIResponsemodel and common response/status abstractionsNativeApi,DataAccessApi,SearchApi,MetricsApi, andSwordApiSemanticApifor JSON-LD/RDF-oriented accessDataverseHubclient for Hub API interactionsNotable behavior and capability changes in this layer include:
QueryOptions2. Add a new high-level Dataverse object model
This PR introduces a new
pyDataverse/dataverse/package that provides higher-level workflow objects on top of the low-level APIs.New public objects include:
DataverseCollectionDatasetFileMetricsThis layer adds:
The root package export surface is also updated so users can import the high-level workflow objects directly from
pyDataverse, while the direct low-level API clients remain available throughpyDataverse.api.3. Add dynamic metadata block support and generated model packages
The previous monolithic
pyDataverse/models.pymodule is replaced by a structuredpyDataverse/models/package.This work adds:
models/directorymodels/generate.shscript for regenerating typed model code from the schemaspyDataverse/dataverse/connect/for building typed metadata models from Dataverse metadata block specificationsThis makes the codebase easier to extend and gives the API layer a clearer typed contract.
4. Add filesystem-style dataset access
This PR adds a new
pyDataverse/filesystem/package with a filesystem-oriented interface for working with dataset files.Key additions include:
DataverseFSfor browsing and interacting with a dataset like a filesystemThis gives users a more natural way to read and write files without manually managing raw API calls.
5. Add MCP support and Dataverse MCP tools
This PR adds a new
pyDataverse/mcp/package and a configurableDataverseMCPserver integration built around FastMCP.Included MCP capabilities cover:
This creates a native path for agent/tool interoperability on top of
pyDataverse.6. Replace the documentation stack
The old in-package Sphinx/ReadTheDocs documentation has been removed and replaced with a dedicated
docs/site based on Astro + Starlight.The new docs include dedicated sections for:
The README is also refreshed to better reflect the current API shape, installation options, and test workflow.
7. Modernize packaging, dependency management, and CI
The project metadata and development workflow have been updated substantially.
Notable changes include:
pyproject.tomltestsandmcpuv.lockrequirements.txt,tox.ini,.travis.yml, and.readthedocs.ymlDockerfilefor running testsrun-tests.shto build and run the test container locallyuv, newer checkout/setup actions, and an expanded Python matrix8. Reorganize and expand the test suite
The test layout now follows the refactored package structure more closely.
Additions include dedicated tests for:
api/data_accessapi/metricsapi/nativeapi/searchapi/semanticapi/swordapi/hubdataversemcpLegacy tests tied to the old flat API/model structure and removed helper assets have been cleaned up accordingly.
Breaking Changes / Migration Notes
This is a large architectural change and likely requires downstream updates.
Important migration points:
pyDataverse/api.pyandpyDataverse/models.pyare removed in favor ofpyDataverse/api/andpyDataverse/models/Dataverse,Collection,Dataset, andFileworkflow objectspyproject.tomlas Python>=3.10,<4.0TLDR
Together, these changes expand
pyDataversefrom its earlier flat layout into a broader typed toolkit for Dataverse: