Basic CLI by gaurav · Pull Request #1 · TranslatorSRI/babel-explorer

gaurav · 2025-12-03T07:28:19Z

Introduces babel-explorer, a CLI tool to query Babel intermediate files (Parquet) via DuckDB and NodeNorm. BabelDownloader handles caching/freshness, BabelXRefs handles querying, NodeNorm handles label enrichment, and cli.py wires them together with Click.

- Add IdentifierRecord dataclass to babel_xrefs.py (resolves TODO) - Add 89 tests across 3 files: test_downloader (26), test_babel_xrefs (31), test_nodenorm (23) - Unit tests (71) use mocks and run without network; integration tests (18) use real downloads/APIs - Add session-scoped fixtures in conftest.py for shared Parquet file downloads - Parametrize integration tests over tests/data/valid_curies.txt for easy expansion - Add integration and slow pytest markers to pyproject.toml - Update CLAUDE.md and README.md with testing documentation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This pull request implements a basic version of babel-explorer in Python using the uv package manager. It's a tool for querying Babel intermediate files to understand why biological/chemical identifiers are considered equivalent. The implementation includes a downloader for large Parquet files with MD5 validation and resume support, NodeNorm API integration for label enrichment, DuckDB-based cross-reference querying, and a Click-based CLI.

Changes:

Initial project structure with uv-based package management (pyproject.toml, Python 3.11+)
Core functionality: BabelDownloader with streaming downloads and MD5 validation, NodeNorm API client with LRU caching, BabelXRefs for DuckDB-based Parquet queries
CLI with three commands: xrefs, ids, and test-concord
Comprehensive test suite with 80 tests split between unit tests (mocked) and integration tests (real network calls)

Reviewed changes

Copilot reviewed 15 out of 19 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
pyproject.toml	Project configuration with dependencies (click, duckdb, requests, tqdm) and pytest markers
.python-version	Specifies Python 3.11 requirement
.gitignore	Excludes /data directory for downloaded files
README.md	User documentation with setup, usage examples, and testing instructions
CLAUDE.md	AI assistant guidance documentation (contains outdated wget reference)
src/babel_explorer/cli.py	Click-based CLI with xrefs, ids, and test-concord commands
src/babel_explorer/core/downloader.py	Streaming file downloader with MD5 validation and resume capability
src/babel_explorer/core/nodenorm.py	NodeNorm API client for identifier normalization
src/babel_explorer/core/babel_xrefs.py	DuckDB-based cross-reference query engine (has frozen dataclass bug)
tests/conftest.py	Session-scoped pytest fixtures for shared test resources
tests/constants.py	Shared test constants and CURIE loader utility
tests/data/valid_curies.txt	Parametrized test data (one CURIE)
tests/test_downloader.py	26 tests for BabelDownloader (22 unit, 3 integration, 1 slow)
tests/test_nodenorm.py	23 tests for NodeNorm (18 unit, 5 integration)
tests/test_babel_xrefs.py	31 tests for BabelXRefs (22 unit, 8 integration, 1 slow)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- Remove _calculate_md5/_fetch_remote_md5 (too slow on 2.5-3.9 GB files) - Add sidecar .meta JSON files (ETag, Last-Modified, Content-Length, last_checked) - Three-tier logic: freshness window → HEAD/ETag check → full re-download - Add freshness_seconds param to BabelDownloader (default 3h) - Add --check-download CLI option to xrefs and ids commands (e.g. 3h, never) - Update tests: replace MD5 test classes with meta/ETag/tier coverage Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

After production code was updated to use `with duckdb.connect()` and `with requests.get()`, the test mocks (plain Mock()) no longer supported the context manager protocol. Updated affected mocks to MagicMock() with __enter__.return_value = self. Also extracted _make_response() helper in TestDownloadWithRetry to eliminate five near-identical 4-line mock setup blocks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

normalize_curie() was calling requests.get() with no timeout, risking an indefinite hang if the NodeNorm service stalls. The downloader had timeout=30 hardcoded in two places with no way to override it. Add timeout: int = 30 to both constructors and thread self.timeout through all three request call sites, making the default consistent and the value overridable without patching. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The docstring claimed the method updates last_checked in the .meta file, but it is a pure predicate — the caller (get_downloaded_file) owns that write. Updated the docstring and removed stale inline comments that said the same thing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add .github/workflows/lint.yml to run ruff check and ruff format --check on every pull request. Fix the 7 errors ruff found, and apply ruff format to all files: - Remove unused `import warnings` in babel_xrefs.py (left over after ignore_curies_in_expansion was removed) - Add # noqa: F841 to the three read_parquet() assignments: ruff flags them as unused, but DuckDB resolves SQL table names by matching the Python variable name, so the assignments are load-bearing - Remove spurious f-prefix from two string literals in downloader.py - Drop unused `local_path` variable in test_downloader.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replaced .github/workflows/lint.yml with ci.yml. The new file keeps the existing ruff lint/format job and adds a parallel test job that runs `pytest -v -m "not integration"` (unit tests only — no network required). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Covers all 5 branches: empty input, 'never', unit suffix, bare integer, and invalid values that raise click.BadParameter. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 17 out of 22 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-30T20:05:22Z

+        if os.path.exists(local_path_to_download_to):
+            meta = self._load_meta(local_path_to_download_to)
+            if meta is not None:
+                # Tier 1: within freshness window — skip all network calls


When the file exists but the sidecar .meta is missing or corrupt, the code falls through to _download_with_retry() without deleting the existing file. _download_with_retry() will then attempt to resume (send a Range request / append), which can silently corrupt the local file if the remote content differs. Treat missing/invalid metadata as an unknown state and force a full re-download (e.g., delete the existing file before downloading, or add a flag to disable resume when no metadata is present).

Copilot · 2026-03-30T20:05:23Z

+@dataclasses.dataclass
+class Identifier:
+    curie: str
+    label: str = ""
+    biolink_type: str = ""
+    taxa: list[str] = dataclasses.field(default_factory=list)
+    description: list[str] = dataclasses.field(default_factory=list)


Identifier.biolink_type is annotated as str, but NodeNorm responses typically provide type as a list of Biolink classes (and from_dict() assigns d["type"] directly). This creates an inconsistent runtime type and can break callers that treat biolink_type as a string. Consider making the field consistently list[str] (or tuple[str, ...]) and updating call sites/printing accordingly, or normalize to a single string (e.g., first element) if that’s the intended semantics.

Copilot · 2026-03-30T20:05:23Z

+        if "description" in d:
+            identifier.description = d["description"]
+        if "type" in d:
+            identifier.biolink_type = d["type"]


Identifier.from_dict() assigns identifier.biolink_type = d["type"], which is often a list in NodeNorm responses. If you keep biolink_type as a string field, this should normalize the value (e.g., join, pick a primary type, or store separately as biolink_types). If you switch biolink_type to a collection type, adjust the dataclass annotation/default and any formatting/printing paths (e.g., CLI output) to avoid printing raw Python lists.

Suggested change

identifier.biolink_type = d["type"]

type_value = d["type"]

if isinstance(type_value, list):

identifier.biolink_type = ", ".join(str(t) for t in type_value)

elif type_value is not None:

identifier.biolink_type = str(type_value)

Copilot · 2026-03-30T20:05:23Z

+    # Value with unit suffix (e.g. '3h', '30m')
+    if lower[-1] in units:
+        try:
+            amount = int(lower[:-1])
+        except ValueError:
+            raise click.BadParameter(
+                f"Invalid duration {value!r}: expected an integer followed by an optional unit "
+                "('s', 'm', 'h', or 'd'), or 'never'."
+            )
+        return amount * units[lower[-1]]
+    # Bare integer seconds
+    try:
+        return int(lower)
+    except ValueError:


parse_duration() accepts negative values (e.g., "-1h" or "-30"), which then feed into freshness_seconds. A negative freshness makes the downloader treat everything as stale and can cause unexpected HEAD/GET behavior. Consider rejecting negative durations with a clear click.BadParameter message (or explicitly documenting that negatives are allowed and what they mean).

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…n .gitignore Both backported from add-nodenorm-frontend: - babel_xrefs.py: avoid calling get_identifier() twice per CURIE in _to_labeled_xref - .gitignore: anchor lib/ and lib64/ to repo root so nested lib dirs aren't ignored Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- nodenorm.py: Identifier.biolink_type str→list[str] to match NodeNorm API - nodenorm.py: get_clique_identifiers returns [] instead of None; add return type annotation - nodenorm.py: log debug message when get_identifier finds no exact match - cli.py: parse_duration return type int|float; join biolink_type list for display - tests: update assertions for new biolink_type type; add test-concord edge cases (unknown CURIE producing no output, multiple CURIEs queried independently) - ci.yml: add workflow_dispatch trigger and integration-test job Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add push trigger for master branch (fires when PRs are merged) - Add schedule trigger: Tuesdays at 17:00 UTC (12pm EST / 1pm EDT) - Change integration-test job condition to run on all non-PR events Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Source files: - nodenorm.py: module, Identifier class/from_dict, NodeNorm class/__init__/ get_identifier/normalize_curie/get_clique_identifiers - babel_xrefs.py: convert # comment to module docstring; CrossReference class/ from_tuple/curies property; LabeledCrossReference class; IdentifierRecord.__str__; BabelXRefs class/__init__/get_curie_xref - downloader.py: module, BabelDownloader.__init__, get_output_file - cli.py: cli() group, test_concord() command Test files (class docstrings only): - test_babel_xrefs.py: TestCrossReference, TestLabeledCrossReference, TestIdentifierRecord, TestBabelXRefsInit - test_nodenorm.py: TestIdentifier, TestNodeNormInit, TestNormalizeCurieMocked, TestGetIdentifierMocked, TestGetCliqueIdentifiersMocked Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 17 out of 22 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-01T05:33:35Z

+        return LabeledCrossReference(
+            subj=xref.subj,
+            obj=xref.obj,
+            filename=xref.filename,
+            pred=xref.pred,
+            subj_label=subj_ident.label,
+            subj_biolink_type=subj_ident.biolink_type,
+            obj_label=obj_ident.label,
+            obj_biolink_type=obj_ident.biolink_type,


LabeledCrossReference declares subj_biolink_type/obj_biolink_type as str, but _to_labeled_xref() passes Identifier.biolink_type, which is a list[str] from NodeNorm. This makes the runtime type inconsistent with the dataclass contract and with unit tests that construct LabeledCrossReference using strings. Consider either changing these fields to list[str] (and updating the name/docs accordingly) or serializing the list (e.g., join with ', ' or select a primary type) before constructing the labeled xref.

Suggested change

return LabeledCrossReference(

subj=xref.subj,

obj=xref.obj,

filename=xref.filename,

pred=xref.pred,

subj_label=subj_ident.label,

subj_biolink_type=subj_ident.biolink_type,

obj_label=obj_ident.label,

obj_biolink_type=obj_ident.biolink_type,

# NodeNorm.Identifier.biolink_type is a list[str]; LabeledCrossReference expects str.

# Normalize by selecting the primary (first) type when a list is provided.

subj_biolink_type = (

subj_ident.biolink_type[0]

if isinstance(subj_ident.biolink_type, list) and subj_ident.biolink_type

else subj_ident.biolink_type

)

obj_biolink_type = (

obj_ident.biolink_type[0]

if isinstance(obj_ident.biolink_type, list) and obj_ident.biolink_type

else obj_ident.biolink_type

)

return LabeledCrossReference(

subj=xref.subj,

obj=xref.obj,

filename=xref.filename,

pred=xref.pred,

subj_label=subj_ident.label,

subj_biolink_type=subj_biolink_type,

obj_label=obj_ident.label,

obj_biolink_type=obj_biolink_type,

Copilot · 2026-04-01T05:33:35Z

+    @functools.lru_cache(maxsize=None)
+    def get_downloaded_file(self, dirpath: str, chunk_size: int = 1024 * 1024):
+        """
+        Download a file from the Babel server to local storage with ETag-based caching.
+
+        Three-tier freshness logic:
+        1. If .meta exists and last_checked is within freshness window → return immediately
+        2. If .meta exists but stale → HEAD request to compare ETag; return if unchanged
+        3. If ETag changed or no .meta → full re-download
+


get_downloaded_file() implements time/ETag-based freshness checks, but it is wrapped in @lru_cache. After the first call for a given (dirpath, chunk_size), subsequent calls return from the cache and will never re-evaluate the freshness window or perform the intended HEAD/GET logic in the same process. If the goal is to honor freshness_seconds on each call, consider removing the lru_cache here (or caching only path computation separately) so the three-tier logic can run as documented.

…map to listcomp - babel_xrefs.py: subj_biolink_type/obj_biolink_type str→list[str] to match Identifier.biolink_type after the nodenorm.py type change - nodenorm.py: replace map(lambda) with list comprehension in get_clique_identifiers - tests: update LabeledCrossReference construction to use list values Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…D, type fixes - nodenorm.py: Identifier is now frozen=True; rewrite from_dict as one-shot constructor to avoid post-construction mutation of lru_cache'd objects - nodenorm.py: remove **kwargs from get_clique_identifiers — unhashable and unused, would raise TypeError if any kwarg was ever passed - downloader.py: download to .tmp then os.replace() so the final file is never partially written; clean up .tmp on failure - downloader.py: _etag_matches returns True (fail open) on HEAD network error instead of False, avoiding spurious 2GB re-downloads on transient failures - cli.py: add nodenorm_url: str annotation in xrefs and test_concord; move test_concord inline comment to docstring - tests: update test_returns_false_on_request_error → test_returns_true_on_request_error - FUTURE.md: track CLI option deduplication refactor Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Fix --expand → --recurse (the actual flag name) in Data Flow and Key Design Patterns - BabelXRefs: remove false claim about writing DuckDB databases to disk; all connections are in-memory (duckdb.connect() with no path) - Remove 'Generated DuckDB databases' entry from File Locations (nothing on disk) - Update test count table: numbers were stale and test_cli.py was missing entirely - Add Identifier to Key Dataclasses (now frozen=True as of recent fix) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…/babel-explorer into basic-implementation-in-uv

- list→tuple on Identifier and LabeledCrossReference fields so frozen dataclasses are hashable (was a TypeError crash in get_curie_xrefs) - NodeNorm(''): add early return in normalize_curie so empty URL truly skips all network calls as documented - BabelDownloader: auto-append trailing slash to url_base so urljoin can't silently drop path segments - CI: fix push trigger branch master → main - Remove dead get_downloaded_dir method (lru_cache + NotImplementedError) - parse_duration: reject negative values with a clear BadParameter error Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Introduces a central formatting.py module (write_records + _record_to_dict) that serialises any dataclass to text, JSON, TSV, or CSV without touching domain objects. A format_option decorator wires --format and --json-indent onto xrefs, ids, and test-concord. test-concord injects a query_curie column for non-text formats. 30 new unit tests in test_formatting.py; 7 CLI format tests added to test_cli.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replaces the 'text' default format with 'console', backed by the rich library. xrefs and test-concord highlight query CURIEs in bold cyan wherever they appear as subject or object; rich auto-strips markup when output is piped. ids uses console.print(str(record)) for TTY-aware plain output. formatting.py gains make_console() and hl_curie() utilities for new commands to reuse. LabeledCrossReference labels appear in parentheses next to CURIEs in console output. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tuple() on a bare string iterates its characters, so biolink_type, taxa, and description would become ('b','i','o',...) when NodeNorm returns them as strings rather than lists. _to_tuple() now wraps a bare string in a 1-tuple. Four new unit tests cover the string case for each field. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

gaurav and others added 19 commits December 2, 2025 15:38

This initializes a uv package in this repository.

fcc27c2

Added basic CLI.

876353d

Add /data to the .gitignore.

ec1d1f0

Initial implementation of a basic xref query-er.

eff8f26

Added a method to look up a particular identifier.

4d04e2a

Added CURIE expansion/recursive lookup.

8531cb7

Added a basic ConcordTester.

a1aeec6

Added labels via NodeNorm.

bb1eb99

Midnight commit: attempting to improve expansion.

40c3338

Added some improvements.

8c41112

Added a CLAUDE.md by Claude.ai.

239c89f

Reorganized file slightly.

8132fe1

Claude wrote some tests.

bd00972

Improved downloader using Claude.

9cc06bc

Added MD5 download functionality.

da8bb0c

Removed empty model file.

8f36b74

Attempted to rename this package to babel-explorer.

0534fd8

Merge branch 'main' into basic-implementation-in-uv

8535202

gaurav requested a review from Copilot February 17, 2026 23:11

Copilot started reviewing on behalf of gaurav February 17, 2026 23:11 View session

Copilot AI reviewed Feb 17, 2026

View reviewed changes

This was referenced Feb 25, 2026

Deconflation endpoint NCATSTranslator/NodeNormalization#340

Open

Publish intermediate files or other information to explain how we came to our cliquing decisions NCATSTranslator/Babel#319

Closed

gaurav and others added 6 commits March 2, 2026 17:35

Added uv.lock (not sure why it wasn't added previously).

ff0dacc

Update CLAUDE.md

bacc72d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update pyproject.toml

96d9609

Update src/babel_explorer/core/babel_xrefs.py

0c33e7e

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/babel_explorer/core/nodenorm.py

1aff013

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

gaurav and others added 3 commits March 16, 2026 16:36

gaurav marked this pull request as ready for review March 16, 2026 20:45

gaurav and others added 7 commits March 16, 2026 16:51

Got rid of ignore_curies_in_expansion, which is no longer used.

a7eb8c1

Improved documentation.

7d1d5ca

Add tests for parse_duration() in cli.py

0d9e32c

Covers all 5 branches: empty input, 'never', unit suffix, bare integer, and invalid values that raise click.BadParameter. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add CliRunner tests for xrefs, ids, and test-concord commands

0ca35eb

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Reformatted code with ruff.

17782a2

gaurav requested a review from Copilot March 30, 2026 20:01

Copilot started reviewing on behalf of gaurav March 30, 2026 20:02 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

gaurav and others added 5 commits March 30, 2026 16:14

Update src/babel_explorer/cli.py

d34c5c3

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

gaurav requested a review from Copilot April 1, 2026 05:29

Copilot started reviewing on behalf of gaurav April 1, 2026 05:30 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

gaurav and others added 8 commits April 1, 2026 01:40

Merge branch 'basic-implementation-in-uv' of github.com:TranslatorSRI…

46f0863

…/babel-explorer into basic-implementation-in-uv

-            identifier.biolink_type = d["type"]
+            type_value = d["type"]
+            if isinstance(type_value, list):
+                identifier.biolink_type = ", ".join(str(t) for t in type_value)
+            elif type_value is not None:
+                identifier.biolink_type = str(type_value)

Conversation

gaurav commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gaurav commented Dec 3, 2025 •

edited

Loading