Skip to content

Basic CLI#1

Open
gaurav wants to merge 68 commits intomainfrom
basic-implementation-in-uv
Open

Basic CLI#1
gaurav wants to merge 68 commits intomainfrom
basic-implementation-in-uv

Conversation

@gaurav
Copy link
Copy Markdown
Collaborator

@gaurav gaurav commented Dec 3, 2025

Introduces babel-explorer, a CLI tool to query Babel intermediate files (Parquet) via DuckDB and NodeNorm. BabelDownloader handles caching/freshness, BabelXRefs handles querying, NodeNorm handles label enrichment, and cli.py wires them together with Click.

gaurav and others added 19 commits December 2, 2025 15:38
- Add IdentifierRecord dataclass to babel_xrefs.py (resolves TODO)
- Add 89 tests across 3 files: test_downloader (26), test_babel_xrefs (31), test_nodenorm (23)
- Unit tests (71) use mocks and run without network; integration tests (18) use real downloads/APIs
- Add session-scoped fixtures in conftest.py for shared Parquet file downloads
- Parametrize integration tests over tests/data/valid_curies.txt for easy expansion
- Add integration and slow pytest markers to pyproject.toml
- Update CLAUDE.md and README.md with testing documentation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements a basic version of babel-explorer in Python using the uv package manager. It's a tool for querying Babel intermediate files to understand why biological/chemical identifiers are considered equivalent. The implementation includes a downloader for large Parquet files with MD5 validation and resume support, NodeNorm API integration for label enrichment, DuckDB-based cross-reference querying, and a Click-based CLI.

Changes:

  • Initial project structure with uv-based package management (pyproject.toml, Python 3.11+)
  • Core functionality: BabelDownloader with streaming downloads and MD5 validation, NodeNorm API client with LRU caching, BabelXRefs for DuckDB-based Parquet queries
  • CLI with three commands: xrefs, ids, and test-concord
  • Comprehensive test suite with 80 tests split between unit tests (mocked) and integration tests (real network calls)

Reviewed changes

Copilot reviewed 15 out of 19 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
pyproject.toml Project configuration with dependencies (click, duckdb, requests, tqdm) and pytest markers
.python-version Specifies Python 3.11 requirement
.gitignore Excludes /data directory for downloaded files
README.md User documentation with setup, usage examples, and testing instructions
CLAUDE.md AI assistant guidance documentation (contains outdated wget reference)
src/babel_explorer/cli.py Click-based CLI with xrefs, ids, and test-concord commands
src/babel_explorer/core/downloader.py Streaming file downloader with MD5 validation and resume capability
src/babel_explorer/core/nodenorm.py NodeNorm API client for identifier normalization
src/babel_explorer/core/babel_xrefs.py DuckDB-based cross-reference query engine (has frozen dataclass bug)
tests/conftest.py Session-scoped pytest fixtures for shared test resources
tests/constants.py Shared test constants and CURIE loader utility
tests/data/valid_curies.txt Parametrized test data (one CURIE)
tests/test_downloader.py 26 tests for BabelDownloader (22 unit, 3 integration, 1 slow)
tests/test_nodenorm.py 23 tests for NodeNorm (18 unit, 5 integration)
tests/test_babel_xrefs.py 31 tests for BabelXRefs (22 unit, 8 integration, 1 slow)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/babel_explorer/core/nodenorm.py Outdated
Comment thread src/babel_explorer/core/downloader.py Outdated
Comment thread src/babel_explorer/cli.py Outdated
Comment thread pyproject.toml Outdated
Comment thread src/babel_explorer/core/downloader.py Outdated
Comment thread src/babel_explorer/core/babel_xrefs.py Outdated
Comment thread src/babel_explorer/core/babel_xrefs.py Outdated
Comment thread src/babel_explorer/core/nodenorm.py Outdated
Comment thread CLAUDE.md Outdated
gaurav and others added 6 commits March 2, 2026 17:35
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Remove _calculate_md5/_fetch_remote_md5 (too slow on 2.5-3.9 GB files)
- Add sidecar .meta JSON files (ETag, Last-Modified, Content-Length, last_checked)
- Three-tier logic: freshness window → HEAD/ETag check → full re-download
- Add freshness_seconds param to BabelDownloader (default 3h)
- Add --check-download CLI option to xrefs and ids commands (e.g. 3h, never)
- Update tests: replace MD5 test classes with meta/ETag/tier coverage

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
gaurav and others added 3 commits March 16, 2026 16:36
After production code was updated to use `with duckdb.connect()` and
`with requests.get()`, the test mocks (plain Mock()) no longer supported
the context manager protocol. Updated affected mocks to MagicMock() with
__enter__.return_value = self.

Also extracted _make_response() helper in TestDownloadWithRetry to
eliminate five near-identical 4-line mock setup blocks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
normalize_curie() was calling requests.get() with no timeout, risking
an indefinite hang if the NodeNorm service stalls. The downloader had
timeout=30 hardcoded in two places with no way to override it.

Add timeout: int = 30 to both constructors and thread self.timeout
through all three request call sites, making the default consistent
and the value overridable without patching.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The docstring claimed the method updates last_checked in the .meta file,
but it is a pure predicate — the caller (get_downloaded_file) owns that
write. Updated the docstring and removed stale inline comments that said
the same thing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@gaurav gaurav marked this pull request as ready for review March 16, 2026 20:45
gaurav and others added 7 commits March 16, 2026 16:51
Add .github/workflows/lint.yml to run ruff check and ruff format --check
on every pull request.

Fix the 7 errors ruff found, and apply ruff format to all files:
- Remove unused `import warnings` in babel_xrefs.py (left over after
  ignore_curies_in_expansion was removed)
- Add # noqa: F841 to the three read_parquet() assignments: ruff flags
  them as unused, but DuckDB resolves SQL table names by matching the
  Python variable name, so the assignments are load-bearing
- Remove spurious f-prefix from two string literals in downloader.py
- Drop unused `local_path` variable in test_downloader.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaced .github/workflows/lint.yml with ci.yml. The new file keeps the
existing ruff lint/format job and adds a parallel test job that runs
`pytest -v -m "not integration"` (unit tests only — no network required).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Covers all 5 branches: empty input, 'never', unit suffix, bare integer,
and invalid values that raise click.BadParameter.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 22 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +302 to +305
if os.path.exists(local_path_to_download_to):
meta = self._load_meta(local_path_to_download_to)
if meta is not None:
# Tier 1: within freshness window — skip all network calls
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the file exists but the sidecar .meta is missing or corrupt, the code falls through to _download_with_retry() without deleting the existing file. _download_with_retry() will then attempt to resume (send a Range request / append), which can silently corrupt the local file if the remote content differs. Treat missing/invalid metadata as an unknown state and force a full re-download (e.g., delete the existing file before downloading, or add a flag to disable resume when no metadata is present).

Copilot uses AI. Check for mistakes.
Comment thread src/babel_explorer/core/nodenorm.py Outdated
Comment on lines +7 to +13
@dataclasses.dataclass
class Identifier:
curie: str
label: str = ""
biolink_type: str = ""
taxa: list[str] = dataclasses.field(default_factory=list)
description: list[str] = dataclasses.field(default_factory=list)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identifier.biolink_type is annotated as str, but NodeNorm responses typically provide type as a list of Biolink classes (and from_dict() assigns d["type"] directly). This creates an inconsistent runtime type and can break callers that treat biolink_type as a string. Consider making the field consistently list[str] (or tuple[str, ...]) and updating call sites/printing accordingly, or normalize to a single string (e.g., first element) if that’s the intended semantics.

Copilot uses AI. Check for mistakes.
Comment thread src/babel_explorer/core/nodenorm.py Outdated
if "description" in d:
identifier.description = d["description"]
if "type" in d:
identifier.biolink_type = d["type"]
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identifier.from_dict() assigns identifier.biolink_type = d["type"], which is often a list in NodeNorm responses. If you keep biolink_type as a string field, this should normalize the value (e.g., join, pick a primary type, or store separately as biolink_types). If you switch biolink_type to a collection type, adjust the dataclass annotation/default and any formatting/printing paths (e.g., CLI output) to avoid printing raw Python lists.

Suggested change
identifier.biolink_type = d["type"]
type_value = d["type"]
if isinstance(type_value, list):
identifier.biolink_type = ", ".join(str(t) for t in type_value)
elif type_value is not None:
identifier.biolink_type = str(type_value)

Copilot uses AI. Check for mistakes.
Comment thread src/babel_explorer/cli.py
Comment on lines +21 to +34
# Value with unit suffix (e.g. '3h', '30m')
if lower[-1] in units:
try:
amount = int(lower[:-1])
except ValueError:
raise click.BadParameter(
f"Invalid duration {value!r}: expected an integer followed by an optional unit "
"('s', 'm', 'h', or 'd'), or 'never'."
)
return amount * units[lower[-1]]
# Bare integer seconds
try:
return int(lower)
except ValueError:
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_duration() accepts negative values (e.g., "-1h" or "-30"), which then feed into freshness_seconds. A negative freshness makes the downloader treat everything as stale and can cause unexpected HEAD/GET behavior. Consider rejecting negative durations with a clear click.BadParameter message (or explicitly documenting that negatives are allowed and what they mean).

Copilot uses AI. Check for mistakes.
Comment thread src/babel_explorer/cli.py Outdated
gaurav and others added 5 commits March 30, 2026 16:14
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…n .gitignore

Both backported from add-nodenorm-frontend:
- babel_xrefs.py: avoid calling get_identifier() twice per CURIE in _to_labeled_xref
- .gitignore: anchor lib/ and lib64/ to repo root so nested lib dirs aren't ignored

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- nodenorm.py: Identifier.biolink_type str→list[str] to match NodeNorm API
- nodenorm.py: get_clique_identifiers returns [] instead of None; add return type annotation
- nodenorm.py: log debug message when get_identifier finds no exact match
- cli.py: parse_duration return type int|float; join biolink_type list for display
- tests: update assertions for new biolink_type type; add test-concord edge cases
  (unknown CURIE producing no output, multiple CURIEs queried independently)
- ci.yml: add workflow_dispatch trigger and integration-test job

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add push trigger for master branch (fires when PRs are merged)
- Add schedule trigger: Tuesdays at 17:00 UTC (12pm EST / 1pm EDT)
- Change integration-test job condition to run on all non-PR events

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Source files:
- nodenorm.py: module, Identifier class/from_dict, NodeNorm class/__init__/
  get_identifier/normalize_curie/get_clique_identifiers
- babel_xrefs.py: convert # comment to module docstring; CrossReference class/
  from_tuple/curies property; LabeledCrossReference class; IdentifierRecord.__str__;
  BabelXRefs class/__init__/get_curie_xref
- downloader.py: module, BabelDownloader.__init__, get_output_file
- cli.py: cli() group, test_concord() command

Test files (class docstrings only):
- test_babel_xrefs.py: TestCrossReference, TestLabeledCrossReference,
  TestIdentifierRecord, TestBabelXRefsInit
- test_nodenorm.py: TestIdentifier, TestNodeNormInit, TestNormalizeCurieMocked,
  TestGetIdentifierMocked, TestGetCliqueIdentifiersMocked

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 22 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +127 to +135
return LabeledCrossReference(
subj=xref.subj,
obj=xref.obj,
filename=xref.filename,
pred=xref.pred,
subj_label=subj_ident.label,
subj_biolink_type=subj_ident.biolink_type,
obj_label=obj_ident.label,
obj_biolink_type=obj_ident.biolink_type,
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LabeledCrossReference declares subj_biolink_type/obj_biolink_type as str, but _to_labeled_xref() passes Identifier.biolink_type, which is a list[str] from NodeNorm. This makes the runtime type inconsistent with the dataclass contract and with unit tests that construct LabeledCrossReference using strings. Consider either changing these fields to list[str] (and updating the name/docs accordingly) or serializing the list (e.g., join with ', ' or select a primary type) before constructing the labeled xref.

Suggested change
return LabeledCrossReference(
subj=xref.subj,
obj=xref.obj,
filename=xref.filename,
pred=xref.pred,
subj_label=subj_ident.label,
subj_biolink_type=subj_ident.biolink_type,
obj_label=obj_ident.label,
obj_biolink_type=obj_ident.biolink_type,
# NodeNorm.Identifier.biolink_type is a list[str]; LabeledCrossReference expects str.
# Normalize by selecting the primary (first) type when a list is provided.
subj_biolink_type = (
subj_ident.biolink_type[0]
if isinstance(subj_ident.biolink_type, list) and subj_ident.biolink_type
else subj_ident.biolink_type
)
obj_biolink_type = (
obj_ident.biolink_type[0]
if isinstance(obj_ident.biolink_type, list) and obj_ident.biolink_type
else obj_ident.biolink_type
)
return LabeledCrossReference(
subj=xref.subj,
obj=xref.obj,
filename=xref.filename,
pred=xref.pred,
subj_label=subj_ident.label,
subj_biolink_type=subj_biolink_type,
obj_label=obj_ident.label,
obj_biolink_type=obj_biolink_type,

Copilot uses AI. Check for mistakes.
Comment on lines +280 to +289
@functools.lru_cache(maxsize=None)
def get_downloaded_file(self, dirpath: str, chunk_size: int = 1024 * 1024):
"""
Download a file from the Babel server to local storage with ETag-based caching.

Three-tier freshness logic:
1. If .meta exists and last_checked is within freshness window → return immediately
2. If .meta exists but stale → HEAD request to compare ETag; return if unchanged
3. If ETag changed or no .meta → full re-download

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_downloaded_file() implements time/ETag-based freshness checks, but it is wrapped in @lru_cache. After the first call for a given (dirpath, chunk_size), subsequent calls return from the cache and will never re-evaluate the freshness window or perform the intended HEAD/GET logic in the same process. If the goal is to honor freshness_seconds on each call, consider removing the lru_cache here (or caching only path computation separately) so the three-tier logic can run as documented.

Copilot uses AI. Check for mistakes.
gaurav and others added 8 commits April 1, 2026 01:40
…map to listcomp

- babel_xrefs.py: subj_biolink_type/obj_biolink_type str→list[str] to match
  Identifier.biolink_type after the nodenorm.py type change
- nodenorm.py: replace map(lambda) with list comprehension in get_clique_identifiers
- tests: update LabeledCrossReference construction to use list values

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…D, type fixes

- nodenorm.py: Identifier is now frozen=True; rewrite from_dict as one-shot
  constructor to avoid post-construction mutation of lru_cache'd objects
- nodenorm.py: remove **kwargs from get_clique_identifiers — unhashable and unused,
  would raise TypeError if any kwarg was ever passed
- downloader.py: download to .tmp then os.replace() so the final file is never
  partially written; clean up .tmp on failure
- downloader.py: _etag_matches returns True (fail open) on HEAD network error
  instead of False, avoiding spurious 2GB re-downloads on transient failures
- cli.py: add nodenorm_url: str annotation in xrefs and test_concord; move
  test_concord inline comment to docstring
- tests: update test_returns_false_on_request_error → test_returns_true_on_request_error
- FUTURE.md: track CLI option deduplication refactor

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix --expand → --recurse (the actual flag name) in Data Flow and Key Design Patterns
- BabelXRefs: remove false claim about writing DuckDB databases to disk;
  all connections are in-memory (duckdb.connect() with no path)
- Remove 'Generated DuckDB databases' entry from File Locations (nothing on disk)
- Update test count table: numbers were stale and test_cli.py was missing entirely
- Add Identifier to Key Dataclasses (now frozen=True as of recent fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…/babel-explorer into basic-implementation-in-uv
- list→tuple on Identifier and LabeledCrossReference fields so frozen
  dataclasses are hashable (was a TypeError crash in get_curie_xrefs)
- NodeNorm(''): add early return in normalize_curie so empty URL truly
  skips all network calls as documented
- BabelDownloader: auto-append trailing slash to url_base so urljoin
  can't silently drop path segments
- CI: fix push trigger branch master → main
- Remove dead get_downloaded_dir method (lru_cache + NotImplementedError)
- parse_duration: reject negative values with a clear BadParameter error

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces a central formatting.py module (write_records + _record_to_dict)
that serialises any dataclass to text, JSON, TSV, or CSV without touching
domain objects. A format_option decorator wires --format and --json-indent
onto xrefs, ids, and test-concord. test-concord injects a query_curie column
for non-text formats. 30 new unit tests in test_formatting.py; 7 CLI format
tests added to test_cli.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the 'text' default format with 'console', backed by the rich
library. xrefs and test-concord highlight query CURIEs in bold cyan
wherever they appear as subject or object; rich auto-strips markup when
output is piped. ids uses console.print(str(record)) for TTY-aware plain
output. formatting.py gains make_console() and hl_curie() utilities for
new commands to reuse. LabeledCrossReference labels appear in parentheses
next to CURIEs in console output.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
tuple() on a bare string iterates its characters, so biolink_type,
taxa, and description would become ('b','i','o',...) when NodeNorm
returns them as strings rather than lists. _to_tuple() now wraps
a bare string in a 1-tuple. Four new unit tests cover the string
case for each field.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants