Skip to content

openkb remove leaves orphan hash and reformats unrelated wiki #58

@SeungwookHan

Description

@SeungwookHan

A single openkb remove <doc> run surfaces two independent bugs at once. Reporting them together because
they share a single repro, but they have different root causes and need separate fixes.

Follow-up to #41 — both are regressions in the implementation shipped by PR #51.

Repro

  1. KB containing at least one document ingested before PR feat(cli): add openkb remove to safely delete a document (closes #41) #51 (hashes.json entry has only {name, type}, no doc_name key) and a handful of LLM-generated concept pages with pre-existing dangling
    wikilinks.
  2. openkb remove <that-doc> (e.g. openkb remove ollama).
  3. Observed: removal "succeeds" but git status / hashes.json show the symptoms below.

Bug 1 — hash entry is not removed for docs ingested before PR #51

cat .openkb/hashes.json still contains the removed doc's entry after openkb remove reports success.
Re-running openkb add <same-file> is then incorrectly treated as a duplicate via the SHA dedup.

Root cause

Commit c504e26 (within this same PR) fixed add_single_file so newly-ingested docs persist doc_name
into the registry. However, entries that already existed in hashes.json before that commit were not
backfilled
— they still carry only {name, type}.

HashRegistry.remove_by_doc_name (openkb/state.py:44-51) matches with meta.get("doc_name") == doc_name. For un-backfilled legacy entries the comparison evaluates to None == "<slug>" → always
False. The method silently returns False; nothing in the call chain checks the return value.

Meanwhile cli.py:670 (doc_name = meta.get("doc_name") or Path(name).stem) does fall back to the
filename stem to drive every other step, so summary/source/concept/index removal succeeds and the failure
is invisible at the surface.

Suggested fix

Either of the following — both are robust against un-backfilled legacy data:

  • Add a fallback in remove_by_doc_name that also matches when Path(meta["name"]).stem == doc_name, OR
  • Introduce remove_by_hash(file_hash) and call it from cli.py:842 since the CLI already has the
    matched hash in hand. Preferred — eliminates the slug round-trip and works regardless of doc_name
    presence.

A one-shot migration that backfills doc_name on the next openkb invocation would also clean this up,
but the call-site fix above is sufficient and avoids touching user data on read paths.


Bug 2 — unrelated wiki pages get reformatted

Removing a single doc produces a sprawling diff. In my repro, removing one ollama.md produced a
39-file / 1254-line diff; 27 of those were concept pages that didn't list ollama as a source.

Example (from a concept page unrelated to ollama):

- **Knowledge access**: agents need curated context such as [[LLM Wiki]]                                
+ **Knowledge access**: agents need curated context such as LLM Wiki                                      

Root cause

cli.py:815 calls fix_broken_links(wiki_dir) over the entire wiki on every remove.
openkb/lint.py:fix_broken_links strips every dangling wikilink in the KB, not only the ones created by
this removal. Pre-existing ghost links (LLM-generated, hand-edited, links to not-yet-added concepts, etc.)
get swept up too.

Impact

  • Removal commits are unreadable — actual deletion effects are buried under unrelated reformat noise.
  • Users lose [[wikilinks]] they may want to keep (e.g. links to a concept they plan to add later).
  • Violates least-surprise: the command name says "remove one doc," but the diff shows wiki-wide
    refactoring.

Suggested fix (preferred)

Limit ghost-link stripping to files actually touched by this removal: concept_result["modified"]
{index.md}. Preserves the original PR #49 intent (clean up dangling links the removal just created)
without sweeping the rest of the KB.

Alternatives

  • Snapshot the global ghost set before/after the removal and strip only the newly-introduced ghosts.
  • Make the global pass opt-in via a flag (e.g. --lint), default off.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions