Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
169 changes: 169 additions & 0 deletions .claude/skills/audit-langchain-docs/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
---
name: audit-langchain-docs
description: "Audit all Diffbot documentation in langchain-ai/docs for drift against this repo's public API. Runs Vale and other docs-repo validators, then applies any prose fixes found back to README.md. Use when the user says: audit docs, check docs are up to date, review langchain docs, docs drift."
allowed-tools: Bash(python3:*), Bash(grep:*), Bash(find:*), Bash(make:*), Read, Edit
---

# Audit Diffbot docs in langchain-ai/docs

This skill checks every Diffbot documentation file on the LangChain docs site against this repo's public API, runs the docs repo's own validators (Vale, etc.), and propagates any findings back into this repo — `README.md` and Python source files (docstrings, comments). The docs repo's validation standards apply repo-wide here.

## Source of truth (this repo)

| Artifact | What it defines |
|----------|----------------|
| `langchain_diffbot/__init__.py` (`__all__`) | The complete public API — every class that should be documented |
| `langchain_diffbot/tools/` | Tool class signatures, input schemas, return types |
| `langchain_diffbot/retrievers/` | Retriever constructor parameters, return types |
| `langchain_diffbot/chat_models/` | Chat model constructor parameters |
| `langchain_diffbot/document_loaders/` | Loader constructor parameters |
| `README.md` | Canonical prose, API table, auth model, examples |

## Documentation files to audit (langchain-ai/docs)

A local `langchain-ai/docs` checkout is expected at the sibling **`../langchain-docs`** by default (override with `$LANGCHAIN_DOCS_REPO`).

| File | What it should reflect |
|------|----------------------|
| `src/oss/python/integrations/providers/diffbot.mdx` | Overview hub: API table, install, auth, component table, links to tools/retrievers pages |
| `src/oss/python/integrations/tools/diffbot.mdx` | All 7 tools: `DiffbotExtractTool`, `DiffbotWebSearchTool`, `DiffbotKnowledgeGraphTool`, `DiffbotEntitiesTool`, `DiffbotAskTool`, `DiffbotOntologyTool`, `DiffbotDQLProbeTool` |
| `src/oss/python/integrations/retrievers/diffbot.mdx` | Both retrievers: `DiffbotKnowledgeGraphRetriever`, `DiffbotWebSearchRetriever` |
| `src/oss/python/integrations/providers/all_providers.mdx` | Card entry for Diffbot |
| `src/oss/python/integrations/tools/index.mdx` | Row in Search table + card in All tools and toolkits |
| `src/oss/python/integrations/retrievers/index.mdx` | Rows in External index table + card in All retrievers |

## Steps

<Steps>

### Locate the docs repo

```bash
DOCS_REPO="${LANGCHAIN_DOCS_REPO:-../langchain-docs}"
if [ ! -d "$DOCS_REPO/src" ]; then
echo "ERROR: docs repo not found at $DOCS_REPO"
exit 1
fi
echo "Docs repo: $DOCS_REPO"
```

### Read the public API surface

Read `langchain_diffbot/__init__.py` and extract every name from `__all__`. This is the canonical list of classes that must appear in the docs.

Also read the source files to extract key constructor parameters and return types for each class:
- Tools: what input schema does each tool accept? What does it return?
- Retrievers: what constructor parameters does each take (`client`, `k`, `fields`, `content_fields`, `document_mapper`)?
- Chat model: what parameters does `ChatDiffbot` take?
- Loaders: what parameters do `DiffbotExtractLoader` and `DiffbotCrawlLoader` take?

### Read every documentation file

Read all 6 files listed in the table above. Use the file paths relative to `$DOCS_REPO/src/`.

### Run Vale on the three Diffbot content pages

Run the docs repo's own Vale validation against the three Diffbot content pages. The docs repo's `make lint_prose` accepts a space-separated `FILES=` argument:

```bash
make -C "${LANGCHAIN_DOCS_REPO:-../langchain-docs}" lint_prose \
FILES="src/oss/python/integrations/providers/diffbot.mdx \
src/oss/python/integrations/tools/diffbot.mdx \
src/oss/python/integrations/retrievers/diffbot.mdx"
```

Capture the full output. Any errors or warnings Vale reports are **definitive** — they are exactly what would fail the docs repo's CI. Include every Vale finding verbatim in the report under a **Vale violations** category. If Vale reports clean, note that explicitly.

### Compare and report

For each issue found, report: the file path, the line or section, and the specific discrepancy. Organize the report into these categories:

**Missing coverage** — Classes in `__all__` not mentioned anywhere in the docs:
- Check `tools/diffbot.mdx` covers all 7 tool classes
- Check `retrievers/diffbot.mdx` covers both retriever classes
- Check `providers/diffbot.mdx` component table has all 12 classes

**Stale class names** — Class names in docs that no longer exist in `__all__`:
- Search docs files for any class name starting with `Diffbot` or `ChatDiffbot` and verify each is still in `__all__`

**Missing constructor parameters** — Key parameters documented in docs but removed from source, or present in source but undocumented:
- Retrievers: `k`, `fields`, `content_fields`, `document_mapper` — verify docs mention all four
- Tools: verify each tool's documented input fields match the actual input schema

**Stale prose** — Description in docs contradicts current behavior:
- The authentication model (client-based, not token-based env var pattern)
- Error handling behavior of `DiffbotExtractTool` (returns `{"error": ..., "errorCode": ...}` dict on failure, does not raise)
- `DiffbotCrawlLoader` page_content behavior (URL only, not page content)

**Broken cross-links** — Internal links in docs files that point to non-existent pages:
- `providers/diffbot.mdx` links to `/oss/integrations/tools/diffbot` and `/oss/integrations/retrievers/diffbot` — verify both pages exist
- `tools/diffbot.mdx` links to `/oss/integrations/providers/diffbot` — verify it exists
- `retrievers/diffbot.mdx` links to `/oss/integrations/providers/diffbot` — verify it exists

**Missing index entries** — Diffbot entries absent from listing pages:
- `all_providers.mdx` has a card for Diffbot
- `tools/index.mdx` has a row in the Search table and a card in All tools and toolkits
- `retrievers/index.mdx` has rows in the External index table and a card in All retrievers

**Vale violations** — Prose issues caught by the docs repo's Vale CI (terminology, dash spacing, etc.). These are definitive: any error here would block a PR merge. Include the raw Vale output line for each finding.

**README/hub drift** — The provider hub page (`providers/diffbot.mdx`) is kept in sync with README.md. Check:
- The API table in the hub matches the API table in README.md
- The component reference table in the hub matches README.md and `__all__`
- The install instructions are consistent

### Output a summary

Produce a concise report:

```
DIFFBOT DOCS AUDIT — <date>

VALE — <pass/fail>
<raw Vale output, or "✔ 0 errors, 0 warnings in 3 files">

✅ PASSING
- All 12 classes in __all__ appear in docs
- ...

⚠️ ISSUES (<count>)
1. [VALE] providers/diffbot.mdx:39 — Remove whitespace around ' —'. (LangChain.DashesSpaces)
2. [MISSING COVERAGE] tools/diffbot.mdx: DiffbotFooTool added to __all__ but not documented
3. [STALE IMPORTS] retrievers/diffbot.mdx:55 — legacy langchain.schema.* import path
4. [MISSING INDEX ENTRY] tools/index.mdx: Diffbot missing from Search table
...

FIXED IN THIS REPO
- README.md:48 — "pre-built" → "prebuilt"
- langchain_diffbot/tools.py:43 — docstring: "pre-built" → "prebuilt"

STILL NEEDS ATTENTION IN langchain-docs
- retrievers/diffbot.mdx:55 — legacy langchain.schema.* import (edit directly in langchain-docs)
```

### Propagate findings back into this repo

For every issue found — Vale violations, stale prose, outdated import patterns — apply the same fix throughout this repo:

1. **README.md** — fix any matching prose, terminology, or example code.
2. **Python source files** (`langchain_diffbot/*.py`) — fix matching issues in docstrings and inline comments. Do not change logic or signatures.

Use `grep` to find occurrences before editing:

```bash
grep -rn "pre-built\| — " langchain_diffbot/ README.md
```

After applying fixes, run the parity tests to confirm nothing broke:

```bash
uv run pytest tests/unit_tests/test_readme_parity.py tests/unit_tests/test_readme_examples.py -q
```

</Steps>

## Notes

- Fixes to `README.md` and Python source are applied directly by this skill. Fixes to `langchain-docs` files must be made there separately (this repo has no write access to the remote).
- To sync the provider hub after fixing README.md: run `/sync-langchain-docs`.
- To fix tools/retrievers deep-dive pages: edit them directly in `langchain-ai/docs`.
114 changes: 114 additions & 0 deletions .claude/skills/sync-langchain-docs/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
---
name: sync-langchain-docs
description: "Keep all Diffbot docs in sync: README.md in this repo and the three LangChain docs pages (providers, tools, retrievers). Use when code changes, when any doc page drifts, or when the user says: sync langchain docs, update the diffbot docs, sync the integration pages."
allowed-tools: Bash(python3:*), Bash(git:*), Bash(gh:*), Bash(make:*), Read, Edit, Write
---

# Sync Diffbot docs across all pages

## What this skill does

Four documents describe the `langchain-diffbot` package. Keep them all accurate and consistent — update whichever ones need it, not just one.

| File | Audience | Owns |
|------|----------|------|
| `README.md` (this repo) | GitHub / PyPI readers | Complete package reference — install, auth, all classes, examples |
| `providers/diffbot.mdx` (langchain-docs) | Docs site visitors landing on Diffbot | Overview only: install, auth, components table, links to detail pages |
| `tools/diffbot.mdx` (langchain-docs) | Docs site visitors looking for tools | Full tool documentation with examples |
| `retrievers/diffbot.mdx` (langchain-docs) | Docs site visitors looking for retrievers | Full retriever documentation with examples |

**Link, don't duplicate within langchain-docs.** The provider hub names every class and links to the tools/retrievers pages; the detail pages don't repeat install/auth. The README is for a different audience and channel (GitHub/PyPI) — it can be complete without violating this rule.

## Where the docs repo lives

A local `langchain-ai/docs` checkout is expected at the sibling **`../langchain-docs`** by default (override with `$LANGCHAIN_DOCS_REPO`). Verify it:

```bash
python3 .claude/skills/sync-langchain-docs/sync.py --path
```

If that errors, ask the user for its path or to clone `git@github.com:langchain-ai/docs.git` next to this repo.

## Steps

<Steps>

### Read all four files

Read every file before touching any of them:

- `README.md`
- `$(python3 .claude/skills/sync-langchain-docs/sync.py --repo)/src/oss/python/integrations/providers/diffbot.mdx`
- `$(python3 .claude/skills/sync-langchain-docs/sync.py --repo)/src/oss/python/integrations/tools/diffbot.mdx`
- `$(python3 .claude/skills/sync-langchain-docs/sync.py --repo)/src/oss/python/integrations/retrievers/diffbot.mdx`

Also check what triggered the sync — git diff, the user's description, or a specific change — so you know what actually changed and can limit edits to what's necessary.

### Identify what needs updating

For each of the four files, decide independently whether it needs a change. Common triggers:

- **New or renamed class** → update the components table in README + provider hub; add documentation to the appropriate detail page (tools or retrievers); update any import examples.
- **Behavior change to a tool or retriever** → update README + the matching detail page.
- **Auth model change** → update README + provider hub (both cover auth); check if detail pages reference auth.
- **Install instructions change** → update README + provider hub.
- **Example improvement** → update README; mirror to the matching detail page if it's more illustrative.
- **Detail page drifted from the code** → update just that page.

If only one file needs a change, only edit that file.

### Apply the updates

Edit each file that needs it. Rules per file:

**README.md** — complete reference, no format restrictions. Run the parity guard after any change to ensure the components table and examples stay in sync with the package:

```bash
uv run pytest tests/unit_tests/test_readme_parity.py tests/unit_tests/test_readme_examples.py -q
```

**providers/diffbot.mdx** — hub only. Structure:
1. Frontmatter (`title`, `description`)
2. Sync comment (keep it — see current file for wording)
3. Short intro + API → class mapping table
4. `## Installation` as `<CodeGroup>` with `pip` and `uv` tabs
5. `## Authentication` — prose + `db = Diffbot(...)` snippet only; no usage examples
6. One short section per class group (Retrievers, Tools, Chat model, Document loaders) — one sentence + import snippet + link to the detail page; no examples
7. `## Components reference` table

**tools/diffbot.mdx** — full tool docs. Include every tool class, usage examples, and any agent patterns. Link back to the provider hub for install/auth. Do not repeat retriever content.

**retrievers/diffbot.mdx** — full retriever docs. Include both retriever classes, output shaping, LCEL chain usage. Link back to the provider hub for install/auth. Do not repeat tool content.

MDX formatting rules (Vale enforces these — violations block the docs CI):
- Em dashes: no surrounding spaces (`word—word`, not `word — word`)
- `prebuilt` not `pre-built`
- Install blocks: `<CodeGroup>` with `pip` and `uv` tabs
- Relative links to this repo become absolute `https://github.com/diffbot/langchain-diffbot/...` URLs

### Lint every changed MDX file

The docs repo has its own Vale setup. Run it against each changed MDX:

```bash
make -C "$(python3 .claude/skills/sync-langchain-docs/sync.py --repo)" \
lint_prose FILES="src/oss/python/integrations/providers/diffbot.mdx"
```

**If Vale or any other docs-repo validation catches a prose issue, fix it in `README.md` too** — the docs repo leads on prose quality and the README should follow. Fix violations and re-run until clean.

### Commit and push

Work from inside the docs repo. Reuse the existing `integration/diffbot` branch if it exists; otherwise create `docs/sync-diffbot-<topic>`.

```bash
DOCS="${LANGCHAIN_DOCS_REPO:-../langchain-docs}"
cd "$DOCS"
git add <changed files>
git commit -m "docs: <short summary of what changed and why>"
git push
```

Stage only the files you changed. If the branch already has an open PR, the push updates it automatically — no need to open a new one unless the user asks.

</Steps>
71 changes: 71 additions & 0 deletions .claude/skills/sync-langchain-docs/sync.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
#!/usr/bin/env python3
"""Resolve the Diffbot provider page inside a local langchain-ai/docs checkout.

README.md in this repo is the single source of truth for the Diffbot provider
page on the LangChain docs site; the `sync-langchain-docs` skill *generates* the
page (`.mdx`) from it. The generation itself is agent-driven (prose → house
style), so this script does not copy anything — it just resolves where the page
lives, so the skill never hardcodes the path:

# Resolve the docs repo from --docs-repo, $LANGCHAIN_DOCS_REPO, or the
# sibling ../langchain-docs, then print:
python3 sync.py --path # absolute path of the target .mdx page
python3 sync.py --repo # absolute path of the docs repo root

The target path inside the docs repo is fixed:
src/oss/python/integrations/providers/diffbot.mdx
"""

from __future__ import annotations

import argparse
import os
from pathlib import Path

# This file lives at <repo>/.claude/skills/sync-langchain-docs/sync.py.
_REPO_ROOT = Path(__file__).resolve().parents[3]
TARGET_RELPATH = "src/oss/python/integrations/providers/diffbot.mdx"


def resolve_docs_repo(arg: str | None) -> Path:
"""Find the langchain-ai/docs checkout from the arg, env, or sibling dir."""
candidate = arg or os.environ.get("LANGCHAIN_DOCS_REPO")
if candidate:
path = Path(candidate).expanduser().resolve()
else:
path = (_REPO_ROOT.parent / "langchain-docs").resolve()
if not (path / TARGET_RELPATH).exists():
msg = (
f"langchain-ai/docs checkout not found at {path} "
f"(no {TARGET_RELPATH}). Pass --docs-repo or set LANGCHAIN_DOCS_REPO."
)
raise SystemExit(msg)
return path


def main() -> int:
"""Print the resolved docs repo root or target page path."""
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--repo",
action="store_true",
help="Print the docs repo root instead of the target page path.",
)
parser.add_argument(
"--path",
action="store_true",
help="Print the absolute path of the target page (default).",
)
parser.add_argument(
"--docs-repo", default=None, help="Path to a local langchain-ai/docs checkout."
)
args = parser.parse_args()

repo = resolve_docs_repo(args.docs_repo)
# Default to the page path so a bare invocation is useful too.
print(repo if args.repo else repo / TARGET_RELPATH)
return 0


if __name__ == "__main__":
raise SystemExit(main())
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,6 @@ build/

# Claude Code harness runtime artifacts
**/.claude/scheduled_tasks.lock

# Claude Code personal/local settings
.claude/settings.local.json
Loading
Loading