Skip to content

feat: add index persistence#140

Open
stephantul wants to merge 1 commit into
mainfrom
add-persistence
Open

feat: add index persistence#140
stephantul wants to merge 1 commit into
mainfrom
add-persistence

Conversation

@stephantul
Copy link
Copy Markdown
Contributor

This PR adds index persistence for use in the CLI, which is a feature requested by several users.

This is a big PR! Sorry for that. This PR:

  • Adds a new index command to the CLI.

You can now index a repository as follows:

semble index -o "my_index"

and then search using:

semble search "where is persistency defined?" --index "my_index"

This greatly speeds up subsequent searches. Loading a decently-sized index takes 200-400ms.

  • Adds persistence to the embedding backend. This was necessary because we override the basicbackend a little weirdly. Something we can think about doing differently
  • Adds persistence to the index itself. All components are saved in subfolders, except the model. For the model, we save the name of the model. To facilitate saving itself, I added helpers to Chunk. These are thin wrappers around asdict and a dictionary expansion.
  • Removed the Encoder protocol: this no longer made sense because we use the saving and loading methods in model2vec.

The ugly part of this is that I chose to refactor a large part of the code: we now now longer pass a model to the index when building it. Instead I use a path to the model. This path is then used to load the model, and reverts to the default model when None. This is a more elegant construction I think, since this allows us to store the model path, and also cache model loading more efficiently.

Follow-up tasks:

  • Not all benchmarks work, but the most important ones (i.e., the regular one and ablations) still work.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines Coverage Δ
src/semble/__init__.py 100.00% <100.00%> (ø)
src/semble/cli.py 100.00% <100.00%> (ø)
src/semble/index/create.py 100.00% <100.00%> (ø)
src/semble/index/dense.py 100.00% <100.00%> (ø)
src/semble/index/index.py 100.00% <100.00%> (ø)
src/semble/index/types.py 100.00% <100.00%> (ø)
src/semble/mcp.py 100.00% <100.00%> (ø)
src/semble/search.py 100.00% <100.00%> (ø)
src/semble/types.py 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@Pringled Pringled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very very nice, some minor things

Comment thread src/semble/index/index.py
if root_path:
root_path = Path(root_path)

model = StaticModel.from_pretrained(model_path)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not just call load_model? So you disable the progressbars and set force download?

Comment thread src/semble/index/index.py
persistence_paths = PersistencePath.from_path(path)
bm_25_index = BM25.load(persistence_paths.bm25_index)
semantic_index = SelectableBasicBackend.load(persistence_paths.semantic_index)
metadata = orjson.loads(open(persistence_paths.metadata).read())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess with open() is a slightly safer pattern but I've honestly never seen any issues in practice with this so yeah... whatever you want

Comment thread src/semble/index/index.py
def load_from_disk(cls: type[SembleIndex], path: Path | str) -> SembleIndex:
"""Load the index from disk."""
path = Path(path)
if not path.exists():
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm is this enough to check? If you use the wrong path here that does exist (but without the data) you get a nasty error when it tries to load the stuff later. Then again checking it adds a bunch of ifs iguess

Comment thread src/semble/cli.py
asyncio.run(serve(args.path, ref=args.ref, include_text_files=args.include_text_files))


def _run_index(*, path: str, include_text_files: bool = False, out: str) -> None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we intentionally only support local paths or do we also want to do git urls? If we don't want to support it I think we should maybe document that or raise an error here if someone does try to do a git url

Comment thread src/semble/index/index.py
model: Encoder | None = None,
extensions: Sequence[str] | None = None,
include_text_files: bool = False,
model_path: str | None = None,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically a breaking API change but I don't think anyone uses Semble as a programmatic API tbh wdyt?

Comment thread src/semble/mcp.py
"""Pre-load the model and optionally pre-index the default source in parallel with starting the server."""
try:
cache._model = await asyncio.to_thread(load_model)
_, cache._model_path = await asyncio.to_thread(load_model)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_model() here caches on (), but later _IndexCache.get calls load_model("minishlab/potion-code-16M") via from_path so it won't hit the cache, I think we can just use None everywhere, should be safe and then we do have a working cache?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants