feat: add index persistence by stephantul · Pull Request #140 · MinishLab/semble

stephantul · 2026-05-21T14:47:28Z

This PR adds index persistence for use in the CLI, which is a feature requested by several users.

This is a big PR! Sorry for that. This PR:

Adds a new index command to the CLI.

You can now index a repository as follows:

semble index -o "my_index"

and then search using:

semble search "where is persistency defined?" --index "my_index"

This greatly speeds up subsequent searches. Loading a decently-sized index takes 200-400ms.

Adds persistence to the embedding backend. This was necessary because we override the basicbackend a little weirdly. Something we can think about doing differently
Adds persistence to the index itself. All components are saved in subfolders, except the model. For the model, we save the name of the model. To facilitate saving itself, I added helpers to Chunk. These are thin wrappers around asdict and a dictionary expansion.
Removed the Encoder protocol: this no longer made sense because we use the saving and loading methods in model2vec.

The ugly part of this is that I chose to refactor a large part of the code: we now now longer pass a model to the index when building it. Instead I use a path to the model. This path is then used to load the model, and reverts to the default model when None. This is a more elegant construction I think, since this allows us to store the model path, and also cache model loading more efficiently.

Follow-up tasks:

Not all benchmarks work, but the most important ones (i.e., the regular one and ablations) still work.

codecov · 2026-05-21T14:48:50Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines	Coverage Δ
src/semble/__init__.py	`100.00% <100.00%> (ø)`
src/semble/cli.py	`100.00% <100.00%> (ø)`
src/semble/index/create.py	`100.00% <100.00%> (ø)`
src/semble/index/dense.py	`100.00% <100.00%> (ø)`
src/semble/index/index.py	`100.00% <100.00%> (ø)`
src/semble/index/types.py	`100.00% <100.00%> (ø)`
src/semble/mcp.py	`100.00% <100.00%> (ø)`
src/semble/search.py	`100.00% <100.00%> (ø)`
src/semble/types.py	`100.00% <100.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Pringled

Very very nice, some minor things

Pringled · 2026-05-21T15:23:59Z

+        if root_path:
+            root_path = Path(root_path)
+
+        model = StaticModel.from_pretrained(model_path)


Should this not just call load_model? So you disable the progressbars and set force download?

Pringled · 2026-05-21T15:26:01Z

+        persistence_paths = PersistencePath.from_path(path)
+        bm_25_index = BM25.load(persistence_paths.bm25_index)
+        semantic_index = SelectableBasicBackend.load(persistence_paths.semantic_index)
+        metadata = orjson.loads(open(persistence_paths.metadata).read())


I guess with open() is a slightly safer pattern but I've honestly never seen any issues in practice with this so yeah... whatever you want

Pringled · 2026-05-21T15:27:45Z

+    def load_from_disk(cls: type[SembleIndex], path: Path | str) -> SembleIndex:
+        """Load the index from disk."""
+        path = Path(path)
+        if not path.exists():


Hmm is this enough to check? If you use the wrong path here that does exist (but without the data) you get a nasty error when it tries to load the stuff later. Then again checking it adds a bunch of ifs iguess

Pringled · 2026-05-21T15:28:45Z

    asyncio.run(serve(args.path, ref=args.ref, include_text_files=args.include_text_files))


+def _run_index(*, path: str, include_text_files: bool = False, out: str) -> None:


Do we intentionally only support local paths or do we also want to do git urls? If we don't want to support it I think we should maybe document that or raise an error here if someone does try to do a git url

Pringled · 2026-05-21T15:31:43Z

-        model: Encoder | None = None,
        extensions: Sequence[str] | None = None,
        include_text_files: bool = False,
+        model_path: str | None = None,


Technically a breaking API change but I don't think anyone uses Semble as a programmatic API tbh wdyt?

Pringled · 2026-05-21T15:40:31Z

        """Pre-load the model and optionally pre-index the default source in parallel with starting the server."""
        try:
-            cache._model = await asyncio.to_thread(load_model)
+            _, cache._model_path = await asyncio.to_thread(load_model)


load_model() here caches on (), but later _IndexCache.get calls load_model("minishlab/potion-code-16M") via from_path so it won't hit the cache, I think we can just use None everywhere, should be safe and then we do have a working cache?

feat: add index persistence

eaa1bc7

stephantul mentioned this pull request May 21, 2026

Tree-sitter chunker fails on deep C++ ASTs (recursion → segfault); no per-file isolation; no CLI index persistence #135

Open

stephantul requested a review from Pringled May 21, 2026 14:48

Pringled approved these changes May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add index persistence#140

feat: add index persistence#140
stephantul wants to merge 1 commit into
mainfrom
add-persistence

stephantul commented May 21, 2026

Uh oh!

codecov Bot commented May 21, 2026 •

edited

Loading

Uh oh!

Pringled left a comment

Uh oh!

Pringled May 21, 2026

Uh oh!

Pringled May 21, 2026

Uh oh!

Pringled May 21, 2026

Uh oh!

Pringled May 21, 2026

Uh oh!

Pringled May 21, 2026

Uh oh!

Pringled May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		asyncio.run(serve(args.path, ref=args.ref, include_text_files=args.include_text_files))


		def _run_index(*, path: str, include_text_files: bool = False, out: str) -> None:

Conversation

stephantul commented May 21, 2026

Uh oh!

codecov Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Pringled left a comment

Choose a reason for hiding this comment

Uh oh!

Pringled May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Pringled May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Pringled May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Pringled May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Pringled May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Pringled May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented May 21, 2026 •

edited

Loading