feat: add index persistence#140
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests.
🚀 New features to boost your workflow:
|
Pringled
left a comment
There was a problem hiding this comment.
Very very nice, some minor things
| if root_path: | ||
| root_path = Path(root_path) | ||
|
|
||
| model = StaticModel.from_pretrained(model_path) |
There was a problem hiding this comment.
Should this not just call load_model? So you disable the progressbars and set force download?
| persistence_paths = PersistencePath.from_path(path) | ||
| bm_25_index = BM25.load(persistence_paths.bm25_index) | ||
| semantic_index = SelectableBasicBackend.load(persistence_paths.semantic_index) | ||
| metadata = orjson.loads(open(persistence_paths.metadata).read()) |
There was a problem hiding this comment.
I guess with open() is a slightly safer pattern but I've honestly never seen any issues in practice with this so yeah... whatever you want
| def load_from_disk(cls: type[SembleIndex], path: Path | str) -> SembleIndex: | ||
| """Load the index from disk.""" | ||
| path = Path(path) | ||
| if not path.exists(): |
There was a problem hiding this comment.
Hmm is this enough to check? If you use the wrong path here that does exist (but without the data) you get a nasty error when it tries to load the stuff later. Then again checking it adds a bunch of ifs iguess
| asyncio.run(serve(args.path, ref=args.ref, include_text_files=args.include_text_files)) | ||
|
|
||
|
|
||
| def _run_index(*, path: str, include_text_files: bool = False, out: str) -> None: |
There was a problem hiding this comment.
Do we intentionally only support local paths or do we also want to do git urls? If we don't want to support it I think we should maybe document that or raise an error here if someone does try to do a git url
| model: Encoder | None = None, | ||
| extensions: Sequence[str] | None = None, | ||
| include_text_files: bool = False, | ||
| model_path: str | None = None, |
There was a problem hiding this comment.
Technically a breaking API change but I don't think anyone uses Semble as a programmatic API tbh wdyt?
| """Pre-load the model and optionally pre-index the default source in parallel with starting the server.""" | ||
| try: | ||
| cache._model = await asyncio.to_thread(load_model) | ||
| _, cache._model_path = await asyncio.to_thread(load_model) |
There was a problem hiding this comment.
load_model() here caches on (), but later _IndexCache.get calls load_model("minishlab/potion-code-16M") via from_path so it won't hit the cache, I think we can just use None everywhere, should be safe and then we do have a working cache?
This PR adds index persistence for use in the CLI, which is a feature requested by several users.
This is a big PR! Sorry for that. This PR:
indexcommand to the CLI.You can now index a repository as follows:
semble index -o "my_index"and then search using:
This greatly speeds up subsequent searches. Loading a decently-sized index takes 200-400ms.
Chunk. These are thin wrappers aroundasdictand a dictionary expansion.Encoderprotocol: this no longer made sense because we use the saving and loading methods inmodel2vec.The ugly part of this is that I chose to refactor a large part of the code: we now now longer pass a
modelto the index when building it. Instead I use a path to the model. This path is then used to load the model, and reverts to the default model whenNone. This is a more elegant construction I think, since this allows us to store the model path, and also cache model loading more efficiently.Follow-up tasks: