Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 51 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,20 @@ semble search "save_pretrained" ./my-project
semble search "save model to disk" ./my-project --top-k 10
​```

If you anticipate doing more than one search, use `semble index` to create an index.

​```bash
semble index ./my-project -o my_index
​```

You can then reuse this index later on:

​```bash
semble search "save_pretrained" --index my_index
​```

An index is not automatically updated, so if the code changes significantly, reindex. If you notice stale results while resolving searches to files, reindex.

Use `--content docs` to search documentation and prose, `--content config` for config files (yaml, toml, etc.), or `--content all` to search code, docs, and config:

​```bash
Expand All @@ -77,17 +91,20 @@ Use `semble find-related` to discover code similar to a known location (pass `fi
semble find-related src/auth.py 42 ./my-project
​```

Like search, `find-related` also accepts an `--index` argument.

`path` defaults to the current directory when omitted; git URLs are accepted.

If `semble` is not on `$PATH`, use `uvx --from "semble[mcp]" semble` in its place.

### Workflow

1. Start with `semble search` to find relevant chunks.
2. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything.
3. Inspect full files only when the returned chunk is not enough context.
4. Optionally use `semble find-related` with a promising result's `file_path` and `line` to discover related implementations.
5. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string.
1. Index the repo using `semble index -o cached_index`.
2. Start with `semble search` to find relevant chunks. Pass the index to achieve results faster.
3. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything.
4. Inspect full files only when the returned chunk does not give enough context.
5. Optionally use `semble find-related` with a promising result's `file_path` and `line` to discover related implementations.
6. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string.
```

</details>
Expand Down Expand Up @@ -318,6 +335,20 @@ semble search "save_pretrained" ./my-project
semble search "save model to disk" ./my-project --top-k 10
​```

If you anticipate doing more than one search, use `semble index` to create an index.

​```bash
semble index ./my-project -o my_index
​```

You can then reuse this index later on:

​```bash
semble search "save_pretrained" --index my_index
​```

An index is not automatically updated, so if the code changes significantly, reindex. If you notice stale results while resolving searches to files, reindex.

Use `--content docs` to search documentation and prose, `--content config` for config files (yaml, toml, etc.), or `--content all` to search code, docs, and config:

​```bash
Expand All @@ -332,17 +363,20 @@ Use `semble find-related` to discover code similar to a known location (pass `fi
semble find-related src/auth.py 42 ./my-project
​```

Like search, `find-related` also accepts an `--index` argument.

`path` defaults to the current directory when omitted; git URLs are accepted.

If `semble` is not on `$PATH`, use `uvx --from "semble[mcp]" semble` in its place.

## Workflow
### Workflow

1. Start with `semble search` to find relevant chunks.
2. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything.
3. Inspect full files only when the returned chunk is not enough context.
4. Optionally use `semble find-related` with a promising result's `file_path` and `line` to discover related implementations.
5. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string.
1. Index the repo using `semble index -o cached_index`.
2. Start with `semble search` to find relevant chunks. Pass the index to achieve results faster.
3. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything.
4. Inspect full files only when the returned chunk does not give enough context.
5. Optionally use `semble find-related` with a promising result's `file_path` and `line` to discover related implementations.
6. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string.
```

### Sub-agent setup
Expand All @@ -365,8 +399,14 @@ If semble is not on `$PATH`, prefix the command with `uvx --from "semble[mcp]"`.
Semble also ships as a standalone CLI. This is useful in scripts or anywhere you want search results without an MCP session.

```bash
# Index a local repository
semble index ./my-project -o my-index

# Search a local repo
semble search "authentication flow" ./my-project
# Or with index (significantly faster)
# the index flag applies to all commands below.
semble search "authentication flow" --index my-index

# Search for a symbol or identifier
semble search "save_pretrained" ./my-project
Expand Down
13 changes: 3 additions & 10 deletions benchmarks/baselines/ablations.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
from dataclasses import asdict

import numpy as np
from model2vec import StaticModel

from benchmarks.data import (
RepoSpec,
Expand Down Expand Up @@ -38,8 +37,6 @@
def _bench(
repo_tasks: dict[str, list[Task]],
specs: dict[str, RepoSpec],
model: StaticModel,
modes: list[str],
*,
verbose: bool = False,
) -> list[RepoResult]:
Expand All @@ -62,7 +59,7 @@ def _bench(
print(f"\n--- {repo} ---", file=sys.stderr)

started = time.perf_counter()
index = SembleIndex.from_path(spec.benchmark_dir, model=model)
index = SembleIndex.from_path(spec.benchmark_dir)
index_ms = (time.perf_counter() - started) * 1000

for mode, (alpha, rerank) in sorted(_MODE_PARAMS.items()):
Expand Down Expand Up @@ -98,30 +95,26 @@ def _bench(
def _parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(description="semble ablation benchmarks.")
add_filter_args(parser, verbose=True)
parser.add_argument(
"--mode", action="append", default=[], choices=sorted(_MODE_PARAMS), help="Mode(s) to evaluate (default: all)."
)
return parser.parse_args()


def main() -> None:
"""Run the semble ablation benchmarks."""
args = _parse_args()
modes = args.mode or sorted(_MODE_PARAMS)

repo_specs, tasks = load_filtered_tasks(args.repo or None, args.language or None)

print("Loading model...", file=sys.stderr)
started = time.perf_counter()
model = StaticModel.from_pretrained(_DEFAULT_MODEL_NAME)
print(f"Loaded in {(time.perf_counter() - started) * 1000:.0f}ms", file=sys.stderr)
print(file=sys.stderr)

results = _bench(grouped_tasks(tasks), repo_specs, model, modes, verbose=args.verbose)
results = _bench(grouped_tasks(tasks), repo_specs, verbose=args.verbose)

if not results:
return

modes = sorted(_MODE_PARAMS)
print(file=sys.stderr)
for mode in modes:
mode_results = [r for r in results if r.mode == mode]
Expand Down
7 changes: 3 additions & 4 deletions benchmarks/baselines/coderankembed.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,6 @@ class RepoResult:
def _evaluate(
index: SembleIndex,
tasks: list[Task],
mode: str,
*,
verbose: bool = False,
) -> tuple[float, float, list[float], dict[str, float]]:
Expand All @@ -78,7 +77,7 @@ def _evaluate(
results: list[SearchResult] = []
for _ in range(_LATENCY_RUNS):
started = time.perf_counter()
results = index.search(task.query, top_k=_TOP_K, mode=mode)
results = index.search(task.query, top_k=_TOP_K)
query_latencies.append((time.perf_counter() - started) * 1000)
latencies.append(float(np.median(query_latencies)))

Expand Down Expand Up @@ -176,12 +175,12 @@ def _bench(
print(f"\n--- {repo} ---", file=sys.stderr)

started = time.perf_counter()
index = SembleIndex.from_path(spec.benchmark_dir, model=model)
index = SembleIndex.from_path(spec.benchmark_dir)
index_ms = (time.perf_counter() - started) * 1000

repo_results: list[RepoResult] = []
for mode in modes:
ndcg5, ndcg10, latencies, by_category = _evaluate(index, tasks, mode, verbose=verbose)
ndcg5, ndcg10, latencies, by_category = _evaluate(index, tasks, verbose=verbose)
p50, p90 = np.percentile(latencies, [50, 90]).tolist()
result = RepoResult(
repo=repo,
Expand Down
Loading
Loading