Skip to content

Latest commit

 

History

History
499 lines (379 loc) · 18.7 KB

File metadata and controls

499 lines (379 loc) · 18.7 KB

peekdocs Library API Reference

Latest release

Use peekdocs — a privacy-first local document search and analysis tool — programmatically from Python code. For CLI and GUI usage, see the User Guide or README.

Prerequisites. Python 3.10+ with peekdocs installed via pipx or pip — see Installation. Run peekdocs --check to verify your install is healthy before scripting against the API.

API at a Glance

Every search workflow available in the CLI and GUI is also available in the Python API:

Workflow API Function CLI Flag GUI
Single search search() peekdocs "term" Search bar
Named search suite run_suite("name") --suite "name" Tools → Search Suites
Named regex collection run_regex_collection("name") --regex-collection "name" Regex Search → Restore
List suites list_suites(directory) Tools → Search Suites
List regex collections list_regex_collections() --regex-collection --list Regex Search → Restore

Quick examples

from peekdocs import search, run_suite, run_regex_collection
from peekdocs import list_suites, list_regex_collections

def main():
    # Single search
    result = search(["budget"], directory="/path/to/docs", recursive=True)
    print(f"{len(result.matches)} matches")

    # Run a named search suite
    suite = run_suite("Weekly Code Scan", directory="/path/to/docs")
    for sr in suite.search_results:
        print(f"  {sr.search_name}: {len(sr.matches)} matches")

    # Run a named regex collection
    collection = run_regex_collection("Code Patterns", directory="/path/to/docs", recursive=True)
    for pr in collection.pattern_results:
        print(f"  {pr.name}: {len(pr.matches)} matches")

    # List what's available
    print("Suites:", list_suites("/path/to/docs"))
    print("Regex collections:", list_regex_collections())

if __name__ == "__main__":
    main()

See the sections below for full parameter details, return values, and error handling.

Table of Contents

Complete Working Example

This example is available as samples/api_example.py and can be run directly:

"""Example: Using the peekdocs Python API to search documents programmatically."""

from peekdocs import search


def main():
    # Basic search — find "budget" in the current directory
    result = search(["budget"], directory=".")

    print(f"Files searched: {len(result.files_searched)}")
    print(f"Matches found: {len(result.matches)}")
    print(f"Elapsed: {result.elapsed:.2f}s")
    print()

    # Print each match
    for match in result.matches[:10]:  # first 10 matches
        print(f"  {match.filename}, line {match.line_num}: {match.text[:80]}")

    print()

    # Advanced search — AND mode, recursive, only PDFs and Word docs
    result = search(
        ["invoice", "payment"],
        directory=".",
        match_all=True,         # AND mode — both terms must appear on the same line
        recursive=True,         # search subfolders
        file_types=[".pdf", ".docx"],  # only PDFs and Word docs
    )

    print(f"AND search: {len(result.matches)} match(es) in {len(result.files_searched)} file(s)")

    # Regex search — find invoice numbers like INV-12345
    result = search(
        [r"INV-\d{4,}"],
        directory=".",
        use_regex=True,
        recursive=True,
    )

    print(f"Invoice pattern: {len(result.matches)} match(es) found")

    # Access match details
    for match in result.matches:
        print(f"  File: {match.filename}")
        print(f"  Line: {match.line_num}")
        print(f"  Text: {match.text}")
        print()


# Required for multiprocessing on macOS and Windows
if __name__ == "__main__":
    main()

Basic Usage

The simplest way to use the API is to call search() with a list of terms and a directory:

from peekdocs import search

def main():
    result = search(["budget", "revenue"], directory="/path/to/docs")

    print(f"Found {len(result.matches)} matches in {len(result.files_searched)} files")
    for match in result.matches:
        print(f"  {match.filename}:{match.line_num}: {match.text}")

# Required for multiprocessing on macOS and Windows
if __name__ == "__main__":
    main()

Note: The if __name__ == "__main__": guard is required because peekdocs uses multiprocessing to search files in parallel. Without it, macOS and Windows will crash with a RuntimeError. See samples/api_example.py for a complete working example.

With Options

search() accepts many keyword arguments for fine-grained control. The example below shows several common patterns:

from peekdocs import search

def main():
    # Wildcard search in specific file types, with subdirectories
    result = search(
        ["budg*"],
        directory="/path/to/docs",
        use_wildcard=True,
        recursive=True,
        file_types=[".pdf", ".docx"],
    )

    # Regex search with AND mode
    result = search(
        [r"\d{3}-\d{3}-\d{4}", "invoice"],
        directory="/path/to/docs",
        use_regex=True,
        match_all=True,
    )

    # Boolean expression search
    result = search(
        [],
        directory="/path/to/docs",
        expression="(budget OR revenue) AND (cost OR profit)",
    )

    # Expression with wildcard mode
    result = search(
        [],
        directory="/path/to/docs",
        expression="budg* AND rev*",
        use_wildcard=True,
    )

    # Range query — filter by value ranges
    result = search(
        ["invoice"],
        directory="/path/to/docs",
        range_filters=["amount:1000..5000", "date:2024-01-01..2024-12-31"],
    )

    # Range-only search (no text terms)
    result = search(
        [],
        directory="/path/to/docs",
        range_filters=["amount:1000..5000"],
    )

    # Range specs inside boolean expressions
    result = search(
        [],
        directory="/path/to/docs",
        expression="budget AND amount:1000..5000",
    )

    # Filename range — filter files by date in filename
    result = search(
        ["budget"],
        directory="/path/to/docs",
        range_filters=["fn:date:2024-01-01..2024-12-31"],
    )

    # Search emails for structured reference IDs (e.g., ORD-12345, REF-9876)
    result = search(
        [r"\b[A-Z]{2,3}-\d{4,}\b"],
        directory="/path/to/exported-emails",
        use_regex=True,
        file_types=[".eml", ".msg", ".pst"],
    )

    # Search inside ZIP archives
    result = search(
        ["confidential"],
        directory="/path/to/docs",
        file_types=[".zip", ".7z"],
    )

    # Search legacy and modern Office files together
    result = search(
        ["budget"],
        directory="/path/to/docs",
        file_types=[".doc", ".docx", ".xls", ".xlsx", ".ppt", ".pptx"],
    )

    # Progress tracking
    def on_progress(done, total, filename):
        print(f"  [{done}/{total}] {filename}")

    result = search(["error"], directory="/var/log", progress=on_progress)

# Required for multiprocessing on macOS and Windows
if __name__ == "__main__":
    main()

Parameters

Parameter Type Default Description
search_terms list[str] (required) Terms to search for (pass [] when using expression or range_filters)
directory str Current directory Directory to search in
match_all bool False Require ALL terms (AND mode)
expression str None Boolean expression with AND, OR, NOT, parentheses, and range specs (e.g. "(budget OR revenue) AND NOT draft", "budget AND amount:1000..5000")
recursive bool False Search subdirectories
use_regex bool False Treat terms as regex patterns
use_fuzzy bool False Approximate matching
use_wildcard bool False Wildcard patterns (* and ?)
use_whole_word bool False Whole-word matching — matches complete words only
use_ocr bool False OCR for scanned PDFs and images
exclude_terms list[str] None Exclude lines matching these terms
file_types list[str] None Limit to these extensions (e.g. [".pdf", ".docx"])
file_names list[str] None Search only these specific files
context_before int 0 Lines to include before each match. What counts as a "line" follows the unit peekdocs indexes per file format: a literal line for plain text and source code, a paragraph for Word (.docx) and PDF, a row for Excel. Paragraph-heavy formats can include several sentences per "line."
context_after int 0 Lines to include after each match. Same per-format meaning as context_before — see above.
proximity int 0 Require terms within N words of each other
line_proximity int 0 Require terms within N lines of each other (the line-proximity counterpart to proximity, which is word-proximity)
cores int Auto CPU cores for parallel processing
use_index bool Auto Use search index if available
progress callable None Callback progress(done, total, filename)
range_filters list[str] None Range filter specs (e.g. ["amount:1000..5000", "date:2024-01-01..2024-12-31"]). Use fn: prefix for filename ranges (e.g. ["fn:date:2024-01-01..2024-12-31"])
max_file_size_mb int 100 Skip files larger than this (in MB). Prevents slow searches and memory issues from very large files. Set to 0 for no limit

Return Value

search() returns a SearchResult with these fields:

Field Type Description
matches list[SearchMatch] List of matches found
files_searched list[str] Absolute paths of all files examined
skipped_files list[tuple] Files that couldn't be read: (filename, error_msg)
elapsed float Search time in seconds
used_index bool Whether the indexed search path was used
index_bypass_reason str Non-empty when the index was requested but bypassed (e.g., regex / fuzzy / wildcard / proximity queries fall through to direct scan because FTS5 can't accelerate them). Empty string otherwise
index_stale_notice str Non-empty when the index's stored parameters don't match the current max_file_size_mb (e.g., the index was built with a 100 MB limit but the call passes 0 / no limit). The notice text names the mismatch and points the user at peekdocs --index to rebuild. The search still runs against the existing index; this is informational, not a failure. Empty string otherwise

Each SearchMatch has fields: file_dir, filename, line_num, text.

Search Suites

Run saved search suites programmatically. Suites are groups of saved searches created in the GUI (Tools → Search Suites) and stored per-folder in .peekdocs_collection.json.

List suites in a folder

from peekdocs import list_suites

suites = list_suites("/path/to/docs")
print(suites)  # e.g. {'Weekly Code Scan': ['Find passwords', 'Find TODOs', 'Find drafts']}

Run a suite

from peekdocs import run_suite

def main():
    result = run_suite(
        "Weekly Code Scan",
        directory="/path/to/docs",
    )

    print(f"Suite: {result.suite}")
    print(f"Total matches: {result.total_matches}")
    print(f"Elapsed: {result.elapsed:.2f}s")
    print(f"Skipped searches: {result.skipped_searches}")

    for sr in result.search_results:
        print(f"  {sr.search_name} ({sr.mode}): {len(sr.matches)} matches in {len(sr.files_searched)} files")
        for match in sr.matches[:3]:  # first 3 per search
            print(f"    {match.filename}:{match.line_num}: {match.text[:80]}")

if __name__ == "__main__":
    main()

With progress tracking

from peekdocs import run_suite

def main():
    def on_progress(i, total, name):
        if i < total:
            print(f"  [{i+1}/{total}] {name}")

    result = run_suite(
        "Weekly Code Scan",
        directory="/path/to/docs",
        progress=on_progress,
    )

if __name__ == "__main__":
    main()

run_suite() Parameters

Parameter Type Default Description
name str (required) Name of a saved search suite
directory str Current directory Directory containing the suite's .peekdocs_collection.json
progress callable None Callback progress(search_index, total_searches, search_name)
max_file_size_mb int 100 Skip files larger than this (MB). 0 = no limit

Return Value

run_suite() returns a SuiteResult with these fields:

Field Type Description
suite str Suite name
search_results list[SuiteSearchResult] Per-search results
total_matches int Sum of all search match counts
elapsed float Total time in seconds
skipped_searches list[tuple] Searches not found: (name, reason)

Each SuiteSearchResult has fields: search_name, search_terms, matches (list of SearchMatch), files_searched, elapsed, mode ("ALL" or "ANY").

Error Handling

Exception When
FileNotFoundError No collection file exists in the directory
KeyError Named suite not found (message lists available suites)
ValueError Suite has no searches

Regex Collections

Run saved regex collections programmatically. Collections are created in the GUI (Regex Search → Save Collection As) and stored in ~/.peekdocs_regex_collections.json.

List saved collections

from peekdocs import list_regex_collections

names = list_regex_collections()
print(names)  # e.g. ['Code Patterns', 'Financial', 'Invoice Extraction']

Run a collection

from peekdocs import run_regex_collection

def main():
    result = run_regex_collection(
        "Code Patterns",
        directory="/path/to/project",
        recursive=True,
    )

    print(f"Collection: {result.collection}")
    print(f"Total matches: {result.total_matches}")
    print(f"Elapsed: {result.elapsed:.2f}s")
    print(f"Skipped patterns: {result.skipped_patterns}")

    for pr in result.pattern_results:
        print(f"  {pr.name}: {len(pr.matches)} matches in {pr.files_matched} files")
        for match in pr.matches[:3]:  # first 3 per pattern
            print(f"    {match.filename}:{match.line_num}: {match.text[:80]}")

if __name__ == "__main__":
    main()

With progress tracking

from peekdocs import run_regex_collection

def main():
    def on_progress(i, total, name):
        if i < total:
            print(f"  [{i+1}/{total}] {name}")

    result = run_regex_collection(
        "Code Review",
        directory=".",
        recursive=True,
        progress=on_progress,
    )

if __name__ == "__main__":
    main()

run_regex_collection() Parameters

Parameter Type Default Description
name str (required) Name of a saved regex collection
directory str Current directory Directory to search in
recursive bool False Search subdirectories
progress callable None Callback progress(pattern_index, total_patterns, pattern_name)
max_file_size_mb int 100 Skip files larger than this (MB). 0 = no limit

Return Value

run_regex_collection() returns a CollectionResult with these fields:

Field Type Description
collection str Collection name
pattern_results list[PatternResult] Per-pattern results
total_matches int Sum of all pattern match counts
files_searched list[str] Absolute paths of all files examined
elapsed float Total time in seconds
skipped_patterns list[tuple] Patterns with invalid regex: (name, error_msg)

Each PatternResult has fields: name, regex, matches (list of SearchMatch), files_matched (int).

Error Handling

Exception When
FileNotFoundError No collections file exists (~/.peekdocs_regex_collections.json)
KeyError Named collection not found (message lists available names)
ValueError Collection has no enabled patterns

Notes

  • No match limit: The API returns all matches. The CLI's -m (max matches) flag caps report output only — it does not exist in the API. If you need to limit results, slice result.matches after the search.
  • No inverse mode: Inverse search (listing files that do not contain terms) is a GUI/CLI feature. To achieve the same result with the API, compare result.files_searched against result.matches to find files with zero matches.

Error Handling

search() raises ValueError for invalid parameter combinations (e.g. combining regex + fuzzy) and FileNotFoundError if specified files are not found.

from peekdocs import search

def main():
    try:
        result = search([r"[invalid"], use_regex=True)
    except ValueError as e:
        print(f"Invalid search: {e}")

if __name__ == "__main__":
    main()

Stuck on something try/except won't catch? If your script crashes with ModuleNotFoundError, hangs without finishing, or behaves differently than the CLI, first run peekdocs --check from your terminal — it verifies your Python version, dependencies, Tesseract availability, SQLite, and free disk space, and tells you exactly what's missing. If that's clean, see FAQ & Troubleshooting for common Python-API and install pitfalls (especially the multiprocessing / __main__ guard issue if your script crashes on Mac or Windows).

Next Steps

For richer end-to-end automation patterns, see the User Guide's worked nightly source-tree watch example (a complete cron pipeline using --stdout, --hash, --diff, and an alert step) and the Search Suite Use Cases section (driving suites and regex collections in Python loops). The complete CLI flag reference lives in the Flag Use Summary.