Skip to content
@mixpeek

Mixpeek

Mixpeek is flexible search infrastructure that's built to scale with you.

Mixpeek

Give your agents eyes and ears.

Mixpeek breaks every video, image, and audio file into structured features
your agents can search, reason over, and trust.

Docs · Get Started · Quickstart · Blog ·


What is Mixpeek?

Mixpeek is multimodal infrastructure for AI agents. Upload video, images, audio, and documents — Mixpeek automatically extracts features (faces, objects, transcripts, embeddings, structured metadata) and indexes them into searchable collections. Your agent queries a single endpoint and gets structured results back.

Index → Upload files to buckets. Mixpeek runs feature extraction automatically — faces, objects, transcripts, embeddings, and structured metadata all get indexed.

Search → Build retrieval pipelines. Semantic search, face search, object search, transcript search — chain them into multi-stage retrievers exposed as a single endpoint.

Integrate → Wire Mixpeek into your agent as a LangChain tool, an MCP server, or a direct REST call.

Quickstart

pip install mixpeek
from mixpeek import Mixpeek

mx = Mixpeek(api_key="YOUR_API_KEY")

# Upload a video
mx.buckets.upload(bucket_id="my-bucket", file_path="video.mp4")

# Search across all extracted features
results = mx.retrievers.execute(
    retriever_id="my-retriever",
    inputs={"query_text": "person wearing a red jacket"},
    limit=10,
)

Also available as:

  • JavaScript SDK: npm install mixpeek
  • MCP Server: Connect Claude, Cursor, or any MCP-compatible agent
  • REST API: POST https://api.mixpeek.com/v1/retrievers/{id}/execute
  • CLI: mixpeek --version (included in the Python SDK)

What Gets Extracted

File Type Features
Video Face embeddings (ArcFace), scene descriptions (Gemini), visual embeddings (Vertex AI), transcripts (Whisper), keyframes
Images Visual embeddings (SigLIP / Vertex AI), face embeddings (ArcFace), OCR, descriptions, structured extraction
Audio Transcripts (Whisper), transcript embeddings (E5-Large), multimodal audio embeddings
Documents Text chunks, text embeddings (E5-Large), OCR for scanned PDFs, structured extraction

Each extracted feature becomes an independently searchable document. A single video can produce hundreds of documents — one per face, one per transcript segment, one per scene.

Architecture

┌─────────────────────────────────────────────────────────┐
│                      Your Agent                         │
│         (LangChain · MCP · REST · SDK · CLI)            │
└──────────────────────┬──────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────┐
│                   API Layer                              │
│            FastAPI + Celery Workers                      │
│   Buckets · Collections · Retrievers · Webhooks         │
└──────────────────────┬──────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────┐
│                 Engine Layer                             │
│                Ray Serve Cluster                         │
│   SigLIP · ArcFace · Whisper · Gemini · E5 · LayoutLM  │
└──────────────────────┬──────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────┐
│                    Storage                               │
│       MongoDB · Qdrant · Redis · S3-compatible          │
└─────────────────────────────────────────────────────────┘

Integrations

Object Storage: AWS S3, Google Cloud Storage, Azure Blob, Cloudflare R2, Backblaze B2, Supabase, Wasabi, Tigris, Mux, Box

Agent Frameworks: LangChain, MCP (Model Context Protocol), OpenAI Function Calling, direct REST

Data Warehouses: Snowflake, Databricks

Use Cases

  • Video understanding — Search surveillance footage by face, scene, or spoken word
  • Content moderation — Detect brand logos, faces, and unsafe content across media libraries
  • Document intelligence — Extract structured data from scanned PDFs, invoices, and forms
  • Media asset management — Find the exact frame across millions of hours of video
  • E-commerce — Visual similarity search, product matching, catalog enrichment

Related Repos

Repo Description
mixpeek/recipes Ready-to-run examples
mixpeek/use-cases End-to-end demos
mixpeek/showcase Community showcase
mixpeek/multimodal-benchmarks Benchmarks

Resources

Pinned Loading

  1. awesome-object-storage awesome-object-storage Public

    A curated, opinionated guide to S3-compatible object storage — 21 providers, pricing, features, gotchas, and an interactive comparison tool.

    9 4

  2. amux amux Public

    Open-source Claude Code agent multiplexer — run dozens of parallel AI coding agents unattended via tmux

    HTML 157 15

  3. awesome-multimodal-search awesome-multimodal-search Public

    Collections of multimodal search libraries, service and research papers

    17

  4. multimodal-tools multimodal-tools Public

    🧰 Simple, standalone tools for working with multimodal data: video, audio, image, and text.

    Python 11 1

  5. multimodal-inference-server multimodal-inference-server Public

    Production-grade Rust inference server for multimodal models (image + text → streamed text), with OpenAI-compatible APIs and high-throughput GPU scheduling.

    Rust 7 1

  6. video-embedding-benchmark video-embedding-benchmark Public

    Head-to-head benchmark of multimodal embedding models for text-to-video retrieval. 6 models, 20 CC0 videos, 60 queries, reproducible IR metrics (NDCG, MRR, Recall).

    Python 5 1

Repositories

Showing 10 of 35 repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…