Skip to content

How the Interactive Explorer queries work (tutorial for contributors) #82

@rdhyee

Description

@rdhyee

How the Interactive Explorer Queries Work

A walkthrough of the data architecture and SQL queries powering the Interactive Explorer, for contributors who want to understand or extend it.

Source code: tutorials/isamples_explorer.qmd


Architecture: Two-Tier Loading

The Explorer uses DuckDB-WASM — a full SQL engine running in the browser. No server is involved. It queries parquet files on data.isamples.org using HTTP range requests (only downloads the bytes it needs).

Tier 1 — Instant facet counts (2 KB):

data.isamples.org/isamples_202601_facet_summaries.parquet

Pre-computed counts for each filter category (source, material, sampled feature, specimen type). Loads in milliseconds so the sidebar filters appear immediately.

Tier 2 — Full sample data (278 MB, but only partial bytes transferred):

data.isamples.org/isamples_202601_wide.parquet

The wide-format PQG file with 20M rows. DuckDB-WASM uses range requests to read only the columns and rows matching the current filters — typically a few MB of actual transfer even though the file is 278 MB.


The Key Queries

1. Facet summaries (Tier 1)

Source: lines 346–357

SELECT * FROM read_parquet('https://data.isamples.org/isamples_202601_facet_summaries.parquet')

Returns rows like:

facet_type facet_value count
source SESAR 3,428,639
source OPENCONTEXT 1,096,127
material Rock 1,234,567
context Earth interior 890,123
object_type Physical specimen 2,345,678

These populate the checkbox counts in the sidebar before the big parquet file loads.

2. Create a SQL view over the remote parquet

Source: lines 311–316

CREATE VIEW samples AS
SELECT * FROM read_parquet('https://data.isamples.org/isamples_202601_wide.parquet')

This doesn't download the file — it registers it as a virtual table. DuckDB-WASM only fetches data when you actually query it.

3. Dynamic WHERE clause from filters

Source: lines 388–434

The WHERE clause is built dynamically from the current UI state:

-- Always applied (base conditions):
otype = 'MaterialSampleRecord'
AND latitude IS NOT NULL

-- Text search (when user types in search box):
AND (label ILIKE '%pottery%'
     OR description ILIKE '%pottery%'
     OR CAST(place_name AS VARCHAR) ILIKE '%pottery%')

-- Source filter (checkbox selection):
AND n IN ('SESAR', 'OPENCONTEXT')

-- Material filter:
AND has_material_category IN ('Rock', 'Ite Biology')

-- Sampled feature filter:
AND has_context_category IN ('Earth interior')

-- Specimen type filter:
AND has_specimen_category IN ('Physical specimen')

Key column mappings in the wide parquet:

  • n → source name (SESAR, OPENCONTEXT, GEOME, SMITHSONIAN)
  • otype → entity type (MaterialSampleRecord, SamplingEvent, etc.)
  • has_material_category → material classification
  • has_context_category → sampled feature / context
  • has_specimen_category → specimen type
  • label, description, place_name → text-searchable fields

4. Count matching records

Source: lines 447–455

SELECT COUNT(*) as count
FROM samples
WHERE {whereClause}

5. Load sample data for display

Source: lines 460–500

SELECT
  row_id, pid, label,
  COALESCE(description, '') as description,
  latitude, longitude,
  n as source, place_name
FROM samples
WHERE {whereClause}
ORDER BY RANDOM()
LIMIT {maxSamples}   -- default 25,000

The ORDER BY RANDOM() ensures a representative geographic spread rather than returning the first N rows (which might cluster in one region). The LIMIT caps browser memory usage.


Data Flow Summary

User adjusts filters
    ↓
Observable reactivity rebuilds WHERE clause
    ↓
DuckDB-WASM sends HTTP range requests to data.isamples.org
    ↓
Only matching columns/row groups are downloaded (~2-10 MB)
    ↓
Results rendered as Globe (Cesium) / Table / List
    ↓
URL params updated for bookmarkability

Data Files Reference

File Size Purpose
facet_summaries.parquet 2 KB Pre-computed filter counts
wide.parquet 278 MB Full sample data (queried via range requests)
wide_h3.parquet 292 MB Wide + H3 hex indices (used by globe tutorials)
samples_map_lite.parquet 60 MB Lightweight subset for fast map rendering

All files served from Cloudflare R2 via data.isamples.org with range request + CORS support.


How to Extend

Add a new filter:

  1. Add a new facet type to the facet_summaries parquet (generated by PQG pipeline)
  2. Add a checkbox widget in the Filters section
  3. Add a condition to the whereClause builder

Add a new view mode:

  1. Add option to the viewMode radio selector
  2. Add rendering branch in the results display block

Change data source: Update parquet_url and facet_summaries_url at the top of the file.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions