-
Notifications
You must be signed in to change notification settings - Fork 5
How the Interactive Explorer queries work (tutorial for contributors) #82
Description
How the Interactive Explorer Queries Work
A walkthrough of the data architecture and SQL queries powering the Interactive Explorer, for contributors who want to understand or extend it.
Source code: tutorials/isamples_explorer.qmd
Architecture: Two-Tier Loading
The Explorer uses DuckDB-WASM — a full SQL engine running in the browser. No server is involved. It queries parquet files on data.isamples.org using HTTP range requests (only downloads the bytes it needs).
Tier 1 — Instant facet counts (2 KB):
data.isamples.org/isamples_202601_facet_summaries.parquet
Pre-computed counts for each filter category (source, material, sampled feature, specimen type). Loads in milliseconds so the sidebar filters appear immediately.
Tier 2 — Full sample data (278 MB, but only partial bytes transferred):
data.isamples.org/isamples_202601_wide.parquet
The wide-format PQG file with 20M rows. DuckDB-WASM uses range requests to read only the columns and rows matching the current filters — typically a few MB of actual transfer even though the file is 278 MB.
The Key Queries
1. Facet summaries (Tier 1)
SELECT * FROM read_parquet('https://data.isamples.org/isamples_202601_facet_summaries.parquet')Returns rows like:
| facet_type | facet_value | count |
|---|---|---|
| source | SESAR | 3,428,639 |
| source | OPENCONTEXT | 1,096,127 |
| material | Rock | 1,234,567 |
| context | Earth interior | 890,123 |
| object_type | Physical specimen | 2,345,678 |
These populate the checkbox counts in the sidebar before the big parquet file loads.
2. Create a SQL view over the remote parquet
CREATE VIEW samples AS
SELECT * FROM read_parquet('https://data.isamples.org/isamples_202601_wide.parquet')This doesn't download the file — it registers it as a virtual table. DuckDB-WASM only fetches data when you actually query it.
3. Dynamic WHERE clause from filters
The WHERE clause is built dynamically from the current UI state:
-- Always applied (base conditions):
otype = 'MaterialSampleRecord'
AND latitude IS NOT NULL
-- Text search (when user types in search box):
AND (label ILIKE '%pottery%'
OR description ILIKE '%pottery%'
OR CAST(place_name AS VARCHAR) ILIKE '%pottery%')
-- Source filter (checkbox selection):
AND n IN ('SESAR', 'OPENCONTEXT')
-- Material filter:
AND has_material_category IN ('Rock', 'Ite Biology')
-- Sampled feature filter:
AND has_context_category IN ('Earth interior')
-- Specimen type filter:
AND has_specimen_category IN ('Physical specimen')Key column mappings in the wide parquet:
n→ source name (SESAR, OPENCONTEXT, GEOME, SMITHSONIAN)otype→ entity type (MaterialSampleRecord, SamplingEvent, etc.)has_material_category→ material classificationhas_context_category→ sampled feature / contexthas_specimen_category→ specimen typelabel,description,place_name→ text-searchable fields
4. Count matching records
SELECT COUNT(*) as count
FROM samples
WHERE {whereClause}5. Load sample data for display
SELECT
row_id, pid, label,
COALESCE(description, '') as description,
latitude, longitude,
n as source, place_name
FROM samples
WHERE {whereClause}
ORDER BY RANDOM()
LIMIT {maxSamples} -- default 25,000The ORDER BY RANDOM() ensures a representative geographic spread rather than returning the first N rows (which might cluster in one region). The LIMIT caps browser memory usage.
Data Flow Summary
User adjusts filters
↓
Observable reactivity rebuilds WHERE clause
↓
DuckDB-WASM sends HTTP range requests to data.isamples.org
↓
Only matching columns/row groups are downloaded (~2-10 MB)
↓
Results rendered as Globe (Cesium) / Table / List
↓
URL params updated for bookmarkability
Data Files Reference
| File | Size | Purpose |
|---|---|---|
facet_summaries.parquet |
2 KB | Pre-computed filter counts |
wide.parquet |
278 MB | Full sample data (queried via range requests) |
wide_h3.parquet |
292 MB | Wide + H3 hex indices (used by globe tutorials) |
samples_map_lite.parquet |
60 MB | Lightweight subset for fast map rendering |
All files served from Cloudflare R2 via data.isamples.org with range request + CORS support.
How to Extend
Add a new filter:
- Add a new facet type to the
facet_summariesparquet (generated by PQG pipeline) - Add a checkbox widget in the Filters section
- Add a condition to the
whereClausebuilder
Add a new view mode:
- Add option to the
viewModeradio selector - Add rendering branch in the results display block
Change data source: Update parquet_url and facet_summaries_url at the top of the file.