isamplesorg.github.io/how-to-use.qmd at main · isamplesorg/isamplesorg.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
title: "How to Use iSamples"
subtitle: "Get started exploring 6.7 million scientific samples"
number-sections: false
---

## Quick Start {.unnumbered}

1. **Open the [Interactive Explorer](/tutorials/progressive_globe.html)** — a 3D globe loads with clustered sample data
2. **Zoom in** — clusters break into finer detail as you zoom (resolution 4 → 6 → 8 → individual samples)
3. **Filter by source** — use the checkboxes to show/hide data from SESAR, OpenContext, GEOME, or Smithsonian
4. **Click a cluster** — see sample count and nearby samples with links to source records
5. **Click an individual sample** — view metadata and follow the "View at source" link to the original repository
6. **Share your view** — copy the URL to share your exact position, zoom level, and selected sample

## What's in the Data? {.unnumbered}

| Source | Samples | Focus |
|--------|---------|-------|
| **SESAR** | 4.6M | Earth science — rocks, minerals, sediments, soils |
| **OpenContext** | 1M | Archaeology — artifacts, excavation materials |
| **GEOME** | 605K | Biology — genomic and tissue specimens |
| **Smithsonian** | 322K | Natural history — museum collections |

## No Installation Required {.unnumbered}

Everything runs in your browser using:

- **DuckDB-WASM** — a fast analytical database running client-side
- **HTTP range requests** — only the data you need is downloaded (typically < 1 MB to start)
- **Cesium** — 3D globe visualization

Works in Chrome, Firefox, Edge, Safari, and Brave. No plugins, no downloads, no accounts.

## For Developers {.unnumbered}

All code is visible and foldable on tutorial pages. Want to build your own analysis?

- **[Search Explorer](/tutorials/isamples_explorer.html)** — faceted search across all 6.7M samples with cross-filtering
- **[Deep-Dive Analysis](/tutorials/zenodo_isamples_analysis.html)** — statistical exploration with Observable Plot
- **[Tutorials index](/tutorials/)** — step-by-step guides from basic exploration to advanced analysis
- **[GitHub](https://github.com/isamplesorg/)** — all source code and data pipelines
- **[Zenodo](https://zenodo.org/communities/isamples)** — archived datasets for reproducible research

## Data Catalog {.unnumbered}

All files are served from [`data.isamples.org`](https://data.isamples.org/)
backed by Cloudflare R2. A Cloudflare Worker in front of the bucket sets
`Cache-Control: public, max-age=31536000, immutable` on filename-versioned
parquets (so browsers and the Cloudflare edge cache aggressively) and
exposes CORS headers required by DuckDB-WASM's HTTP range requests.

File naming convention: `isamples_<YYYYMM>_<variant>.parquet`. The month
in the filename is the data-generation snapshot — content at a given
URL never changes.

### Primary datasets {.unnumbered}

The two main files carrying the sample records themselves:

| File | Size | Shape | Rows | Use when you need… |
|---|---:|---|---:|---|
| [`current/wide.parquet`](https://data.isamples.org/current/wide.parquet) ∗ | 292 MB | Wide (one row per entity, nested relationships in `p__*` array columns) | 20 M | General entity queries, UI filtering, description text |
| [`isamples_202601_wide_h3.parquet`](https://data.isamples.org/isamples_202601_wide_h3.parquet) | 292 MB | Wide + H3 BIGINT indices (`h3_res4`, `h3_res6`, `h3_res8`) | 20 M | Geospatial queries with H3 clustering at arbitrary zoom |
| [`isamples_202512_narrow.parquet`](https://data.isamples.org/isamples_202512_narrow.parquet) | 820 MB | Narrow (graph: nodes + explicit `_edge_` rows, s/p/o/n fields) | 106 M | Graph traversals, relationship-centric analysis, PQG work |

∗ `/current/wide.parquet` is a stable alias that HTTP 302-redirects to the
latest dated file (currently
[`isamples_202604_wide.parquet`](https://data.isamples.org/isamples_202604_wide.parquet),
enriched with ~47 K OpenContext thumbnails). The dated filename is
immutable; the alias rotates atomically when we rebuild. Use the alias for
interactive work, the dated URL when you want a pinned, reproducible
reference. The original
[`isamples_202601_wide.parquet`](https://data.isamples.org/isamples_202601_wide.parquet)
(278 MB, no thumbnails) is kept available for historical pinning.

All three represent the same underlying data (SESAR + OpenContext + GEOME
+ Smithsonian) with identical semantics — they differ only in serialization
strategy. See the
[Technical: Narrow vs Wide tutorial](/tutorials/narrow_vs_wide_performance.html)
for a performance comparison.

### Pre-aggregated helpers {.unnumbered}

Small lookup tables computed ahead of time so a page can render facets
and counts instantly, without touching the 278 MB primary file:

| File | Size | Contents | Use when… |
|---|---:|---|---|
| [`isamples_202601_facet_summaries.parquet`](https://data.isamples.org/isamples_202601_facet_summaries.parquet) | 2 KB | `(facet_type, facet_value, count)` for source, material, context, object_type | You want instant initial facet counts with no filters applied |
| [`isamples_202601_facet_cross_filter.parquet`](https://data.isamples.org/isamples_202601_facet_cross_filter.parquet) | 6 KB | Pre-computed counts for single-facet selections | You want instant cross-filtered counts for a single active filter |
| [`isamples_202601_sample_facets_v2.parquet`](https://data.isamples.org/isamples_202601_sample_facets_v2.parquet) | 63 MB | `(pid, material, context, object_type)` facet URIs per sample | You need to filter on *combinations* of facets at query time |

### Geospatial aggregates (H3) {.unnumbered}

Hexagonal H3 cells pre-aggregated at three resolutions for zoom-adaptive
globe rendering. Each row: `h3_cell, center_lat, center_lng, sample_count,
dominant_source, source_count`.

| File | Size | Cells | Typical altitude |
|---|---:|---:|---|
| [`isamples_202601_h3_summary_res4.parquet`](https://data.isamples.org/isamples_202601_h3_summary_res4.parquet) | 580 KB | ~38 K | Continental (world view) |
| [`isamples_202601_h3_summary_res6.parquet`](https://data.isamples.org/isamples_202601_h3_summary_res6.parquet) | 1.6 MB | ~112 K | Regional (country / state) |
| [`isamples_202601_h3_summary_res8.parquet`](https://data.isamples.org/isamples_202601_h3_summary_res8.parquet) | 2.4 MB | ~176 K | Neighborhood |

CSV twins exist alongside each parquet (3× larger) for human inspection —
browsers use the parquet versions.

### Individual sample points (lite) {.unnumbered}

| File | Size | Contents | Use when… |
|---|---:|---|---|
| [`isamples_202601_samples_map_lite.parquet`](https://data.isamples.org/isamples_202601_samples_map_lite.parquet) | 60 MB | `pid, label, source, latitude, longitude, place_name, result_time, h3_res8, h3_res8_hex` — no description | Point-level rendering below ~120 km altitude |

### Which tutorial uses which file {.unnumbered}

| | Interactive Explorer | Search Explorer | Deep-Dive Analysis |
|---|:-:|:-:|:-:|
| `wide.parquet` | | ● | |
| `wide_h3.parquet` | | | ● |
| `facet_summaries.parquet` | ● | ● | ● |
| `facet_cross_filter.parquet` | | ● | |
| `sample_facets_v2.parquet` | ● | ● | |
| `h3_summary_res4/6/8.parquet` | ● | | |
| `samples_map_lite.parquet` | ● | | |

### Quick query recipes {.unnumbered}

From Python:

```python
import duckdb
con = duckdb.connect()
con.sql("""
    SELECT source, COUNT(*) AS n
    FROM read_parquet('https://data.isamples.org/current/wide.parquet')
    WHERE otype = 'MaterialSampleRecord'
    GROUP BY 1 ORDER BY 2 DESC
""").df()
```

From the browser via DuckDB-WASM — see the
[tutorials](/tutorials/) for complete examples with HTTP range requests.