Skip to content

Add Amazon Reviews 2023 (Subscription Boxes) dataset + publish redistribution guards#16

Merged
mprammer merged 2 commits into
developfrom
mp/add-amazon-reviews-2023
Jun 11, 2026
Merged

Add Amazon Reviews 2023 (Subscription Boxes) dataset + publish redistribution guards#16
mprammer merged 2 commits into
developfrom
mp/add-amazon-reviews-2023

Conversation

@mprammer

Copy link
Copy Markdown
Contributor

The McAuley-Lab Amazon-Reviews-2023 corpus is an academic HTML crawl of Amazon with no upstream license, and Amazon's Conditions of Use forbid both the data-mining that produced it and any redistribution of the derivatives. This adds the smallest category (Subscription_Boxes, 16,216 reviews) as the catalog's first redistribution-restricted entry β€” license NoAssertion, redistribution_permitted=false, with a scrape_advisory recording the provenance. It builds and loads locally (fetch β†’ parse β†’ write β†’ vortex all green, raincloud.load(...) round-trips it including the nested images struct), which is the customary research posture; republishing it is the one thing the terms actually forbid.

To hold that line, publish now refuses two classes of slug by default: anything carrying a scrape_advisory, and anything with redistribution_permitted=false. They're independent gates with separate --allow-scrape-advisory / --allow-no-redistribution bypasses, so a slug tripping both (like this one) needs both flags; --all skips blocked slugs and keeps going. This also retroactively guards the eight existing scrape-flagged slugs (C4, FineWeb, SlimPajama, …), not just the new one.

The docs/v1 regen is scoped to this slug only β€” a separate follow-up PR will resync the unrelated drift already sitting in the tracked snapshot.

πŸ€– Generated with Claude Code

mprammer and others added 2 commits June 11, 2026 11:48
A mirror upload is redistribution, so refuse two classes of slug by default:
those whose license carries a scrape_advisory (declared license doesn't cover
the underlying content) and those with redistribution_permitted=false (license
grants no redistribution at all). Independent gates, each with its own
--allow-* bypass, so clearing one never silently clears the other; a slug
tripping both needs both flags. --all skips blocked slugs and continues.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
The McAuley-Lab Amazon-Reviews-2023 corpus is an academic HTML crawl of
Amazon with no upstream license; Amazon's Conditions of Use forbid the
data-mining that produced it and any redistribution of the derivatives. Add
the smallest category (Subscription_Boxes, 16,216 reviews) as the catalog's
first redistribution-restricted entry: license NoAssertion,
redistribution_permitted=false, plus a scrape_advisory documenting the
provenance. Buildable and loadable locally; publish refuses it by default.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
@mprammer mprammer marked this pull request as ready for review June 11, 2026 16:07
@mprammer mprammer merged commit 8410621 into develop Jun 11, 2026
7 checks passed
@mprammer mprammer deleted the mp/add-amazon-reviews-2023 branch June 11, 2026 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant