Add Amazon Reviews 2023 (Subscription Boxes) dataset + publish redistribution guards#16
Merged
Merged
Conversation
A mirror upload is redistribution, so refuse two classes of slug by default: those whose license carries a scrape_advisory (declared license doesn't cover the underlying content) and those with redistribution_permitted=false (license grants no redistribution at all). Independent gates, each with its own --allow-* bypass, so clearing one never silently clears the other; a slug tripping both needs both flags. --all skips blocked slugs and continues. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
The McAuley-Lab Amazon-Reviews-2023 corpus is an academic HTML crawl of Amazon with no upstream license; Amazon's Conditions of Use forbid the data-mining that produced it and any redistribution of the derivatives. Add the smallest category (Subscription_Boxes, 16,216 reviews) as the catalog's first redistribution-restricted entry: license NoAssertion, redistribution_permitted=false, plus a scrape_advisory documenting the provenance. Buildable and loadable locally; publish refuses it by default. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
This was referenced Jun 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The McAuley-Lab Amazon-Reviews-2023 corpus is an academic HTML crawl of Amazon with no upstream license, and Amazon's Conditions of Use forbid both the data-mining that produced it and any redistribution of the derivatives. This adds the smallest category (Subscription_Boxes, 16,216 reviews) as the catalog's first redistribution-restricted entry β license
NoAssertion,redistribution_permitted=false, with ascrape_advisoryrecording the provenance. It builds and loads locally (fetch β parse β write β vortex all green,raincloud.load(...)round-trips it including the nestedimagesstruct), which is the customary research posture; republishing it is the one thing the terms actually forbid.To hold that line,
publishnow refuses two classes of slug by default: anything carrying ascrape_advisory, and anything withredistribution_permitted=false. They're independent gates with separate--allow-scrape-advisory/--allow-no-redistributionbypasses, so a slug tripping both (like this one) needs both flags;--allskips blocked slugs and keeps going. This also retroactively guards the eight existing scrape-flagged slugs (C4, FineWeb, SlimPajama, β¦), not just the new one.The
docs/v1regen is scoped to this slug only β a separate follow-up PR will resync the unrelated drift already sitting in the tracked snapshot.π€ Generated with Claude Code