-
Notifications
You must be signed in to change notification settings - Fork 0
Propagate HTTP Last-Modified from probe to distribution for download caching #296
Description
The ImportResolver probes data dump distributions via HEAD requests and stores the HTTP Last-Modified header on the probe result (ProbeResult.lastModified). However, this value is never propagated to Distribution.lastModified before passing distributions to the importer.
As a result, LastModifiedDownloader.localFileIsUpToDate() always returns false when distribution.lastModified is undefined — causing the file to be re-downloaded every run. This in turn invalidates the QLever index cache (since the file's mtime is updated), forcing a full re-index on every pipeline run.
This affects all distributions where lastModified isn't set by the dataset source (e.g. manual dataset.ttl selections). Registry-backed datasets are unaffected because @lde/dataset-registry-client sets lastModified from the registry's modified field.
Suggested fix
In ImportResolver.importDataset(), copy probeResult.lastModified onto the candidate distribution before passing it to the importer:
for (const candidate of candidates) {
const probeResult = probeResults.find(r => r.url === candidate.accessUrl.toString());
if (probeResult?.lastModified && !candidate.lastModified) {
candidate.lastModified = probeResult.lastModified;
}
}This would let the downloader use the HTTP Last-Modified header to skip redundant downloads, which in turn preserves the QLever index cache.