Skip to content

Propagate HTTP Last-Modified from probe to distribution for download caching #296

@ddeboer

Description

@ddeboer

The ImportResolver probes data dump distributions via HEAD requests and stores the HTTP Last-Modified header on the probe result (ProbeResult.lastModified). However, this value is never propagated to Distribution.lastModified before passing distributions to the importer.

As a result, LastModifiedDownloader.localFileIsUpToDate() always returns false when distribution.lastModified is undefined — causing the file to be re-downloaded every run. This in turn invalidates the QLever index cache (since the file's mtime is updated), forcing a full re-index on every pipeline run.

This affects all distributions where lastModified isn't set by the dataset source (e.g. manual dataset.ttl selections). Registry-backed datasets are unaffected because @lde/dataset-registry-client sets lastModified from the registry's modified field.

Suggested fix

In ImportResolver.importDataset(), copy probeResult.lastModified onto the candidate distribution before passing it to the importer:

for (const candidate of candidates) {
    const probeResult = probeResults.find(r => r.url === candidate.accessUrl.toString());
    if (probeResult?.lastModified && !candidate.lastModified) {
        candidate.lastModified = probeResult.lastModified;
    }
}

This would let the downloader use the HTTP Last-Modified header to skip redundant downloads, which in turn preserves the QLever index cache.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions