Skip to content

improvement: Use tracking timestamps instead of data timestamps for stale-entry eviction in ACTIVE_OBJECT_STORE_SYNC_FILES #1634

@coderabbitai

Description

@coderabbitai

Summary

In src/storage/object_storage.rs, the stale-entry cleanup in collect_upload_results currently uses extract_datetime_from_parquet_path_regex to compare the parquet file's data timestamp (extracted from the filename) against a 5-minute threshold, rather than the time the path was added to ACTIVE_OBJECT_STORE_SYNC_FILES.

Problem

  • Historical ingestion (with time partition): entries are immediately eligible for cleanup because the data timestamp (from the filename) is old, potentially causing races where a currently-uploading file gets evicted from the tracking set prematurely.
  • Near-real-time ingestion: works correctly since data timestamps closely match wall-clock time.

Proposed Improvement

Change the tracking structure from HashSet<PathBuf> to a HashMap<PathBuf, DateTime<Utc>> (or equivalent), recording Utc::now() when each path is inserted. Update the retain logic to compare now - tracked_at >= Duration::minutes(5) instead of parsing the data timestamp from the filename.

This ensures accurate duration-based eviction regardless of whether data is near-real-time or historical.

Context

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions