Skip to content

Offline storage: data-safety fixes + batched flush (empty-filter guard, store-failure propagation, event-loss fix, one-transaction flush)#1491

Open
bmehta001 wants to merge 16 commits into
microsoft:mainfrom
bmehta001:bhamehta/fix-storage-data-safety
Open

Offline storage: data-safety fixes + batched flush (empty-filter guard, store-failure propagation, event-loss fix, one-transaction flush)#1491
bmehta001 wants to merge 16 commits into
microsoft:mainfrom
bmehta001:bhamehta/fix-storage-data-safety

Conversation

@bmehta001

@bmehta001 bmehta001 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Summary

Offline-storage data-safety + flush performance (combines the data-safety fixes with the SQLite batch-flush optimization, formerly PR #1496 and #1497).

Area Fix
MemoryStorage::DeleteRecords(filter) An empty filter matched every record and silently wiped the RAM queue. Now ignored (use DeleteAllRecords()), mirroring the SQLite backend.
OfflineStorage_SQLite::StoreRecords Batches the whole flush into one BEGIN EXCLUSIVE/COMMIT (one fsync) instead of one transaction per record. ~11× (200 recs) / ~40× (1000 recs) faster (measured on the SDK's vendored sqlite). All-or-nothing: any insert failure rolls the batch back (new SqliteDB::rollback), so it can be re-queued without duplicate rows (no unique record_id).
OfflineStorage_SQLite::StoreRecord Returns false and reports OnStorageFailed on a real write failure (was ignoring the execute() result). Validation runs before the transaction; write failures are reported after it closes.
OfflineStorageHandler::StoreRecord Propagates the disk store result (was always true).
OfflineStorageHandler::Flush Drained the RAM queue and persisted before confirming, losing events on a disk-write failure. Now drains and persists via the batched, all-or-nothing StoreRecords(), and re-queues the whole batch if it doesn't fully persist — realizing the batch speedup and keeping the no-loss / no-duplicate guarantee. Also null-guards the size read (disk-only storage).

Tests

  • MemoryStorageTests.DeleteRecordsWithEmptyFilterDoesNotDeleteAll
  • OfflineStorageHandlerFlushTests.FailedDiskStoreDuringFlushReturnsRecordsToMemory — records the disk store rejects remain in memory after Flush(). Verified it fails against the previous GetRecords()-based Flush.
  • OfflineStorageTests_SQLite.StoreRecordsBatchStoresAllRecords
  • Full WSL UnitTests: 527/527 pass.

History

The batching half (PR #1497) was driven to a clean Copilot review over ~9 rounds, then folded here so the batched StoreRecords() and the Flush data-loss fix cooperate (all-or-nothing batch + re-queue) rather than conflict.


Additional correctness fixes (bundled)

Three small, unrelated correctness fixes (happy to split out if reviewers prefer):

Issue Fix
#1334 PrivacyGuard JNI use-after-free: nativeInitializePrivacyGuard[WithoutCommonDataContext] stored JStringToStdString(...).c_str() into InitializationConfiguration's const char* fields; the temporary std::string died at end-of-statement, so config pointed at freed memory before PrivacyGuard was constructed. Now the converted strings are held in locals that outlive make_shared<PrivacyGuard>(config).
#1333 GetAppLocalTempDirectory leaked a RoInitialize reference on the UWP path (no matching RoUninitialize). Balanced with RoUninitialize() when the call succeeded, releasing the WinRT StorageFolder first so it isn't destroyed in an uninitialized apartment.
#312 TransmitProfiles JSON powerState map was missing low_battery, so profiles using it silently fell back to default. Mapped low_battery -> PowerSource_LowBattery.

Validation: full WSL UnitTests 530/530 pass (covers #312 + no regression). The JNI (#1334) and UWP (#1333) paths are platform-specific and not in the desktop build; both are minimal lifetime fixes.

Concurrency / lifecycle fixes (bundled)

Issue Fix
#1221 Data race (TSan, iOS): OfflineStorageHandler::GetAndReserveRecords wrote m_lastReadCount/m_readFromMemory unsynchronized while LastReadRecordCount()/IsLastReadFromMemory() read them from the upload path. Both members are now std::atomic (all uses are by-value loads/stores/fetch-add).
#1134 SQLite leak (ASan): SqliteDB had no destructor, so a SqliteDB destroyed without an explicit shutdown() leaked its handle + prepared statements. Added ~SqliteDB() calling the idempotent shutdown() (finalize + close + release instance count).

Validation: full WSL UnitTests 530/530 pass, including the SQLite instance-count suite (confirms ~SqliteDB doesn't double-release). #1221 is a memory-model fix (TSan-only repro); #1134's leak is sanitizer-detected (how it was reported), so both are validated by no-regression + correctness rather than a new deterministic test.

bmehta001 and others added 2 commits June 22, 2026 10:48
…ailure

Two latent data-loss bugs found during a repo-wide review:

1) MemoryStorage::DeleteRecords(whereFilter) matched EVERY record when
   whereFilter was empty (the matcher starts `matched = true` and the
   per-key loop never runs), silently wiping the entire in-memory queue.
   This contradicts the fail-closed OfflineStorage_SQLite::DeleteRecords
   and the Room backend. Guard an empty filter and return without
   deleting; intentional full clears use DeleteAllRecords().

2) OfflineStorage_SQLite::StoreRecord ignored the bool returned by
   SqliteStatement::execute(), returning true and bumping
   m_DbSizeEstimate even on a real write failure (SQLITE_FULL/IOERR/etc).
   The event is silently lost with no OnStorageFailed notification and
   the size estimate drifts. Capture the result; on failure log, notify
   the observer, and return false (skipping the size bump).

Tests: added MemoryStorageTests.DeleteRecordsWithEmptyFilterDoesNotDeleteAll
(fails without the guard -- the queue is wiped to 0; passes with it).
The StoreRecord write-failure path isn't unit-testable here (the insert
is REPLACE INTO with no constraint to violate), so it's covered by build
+ review. Verified locally on Linux: all 9 MemoryStorageTests and 32
OfflineStorageTests_SQLite pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Verified TDD: this test fails without the empty-filter guard (the queue
is wiped, GetSize()/GetRecordCount() drop to 0) and passes with it.
Run on Linux host.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bmehta001 bmehta001 requested a review from a team as a code owner June 22, 2026 15:56
Comment thread lib/offline/OfflineStorage_SQLite.cpp Outdated
lib/offline/OfflineStorage_SQLite.cpp::StoreRecord now returns false on a write
failure (this PR), but OfflineStorageHandler::StoreRecord ignored the disk
result and always returned true, so a failed synchronous store (RAM queue
disabled or during shutdown) was counted as successfully persisted by
StoreRecords()/StorageObserver.

Return the disk StoreRecord() result in the direct-to-disk path. The memory
path is unchanged: MemoryStorage::StoreRecord returning false means an
intentional latency-Off skip, not a failure, so it must not surface as an error.

Verified at lib/offline/OfflineStorageHandler.cpp:266-275 and
lib/offline/OfflineStorage_SQLite.cpp:180-186.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the offline-storage layer against two data-loss scenarios by (1) preventing accidental full-queue deletes when a delete predicate is empty and (2) ensuring SQLite write failures are surfaced to callers/observers instead of being treated as successful stores.

Changes:

  • Add an empty-filter guard to MemoryStorage::DeleteRecords(whereFilter) and a unit test to prevent unintended full deletes.
  • Propagate synchronous disk StoreRecord() failures through OfflineStorageHandler::StoreRecord so callers don’t count failed persistence as success.
  • Treat SqliteStatement::execute() failures in OfflineStorage_SQLite::StoreRecord as hard failures: log, notify OnStorageFailed, and return false without updating size estimates.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
tests/unittests/MemoryStorageTests.cpp Adds coverage to ensure DeleteRecords({}) does not wipe the in-memory queue.
lib/offline/MemoryStorage.cpp Guards empty whereFilter to avoid “match-all” deletion behavior.
lib/offline/OfflineStorageHandler.cpp Returns disk StoreRecord() result to propagate synchronous persistence failures to callers.
lib/offline/OfflineStorage_SQLite.cpp Checks execute() result and reports SQLite insert failures via logging + OnStorageFailed.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/offline/OfflineStorageHandler.cpp
Comment thread lib/offline/OfflineStorageHandler.cpp
bmehta001 added a commit to bmehta001/cpp_client_telemetry that referenced this pull request Jun 23, 2026
OfflineStorage_SQLite::StoreRecord: return false when the INSERT execute() fails
(was ignoring the result and returning true even on a real write failure). The
Flush() reserve/confirm-delete logic in this PR relies on the disk backend
reporting per-record failure; without this, a failed sqlite3_step would still be
treated as persisted and dropped from memory. (This is the same one-line fix as
PR microsoft#1491, included here so this PR is correct on its own; trivial overlap that
resolves cleanly at merge.)

OfflineStorageTests.cpp NoopTaskDispatcher: own queued tasks and delete them on
Cancel()/Join()/destruction instead of dropping the Task* (PAL::scheduleTask
allocates with new and expects the dispatcher to take ownership), so the helper
cannot leak.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Combine the Flush() data-loss fix into this storage-data-safety PR (the two are
halves of the same fix: this PR already makes OfflineStorage_SQLite::StoreRecord
report write failures; Flush() must act on that).

OfflineStorageHandler::Flush() previously drained the in-memory queue with
GetRecords() (which removes records) and handed them to StoreRecords() before
confirming persistence. On a partial/total disk write failure the un-persisted
records were already gone from memory and never re-queued -> events lost.

Flush() now drains into a local batch, persists one record at a time, and
re-inserts only the records that fail to persist (so failures are retried, not
lost). Per-record StoreRecord() is used deliberately: a batched StoreRecords()
only returns a count, so on a partial failure we could not tell which records to
re-queue, and re-storing already-saved records would duplicate them (no unique
record_id constraint). Also null-guards the dbSizeBeforeFlush read so Flush() is
safe with disk-only storage (CFG_INT_RAM_QUEUE_SIZE == 0).

Adds OfflineStorageHandlerFlushTests.FailedDiskWriteDuringFlushReturnsRecordsToMemory
(records the SQLite store rejects stay in memory after Flush; verified it fails
against the previous GetRecords()-based Flush). Closes the separate PR microsoft#1496.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bmehta001 bmehta001 changed the title Offline storage: guard empty-filter delete and propagate SQLite store failures Offline storage data-safety: empty-filter guard, store-failure propagation, and flush event-loss fix Jun 23, 2026
@bmehta001 bmehta001 requested a review from Copilot June 23, 2026 03:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment thread tests/unittests/OfflineStorageTests.cpp
The test helper's Cancel() returned true unconditionally, violating the
ITaskDispatcher::Cancel contract (return whether the task was found/cancelled).
Return true only when the task was present in the queue, false otherwise.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Comment thread lib/offline/OfflineStorageHandler.cpp Outdated
Comment thread tests/unittests/OfflineStorageTests.cpp Outdated
Comment thread tests/unittests/OfflineStorageTests.cpp Outdated
Rename FailedDiskWriteDuringFlush... -> FailedDiskStoreDuringFlush... and reword
its comments: the test exercises a disk StoreRecord() rejection (SQLite input
validation), which drives the same Flush() re-queue path as any disk store
failure, not a literal disk write/IO error.

(The reviewer's separate note that Flush() ignores EventPersistence_DoNotStoreOnDisk
is a pre-existing behavior, out of scope for this data-safety change and not
cleanly unit-testable via the public API; tracked as a follow-up.)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

bmehta001 and others added 2 commits June 23, 2026 10:02
…(was PR microsoft#1497)

Combine the batched-flush perf work into this PR and make it cooperate with the
Flush() data-loss fix, so both land together.

OfflineStorage_SQLite: StoreRecords() now inserts the whole batch in a single
BEGIN EXCLUSIVE / COMMIT (one fsync) instead of one transaction per record
(~11x at 200 records, ~40x at 1000 vs the SDK's vendored sqlite). Shared per-record
logic is factored into isValidRecord / insertRecordUnsafe / checkStorageSizeLimits.
The batch is all-or-nothing: if any insert fails, the transaction is rolled back
(new SqliteDB::rollback / DbTransaction::markForRollback) and the size estimate is
undone, so callers can re-queue the whole batch without risking duplicate rows
(the events table has no unique record_id constraint).

OfflineStorageHandler::Flush() now uses the batched StoreRecords() to persist a
drained batch in one transaction. Because StoreRecords() is all-or-nothing, on
failure nothing is committed and Flush returns every record to the in-memory queue
for retry -- realizing the batching speedup while keeping the no-event-loss /
no-duplicate guarantee.

StoreRecords/StoreRecord report write failures via OnStorageFailed after the
transaction closes; validation runs before the transaction. Adds
OfflineStorageTests_SQLite.StoreRecordsBatchStoresAllRecords. Full UnitTests (527)
pass. Closes PR microsoft#1497.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bmehta001 bmehta001 changed the title Offline storage data-safety: empty-filter guard, store-failure propagation, and flush event-loss fix Offline storage: data-safety fixes + batched flush (empty-filter guard, store-failure propagation, event-loss fix, one-transaction flush) Jun 23, 2026
@bmehta001 bmehta001 requested a review from Copilot June 23, 2026 16:34

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comment thread lib/offline/OfflineStorage_SQLite.cpp Outdated
…cords

StoreRecords() previously filtered out invalid records and committed the valid
ones, so it could return a count < records.size() even though some records were
persisted. OfflineStorageHandler::Flush() treats totalSaved < records.size() as a
batch failure and re-queues ALL drained records, which would duplicate the valid
records that were actually stored.

Make StoreRecords() truly all-or-nothing: if ANY input record is invalid, store
nothing and return 0 (invalids are still reported via isValidRecord()). Combined
with the existing rollback-on-write-failure, StoreRecords() now returns either
records.size() (whole batch committed) or 0 (nothing committed), so Flush's
re-queue-all-on-short-return can never duplicate records.

Adds OfflineStorageTests_SQLite.StoreRecordsBatchWithAnyInvalidStoresNothing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comment thread lib/offline/OfflineStorageHandler.cpp
Flush() re-queued the whole drained batch whenever StoreRecords() returned a
count < records.size(). Both disk backends are all-or-nothing (SQLite rolls back;
Room returns 0 on a failed JNI batch), so the only meaningful "failure" value is
0. Room also caps its returned count at min(size, INT32_MAX); keying off
< records.size() would treat that capped count as a failure and re-queue
already-persisted records (duplicates). Key the re-queue off totalSaved == 0
instead, which is the true "nothing committed" signal. (The cap only matters for
a batch larger than the RAM queue could ever hold.)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.

bmehta001 and others added 2 commits June 24, 2026 11:36
…profile

Three small correctness fixes bundled with the offline-storage work:

- microsoft#1334: PrivacyGuard JNI use-after-free. nativeInitializePrivacyGuard[WithoutCommonDataContext]
  assigned JStringToStdString(...).c_str() into InitializationConfiguration's const char*
  fields; the temporary std::string was destroyed at the end of the statement, leaving the
  config pointing at freed memory before PrivacyGuard was constructed. Hold the converted
  strings in locals that outlive the make_shared<PrivacyGuard>(config) call.

- microsoft#1333: GetAppLocalTempDirectory leaked a RoInitialize reference on the UWP path (no matching
  RoUninitialize). Balance it with RoUninitialize() when the call succeeded, releasing the
  WinRT StorageFolder first so it is not destroyed in an uninitialized apartment.

- microsoft#312: TransmitProfiles JSON powerState map was missing the low_battery key, so profiles
  using it silently fell back to default. Map low_battery -> PowerSource_LowBattery.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.

…wn leak (microsoft#1134)

- microsoft#1221: OfflineStorageHandler::GetAndReserveRecords wrote m_lastReadCount and
  m_readFromMemory with no synchronization while IsLastReadFromMemory() and
  LastReadRecordCount() read them from the upload path (TSan-reported on iOS).
  Make both members std::atomic so every access is well-defined; all uses are
  by-value loads/stores/fetch-add, so no other change is needed.

- microsoft#1134: SqliteDB had no destructor, so a SqliteDB destroyed without an explicit
  shutdown() (e.g. when the owning OfflineStorage_SQLite is torn down without
  Shutdown()) leaked its open handle and prepared statements -- the one-time
  sqlite allocation seen under ASan. Add ~SqliteDB() that calls the existing
  idempotent shutdown() (finalizes statements, closes the db, releases the
  instance count); an earlier explicit shutdown() makes it a no-op.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Comment thread lib/utils/Utils.cpp Outdated
Utils.cpp microsoft#1333: the explicit RoUninitialize() only ran on the normal return
path, so a throwing WinRT call (e.g. TemporaryFolder access) between
RoInitialize() and it would leave a successful RoInitialize() unbalanced.
Move the balance into an RAII guard so it runs on every exit path including
exceptions; the WinRT StorageFolder is still released in an inner scope before
the guard runs, so it is not destroyed in an uninitialized apartment.
  Verified against lib/utils/Utils.cpp:105-127.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated no new comments.

load_Json_RuleWithLowBatteryPowerState_MapsToPowerSourceLowBattery loads a
profile whose rule uses "powerState": "low_battery" and asserts the parsed
rule maps to PowerSource_LowBattery. Verified it fails against the pre-fix code
(the key was absent from transmitProfilePowerState, so powerState fell back to
the default PowerSource_Any) and passes with the fix. Full UnitTests: 531/531.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Comment thread lib/offline/OfflineStorageHandler.cpp
OfflineStorageHandler::Flush() called m_offlineStorageDisk->Flush() in the
CFG_BOOL_CHECKPOINT_DB_ON_FLUSH branch without a null check. With RAM-only
storage (no disk backend, e.g. HAVE_MAT_STORAGE disabled) m_offlineStorageDisk
is null, so enabling that config would dereference null and crash. Guard the
call with m_offlineStorageDisk, matching the null checks elsewhere in Flush().
  Verified at lib/offline/OfflineStorageHandler.cpp:221-225.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants