Rebuild native crash detection as app lifecycle monitors

## Problem Statement

Harness needs reliable native crash detection across Android, iOS Simulator, and physical iOS devices. The current crash detector should be re-implemented from scratch because the new design needs to be explicit about platform capabilities, easier to test, and less coupled to app launch internals.

From a Harness user's perspective, native crashes should fail the relevant startup or test execution phase quickly, include useful evidence, and avoid false positives when Harness itself intentionally stops or restarts the app. The system should work well on Android and iOS Simulator, while being honest about the more limited realtime guarantees available on physical iOS devices.

The implementation should preserve Harness' existing platform ownership model: Jest orchestrates runs and lifecycle timing, platform packages own platform tool details, and shared packages define only stable contracts.

## Solution

Introduce a shared `AppLifecycleMonitor` contract implemented by each platform. The Jest package will drive this contract linearly during session setup, app launch, restart, stop, startup readiness, test execution, and teardown. Every platform returns a monitor implementation; disabled or unsupported monitoring is represented by a noop monitor rather than optional metadata or conditional checks in Jest.

Platform packages will own evidence collection:

- Android will monitor `adb logcat`, process state through `pidof`, and best-effort tombstone/ANR artifacts.
- iOS Simulator will monitor `simctl spawn ... log stream`, simulator/app state, and host crash reports.
- Physical iOS devices will monitor `devicectl` JSON process state and crash log artifacts, with heavier fallbacks where needed.

The monitor will distinguish low-latency suspicion from confirmed crash reports. A suspected crash can be raised from logs or process transitions, then confirmed through corroborating evidence such as process exit, crash report files, tombstones, ANR traces, or device crash logs. Crash evidence should be persisted through the existing crash artifact writer so test failures can point to local reports.

## User Stories

1. As a Harness user, I want Android Java crashes to fail the active test quickly, so that failures are reported as native crashes instead of bridge timeouts.
2. As a Harness user, I want Android native crashes to include logcat evidence, so that I can diagnose failures without rerunning manually.
3. As a Harness user, I want Android ANRs to be detected when platform evidence is available, so that hangs are not reported only as generic timeouts.
4. As a Harness user, I want iOS Simulator crashes to be detected from simulator logs and crash reports, so that simulator runs fail with actionable native diagnostics.
5. As a Harness user, I want physical iOS device crashes to be detected from official device artifacts, so that device runs can report native failures without relying on unsupported log streams.
6. As a Harness user, I want crash detection to avoid false positives during controlled restarts, so that Harness does not treat its own `stopApp` calls as crashes.
7. As a Harness user, I want startup crashes to fail during app readiness, so that Harness does not wait for the bridge until timeout when the app has already crashed.
8. As a Harness user, I want execution crashes to fail the currently running test file, so that the failure is attributed to the correct phase.
9. As a Harness user, I want fast provisional detection and later confirmed reports, so that I get low latency without losing diagnostic depth.
10. As a Harness user, I want crash artifacts to be saved locally, so that I can inspect tombstones, ANR traces, logcat windows, or iOS crash reports after the run.
11. As a Harness user, I want monitoring to be togglable, so that I can disable native crash detection without changing the rest of the run flow.
12. As a Harness user, I want disabled monitoring to behave predictably, so that the same Jest orchestration path is used whether monitoring is real or noop.
13. As a Harness maintainer, I want Jest orchestration to stay platform-neutral, so that it does not depend on `adb`, `simctl`, or `devicectl` command details.
14. As a Harness maintainer, I want platform packages to own their evidence collectors, so that platform-specific behavior can evolve independently.
15. As a Harness maintainer, I want a shared monitor interface, so that tests can use fake monitors to verify lifecycle ordering.
16. As a Harness maintainer, I want lifecycle events around launch, stop, and restart, so that monitors can suppress controlled process exits and correlate real launch windows.
17. As a Harness maintainer, I want monitor implementations to use instance keys rather than bare PIDs, so that fast app restarts and PID reuse do not merge unrelated evidence.
18. As a Harness maintainer, I want Android detection to combine logcat and process polling, so that neither noisy logs nor ambiguous process disappearance become the only signal.
19. As a Harness maintainer, I want iOS Simulator detection to combine unified logs and crash report watchers, so that realtime suspicion and file-based confirmation reinforce each other.
20. As a Harness maintainer, I want physical iOS detection to be conservative, so that weak process-loss evidence does not become an unreliable crash failure.
21. As a Harness maintainer, I want warning states for degraded capabilities, so that missing tombstone access or delayed iOS crash logs do not look like monitor bugs.
22. As a Harness maintainer, I want the artifact writer to stay storage-only, so that detection logic remains separate from persistence.
23. As a Harness maintainer, I want monitor tests to use command output fixtures, so that platform parsing can be hardened without requiring live devices in unit tests.
24. As a Harness maintainer, I want end-to-end validation on emulator, simulator, and physical-device paths, so that the implementation reflects real platform behavior.
25. As a plugin author, I want future structured crash events to contain phase and artifact metadata, so that reporting integrations can display useful failure information.
26. As a CI user, I want Android and iOS Simulator crash detection to work in CI-friendly environments, so that native failures are caught automatically.
27. As a CI user with physical iOS devices, I want degraded but official crash retrieval paths, so that device-lab runs can collect evidence when fast crash logs are delayed.
28. As a developer debugging a flaky run, I want duplicate crash signals to be debounced, so that one crash does not produce multiple conflicting failures.
29. As a developer debugging app launch, I want launch IDs or launch windows to appear in monitor correlation, so that startup failures are tied to the correct attempt.
30. As a developer maintaining platform launch code, I want launch mechanics to remain unchanged from the monitor's perspective, so that crash detection does not tightly couple to start command construction.

## Implementation Decisions

- Add a shared `AppLifecycleMonitor` contract with lifecycle notifications, crash watches, reset, start, stop, dispose, and liveness methods.
- Add a noop monitor that implements the full contract and observes nothing.
- Keep monitor creation on the platform runner through `createAppMonitor`.
- Do not add optional `monitorTarget` metadata. Platform implementations capture their target identifiers when constructing monitors.
- Preserve `detectNativeCrashes` as the user-facing toggle by returning either a real monitor or the noop monitor.
- Let Jest drive monitor lifecycle linearly: create, start, notify launch/stop/restart, race watches, reset, stop, dispose.
- Keep platform command details out of Jest.
- Keep `adb`, `simctl`, and `devicectl` usage inside the respective platform packages.
- Use lifecycle events to suppress crashes during controlled stops and restarts.
- Use launch IDs or launch windows to correlate startup and execution crashes.
- Normalize platform evidence into shared crash concepts: suspected crash, confirmed crash, report ready, warning, and monitor error.
- Use confidence levels internally so weak evidence, such as process disappearance alone, can be treated differently from crash reports or fatal log signatures.
- Use instance keys instead of bare PIDs.
- Debounce duplicate signals within a short correlation window.
- Persist crash evidence through the existing artifact writer, but keep the writer out of detection decisions.
- Android monitor implementation should include a logcat session, process poller, and artifact fetcher.
- Android logcat should stream `crash`, `main`, and `system` buffers with `threadtime`; the `events` buffer may be added for ANR correlation.
- Android process polling should use `pidof` and treat process disappearance as neutral unless correlated with crash evidence or exit-reason support.
- Android artifact fetch should opportunistically retrieve tombstones and ANR traces when accessible, treating access failures as warnings.
- iOS Simulator monitor implementation should include a unified log stream, process-state checks, and a host crash report watcher.
- iOS Simulator log streaming should prefer JSON style and fall back to compact style when needed.
- iOS Simulator crash report matching should filter stale reports and confirm the current simulator or launch window.
- Physical iOS monitor implementation should use `devicectl` JSON-output commands and avoid stdout scraping.
- Physical iOS process checks should match running processes through installed app metadata.
- Physical iOS confirmation should primarily come from matching device crash logs rather than process disappearance alone.
- Physical iOS sysdiagnose collection should be an escalation fallback, not the normal fast path.
- App-assisted Android `ApplicationExitInfo` and physical iOS heartbeat support are valid future enhancements but are not required for the baseline.
- Monitor warnings should distinguish degraded platform capability from app crash failures.
- The ADR attached in the issue comments is the detailed implementation reference for command shapes and detection flow.

## Testing Decisions

- Test external behavior and lifecycle ordering rather than private implementation details.
- Jest orchestration tests should use fake monitors and assert that lifecycle events are emitted around startup, restart, stop, execution, and teardown.
- Jest orchestration tests should verify that disabled monitoring still follows the same code path through a noop monitor.
- Shared monitor tests should cover suspicion windows, confirmation windows, duplicate suppression, controlled-stop suppression, launch correlation, and report assembly.
- Android monitor tests should use logcat fixtures for Java exceptions, native crashes, ANRs, unrelated logs, and noisy device output.
- Android process tests should cover PID disappearance, PID replacement, controlled force-stop, device disconnect, and inaccessible artifact directories.
- iOS Simulator tests should use unified log fixtures and `.ips` or `.crash` fixtures for current, stale, and mismatched simulator reports.
- iOS Simulator tests should cover JSON log parsing and compact log fallback.
- Physical iOS tests should use `devicectl` JSON fixtures for device discovery, app lookup, process listing, crash log listing, and crash log copying.
- Physical iOS tests should cover process disappearance without crash logs, delayed crash log arrival, stale crash logs, and device unplug during artifact retrieval.
- Artifact persistence tests should remain focused on file/text persistence and deduplication.
- End-to-end validation should cover Android Java crash, Android native crash, Android ANR, iOS Simulator fatal error, iOS Simulator abort, physical iOS crash log retrieval, and controlled restart false-positive suppression.

## Out of Scope

- Reusing or refactoring the existing crash monitor internals.
- Building a generic cross-platform process-death detector that treats all platforms the same.
- Adding mandatory app instrumentation for the baseline implementation.
- Guaranteeing Android-style realtime crash detection on physical iOS devices.
- Uploading crash artifacts to external services.
- Implementing full symbolication as a required synchronous step.
- Adding retention-policy UI or configuration for crash artifacts.
- Implementing app-assisted Android exit-reason bridge in the baseline unless it naturally falls out as a small optional extension.
- Implementing app-assisted physical iOS heartbeat in the baseline unless it is explicitly prioritized later.

## Further Notes

The key architectural rule is: platform adapters may know platform tools; the shared monitor and Jest orchestration may only know lifecycle events, monitor capabilities, normalized evidence, and crash artifacts.

The implementation should be honest about asymmetry:

- Android has the strongest near-realtime path through logcat plus PID correlation.
- iOS Simulator has a strong path through simulator unified logs plus host crash reports.
- Physical iOS devices should use official `devicectl` JSON and crash artifact retrieval, with conservative realtime assumptions.

The attached ADR contains the detailed interface sketch, dependency chain, shell commands, platform detection logic, artifact policy, migration plan, and open questions.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebuild native crash detection as app lifecycle monitors #126

Problem Statement

Solution

User Stories

Implementation Decisions

Testing Decisions

Out of Scope

Further Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Rebuild native crash detection as app lifecycle monitors #126

Description

Problem Statement

Solution

User Stories

Implementation Decisions

Testing Decisions

Out of Scope

Further Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions