ADR 0009: Robust Synchronization and Error Handling
Status
Accepted (Supersedes behavioral sections of ADR 0005)
Context
As the system matured, several edge cases were identified:
1. Panic Risk: In high-concurrency scenarios, a panic in a leader thread could result in "mutex poisoning," causing all subsequent threads waiting for the same key to panic and crash the entire process.
2. Silent Storage Failures: Errors in LMDB transactions (e.g., oversized keys or disk full) were being logged but not propagated to the user, leading to data loss or confusing behavior.
3. Brittle Invalidation: The invalidation bus used a simple tag|version string format. If a tag contained a | character, it would break the distributed sync.
4. Counter Drift: In LMDB, the entry count was updated in-memory before the transaction committed, leading to desynchronization if the commit failed.
Decision
- Safe Synchronization: Replace all
.unwrap()calls on mutexes and condition variables inflight.rs. Implement a 60-second timeout for waiting threads to prevent indefinite stalls. Detect poisoned mutexes and returnFlightStatus::Errorinstead of panicking. - Error Propagation: Ensure all storage operations (
set,clear, etc.) return aPyResult. Propagate LMDB errors to Python as aRuntimeError. - Tag Validation: Enforce a strict character set for dependency tags:
[a-zA-Z0-9_:]. Introduce a customInvalidTagexception raised on bothinvalidate()andset(). - Post-Commit Updates: For persistent storage (LMDB), update in-memory atomic counters ONLY after the transaction has successfully committed.
Consequences
- Positive: Significantly improved system stability. A single failing request can no longer crash the entire node.
- Positive: Correctness and transparency. Users are immediately notified of storage-level issues.
- Positive: Robust distributed invalidation by preventing invalid characters in tags.
- Negative: Small performance overhead for character validation (nanoseconds).