hi, i'm charles
always open to talk · discord · twitter · email · site
infra oriented engineer across fintech, web3, and tooling
building to solve. focused on correctness, auditability, automation, efficiency
currently working on low latency infra, compliance systems, evaluation harnesses
polyedge · fintech infrastructure · prediction market platform · hardening paid tiers
tokio-powered data plane axum + tower parquet react · vercel · cloudflare
fortifai · cognitive infrastructure · LLM-proctored harness · hardening falsifiability
discord interfaced sqlite claude prompt-cached streaming agnostic pipeline
___
fortifai · self-audit loop · streak 1d
self-audit: scenario-based time-pressured recall, cross-domain breadth, b3-calibrated
invariant: zero outside assistance. no docs, no ai, no peers. 10m/response, 5m/single refinement
breadth: systems/distributed, backend, sre, ml, ai/llm, frontend, data, security
bar: consistent ≥3 across all 8 swe fields
industry swe updated 2026-06-06
scope cross-domain · grab-bag duration 43m 39s
calibration b3 "practitioner" rotation bias underindexed-weighted
b₂ b3 b₄
q1 security secret-rotation ₃ 3 ₂
q2 systems-distributedheartbeat-failure ₅ 4 ₃
q3 sre rolling-update-surge ₂ 2 ₂
q4 backend soft-delete-vs-hard ₂ 2 ₂
q5 ml-engineering feature-normalization ₃ 2 ₂
strengths false-positive-failover-tradeoff · heartbeat-failure-detection · phi-accrual-detector
gaps deployment-capacity-headroom · feature-normalization-train-test-fit · maxsurge-maxunavailable-tradeoff · partial-unique-index
score (dreyfus) 1 (novice) → 3 (competent) → 5 (mastered)
band (swecom) b1 (technician) → b3 (practitioner) → b5 (principal)
q1 · security · secret-rotation-mechanism · pre 2 → post 3 · ceiling — · transitional b1–b3
Scenario: A SaaS platform, 'LedgerVault', stores customer financial documents encrypted at rest in object storage. The security team currently encrypts every object directly with a single master key held in a KMS. They want to support periodic master-key rotation (e.g., quarterly) WITHOUT having to re-encrypt and rewrite every object in the bucket. A junior engineer proposes 'just rotate the KMS key and re-encrypt everything in a nightly batch job.' Explain why envelope encryption (a per-object data encryption key wrapped by the master key) is the standard mechanism here, what specifically it lets you rotate cheaply versus what it does NOT rotate, and what the re-encrypt-everything approach gives up. Be concrete about which ciphertext changes on rotation and which does not.
Refinement: You said 'cheap rotations of the encrypted secret key'. Clarify: which specific ciphertext artifact changes on a master-key rotation under envelope encryption, and which ciphertext artifact remains byte-for-byte identical to what was stored before the rotation?
Assessment: The original response inverted the mechanism — claiming per-object keys rotate while the master-key ciphertext stays fixed, and importing irrelevant signing/hash-round vocabulary. The refinement probe corrected the inversion: the answerer now places the changing artifact on the master-key/KMS side and the unchanged bulk ciphertext downstream, which is the right direction. The remaining gap is precision about which artifact re-wraps: it is the small wrapped-DEK blob that changes, the DEK plaintext value and the payload ciphertext do not, and the answer does not yet state why that makes rotation cheap or what the re-encrypt path actually buys (DEK/payload rotation after compromise).
Literature
- [remediation] AWS Key Management Service Developer Guide — Envelope Encryption — §Envelope encryption and §Rotating AWS KMS keys — the specific subsection on how automatic key rotation re-wraps the data key without re-encrypting data — ~20m
- [remediation] Cryptography Engineering — Ch. 21 §Key Management — key hierarchies, key-encryption keys vs data-encryption keys, and what rotation at each layer does and does not invalidate — ~1h 0m
q2 · systems-distributed · heartbeat-failure-detection · pre 3 → post 4 · ceiling b3 · transitional b4–b5
Scenario: A distributed coordination service runs 5 nodes that detect each other's liveness via periodic heartbeats. Operators tune two parameters: heartbeat interval and the timeout after which a peer is declared dead. During a brief network hiccup (200ms of packet loss), the cluster declared a healthy node dead and triggered an unnecessary failover, causing a leadership churn. Explain the tradeoff governed by the timeout value: what you give up by setting it aggressively low versus conservatively high. Then explain why a fixed timeout is a blunt instrument under variable network latency, and describe at a mechanism level how an adaptive (phi-accrual-style) detector improves on it — what it measures and what it outputs instead of a binary alive/dead.
Refinement: You said 'it measures multiple signals like throughput, calls, resource utilization, etc'. Clarify: what specific statistical property of heartbeat arrival times does a phi-accrual detector compute, and what does it output in place of a binary threshold comparison?
Assessment: Pre-refinement the answer nailed the timeout tradeoff and correctly argued why a fixed value is blunt under jitter, but misidentified the adaptive mechanism as load-driven timeout adjustment. The refinement probe moved it onto the right track — it recognized that the detector builds a baseline from the statistical history of heartbeat inter-arrival times over a sliding window and emits a graded health signal instead of a binary verdict. The remaining gap is the precise statistical object the detector outputs and how a consumer acts on it; the reader should pin down exactly what 'phi' is computed from and what it represents before consuming it.
Literature
- [remediation] The Phi Accrual Failure Detector — §3 'The φ Accrual Failure Detector' and §4 'Implementation' — the inter-arrival sampling window, normal-distribution estimation, and the φ = -log10(P_later(t)) output — ~40m
- [growth] Designing Data-Intensive Applications, Ch. 8 §The Truth Is Defined by the Majority / §Knowledge, Truth, and Lies — Connects this question's accrual-detector mechanism to the broader design-space framing: how unreliable detection couples to leader leases, fencing tokens, and why no detector can be both perfectly complete and perfectly accurate over an asynchronous network. — ~1h
q3 · sre · rolling-update-surge-unavailable · pre 2 → post 2 · ceiling — · transitional b1
Scenario: A team runs a stateless API on Kubernetes with a Deployment of 10 replicas sized to run at ~80% CPU at peak. They configure a RollingUpdate with maxSurge=0 and maxUnavailable=2 and notice that deploys during peak traffic cause elevated latency and some 503s. Explain the mechanism: what maxSurge and maxUnavailable each control during a rolling update, and why this specific combination reduces serving capacity mid-deploy. Then compare the tradeoff of flipping to maxSurge=2, maxUnavailable=0 — what it costs and what it buys — and state the precondition under which the maxSurge approach can still fail to protect capacity.
Refinement: You said 'surge has to handle that 20% flawlessly'. Clarify: what cluster-level resource condition determines whether the extra surge pods can be scheduled at all, independent of CPU headroom on existing nodes?
Assessment: The answer never establishes what maxSurge and maxUnavailable actually control during a Deployment rollout — it treats maxSurge as utilization-spike protection and maxUnavailable as a node-health threshold, so the central capacity-loss mechanism (terminating ready replicas before replacements exist, dropping an 80%-CPU fleet to a level that saturates survivors) is not derived. The refinement probe pointed directly at the scheduling precondition (allocatable cluster headroom for the surge pods, independent of per-node CPU), and the response guessed network throughput/queue health instead, leaving the B3 boundedness 'why' unmet. The gaps are in (1) the rollout-parameter semantics and (2) the cluster-scheduling resource that gates surge-pod admission.
Literature
- [remediation] Kubernetes Documentation — Deployments: Rolling Update Strategy — §Rolling Update Deployment and §Max Unavailable / Max Surge — the maxUnavailable/maxSurge parameter pair and how the controller terminates old / creates new ReplicaSet pods during a rollout — ~20m
- [remediation] Kubernetes Documentation — Scheduling: Resource Requests and Node Allocatable — §Node Allocatable and Resource Management for Pods and Containers §How Pods with resource requests are scheduled — why a surge pod stays Pending when no node has allocatable CPU/memory to satisfy its requests, independent of live CPU utilization — ~25m
q4 · backend · soft-delete-vs-hard-delete · pre 2 → post 2 · ceiling — · transitional b1
Scenario: A backend service for 'TaskHub' models users with a deleted_at timestamp column for soft deletes (NULL = active). They have a UNIQUE constraint on email. A user deletes their account, then a new person tries to register with that same email — and the insert fails on the unique constraint, even though the old row is 'deleted'. Separately, an analytics query accidentally counted deleted users because it forgot a WHERE deleted_at IS NULL clause. Explain the mechanism behind both problems (why the unique constraint still bites, and why soft delete leaks into queries), then give the database-level fix for the uniqueness problem using a partial unique index — state precisely what predicate the index carries and why that resolves the collision while still preventing two ACTIVE users from sharing an email. Name one tradeoff soft delete imposes that hard delete does not.
Refinement: You said 'The predicate the index carries is a check on the null column, to determine state at query time'. Clarify: what is the exact SQL predicate expression on that index, and why does a NULL value in deleted_at satisfy or not satisfy it?
Assessment: The answer landed on the right tool (partial unique index) and the right framing (separating active from deleted rows) but never produced the gating mechanism: the index predicate and why it excludes soft-deleted rows from uniqueness enforcement while still catching two active rows. The explanation of why the original constraint bites was built on an incorrect lock/PK model rather than index coverage of all rows. The refinement asked directly for the predicate expression and the NULL semantics, and the answerer honestly conceded they did not know it and then described NULL incorrectly as a SELECT no-op — confirming the mechanism is not held. The gap is in how a unique constraint is a uniqueness predicate over indexed rows and how a filtered index changes the row set under enforcement.
Literature
- [remediation] PostgreSQL Documentation — CREATE INDEX (Partial Indexes) — §11.8 Partial Indexes — the example 'CREATE UNIQUE INDEX ... WHERE ...' enforcing uniqueness over a subset of rows (one focused section) — ~15m
- [remediation] PostgreSQL Documentation — Index Uniqueness Checks & Unique Indexes — §CREATE INDEX — UNIQUE clause and NULL handling: how a unique index treats every indexed row and how NULLs interact with uniqueness (one focused subsection) — ~15m
q5 · ml-engineering · feature-normalization-train-test-fit · pre 2 → post 2 · ceiling — · transitional b1–b2
Scenario: An ML engineer building a churn classifier standardizes features (subtract mean, divide by std) before splitting into train and test sets — they call StandardScaler.fit_transform on the full dataset, then split. Validation accuracy looks great but production performance is worse than expected. Explain the mechanism of the leakage: what statistic crosses the train/test boundary when you fit the scaler on the full dataset, and why that inflates the validation metric. State the correct procedure (fit on train, transform train and test with the train-derived parameters) and explain why fitting on train only is the faithful simulation of production. Then identify why this same leakage is more dangerous inside k-fold cross-validation than in a single split, and name the mechanism that fixes it across folds.
Refinement: You said 'on training only is the ideal way, you want to seperate the test from the training and have the test be a complete black box/no alterations or bias in the dataset'. Clarify: what specific statistic computed from the test set's values contaminates the train-derived model when fit_transform is called on the full dataset before splitting, and how does that statistic's presence in the scaler parameters cause the validation metric to diverge from production performance?
Assessment: The answer correctly identifies that this is preprocessing leakage and that something about the mean crosses the train/test boundary, but never specifies that the test set's per-feature mean and standard deviation are aggregated into the scaler parameters and then applied to the training features — the precise statistic the question and refinement both asked for. The refinement did not close this gap; the answerer drifted into a separate and incorrect claim that fit_transform corrupts the random sampling/splitting itself. The correct fit-on-train / transform-with-train-parameters procedure is stated only as a vague 'isolate the test set', and the across-fold fix (encapsulating the scaler in a Pipeline so it re-fits per fold) is explicitly unknown to the answerer. The path to the answer is understanding that any preprocessing step must derive its parameters only from data available at training time, and how cross-validation amplifies the violation.
Literature
- [remediation] Designing Machine Learning Systems — Ch. 5 §Data Leakage — specifically the 'scaling before splitting' worked example and the rule that statistics must be computed on the training split only — ~45m
- [remediation] scikit-learn User Guide — Pipelines and composite estimators / Common pitfalls and recommended practices — §'Data leakage' and §'How to avoid data leakage' — the Pipeline-inside-cross_val_score pattern that re-fits the scaler per fold — ~30m
widget regenerated from fortifai's data/session.json via profile_widget.py



