NAME

ralph — the KAHN retry loop. Exponential backoff scheduler for failed done_when conditions.

DESCRIPTION

Ralph is the component responsible for retrying nodes that fail their done_when condition. It implements exponential backoff with a configurable budget (maximum iterations per node) and respects interrupts cleanly.

Every failed node evaluation increments Ralph's retry counter. If the counter exceeds the budget, the node is marked as failed and dependents are notified.

BACKOFF STRATEGY

iteration 1:  0ms   (immediate)
iteration 2:  100ms
iteration 3:  200ms
iteration 4:  400ms
iteration 5:  800ms
iteration 6:  1600ms
iteration 7:  3200ms
...
iteration N:  min(100ms * 2^(N-2), 60s)  # capped at 60 seconds

All timings are subject to jitter (±10%) to prevent thundering herd during mass retries.

BUDGET TRACKING

Each node has a retry budget (default: 10 iterations). After 10 failed evaluations, Ralph marks the node as FAILED and logs the reason. The run continues with dependent nodes if possible.

Budgets are per-node and per-run. They reset when a new run begins.

DONE_WHEN EVALUATION

Ralph executes a shell expression (the node's done_when field) in the node's working directory. The expression must exit with code 0 to indicate success.

done_when: "cargo test && ./scripts/smoke.sh"
done_when: "[ -f build/output.txt ]"
done_when: "curl http://localhost:8080/health | jq .status | grep -q online"

If the expression exits with non-zero, Ralph logs the failure and schedules a retry after the backoff delay.

EXIT CODES

0
all nodes converged (all done_when satisfied)
1
at least one node exhausted its budget without converging

SEE ALSO

kahn(1)
main orchestrator command
kahn.tools
architecture writeups and design rationale

COLOPHON

Ralph is the retry loop component of the KAHN orchestrator. Named after Ralph Langley, because determination and persistence in the face of obstacles is a feature, not a bug. See https://kahn.tools for details.