diff options
author | Kent Overstreet <kent.overstreet@linux.dev> | 2025-02-26 18:44:23 -0500 |
---|---|---|
committer | Kent Overstreet <kent.overstreet@linux.dev> | 2025-03-14 21:02:16 -0400 |
commit | 981e3801443f507d74e2dae5710452642c96e8e3 (patch) | |
tree | af7f2560bc7e587b7da33438592226483f68e723 /tools/perf/scripts/python/export-to-postgresql.py | |
parent | d71e023376d3e56bf2a787c9b5d2600a2db2aabf (diff) |
bcachefs: Kick devices out after too many write IO errors
We're improving our handling of write errors - we shouldn't write
degraded data just because a write failed once, we should retry it (on
other devices, if possible).
But for this to work, we need to kick devices out when they're only
returning errors - otherwise those retries will loop infinitely.
This adds a configurable timeout - if writes are failing for too long,
we'll set that device read-only.
In the future we should also implement more tracking and another knob
for an "allowed error rate", so that we can kick out drives that are
acting "unhealthy".
Another thing we'll want is a mechanism (likely in userspace) for
bringing a device back in after a transient error - perhaps a cable was
jiggled, or there was a controller reset.
After transient errors we also need a mechanism to walk (from the
journal) recent btree updates that weren't flushed to that device and
treat them as "degraded", since unflushed data may well not have been
written. Out of scope for this patch, but becoming relevant.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Diffstat (limited to 'tools/perf/scripts/python/export-to-postgresql.py')
0 files changed, 0 insertions, 0 deletions