summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2016-12-08dm table: an 'all_blk_mq' table must be loaded for a blk-mq DM deviceBart Van Assche
When dm_table_set_type() is used by a target to establish a DM table's type (e.g. DM_TYPE_MQ_REQUEST_BASED in the case of DM multipath) the DM core must go on to verify that the devices in the table are compatible with the established type. Fixes: e83068a5 ("dm mpath: add optional "queue_mode" feature") Cc: stable@vger.kernel.org # 4.8+ Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08dm table: fix 'all_blk_mq' inconsistency when an empty table is loadedMike Snitzer
An earlier DM multipath table could have been build ontop of underlying devices that were all using blk-mq. In that case, if that active multipath table is replaced with an empty DM multipath table (that reflects all paths have failed) then it is important that the 'all_blk_mq' state of the active table is transfered to the new empty DM table. Otherwise dm-rq.c:dm_old_prep_tio() will incorrectly clone a request that isn't needed by the DM multipath target when it is to issue IO to an underlying blk-mq device. Fixes: e83068a5 ("dm mpath: add optional "queue_mode" feature") Cc: stable@vger.kernel.org # 4.8+ Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Tested-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08md/r5cache: after recovery, increase journal seq by 10000Song Liu
Currently, we increase journal entry seq by 10 after recovery. However, this is not sufficient in the following case. After crash the journal looks like | seq+0 | +1 | +2 | +3 | +4 | +5 | +6 | +7 | ... | +11 | +12 | If +1 is not valid, we dropped all entries from +1 to +12; and write seq+10: | seq+0 | +10 | +2 | +3 | +4 | +5 | +6 | +7 | ... | +11 | +12 | However, if we write a big journal entry with seq+11, it will connect with some stale journal entry: | seq+0 | +10 | +11 | +12 | To reduce the risk of this issue, we increase seq by 10000 instead. Shaohua: use 10000 instead of 1000. The risk should be very unlikely. The total stripe cache size is less than 2k typically, and several stripes can fit into one meta data block. So the total inflight meta data blocks would be quite small, which means the the total sequence number used should be quite small. The 10000 sequence number increase should be far more than safe. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-08md/raid5-cache: fix crc in rewrite_data_only_stripes()Song Liu
r5l_recovery_create_empty_meta_block() creates crc for the empty metablock. After the metablock is updated, we need clear the checksum before recalculate it. Shaohua: moved checksum calculation out of r5l_recovery_create_empty_meta_block. We should calculate it after all fields are updated. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-08md/raid5-cache: no recovery is required when create super-blockJackieLiu
When create the super-block information, We do not need to do this recovery stage, only need to initialize some variables. Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-05md: fix refcount problem on mddev when stopping array.NeilBrown
md_open() gets a counted reference on an mddev using mddev_find(). If it ends up returning an error, it must drop this reference. There are two error paths where the reference is not dropped. One only happens if the process is signalled and an awkward time, which is quite unlikely. The other was introduced recently in commit af8d8e6f0. Change the code to ensure the drop the reference when returning an error, and make it harded to re-introduce this sort of bug in the future. Reported-by: Marc Smith <marc.smith@mcc.edu> Fixes: af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag") Signed-off-by: NeilBrown <neilb@suse.com> Acked-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-05md/r5cache: do r5c_update_log_state after log recoveryZhengyuan Liu
We should update log state after we did a log recovery, current completion may get wrong log state since log->log_start wasn't initalized until we called r5l_recovery_log. At log recovery stage, no lock needed as there is no race conditon. next_checkpoint field will be initialized in r5l_recovery_log too. Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-05md/raid5-cache: adjust the write position of the empty block if no data blocksJackieLiu
When recovery is complete, we write an empty block and record his position first, then make the data-only stripes rewritten done, the location of the empty block as the last checkpoint position to write into the super block. And we should update last_checkpoint to this empty block position. ------------------------------------------------------------------ | old log | empty block | data only stripes | invalid log | ------------------------------------------------------------------ ^ ^ ^ | |- log->last_checkpoint |- log->log_start | |- log->last_cp_seq |- log->next_checkpoint |- log->seq=n |- log->seq=10+n At the same time, if there is no data-only stripes, this scene may appear, | meta1 | meta2 | meta3 | meta 1 is valid, meta 2 is invalid. meta 3 could be valid. so we should The solution is we create a new meta in meta2 with its seq == meta1's seq + 10 and let superblock points to meta2. Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Reviewed-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-02md/r5cache: run_no_space_stripes() when R5C_LOG_CRITICAL == 0Song Liu
With writeback cache, we define log space critical as free_space < 2 * reclaim_required_space So the deassert of R5C_LOG_CRITICAL could happen when 1. free_space increases 2. reclaim_required_space decreases Currently, run_no_space_stripes() is called when 1 happens, but not (always) when 2 happens. With this patch, run_no_space_stripes() is call when R5C_LOG_CRITICAL is cleared. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29md/raid5: limit request size according to implementation limitsKonstantin Khlebnikov
Current implementation employ 16bit counter of active stripes in lower bits of bio->bi_phys_segments. If request is big enough to overflow this counter bio will be completed and freed too early. Fortunately this not happens in default configuration because several other limits prevent that: stripe_cache_size * nr_disks effectively limits count of active stripes. And small max_sectors_kb at lower disks prevent that during normal read/write operations. Overflow easily happens in discard if it's enabled by module parameter "devices_handle_discard_safely" and stripe_cache_size is set big enough. This patch limits requests size with 256Mb - 8Kb to prevent overflows. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Shaohua Li <shli@kernel.org> Cc: Neil Brown <neilb@suse.com> Cc: stable@vger.kernel.org Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29md/raid5-cache: do not need to set STRIPE_PREREAD_ACTIVE repeatedlyJackieLiu
R5c_make_stripe_write_out has set this flag, do not need to set again. Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29md/raid5-cache: remove the unnecessary next_cp_seq field from the r5l_logJackieLiu
The next_cp_seq field is useless, remove it. Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29md/raid5-cache: release the stripe_head at the appropriate locationJackieLiu
If we released the 'stripe_head' in r5c_recovery_flush_log, ctx->cached_list will both release the data-parity stripes and data-only stripes, which will become empty. And we also need to use the data-only stripes in r5c_recovery_rewrite_data_only_stripes, so we should wait util rewrite data-only stripes is done before releasing them. Reviewed-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29md/raid5-cache: use ring add to prevent overflowJackieLiu
'write_pos' must be protected with 'r5l_ring_add', or it may overflow Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29md/raid5-cache: remove unnecessary function parametersJackieLiu
The function parameter 'recovery_list' is not used in body, we can delete it Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29raid5-cache: don't set STRIPE_R5C_PARTIAL_STRIPE flag while load stripe into ↵Zhengyuan Liu
cache r5c_recovery_load_one_stripe should not set STRIPE_R5C_PARTIAL_STRIPE flag,as the data-only stripe may be STRIPE_R5C_FULL_STRIPE stripe. The state machine would release the stripe later and add it into neither r5c_cached_full_stripes list or r5c_cached_partial_stripes list and set correct flag. Reviewed-by: JackieLiu <liuyun01@kylinos.cn> Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29raid5-cache: add another check conditon before replaying one stripeZhengyuan Liu
New stripe that was just allocated has no STRIPE_R5C_CACHING state too, add this check condition could avoid unnecessary replaying for empty stripe. r5l_recovery_replay_one_stripe would reset stripe for any case, delete it to make code more clean. Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-27md/r5cache: enable IRQs on error pathDan Carpenter
We need to re-enable the IRQs here before returning. Fixes: a39f7afde358 ("md/r5cache: write-out phase and reclaim support") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-27md/r5cache: handle alloc_page failureSong Liu
RMW of r5c write back cache uses an extra page to store old data for prexor. handle_stripe_dirtying() allocates this page by calling alloc_page(). However, alloc_page() may fail. To handle alloc_page() failures, this patch adds an extra page to disk_info. When alloc_page fails, handle_stripe() trys to use these pages. When these pages are used by other stripe (R5C_EXTRA_PAGE_IN_USE), the stripe is added to delayed_list. Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-23md: stop write should stop journal reclaimShaohua Li
__md_stop_writes currently doesn't stop raid5-cache reclaim thread. It's possible the reclaim thread is still running and doing write, which doesn't match what __md_stop_writes should do. The extra ->quiesce() call should not harm any raid types. For raid5-cache, this will guarantee we reclaim all caches before we update superblock. Signed-off-by: Shaohua Li <shli@fb.com> Reviewed-by: NeilBrown <neilb@suse.de> Cc: Song Liu <songliubraving@fb.com>
2016-11-23raid5-cache: suspend reclaim thread instead of shutdownShaohua Li
There is mechanism to suspend a kernel thread. Use it instead of playing create/destroy game. Signed-off-by: Shaohua Li <shli@fb.com> Reviewed-by: NeilBrown <neilb@suse.de> Cc: Song Liu <songliubraving@fb.com>
2016-11-22md/raid10: add failfast handling for writes.NeilBrown
When writing to a fastfail device, we use MD_FASTFAIL unless it is the only device being written to. For resync/recovery, assume there was a working device to read from so always use MD_FASTFAIL. If a write for resync/recovery fails, we just fail the device - there is not much else to do. If a normal write fails, but the device cannot be marked Faulty (must be only one left), we queue for write error handling which calls narrow_write_error() to write the block synchronously without any failfast flags. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-22md/raid10: add failfast handling for reads.NeilBrown
If a device is marked FailFast, and it is not the only device we can read from, we mark the bio as MD_FAILFAST. If this does fail-fast, we don't try read repair but just allow failure. If it was the last device, it doesn't get marked Faulty so the retry happens on the same device - this time without FAILFAST. A subsequent failure will not retry but will just pass up the error. During resync we may use FAILFAST requests, and on a failure we will simply use the other device(s). During recovery we will only use FAILFAST in the unusual case were there are multiple places to read from - i.e. if there are > 2 devices. If we get a failure we will fail the device and complete the resync/recovery with remaining devices. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-22md/raid1: add failfast handling for writes.NeilBrown
When writing to a fastfail device we use MD_FASTFAIL unless it is the only device being written to. For resync/recovery, assume there was a working device to read from so always use REQ_FASTFAIL_DEV. If a write for resync/recovery fails, we just fail the device - there is not much else to do. If a normal failfast write fails, but the device cannot be failed (must be only one left), we queue for write error handling. This will call narrow_write_error() to retry the write synchronously and without any FAILFAST flags. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-22md/raid1: add failfast handling for reads.NeilBrown
If a device is marked FailFast and it is not the only device we can read from, we mark the bio with REQ_FAILFAST_* flags. If this does fail, we don't try read repair but just allow failure. If it was the last device it doesn't fail of course, so the retry happens on the same device - this time without FAILFAST. A subsequent failure will not retry but will just pass up the error. During resync we may use FAILFAST requests and on a failure we will simply use the other device(s). During recovery we will only use FAILFAST in the unusual case were there are multiple places to read from - i.e. if there are > 2 devices. If we get a failure we will fail the device and complete the resync/recovery with remaining devices. The new R1BIO_FailFast flag is set on read reqest to suggest the a FAILFAST request might be acceptable. The rdev needs to have FailFast set as well for the read to actually use REQ_FAILFAST_*. We need to know there are at least two working devices before we can set R1BIO_FailFast, so we mustn't stop looking at the first device we find. So the "min_pending == 0" handling to not exit early, but too always choose the best_pending_disk if min_pending == 0. The spinlocked region in raid1_error() in enlarged to ensure that if two bios, reading from two different devices, fail at the same time, then there is no risk that both devices will be marked faulty, leaving zero "In_sync" devices. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-22md: Use REQ_FAILFAST_* on metadata writes where appropriateNeilBrown
This can only be supported on personalities which ensure that md_error() never causes an array to enter the 'failed' state. i.e. if marking a device Faulty would cause some data to be inaccessible, the device is status is left as non-Faulty. This is true for RAID1 and RAID10. If we get a failure writing metadata but the device doesn't fail, it must be the last device so we re-write without FAILFAST to improve chance of success. We also flag the device as LastDev so that future metadata updates don't waste time on failfast writes. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-22md/failfast: add failfast flag for md to be used by some personalities.NeilBrown
This patch just adds a 'failfast' per-device flag which can be stored in v0.90 or v1.x metadata. The flag is not used yet but the intent is that it can be used for mirrored (raid1/raid10) arrays where low latency is more important than keeping all devices on-line. Setting the flag for a device effectively gives permission for that device to be marked as Faulty and excluded from the array on the first error. The underlying driver will be directed not to retry requests that result in failures. There is a proviso that the device must not be marked faulty if that would cause the array as a whole to fail, it may only be marked Faulty if the array remains functional, but is degraded. Failures on read requests will cause the device to be marked as Faulty immediately so that further reads will avoid that device. No attempt will be made to correct read errors by over-writing with the correct data. It is expected that if transient errors, such as cable unplug, are possible, then something in user-space will revalidate failed devices and re-add them when they appear to be working again. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-22bcache: debug: avoid accessing .bi_io_vec directlyMing Lei
Instead we use standard iterator way to do that. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22block: bio: pass bvec table to bio_init()Ming Lei
Some drivers often use external bvec table, so introduce this helper for this case. It is always safe to access the bio->bi_io_vec in this way for this case. After converting to this usage, it will becomes a bit easier to evaluate the remaining direct access to bio->bi_io_vec, so it can help to prepare for the following multipage bvec support. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Fixed up the new O_DIRECT cases. Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-21dm mpath: do not modify *__clone if blk_mq_alloc_request() failsBart Van Assche
Purely cleanup, avoids potential for strange coding bugs. But in reality if __multipath_map() fails the caller has no business looking at *__clone. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-11-21dm mpath: change return type of pg_init_all_paths() from int to voidBart Van Assche
None of the callers of pg_init_all_paths() check its return value. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-11-21dm mpath: add checks for priority group count to avoid invalid memory accesstang.junhui
This avoids the potential for invalid memory access, if/when there are no priority groups, in response to invalid arguments being sent by the user via DM message (e.g. "switch_group", "disable_group" or "enable_group"). Signed-off-by: tang.junhui <tang.junhui@zte.com.cn> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-11-21dm mpath: add m->hw_handler_name NULL pointer check in parse_hw_handler()tang.junhui
Avoids false positive of no hardware handler being specified (which is implied by a NULL m->hw_handler_name). Signed-off-by: tang.junhui <tang.junhui@zte.com.cn> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-11-21dm flakey: return -EINVAL on interval bounds error in flakey_ctr()Wei Yongjun
Fix to return error code -EINVAL instead of 0, as is done elsewhere in this function. Fixes: e80d1c805a3b ("dm: do not override error code returned from dm_get_device()") Cc: stable@vger.kernel.org # 4.3+ Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-11-21dm crypt: constify crypt_iv_operations structuresJulia Lawall
The crypt_iv_operations are never modified, so declare them as const. Done with the help of Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-11-21dm raid: correct error messages on old metadata validationHeinz Mauelshagen
When target 1.9.1 gets takeover/reshape requests on devices with old superblock format not supporting such conversions and rejects them in super_init_validation(), it logs bogus error message (e.g. Reshape when a takeover is requested). Whilst on it, add messages for disk adding/removing and stripe sectors reshape requests, use the newer rs_{takeover,reshape}_requested() API, address a raid10 false positive in checking array positions and remove rs_set_new() because device members are already set proper. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-11-21dm cache: add missing cache device name to DMERR in set_cache_mode()Mike Snitzer
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-11-21dm cache metadata: remove an extra newline in DMERR and codeMike Snitzer
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-11-21dm verity: fix incorrect error messageEric Biggers
Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-11-21dm crypt: rename crypt_setkey_allcpus to crypt_setkeyMikulas Patocka
In the past, dm-crypt used per-cpu crypto context. This has been removed in the kernel 3.15 and the crypto context is shared between all cpus. This patch renames the function crypt_setkey_allcpus to crypt_setkey, because there is really no activity that is done for all cpus. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-11-21dm crypt: mark key as invalid until properly loadedOndrej Kozina
In crypt_set_key(), if a failure occurs while replacing the old key (e.g. tfm->setkey() fails) the key must not have DM_CRYPT_KEY_VALID flag set. Otherwise, the crypto layer would have an invalid key that still has DM_CRYPT_KEY_VALID flag set. Cc: stable@vger.kernel.org Signed-off-by: Ondrej Kozina <okozina@redhat.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-11-21dm crypt: use bio_add_page()Ming Lei
Use bio_add_page(), the standard interface for adding a page to a bio, rather than open-coding the same. It should be noted that the 'clone' bio that is allocated using bio_alloc_bioset(), in crypt_alloc_buffer(), does _not_ set the bio's BIO_CLONED flag. As such, bio_add_page()'s early return for true bio clones (those with BIO_CLONED set) isn't applicable. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-11-21dm io: use bvec iterator helpers to implement .get_page and .next_pageMing Lei
Firstly we have mature bvec/bio iterator helper for iterate each page in one bio, not necessary to reinvent a wheel to do that. Secondly the coming multipage bvecs requires this patch. Also add comments about the direct access to bvec table. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-11-21dm rq: replace 'bio->bi_vcnt == 1' with !bio_multiple_segmentsMing Lei
Avoid accessing .bi_vcnt directly, because the bio can be split from block layer and .bi_vcnt should never have been used here. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-11-18md/r5cache: handle FLUSH and FUASong Liu
With raid5 cache, we committing data from journal device. When there is flush request, we need to flush journal device's cache. This was not needed in raid5 journal, because we will flush the journal before committing data to raid disks. This is similar to FUA, except that we also need flush journal for FUA. Otherwise, corruptions in earlier meta data will stop recovery from reaching FUA data. slightly changed the code by Shaohua Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18md/r5cache: r5cache recovery: part 2Song Liu
1. In previous patch, we: - add new data to r5l_recovery_ctx - add new functions to recovery write-back cache The new functions are not used in this patch, so this patch does not change the behavior of recovery. 2. In this patchpatch, we: - modify main recovery procedure r5l_recovery_log() to call new functions - remove old functions Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18md/r5cache: r5cache recovery: part 1Song Liu
Recovery of write-back cache has different logic to write-through only cache. Specifically, for write-back cache, the recovery need to scan through all active journal entries before flushing data out. Therefore, large portion of the recovery logic is rewritten here. To make the diffs cleaner, we split the rewrite as follows: 1. In this patch, we: - add new data to r5l_recovery_ctx - add new functions to recovery write-back cache The new functions are not used in this patch, so this patch does not change the behavior of recovery. 2. In next patch, we: - modify main recovery procedure r5l_recovery_log() to call new functions - remove old functions With cache feature, there are 2 different scenarios of recovery: 1. Data-Parity stripe: a stripe with complete parity in journal. 2. Data-Only stripe: a stripe with only data in journal (or partial parity). The code differentiate Data-Parity stripe from Data-Only stripe with flag STRIPE_R5C_CACHING. For Data-Parity stripes, we use the same procedure as raid5 journal, where all the data and parity are replayed to the RAID devices. For Data-Only strips, we need to finish complete calculate parity and finish the full reconstruct write or RMW write. For simplicity, in the recovery, we load the stripe to stripe cache. Once the array is started, the stripe cache state machine will handle these stripes through normal write path. r5c_recovery_flush_log contains the main procedure of recovery. The recovery code first scans through the journal and loads data to stripe cache. The code keeps tracks of all these stripes in a list (use sh->lru and ctx->cached_list), stripes in the list are organized in the order of its first appearance on the journal. During the scan, the recovery code assesses each stripe as Data-Parity or Data-Only. During scan, the array may run out of stripe cache. In these cases, the recovery code will also call raid5_set_cache_size to increase stripe cache size. If the array still runs out of stripe cache because there isn't enough memory, the array will not assemble. At the end of scan, the recovery code replays all Data-Parity stripes, and sets proper states for Data-Only stripes. The recovery code also increases seq number by 10 and rewrites all Data-Only stripes to journal. This is to avoid confusion after repeated crashes. More details is explained in raid5-cache.c before r5c_recovery_rewrite_data_only_stripes(). Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18md/r5cache: refactoring journal recovery codeSong Liu
1. rename r5l_read_meta_block() as r5l_recovery_read_meta_block(); 2. pull the code that initialize r5l_meta_block from r5l_log_write_empty_meta_block() to a separate function r5l_recovery_create_empty_meta_block(), so that we can reuse this piece of code. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18md/r5cache: sysfs entry journal_modeSong Liu
With write cache, journal_mode is the knob to switch between write-back and write-through. Below is an example: root@virt-test:~/# cat /sys/block/md0/md/journal_mode [write-through] write-back root@virt-test:~/# echo write-back > /sys/block/md0/md/journal_mode root@virt-test:~/# cat /sys/block/md0/md/journal_mode write-through [write-back] Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18md/r5cache: write-out phase and reclaim supportSong Liu
There are two limited resources, stripe cache and journal disk space. For better performance, we priotize reclaim of full stripe writes. To free up more journal space, we free earliest data on the journal. In current implementation, reclaim happens when: 1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim if there is no reclaim in the past 5 seconds. 2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes, or cached stripes is enough for a full stripe (chunk size / 4k) (r5c_check_cached_full_stripe) 3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage) 4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data) r5c_do_reclaim() contains new logic of reclaim. For stripe cache: When stripe cache pressure is high (more than 3/4 stripes are cached, or there is empty inactive lists), flush all full stripe. If fewer than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes are flushed, flush some paritial stripes. When stripe cache pressure is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes. For log space: To avoid deadlock due to log space, we need to reserve enough space to flush cached data. The size of required log space depends on total number of cached stripes (stripe_in_journal_count). In current implementation, the writing-out phase automatically include pending data writes with parity writes (similar to write through case). Therefore, we need up to (conf->raid_disks + 1) pages for each cached stripe (1 page for meta data, raid_disks pages for all data and parity). r5c_log_required_to_flush_cache() calculates log space required to flush cache. In the following, we refer to the space calculated by r5c_log_required_to_flush_cache() as reclaim_required_space. Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL is set when free space on the log device is less than 2x of reclaim_required_space. r5c_cache keeps all data in cache (not fully committed to RAID) in a list (stripe_in_journal_list). These stripes are in the order of their first appearance on the journal. So the log tail (last_checkpoint) should point to the journal_start of the first item in the list. When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is set, the state machine only writes data that are already in the log device (in stripe_in_journal_list). This patch includes a fix to improve performance by Shaohua Li <shli@fb.com>. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>