Age | Commit message (Collapse) | Author |
|
Drop the 'id' part of the attribute group name because we want to expose
non 'id' related attributes via the ns attribute group.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Use nvme_ns_head instead of nvme_ns where possible. This reduces the
coupling between the different data structures.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Pass in the nvme_ns_head pointer directly. This reduces the necessity on
the caller side have the nvme_ns data structure present. Thus we can
refactor the caller side in the next step as well.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Move the namesapce info to struct nvme_ns_head, because it's the same
for all associated namespaces.
Note: with multipathing enabled the PI information is shared between all
paths. If a path is using a different PI configuration it will overwrite
the previous settings. This is obviously not correct and such
configuration will be rejected in future. For the time being we expect
a correctly configured storage.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
'first_minor' represents the starting minor number of disks, and
'minors' represents the number of partitions in the device. Neither
of them can be greater than MINORMASK + 1.
Commit e338924bd05d ("block: check minor range in device_add_disk()")
only added the check of 'first_minor + minors'. However, their sum might
be less than MINORMASK but their values are wrong. Complete the checks now.
Fixes: e338924bd05d ("block: check minor range in device_add_disk()")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231219075942.840255-1-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Even if BLK_CGROUP is enabled, it does not work for passthrough io.
So skip setting up blkg for passthrough bio.
Reduced processing gives ~5% hike in peak-performance workload.
Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20231218152722.1768-1-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
After commit 1e50915fe0bb ("raid: improve MD/raid10 handling of correctable
read errors."), rdev will be set to faulty if it reads data error to many
times in raid10. Add this mechanism to raid1 now.
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231215023852.3478228-3-linan666@huaweicloud.com
|
|
Move check_decay_read_errors() to raid1-10.c and factor out a helper
exceed_read_errors() to check if read_errors exceeds the limit, so that
raid1 can also use it. There are no functional changes.
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231215023852.3478228-2-linan666@huaweicloud.com
|
|
Upon assembling the array, both kernel and mdadm allow the devices to have event
counter difference of 1, and still consider them as up-to-date.
However, a device whose event count is behind by 1, may in fact not be up-to-date,
and array resync with such a device may cause data corruption.
To avoid this, consult the superblock of the freshest device about the status
of a device, whose event counter is behind by 1.
Signed-off-by: Alex Lyakas <alex.lyakas@zadara.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/1702470271-16073-1-git-send-email-alex.lyakas@zadara.com
|
|
It's clearly been a while since someone looked at this, so I gave it a
quick shot. There are few issues in here:
- Random bundling of members that are mostly read-only and often written
- Random holes that need not be there
This moves the most frequently used bits into cacheline 1 and 2, with
the 2nd one being more write intensive than the first one, which is
basically read-only.
Outside of making this work a bit more efficiently, it also reduces the
size of struct request_queue for my test setup from 864 bytes (spanning
14 cachelines!) to 832 bytes and 13 cachelines.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/d2b7b61c-4868-45c0-9060-4f9c73de9d7e@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bio_add_hw_page currently always fails or succeeds. This is fine for
the existing callers that always add PAGE_SIZE worth given that the
max_segment_size and max_sectors must always allow at least a page
worth of data. But when we want to add it for bigger amounts of data
this means it can also fail when adding the data to a bio, and creating
a fallback for that becomes really annoying in the callers.
Make use of the existing API design that allows to return a smaller
length than the one passed in and add up to max_segment_size worth
of data from a larger input. All the existing callers are fine with
this - not because they handle this return correctly, but because they
never pass more than a page in.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20231204173419.782378-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Reordered a check to avoid a possible overflow when adding len to bv_len.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20231204173419.782378-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If %__GFP_DIRECT_RECLAIM is set then bio_alloc_bioset will always
be able to allocate a bio. See comment of bio_alloc_bioset.
Signed-off-by: Gou Hao <gouhao@uniontech.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231214151458.28970-1-gouhao@uniontech.com
|
|
Switch to the modern style of printing kernel messages. Use %u instead
of %d to print unsigned integers.
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20231213194702.90381-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The cntlid_min and cntlid_max are checked in configfs, don't check
again in nvmet_alloc_ctrl().
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
When the user wants to restrict to only creating one controller,
they can set cntlid_min and cntlid_max to the same value.
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Before calling add partition or resize partition, there is no check
on whether the length is aligned with the logical block size.
If the logical block size of the disk is larger than 512 bytes,
then the partition size maybe not the multiple of the logical block size,
and when the last sector is read, bio_truncate() will adjust the bio size,
resulting in an IO error if the size of the read command is smaller than
the logical block size.If integrity data is supported, this will also
result in a null pointer dereference when calling bio_integrity_free.
Cc: <stable@vger.kernel.org>
Signed-off-by: Min Li <min15.li@samsung.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230629142517.121241-1-min15.li@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
On the error path of device_add_disk(), device's memalloc_noio flag was
set but not cleared. As the comment of pm_runtime_set_memalloc_noio(),
"The function should be called between device_add() and device_del()".
Clear this flag before device_del() now.
Fixes: 25e823c8c37d ("block/genhd.c: apply pm_runtime_set_memalloc_noio on block devices")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231211075356.1839282-1-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Since "dev_search_path" can technically be as large as PATH_MAX,
there was a risk of truncation when copying it and a second string
into "full_path" since it was also PATH_MAX sized. The W=1 builds were
reporting this warning:
drivers/block/rnbd/rnbd-srv.c: In function 'process_msg_open.isra':
drivers/block/rnbd/rnbd-srv.c:616:51: warning: '%s' directive output may be truncated writing up to 254 bytes into a region of size between 0 and 4095 [-Wformat-truncation=]
616 | snprintf(full_path, PATH_MAX, "%s/%s",
| ^~
In function 'rnbd_srv_get_full_path',
inlined from 'process_msg_open.isra' at drivers/block/rnbd/rnbd-srv.c:721:14: drivers/block/rnbd/rnbd-srv.c:616:17: note: 'snprintf' output between 2 and 4351 bytes into a destination of size 4096
616 | snprintf(full_path, PATH_MAX, "%s/%s",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
617 | dev_search_path, dev_name);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~
To fix this, unconditionally check for truncation (as was already done
for the case where "%SESSNAME%" was present).
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202312100355.lHoJPgKy-lkp@intel.com/
Cc: Md. Haris Iqbal <haris.iqbal@ionos.com>
Cc: Jack Wang <jinpu.wang@ionos.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <linux-block@vger.kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20231212214738.work.169-kees@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.8/block
Pull MD updates from Song:
"1. Fix/Cleanup RCU usage from conf->disks[i].rdev, by Yu Kuai;
2. Fix raid5 hang issue, by Junxiao Bi;
3. Add Yu Kuai as Reviewer of the md subsystem."
* tag 'md-next-20231208' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
md: synchronize flush io with array reconfiguration
MAINTAINERS: SOFTWARE RAID: Add Yu Kuai as Reviewer
md/md-multipath: remove rcu protection to access rdev from conf
md/raid5: remove rcu protection to access rdev from conf
md/raid1: remove rcu protection to access rdev from conf
md/raid10: remove rcu protection to access rdev from conf
md: remove flag RemoveSynchronized
Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"
md: bypass block throttle for superblock update
|
|
The special casing was originally added in pre-git history; reproducing
the commit log here:
> commit a318a92567d77
> Author: Andrew Morton <akpm@osdl.org>
> Date: Sun Sep 21 01:42:22 2003 -0700
>
> [PATCH] Speed up direct-io hugetlbpage handling
>
> This patch short-circuits all the direct-io page dirtying logic for
> higher-order pages. Without this, we pointlessly bounce BIOs up to
> keventd all the time.
In the last twenty years, compound pages have become used for more than
just hugetlb. Rewrite these functions to operate on folios instead
of pages and remove the special case for hugetlbfs; I don't think
it's needed any more (and if it is, we can put it back in as a call
to folio_test_hugetlb()).
This was found by inspection; as far as I can tell, this bug can lead
to pages used as the destination of a direct I/O read not being marked
as dirty. If those pages are then reclaimed by the MM without being
dirtied for some other reason, they won't be written out. Then when
they're faulted back in, they will not contain the data they should.
It'll take a pretty unusual setup to produce this problem with several
races all going the wrong way.
This problem predates the folio work; it could for example have been
triggered by mmaping a THP in tmpfs and using that as the target of an
O_DIRECT read.
Fixes: 800d8c63b2e98 ("shmem: add huge pages support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Make sure that ioccsz and iorcsz returned by target are correct before use it.
Per 2.0a base NVMe spec:
I/O Queue Command Capsule Supported Size (IOCCSZ): This field defines
the maximum I/O command capsule size in 16 byte units. The minimum value
that shall be indicated is 4 corresponding to 64 bytes.
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Inroduce nvme_check_ctrl_fabric_info helper to check fabric controller info
returned by target.
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Write-back throttling (WBT) enables QUEUE_FLAG_STATS on the request
queue. But WBT does not make sense for passthrough io, so skip
QUEUE_FLAG_STATS processing.
Also skip rq_qos_issue/done for passthrough io.
Overall, the change gives ~11% hike in peak performance.
Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20231123190331.7934-1-kundan.kumar@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
No more users of this field.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20231130215309.2923568-5-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
No more users of this flag.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20231130215309.2923568-4-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Map user metadata buffers directly. Now that the bio tracks the
metadata, nvme doesn't need special metadata handling and tracking with
callbacks and additional fields in the pdu.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20231130215309.2923568-3-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Passthrough commands that utilize metadata currently need to bounce the
user space buffer through the kernel. Add support for mapping user space
directly so that we can avoid this costly overhead. This is similar to
how the normal bio data payload utilizes user addresses with
bio_map_user_iov().
If the user address can't directly be used for reason, like too many
segments or address unalignement, fallback to a copy of the user vec
while keeping the user address pinned for the IO duration so that it
can safely be copied on completion in any process context.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20231130215309.2923568-2-kbusch@meta.com
[axboe: fold in fix from Kanchan Joshi]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently rcu is used to protect iterating rdev from submit_flushes():
submit_flushes remove_and_add_spares
synchronize_rcu
pers->hot_remove_disk()
rcu_read_lock()
rdev_for_each_rcu
if (rdev->raid_disk >= 0)
rdev->radi_disk = -1;
atomic_inc(&rdev->nr_pending)
rcu_read_unlock()
bi = bio_alloc_bioset()
bi->bi_end_io = md_end_flush
bi->private = rdev
submit_bio
// issue io for removed rdev
Fix this problem by grabbing 'acive_io' before iterating rdev, make sure
that remove_and_add_spares() won't concurrent with submit_flushes().
Fixes: a2826aa92e2e ("md: support barrier requests on all personalities.")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231129020234.1586910-1-yukuai1@huaweicloud.com
|
|
Add Yu Kuai as reviewer for md/raid subsystem.
Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20231128035807.3191738-1-song@kernel.org
|
|
From Yu Kuai:
md: remove rcu protection to access rdev from conf
The lifetime of rdev:
1. md_import_device() generate a rdev based on underlying disk;
mddev_lock()
rdev = kzalloc();
rdev->bdev = blkdev_get_by_dev();
mddev_unlock()
2. bind_rdev_to_array() add this rdev to mddev->disks;
mddev_lock()
kobject_add(&rdev->kobj, &mddev->kobj, ...);
list_add_rcu(&rdev->same_set, &mddev->disks);
mddev_unlock()
3. remove_and_add_spares() add this rdev to conf;
mddev_lock()
rdev_addable();
pers->hot_add_disk();
rcu_assign_pointer(conf->rdev, rdev);
mddev_unlock()
4. Use this array with rdev;
5. remove_and_add_spares() remove rdev from conf;
// triggered by sysfs/ioctl
mddev_lock()
rdev_removeable();
pers->hot_remove_disk();
rcu_assign_pointer(conf->rdev, NULL);
synchronize_rcu();
mddev_unlock()
// triggered by daemon
mddev_lock()
rdev_removeable();
synchronize_rcu(); -> this can't protect accessing rdev from conf
pers->hot_remove_disk();
rcu_assign_pointer(conf->rdev, NULL);
mddev_unlock()
6. md_kick_rdev_from_array() remove rdev from mddev->disks;
mddev_lock()
list_del_rcu(&rdev->same_set);
synchronize_rcu();
list_add(&rdev->same_set, &mddev->deleting)
mddev_unlock()
export_rdev
There are two separate rcu protection for rdev, and this pathset remove
the protection of conf(step 3 and 5), because it's safe to access rdev
from conf in following cases:
- If 'reconfig_mutex' is held, because rdev can't be added or rmoved to
conf;
- If there is normal IO inflight, because mddev_suspend() will wait for
IO to be done and prevent rdev to be added or removed to conf;
- If sync thread is running, because remove_and_add_spares() can only be
called from daemon thread when sync thread is done, and
'MD_RECOVERY_RUNNING' is also checked for ioctl/sysfs;
- if any spinlock or rcu_read_lock() is held, because synchronize_rcu()
from step 6 prevent rdev to be freed until spinlock is released or
rcu_read_unlock();
|
|
Because it's safe to accees rdev from conf:
- If any spinlock is held, because synchronize_rcu() from
md_kick_rdev_from_array() will prevent 'rdev' to be freed until
spinlock is released;
- If there is normal IO inflight, because mddev_suspend() will prevent
rdev to be added or removed from array;
And these will cover all the scenarios in md-multipath.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231125081604.3939938-6-yukuai1@huaweicloud.com
|
|
Because it's safe to accees rdev from conf:
- If any spinlock is held, because synchronize_rcu() from
md_kick_rdev_from_array() will prevent 'rdev' to be freed until
spinlock is released;
- If 'reconfig_lock' is held, because rdev can't be added or removed from
array;
- If there is normal IO inflight, because mddev_suspend() will prevent
rdev to be added or removed from array;
- If there is sync IO inflight, because 'MD_RECOVERY_RUNNING' is
checked in remove_and_add_spares().
And these will cover all the scenarios in raid456.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231125081604.3939938-5-yukuai1@huaweicloud.com
|
|
Because it's safe to accees rdev from conf:
- If any spinlock is held, because synchronize_rcu() from
md_kick_rdev_from_array() will prevent 'rdev' to be freed until
spinlock is released;
- If 'reconfig_lock' is held, because rdev can't be added or removed from
array;
- If there is normal IO inflight, because mddev_suspend() will prevent
rdev to be added or removed from array;
- If there is sync IO inflight, because 'MD_RECOVERY_RUNNING' is
checked in remove_and_add_spares().
And these will cover all the scenarios in raid1.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231125081604.3939938-4-yukuai1@huaweicloud.com
|
|
Because it's safe to accees rdev from conf:
- If any spinlock is held, because synchronize_rcu() from
md_kick_rdev_from_array() will prevent 'rdev' to be freed until
spinlock is released;
- If 'reconfig_lock' is held, because rdev can't be added or removed from
array;
- If there is normal IO inflight, because mddev_suspend() will prevent
rdev to be added or removed from array;
- If there is sync IO inflight, because 'MD_RECOVERY_RUNNING' is
checked in remove_and_add_spares().
And these will cover all the scenarios in raid10.
This patch also cleanup the code to handle the case that replacement
replace rdev while IO is still inflight.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231125081604.3939938-3-yukuai1@huaweicloud.com
|
|
rcu is not used correctly here, because synchronize_rcu() is called
before replacing old value, for example:
remove_and_add_spares // other path
synchronize_rcu
// called before replacing old value
set_bit(RemoveSynchronized)
rcu_read_lock()
rdev = conf->mirros[].rdev
pers->hot_remove_disk
conf->mirros[].rdev = NULL;
if (!test_bit(RemoveSynchronized))
synchronize_rcu
/*
* won't be called, and won't wait
* for concurrent readers to be done.
*/
// access rdev after remove_and_add_spares()
rcu_read_unlock()
Fortunately, there is a separate rcu protection to prevent such rdev
to be freed:
md_kick_rdev_from_array //other path
rcu_read_lock()
rdev = conf->mirros[].rdev
list_del_rcu(&rdev->same_set)
rcu_read_unlock()
/*
* rdev can be removed from conf, but
* rdev won't be freed.
*/
synchronize_rcu()
free rdev
Hence remove this useless flag and prepare to remove rcu protection to
access rdev from 'conf'.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231125081604.3939938-2-yukuai1@huaweicloud.com
|
|
This reverts commit 5e2cf333b7bd5d3e62595a44d598a254c697cd74.
That commit introduced the following race and can cause system hung.
md_write_start: raid5d:
// mddev->in_sync == 1
set "MD_SB_CHANGE_PENDING"
// running before md_write_start wakeup it
waiting "MD_SB_CHANGE_PENDING" cleared
>>>>>>>>> hung
wakeup mddev->thread
...
waiting "MD_SB_CHANGE_PENDING" cleared
>>>> hung, raid5d should clear this flag
but get hung by same flag.
The issue reverted commit fixing is fixed by last patch in a new way.
Fixes: 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d")
Cc: stable@vger.kernel.org # v5.19+
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231108182216.73611-2-junxiao.bi@oracle.com
|
|
commit 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d")
introduced a hung bug and will be reverted in next patch, since the issue
that commit is fixing is due to md superblock write is throttled by wbt,
to fix it, we can have superblock write bypass block layer throttle.
Fixes: 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d")
Cc: stable@vger.kernel.org # v5.19+
Suggested-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231108182216.73611-1-junxiao.bi@oracle.com
|
|
Allow using a few symbols with IS_ENABLED instead of #idef by moving
the declarations out of #idef CONFIG_BLK_DEV_ZONED, and move
bdev_nr_zones into the remaining #idef CONFIG_BLK_DEV_ZONED, #else
block below.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231127072002.1332685-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
While printing error, replace %ld by %pe. %pe prints a string
whereas %ld would print an error code.
Signed-off-by: Supriti Singh <supriti.singh@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Link: https://lore.kernel.org/r/20231124213422.113449-3-haris.iqbal@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Remove REQ_OP_WRITE_SAME in favour of REQ_OP_WRITE_ZEROES.
Signed-off-by: Santosh Pradhan <santosh.pradhan@ionos.com>
Reviewed-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Link: https://lore.kernel.org/r/20231124213422.113449-2-haris.iqbal@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt::
"Eventfs fixes:
- With the usage of simple_recursive_remove() recommended by Al Viro,
the code should not be calling "d_invalidate()" itself. Doing so is
causing crashes. The code was calling d_invalidate() on the race of
trying to look up a file while the parent was being deleted. This
was detected, and the added dentry was having d_invalidate() called
on it, but the deletion of the directory was also calling
d_invalidate() on that same dentry.
- A fix to not free the eventfs_inode (ei) until the last dput() was
called on its ei->dentry made the ei->dentry exist even after it
was marked for free by setting the ei->is_freed. But code elsewhere
still was checking if ei->dentry was NULL if ei->is_freed is set
and would trigger WARN_ON if that was the case. That's no longer
true and there should not be any warnings when it is true.
- Use GFP_NOFS for allocations done under eventfs_mutex. The
eventfs_mutex can be taken on file system reclaim, make sure that
allocations done under that mutex do not trigger file system
reclaim.
- Clean up code by moving the taking of inode_lock out of the helper
functions and into where they are needed, and not use the parameter
to know to take it or not. It must always be held but some callers
of the helper function have it taken when they were called.
- Warn if the inode_lock is not held in the helper functions.
- Warn if eventfs_start_creating() is called without a parent. As
eventfs is underneath tracefs, all files created will have a parent
(the top one will have a tracefs parent).
Tracing update:
- Add Mathieu Desnoyers as an official reviewer of the tracing subsystem"
* tag 'trace-v6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
MAINTAINERS: TRACING: Add Mathieu Desnoyers as Reviewer
eventfs: Make sure that parent->d_inode is locked in creating files/dirs
eventfs: Do not allow NULL parent to eventfs_start_creating()
eventfs: Move taking of inode_lock into dcache_dir_open_wrapper()
eventfs: Use GFP_NOFS for allocation when eventfs_mutex is held
eventfs: Do not invalidate dentry in create_file/dir_dentry()
eventfs: Remove expectation that ei->is_freed means ei->dentry == NULL
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc architecture fixes from Helge Deller:
"This patchset fixes and enforces correct section alignments for the
ex_table, altinstructions, parisc_unwind, jump_table and bug_table
which are created by inline assembly.
Due to not being correctly aligned at link & load time they can
trigger unnecessarily the kernel unaligned exception handler at
runtime. While at it, I switched the bug table to use relative
addresses which reduces the size of the table by half on 64-bit.
We still had the ENOSYM and EREMOTERELEASE errno symbols as left-overs
from HP-UX, which now trigger build-issues with glibc. We can simply
remove them.
Most of the patches are tagged for stable kernel series.
Summary:
- Drop HP-UX ENOSYM and EREMOTERELEASE return codes to avoid glibc
build issues
- Fix section alignments for ex_table, altinstructions, parisc unwind
table, jump_table and bug_table
- Reduce size of bug_table on 64-bit kernel by using relative
pointers"
* tag 'parisc-for-6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Reduce size of the bug_table on 64-bit kernel by half
parisc: Drop the HP-UX ENOSYM and EREMOTERELEASE error codes
parisc: Use natural CPU alignment for bug_table
parisc: Ensure 32-bit alignment on parisc unwind section
parisc: Mark lock_aligned variables 16-byte aligned on SMP
parisc: Mark jump_table naturally aligned
parisc: Mark altinstructions read-only and 32-bit aligned
parisc: Mark ex_table entries 32-bit aligned in uaccess.h
parisc: Mark ex_table entries 32-bit aligned in assembly.h
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode fixes from Ingo Molnar:
"Fix/enhance x86 microcode version reporting: fix the bootup log spam,
and remove the driver version announcement to avoid version confusion
when distros backport fixes"
* tag 'x86-urgent-2023-11-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode: Rework early revisions reporting
x86/microcode: Remove the driver announcement and version
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf event fix from Ingo Molnar:
"Fix a bug in the Intel hybrid CPUs hardware-capabilities enumeration
code resulting in non-working events on those platforms"
* tag 'perf-urgent-2023-11-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Correct incorrect 'or' operation for PMU capabilities
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Ingo Molnar:
"Fix lockdep block chain corruption resulting in KASAN warnings"
* tag 'locking-urgent-2023-11-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
lockdep: Fix block chain corruption
|
|
Pull smb client fixes from Steve French:
- use after free fix in releasing multichannel interfaces
- fixes for special file types (report char, block, FIFOs properly when
created e.g. by NFS to Windows)
- fixes for reporting various special file types and symlinks properly
when using SMB1
* tag '6.7-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: introduce cifs_sfu_make_node()
smb: client: set correct file type from NFS reparse points
smb: client: introduce ->parse_reparse_point()
smb: client: implement ->query_reparse_point() for SMB1
cifs: fix use after free for iface while disabling secondary channels
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB / PHY / Thunderbolt fixes from Greg KH:
"Here are a number of reverts, fixes, and new device ids for 6.7-rc3
for the USB, PHY, and Thunderbolt driver subsystems. Include in here
are:
- reverts of some PHY drivers that went into 6.7-rc1 that shouldn't
have been merged yet, the author is reworking them based on review
comments as they were using older apis that shouldn't be used
anymore for newer drivers
- small thunderbolt driver fixes for reported issues
- USB driver fixes for a variety of small issues in dwc3, typec,
xhci, and other smaller drivers.
- new device ids for usb-serial and onboard_usb_hub drivers.
All of these have been in linux-next with no reported issues"
* tag 'usb-6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (33 commits)
USB: serial: option: add Luat Air72*U series products
USB: dwc3: qcom: fix ACPI platform device leak
USB: dwc3: qcom: fix software node leak on probe errors
USB: dwc3: qcom: fix resource leaks on probe deferral
USB: dwc3: qcom: simplify wakeup interrupt setup
USB: dwc3: qcom: fix wakeup after probe deferral
dt-bindings: usb: qcom,dwc3: fix example wakeup interrupt types
usb: misc: onboard-hub: add support for Microchip USB5744
dt-bindings: usb: microchip,usb5744: Add second supply
usb: misc: ljca: Fix enumeration error on Dell Latitude 9420
USB: serial: option: add Fibocom L7xx modules
USB: xhci-plat: fix legacy PHY double init
usb: typec: tipd: Supply also I2C driver data
usb: xhci-mtk: fix in-ep's start-split check failure
usb: dwc3: set the dma max_seg_size
usb: config: fix iteration issue in 'usb_get_bos_descriptor()'
usb: dwc3: add missing of_node_put and platform_device_put
USB: dwc2: write HCINT with INTMASK applied
usb: misc: ljca: Drop _ADR support to get ljca children devices
usb: cdnsp: Fix deadlock issue during using NCM gadget
...
|
|
Pull xfs fix from Chandan Babu:
- Validate quota records recovered from the log before writing them to
the disk.
* tag 'xfs-6.7-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: dquot recovery does not validate the recovered dquot
xfs: clean up dqblk extraction
|