summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-06-30lightnvm: pblk: schedule if data is not readyJavier González
When user threads place data into the write buffer, they reserve space and do the memory copy out of the lock. As a consequence, when the write thread starts persisting data, there is a chance that it is not copied yet. In this case, avoid polling, and schedule before retrying. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-30lightnvm: pblk: remove unused return variableJavier González
Remove unused variable. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-30lightnvm: pblk: fix double-free on pblk initJavier González
Prevent pblk->lines being double freed in case of an error during pblk initialization. Fixes: dd2a43437337: "lightnvm: pblk: sched. metadata on write thread" Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-30lightnvm: pblk: fix bad le64 assignationsJavier González
Use the right types and conversions on le64 variables. Reported by sparse. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-29nvme: Makefile: remove dead build ruleValentin Rothberg
Remove dead build rule for drivers/nvme/host/scsi.c which has been removed by commit ("nvme: Remove SCSI translations"). Signed-off-by: Valentin Rothberg <vrothberg@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-29blk-mq: map all HWQ also in hyperthreaded systemMax Gurtovoy
This patch performs sequential mapping between CPUs and queues. In case the system has more CPUs than HWQs then there are still CPUs to map to HWQs. In hyperthreaded system, map the unmapped CPUs and their siblings to the same HWQ. This actually fixes a bug that found unmapped HWQs in a system with 2 sockets, 18 cores per socket, 2 threads per core (total 72 CPUs) running NVMEoF (opens upto maximum of 64 HWQs). Performance results running fio (72 jobs, 128 iodepth) using null_blk (w/w.o patch): bs IOPS(read submit_queues=72) IOPS(write submit_queues=72) IOPS(read submit_queues=24) IOPS(write submit_queues=24) ----- ---------------------------- ------------------------------ ---------------------------- ----------------------------- 512 4890.4K/4723.5K 4524.7K/4324.2K 4280.2K/4264.3K 3902.4K/3909.5K 1k 4910.1K/4715.2K 4535.8K/4309.6K 4296.7K/4269.1K 3906.8K/3914.9K 2k 4906.3K/4739.7K 4526.7K/4330.6K 4301.1K/4262.4K 3890.8K/3900.1K 4k 4918.6K/4730.7K 4556.1K/4343.6K 4297.6K/4264.5K 3886.9K/3893.9K 8k 4906.4K/4748.9K 4550.9K/4346.7K 4283.2K/4268.8K 3863.4K/3858.2K 16k 4903.8K/4782.6K 4501.5K/4233.9K 4292.3K/4282.3K 3773.1K/3773.5K 32k 4885.8K/4782.4K 4365.9K/4184.2K 4307.5K/4289.4K 3780.3K/3687.3K 64k 4822.5K/4762.7K 2752.8K/2675.1K 4308.8K/4312.3K 2651.5K/2655.7K 128k 2388.5K/2313.8K 1391.9K/1375.7K 2142.8K/2152.2K 1395.5K/1374.2K Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28nvmet-rdma: register ib_client to not deadlock in device removalSagi Grimberg
We can deadlock in case we got to a device removal event on a queue which is already in the process of destroying the cm_id is this is blocking until all events on this cm_id will drain. On the other hand we cannot guarantee that rdma_destroy_id was invoked as we only have indication that the queue disconnect flow has been queued (the queue state is updated before the realease work has been queued). So, we leave all the queue removal to a separate ib_client to avoid this deadlock as ib_client device removal is in a different context than the cm_id itself. Reported-by: Shiraz Saleem <shiraz.saleem@intel.com> Tested-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28nvme_fc: fix error recovery on link down.James Smart
Currently, the fc transport invokes nvme_fc_error_recovery() on every io in which the transport detects an error. Which means: a) it's really noisy on large io loads that all get hit by a link down. b) we repeatively call nvme_stop_queues() even though queues are stopped upon the first error or as first steps of reset_work. Correct by: Errors are only meaningful if the controller is in the LIVE state. Thus, enact the reset_work only if LIVE. If called repeatively, state will have already transitioned. There's no need to stop the queues here. Let the first steps of reset_work do the queue stopping. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28nvmet_fc: fix crashes on bad opcodesJames Smart
if a nvme command is issued with an opcode that is not supported by the target (example: opcode 21 - detach namespace), the target crashes due to a null pointer. nvmet_req_init() detects the bad opcode and immediately calls the nvme command done routine with an error status, allowing the transport to send the response. However, the FC transport was aborting the command on error, so the abort freed the lldd point, but the rsp transmit path referenced it psot the free. Fix by removing the abort call on nvmet_req_init() failure. The completion response will be sent with an error status code. As the completion path will terminate the io, ensure the data_sg lists show an unused state so that teardown paths are successful. Signed-off-by: Paul Ely <Paul.Ely@broadcom.com> Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28nvme_fc: Fix crash when nvme controller connection fails.James Smart
If a controller connection is attempted (say to a subsystem that does not exist), the first attempt errors out. If another connect is attempted, it crashes. Issue is the prior controller has yet execute it's final put, thus its still on lists. However, opts points on it have been cleared, thus causing the crash if they are referenced. Fix is to add the missing put after the nvme_uninit_ctrl() call on the attachment failure. Signed-off-by: Paul Ely <Paul.Ely@broadcom.com> Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28nvme_fc: replace ioabort msleep loop with completionJames Smart
Per the recommendation by Sagi on: http://lists.infradead.org/pipermail/linux-nvme/2017-April/009261.html Wait for io aborts to complete wait converted from msleep look to using a struct completion. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28nvme_fc: fix double calls to nvme_cleanup_cmd()James Smart
Current fc transport code, on io termination, is calling nvme_cleanup_cmd() followed by the transport dma unmap routine which also calls nvme_cleanup_cmd(). Which means two kfrees occur on the same address, raising havoc. This resulted in odd data errors, effectively corruption.. Fix by removing the extraneous double calls. Call now occurs only in teardown paths and as part of dma unmap routine. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28nvme-fabrics: verify that a controller returns the correct NQNChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28nvme: simplify nvme_dev_attrs_are_visibleChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28nvme: read the subsystem NQN from Identify ControllerChristoph Hellwig
NVMe 1.2.1 or later requires controllers to provide a subsystem NQN in the Identify controller data structures. Use this NQN for the subsysnqn sysfs attribute by storing it in the nvme_ctrl structure after verifying it. For older controllers we generate a "fake" NQN per non-normative text in the NVMe 1.3 spec. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28nvme: remove a misleading comment on struct nvme_nsChristoph Hellwig
While a NVMe Namespace is somewhat similar to a SCSI Logical Unit (and not a Logical Unit Number anyway) there are subtile differences. Remove the misleading comment. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grmberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28nvme: explicitly disable APST on quirked devicesKai-Heng Feng
A user reports APST is enabled, even when the NVMe is quirked or with option "default_ps_max_latency_us=0". The current logic will not set APST if the device is quirked. But the NVMe in question will enable APST automatically. Separate the logic "apst is supported" and "to enable apst", so we can use the latter one to explicitly disable APST at initialiaztion. BugLink: https://bugs.launchpad.net/bugs/1699004 Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28nvme: use a single NVME_AQ_DEPTH and relax it to 32Sagi Grimberg
No need to differentiate fabrics from pci/loop, also lower it to 32 as we don't really need 256 inflight admin commands. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28nvme: add hostid token to fabric optionsJohannes Thumshirn
Currently we have no way to define a stable host-id but always use the one which is randomly generated when we add the host or use the default host. Provide a "hostid=%s" for user-space to pass in a persistent host-id which overrides the randomly generated one. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28nvme: Remove SCSI translationsKeith Busch
The SCSI-to-NVMe translations were added to assist storage applications utilizing SG_IO transitioning to NVMe. It was always recommended, however, to use native NVMe for device management as too much is lost in translation and the maintenance burden in keeping this kludgey layer around has been neglected such that much of the translations are completely broken. This patch removes SG_IO handling from NVMe to avoid any confusion regarding maintenance support for this interface. The config option for NVMe SCSI emulation has been disabled by default since 4.5. The driver has supported native nvme user commands since the beginning, and native tooling is publicly available for use or as reference for anyone writing their own tools, so there's no excuse for hanging onto a broken crutch. Signed-off-by: Keith Busch <keith.busch@intel.com> Acked-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Guan Junxiong <guanjunxiong@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28nvme-pci: open-code polling logic in nvme_pollSagi Grimberg
Given that the code is simple enough it seems better then passing a tag by reference for each call site, also we can now get rid of __nvme_process_cq. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28nvme-pci: factor out the cqe reading mechanics from __nvme_process_cqSagi Grimberg
Also, maintain a consumed counter to rely on for doorbell and cqe_seen update instead of directly relying on the cq head and phase. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28nvme-pci: factor out cqe handling into a dedicated routineSagi Grimberg
Makes the code slightly more readable. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28nvme-pci: Introduce nvme_ring_cq_doorbellSagi Grimberg
Nice abstraction of the actual mechanics of how to do it. Note the change that we call it after we assign nvmeq->cq_head to avoid passing it. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28fs/fcntl: use copy_to/from_user() for u64 typesJens Axboe
Some architectures (at least PPC) doesn't like get/put_user with 64-bit types on a 32-bit system. Use the variably sized copy to/from user variants instead. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: c75b1d9421f8 ("fs: add fcntl() interface for setting/getting write life time hints") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27drbd: Drop unnecessary staticJulia Lawall
Drop static on a local variable, when the variable is initialized before any use, on every possible execution path through the function. The static has no benefit, and dropping it reduces the code size. The semantic patch that fixes this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @bad exists@ position p; identifier x; type T; @@ static T x@p; ... x = <+...x...+> @@ identifier x; expression e; type T; position p != bad.p; @@ -static T x@p; ... when != x when strict ?x = e; // </smpl> The change in code size is indicates by the following output from the size command. before: text data bss dec hex filename 67299 2291 1056 70646 113f6 drivers/block/drbd/drbd_nl.o after: text data bss dec hex filename 67283 2291 1056 70630 113e6 drivers/block/drbd/drbd_nl.o Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27block, bfq: update wr_busy_queues if needed on a queue splitPaolo Valente
This commit fixes a bug triggered by a non-trivial sequence of events. These events are briefly described in the next two paragraphs. The impatiens, or those who are familiar with queue merging and splitting, can jump directly to the last paragraph. On each I/O-request arrival for a shared bfq_queue, i.e., for a bfq_queue that is the result of the merge of two or more bfq_queues, BFQ checks whether the shared bfq_queue has become seeky (i.e., if too many random I/O requests have arrived for the bfq_queue; if the device is non rotational, then random requests must be also small for the bfq_queue to be tagged as seeky). If the shared bfq_queue is actually detected as seeky, then a split occurs: the bfq I/O context of the process that has issued the request is redirected from the shared bfq_queue to a new non-shared bfq_queue. As a degenerate case, if the shared bfq_queue actually happens to be shared only by one process (because of previous splits), then no new bfq_queue is created: the state of the shared bfq_queue is just changed from shared to non shared. Regardless of whether a brand new non-shared bfq_queue is created, or the pre-existing shared bfq_queue is just turned into a non-shared bfq_queue, several parameters of the non-shared bfq_queue are set (restored) to the original values they had when the bfq_queue associated with the bfq I/O context of the process (that has just issued an I/O request) was merged with the shared bfq_queue. One of these parameters is the weight-raising state. If, on the split of a shared bfq_queue, 1) a pre-existing shared bfq_queue is turned into a non-shared bfq_queue; 2) the previously shared bfq_queue happens to be busy; 3) the weight-raising state of the previously shared bfq_queue happens to change; the number of weight-raised busy queues changes. The field wr_busy_queues must then be updated accordingly, but such an update was missing. This commit adds the missing update. Reported-by: Luca Miccio <lucmiccio@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27mmc/block: remove a call to blk_queue_bounce_limitChristoph Hellwig
BLK_BOUNCE_ANY is the defauly now, so the call is superflous. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27dm: don't set bounce limitChristoph Hellwig
Now all queues allocators come without abounce limit by default, dm doesn't have to override this anymore. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27block: don't set bounce limit in blk_init_queueChristoph Hellwig
Instead move it to the callers. Those that either don't use bio_data() or page_address() or are specific to architectures that do not support highmem are skipped. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27block: don't set bounce limit in blk_init_allocated_queueChristoph Hellwig
And just move it into scsi_transport_sas which needs it due to low-level drivers directly derferencing bio_data, and into blk_init_queue_node, which will need a further push into the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27blk-mq: don't bounce by defaultChristoph Hellwig
For historical reasons we default to bouncing highmem pages for all block queues. But the blk-mq drivers are easy to audit to ensure that we don't need this - scsi and mtip32xx set explicit limits and everyone else doesn't have any particular ones. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27block: don't bother with bounce limits for make_request driversChristoph Hellwig
We only call blk_queue_bounce for request-based drivers, so stop messing with it for make_request based drivers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27block: remove the queue_bounce_pfn helperChristoph Hellwig
Only used inside the bounce code, and opencoding it makes it more obvious what is going on. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27block: move bounce declarations to block/blk.hChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27blk-map: call blk_queue_bounce from blk_rq_append_bioChristoph Hellwig
This makes moves the knowledge about bouncing out of the callers into the block core (just like we do for the normal I/O path), and allows to unexport blk_queue_bounce. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27pktcdvd: remove the call to blk_queue_bounceChristoph Hellwig
pktcdvd is a make_request based stacking driver and thus doesn't have any addressing limits on it's own. It also doesn't use bio_data() or page_address(), so it doesn't need a lowmem bounce either. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27nvme: add support for streams and directivesJens Axboe
This adds support for Directives in NVMe, particular for the Streams directive. Support for Directives is a new feature in NVMe 1.3. It allows a user to pass in information about where to store the data, so that it the device can do so most effiently. If an application is managing and writing data with different life times, mixing differently retentioned data onto the same locations on flash can cause write amplification to grow. This, in turn, will reduce performance and life time of the device. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27btrfs: add support for passing in write hints for buffered writesJens Axboe
Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27xfs: add support for passing in write hints for buffered writesJens Axboe
Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27ext4: add support for passing in write hints for buffered writesJens Axboe
Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27fs: add support for buffered writeback to pass down write hintsJens Axboe
Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27fs: add O_DIRECT and aio support for sending down write life time hintsJens Axboe
Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27blk-mq: expose write hints through debugfsJens Axboe
Useful to verify that things are working the way they should. Reading the file will return number of kb written with each write hint. Writing the file will reset the statistics. No care is taken to ensure that we don't race on updates. Drivers will write to q->write_hints[] if they handle a given write hint. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27block: add support for write hints in a bioJens Axboe
No functional changes in this patch, we just use up some holes in the bio and request structures to define a write hint that we psas down the stack. Ensure that we don't merge requests that have different life time hints assigned to them, and that we inherit the write hint when cloning a bio. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27fs: add fcntl() interface for setting/getting write life time hintsJens Axboe
Define a set of write life time hints: RWH_WRITE_LIFE_NOT_SET No hint information set RWH_WRITE_LIFE_NONE No hints about write life time RWH_WRITE_LIFE_SHORT Data written has a short life time RWH_WRITE_LIFE_MEDIUM Data written has a medium life time RWH_WRITE_LIFE_LONG Data written has a long life time RWH_WRITE_LIFE_EXTREME Data written has an extremely long life time The intent is for these values to be relative to each other, no absolute meaning should be attached to these flag names. Add an fcntl interface for querying these flags, and also for setting them as well: F_GET_RW_HINT Returns the read/write hint set on the underlying inode. F_SET_RW_HINT Set one of the above write hints on the underlying inode. F_GET_FILE_RW_HINT Returns the read/write hint set on the file descriptor. F_SET_FILE_RW_HINT Set one of the above write hints on the file descriptor. The user passes in a 64-bit pointer to get/set these values, and the interface returns 0/-1 on success/error. Sample program testing/implementing basic setting/getting of write hints is below. Add support for storing the write life time hint in the inode flags and in struct file as well, and pass them to the kiocb flags. If both a file and its corresponding inode has a write hint, then we use the one in the file, if available. The file hint can be used for sync/direct IO, for buffered writeback only the inode hint is available. This is in preparation for utilizing these hints in the block layer, to guide on-media data placement. /* * writehint.c: get or set an inode write hint */ #include <stdio.h> #include <fcntl.h> #include <stdlib.h> #include <unistd.h> #include <stdbool.h> #include <inttypes.h> #ifndef F_GET_RW_HINT #define F_LINUX_SPECIFIC_BASE 1024 #define F_GET_RW_HINT (F_LINUX_SPECIFIC_BASE + 11) #define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 12) #endif static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE", "RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM", "RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" }; int main(int argc, char *argv[]) { uint64_t hint; int fd, ret; if (argc < 2) { fprintf(stderr, "%s: file <hint>\n", argv[0]); return 1; } fd = open(argv[1], O_RDONLY); if (fd < 0) { perror("open"); return 2; } if (argc > 2) { hint = atoi(argv[2]); ret = fcntl(fd, F_SET_RW_HINT, &hint); if (ret < 0) { perror("fcntl: F_SET_RW_HINT"); return 4; } } ret = fcntl(fd, F_GET_RW_HINT, &hint); if (ret < 0) { perror("fcntl: F_GET_RW_HINT"); return 3; } printf("%s: hint %s\n", argv[1], str[hint]); close(fd); return 0; } Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27lightnvm: if LUNs are already allocated fix returnRakesh Pandit
While creating new device with NVM_DEV_CREATE if LUNs are already allocated ioctl would return -ENOMEM which is wrong. This patch propagates -EBUSY from nvm_reserve_luns which is correct response. Fixes: ade69e243 ("lightnvm: merge gennvm with core") Reviewed-by: Frans Klaver <fransklaver@gmail.com> Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-26lightnvm: pblk: fail gracefully on irrec. errorJavier González
Due to user writes being decoupled from media writes because of the need of an intermediate write buffer, irrecoverable media write errors lead to pblk stalling; user writes fill up the buffer and end up in an infinite retry loop. In order to let user writes fail gracefully, it is necessary for pblk to keep track of its own internal state and prevent further writes from being placed into the write buffer. This patch implements a state machine to keep track of internal errors and, in case of failure, fail further user writes in an standard way. Depending on the type of error, pblk will do its best to persist buffered writes (which are already acknowledged) and close down on a graceful manner. This way, data might be recovered by re-instantiating pblk. Such state machine paves out the way for a state-based FTL log. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-26lightnvm: pblk: set mempool and workqueue params.Javier González
Make constants to define sizes for internal mempools and workqueues. In this process, adjust the values to be more meaningful given the internal constrains of the FTL. In order to do this for workqueues, separate the current auxiliary workqueue into two dedicated workqueues to manage lines being closed and bad blocks. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-26lightnvm: pblk: redesign GC algorithmJavier González
At the moment, in order to get enough read parallelism, we have recycled several lines at the same time. This approach has proven not to work well when reaching capacity, since we end up mixing valid data from all lines, thus not maintaining a sustainable free/recycled line ratio. The new design, relies on a two level workqueue mechanism. In the first level, we read the metadata for a number of lines based on the GC list they reside on (this is governed by the number of valid sectors in each line). In the second level, we recycle a single line at a time. Here, we issue reads in parallel, while a single GC write thread places data in the write buffer. This design allows to (i) only move data from one line at a time, thus maintaining a sane free/recycled ration and (ii) maintain the GC writer busy with recycled data. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>