Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GIT PULL] SCSI multi-actuator support #11

Merged
merged 164 commits into from
Nov 1, 2021

Commits on Oct 18, 2021

  1. blk-cgroup: blk_cgroup_bio_start() should use irq-safe operations on …

    …blkg->iostat_cpu
    
    c3df5fb ("cgroup: rstat: fix A-A deadlock on 32bit around
    u64_stats_sync") made u64_stats updates irq-safe to avoid A-A deadlocks.
    Unfortunately, the conversion missed one in blk_cgroup_bio_start(). Fix it.
    
    Fixes: 2d146aa ("mm: memcontrol: switch to rstat")
    Cc: stable@vger.kernel.org # v5.13+
    Reported-by: syzbot+9738c8815b375ce482a1@syzkaller.appspotmail.com
    Signed-off-by: Tejun Heo <tj@kernel.org>
    Link: https://lore.kernel.org/r/YWi7NrQdVlxD6J9W@slm.duckdns.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    htejun authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    3c08b09 View commit details
    Browse the repository at this point in the history
  2. mm: don't include <linux/blk-cgroup.h> in <linux/writeback.h>

    blk-cgroup.h pulls in blkdev.h and thus pretty much all the block
    headers.  Break this dependency chain by turning wbc_blkcg_css into a
    macro and dropping the blk-cgroup.h include.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210920123328.1399408-2-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    348332e View commit details
    Browse the repository at this point in the history
  3. mm: don't include <linux/blk-cgroup.h> in <linux/backing-dev.h>

    There is no need to pull blk-cgroup.h and thus blkdev.h in here, so
    break the include chain.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210920123328.1399408-3-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    e41d12f View commit details
    Browse the repository at this point in the history
  4. mm: don't include <linux/blkdev.h> in <linux/backing-dev.h>

    Move inode_to_bdi out of line to avoid having to include blkdev.h.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210920123328.1399408-4-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    ccdf774 View commit details
    Browse the repository at this point in the history
  5. mm: remove spurious blkdev.h includes

    Various files have acquired spurious includes of <linux/blkdev.h> over
    time.  Remove them.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210920123328.1399408-5-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    518d550 View commit details
    Browse the repository at this point in the history
  6. arch: remove spurious blkdev.h includes

    Various files have acquired spurious includes of <linux/blkdev.h> over
    time.  Remove them.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210920123328.1399408-6-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    dcbfa22 View commit details
    Browse the repository at this point in the history
  7. kernel: remove spurious blkdev.h includes

    Various files have acquired spurious includes of <linux/blkdev.h> over
    time.  Remove them.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210920123328.1399408-7-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    545c664 View commit details
    Browse the repository at this point in the history
  8. sched: move the <linux/blkdev.h> include out of kernel/sched/sched.h

    Only core.c needs blkdev.h, so move the #include statement there.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210920123328.1399408-8-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    6a5850d View commit details
    Browse the repository at this point in the history
  9. block: remove the unused rq_end_sector macro

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210920123328.1399408-9-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    1d9433c View commit details
    Browse the repository at this point in the history
  10. block: remove the unused blk_queue_state enum

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210920123328.1399408-10-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    9013823 View commit details
    Browse the repository at this point in the history
  11. block: remove the cmd_size field from struct request_queue

    Entirely unused.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210920123328.1399408-11-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    713e4e1 View commit details
    Browse the repository at this point in the history
  12. block: remove the struct blk_queue_ctx forward declaration

    This type doesn't exist at all, so no need to forward declare it.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210920123328.1399408-12-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    9778ac7 View commit details
    Browse the repository at this point in the history
  13. block: move elevator.h to block/

    Except for the features passed to blk_queue_required_elevator_features,
    elevator.h is only needed internally to the block layer.  Move the
    ELEVATOR_F_* definitions to blkdev.h, and the move elevator.h to
    block/, dropping all the spurious includes outside of that.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210920123328.1399408-13-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    2e9bc34 View commit details
    Browse the repository at this point in the history
  14. block: drop unused includes in <linux/blkdev.h>

    Drop various include not actually used in blkdev.h itself.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210920123328.1399408-14-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    3ab0bc7 View commit details
    Browse the repository at this point in the history
  15. block: drop unused includes in <linux/genhd.h>

    Drop various include not actually used in genhd.h itself, and
    move the remaning includes closer together.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210920123328.1399408-15-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    b81e0c2 View commit details
    Browse the repository at this point in the history
  16. block: move a few merge helpers out of <linux/blkdev.h>

    These are block-layer internal helpers, so move them to block/blk.h and
    block/blk-merge.c.  Also update a comment a bit to use better grammar.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210920123328.1399408-16-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    badf7f6 View commit details
    Browse the repository at this point in the history
  17. block: move integrity handling out of <linux/blkdev.h>

    Split the integrity/metadata handling definitions out into a new header.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210920123328.1399408-17-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    fe45e63 View commit details
    Browse the repository at this point in the history
  18. block: move struct request to blk-mq.h

    struct request is only used by blk-mq drivers, so move it and all
    related declarations to blk-mq.h.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Link: https://lore.kernel.org/r/20210920123328.1399408-18-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    24b83de View commit details
    Browse the repository at this point in the history
  19. block/mq-deadline: Improve request accounting further

    The scheduler .insert_requests() callback is called when a request is
    queued for the first time and also when it is requeued. Only count a
    request the first time it is queued. Additionally, since the mq-deadline
    scheduler only performs zone locking for requests that have been
    inserted, skip the zone unlock code for requests that have not been
    inserted into the mq-deadline scheduler.
    
    Fixes: 38ba64d ("block/mq-deadline: Track I/O statistics")
    Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
    Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
    Cc: Hannes Reinecke <hare@suse.de>
    Signed-off-by: Bart Van Assche <bvanassche@acm.org>
    Link: https://lore.kernel.org/r/20210927220328.1410161-2-bvanassche@acm.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    bvanassche authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    e2c7275 View commit details
    Browse the repository at this point in the history
  20. block/mq-deadline: Add an invariant check

    Check a statistics invariant at module unload time. When running
    blktests, the invariant is verified every time a request queue is
    removed and hence is verified at least once per test.
    
    Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
    Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
    Cc: Hannes Reinecke <hare@suse.de>
    Signed-off-by: Bart Van Assche <bvanassche@acm.org>
    Link: https://lore.kernel.org/r/20210927220328.1410161-3-bvanassche@acm.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    bvanassche authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    32f64ca View commit details
    Browse the repository at this point in the history
  21. block/mq-deadline: Stop using per-CPU counters

    Calculating the sum over all CPUs of per-CPU counters frequently is
    inefficient. Hence switch from per-CPU to individual counters. Three
    counters are protected by the mq-deadline spinlock since these are
    only accessed from contexts that already hold that spinlock. The fourth
    counter is atomic because protecting it with the mq-deadline spinlock
    would trigger lock contention.
    
    Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
    Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
    Cc: Hannes Reinecke <hare@suse.de>
    Signed-off-by: Bart Van Assche <bvanassche@acm.org>
    Link: https://lore.kernel.org/r/20210927220328.1410161-4-bvanassche@acm.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    bvanassche authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    bce0363 View commit details
    Browse the repository at this point in the history
  22. block/mq-deadline: Prioritize high-priority requests

    In addition to reverting commit 7b05bf7 ("Revert "block/mq-deadline:
    Prioritize high-priority requests""), this patch uses 'jiffies' instead
    of ktime_get() in the code for aging lower priority requests.
    
    This patch has been tested as follows:
    
    Measured QD=1/jobs=1 IOPS for nullb with the mq-deadline scheduler.
    Result without and with this patch: 555 K IOPS.
    
    Measured QD=1/jobs=8 IOPS for nullb with the mq-deadline scheduler.
    Result without and with this patch: about 380 K IOPS.
    
    Ran the following script:
    
    set -e
    scriptdir=$(dirname "$0")
    if [ -e /sys/module/scsi_debug ]; then modprobe -r scsi_debug; fi
    modprobe scsi_debug ndelay=1000000 max_queue=16
    sd=''
    while [ -z "$sd" ]; do
      sd=$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*)
    done
    echo $((100*1000)) > "/sys/block/$sd/queue/iosched/prio_aging_expire"
    if [ -e /sys/fs/cgroup/io.prio.class ]; then
      cd /sys/fs/cgroup
      echo restrict-to-be >io.prio.class
      echo +io > cgroup.subtree_control
    else
      cd /sys/fs/cgroup/blkio/
      echo restrict-to-be >blkio.prio.class
    fi
    echo $$ >cgroup.procs
    mkdir -p hipri
    cd hipri
    if [ -e io.prio.class ]; then
      echo none-to-rt >io.prio.class
    else
      echo none-to-rt >blkio.prio.class
    fi
    { "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/low-pri.txt & }
    echo $$ >cgroup.procs
    "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/hi-pri.txt
    
    Result:
    * 11000 IOPS for the high-priority job
    *    40 IOPS for the low-priority job
    
    If the prio aging expiry time is changed from 100s into 0, the IOPS results
    change into 6712 and 6796 IOPS.
    
    The max-iops script is a script that runs fio with the following arguments:
    --bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
    --norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
    --iodepth=${arg_d} --iodepth_batch_submit=${arg_a}
    --iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1}
    --filename=${positional_argument_1}
    
    Cc: Damien Le Moal <damien.lemoal@wdc.com>
    Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
    Cc: Hannes Reinecke <hare@suse.de>
    Signed-off-by: Bart Van Assche <bvanassche@acm.org>
    Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
    Link: https://lore.kernel.org/r/20210927220328.1410161-5-bvanassche@acm.org
    [axboe: @latest -> @latest_start]
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    bvanassche authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    322cff7 View commit details
    Browse the repository at this point in the history
  23. block: print the current process in handle_bad_sector

    Make the bad sector information a little more useful by printing
    current->comm to identify the caller.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
    Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
    Link: https://lore.kernel.org/r/20210928052755.113016-1-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    8a3ee67 View commit details
    Browse the repository at this point in the history
  24. blk-mq: Change rqs check in blk_mq_free_rqs()

    The original code in commit 24d2f90 ("blk-mq: split out tag
    initialization, support shared tags") would check tags->rqs is non-NULL and
    then dereference tags->rqs[].
    
    Then in commit 2af8cbe ("blk-mq: split tag ->rqs[] into two"), we
    started to dereference tags->static_rqs[], but continued to check non-NULL
    tags->rqs.
    
    Check tags->static_rqs as non-NULL instead, which is more logical.
    
    Signed-off-by: John Garry <john.garry@huawei.com>
    Reviewed-by: Ming Lei <ming.lei@redhat.com>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Link: https://lore.kernel.org/r/1633429419-228500-2-git-send-email-john.garry@huawei.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    John Garry authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    65de57b View commit details
    Browse the repository at this point in the history
  25. block: Rename BLKDEV_MAX_RQ -> BLKDEV_DEFAULT_RQ

    It is a bit confusing that there is BLKDEV_MAX_RQ and MAX_SCHED_RQ, as
    the name BLKDEV_MAX_RQ would imply the max requests always, which it is
    not.
    
    Rename to BLKDEV_MAX_RQ to BLKDEV_DEFAULT_RQ, matching its usage - that being
    the default number of requests assigned when allocating a request queue.
    
    Signed-off-by: John Garry <john.garry@huawei.com>
    Reviewed-by: Ming Lei <ming.lei@redhat.com>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Link: https://lore.kernel.org/r/1633429419-228500-3-git-send-email-john.garry@huawei.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    John Garry authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    d2a2796 View commit details
    Browse the repository at this point in the history
  26. blk-mq: Relocate shared sbitmap resize in blk_mq_update_nr_requests()

    For shared sbitmap, if the call to blk_mq_tag_update_depth() was
    successful for any hctx when hctx->sched_tags is not set, then it would be
    successful for all (due to nature in which blk_mq_tag_update_depth()
    fails).
    
    As such, there is no need to call blk_mq_tag_resize_shared_sbitmap() for
    each hctx. So relocate the call until after the hctx iteration under the
    !q->elevator check, which is equivalent (to !hctx->sched_tags).
    
    Signed-off-by: John Garry <john.garry@huawei.com>
    Reviewed-by: Ming Lei <ming.lei@redhat.com>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Link: https://lore.kernel.org/r/1633429419-228500-4-git-send-email-john.garry@huawei.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    John Garry authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    8fa0446 View commit details
    Browse the repository at this point in the history
  27. blk-mq: Invert check in blk_mq_update_nr_requests()

    It's easier to read:
    
    if (x)
    	X;
    else
    	Y;
    
    over:
    
    if (!x)
    	Y;
    else
    	X;
    
    No functional change intended.
    
    Signed-off-by: John Garry <john.garry@huawei.com>
    Reviewed-by: Ming Lei <ming.lei@redhat.com>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Link: https://lore.kernel.org/r/1633429419-228500-5-git-send-email-john.garry@huawei.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    John Garry authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    f6adcef View commit details
    Browse the repository at this point in the history
  28. blk-mq-sched: Rename blk_mq_sched_alloc_{tags -> map_and_rqs}()

    Function blk_mq_sched_alloc_tags() does same as
    __blk_mq_alloc_map_and_request(), so give a similar name to be consistent.
    
    Similarly rename label err_free_tags -> err_free_map_and_rqs.
    
    Signed-off-by: John Garry <john.garry@huawei.com>
    Reviewed-by: Ming Lei <ming.lei@redhat.com>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Link: https://lore.kernel.org/r/1633429419-228500-6-git-send-email-john.garry@huawei.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    John Garry authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    d99a6bb View commit details
    Browse the repository at this point in the history
  29. blk-mq-sched: Rename blk_mq_sched_free_{requests -> rqs}()

    To be more concise and consistent in naming, rename
    blk_mq_sched_free_requests() -> blk_mq_sched_free_rqs().
    
    Signed-off-by: John Garry <john.garry@huawei.com>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Reviewed-by: Ming Lei <ming.lei@redhat.com>
    Link: https://lore.kernel.org/r/1633429419-228500-7-git-send-email-john.garry@huawei.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    John Garry authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    1820f4f View commit details
    Browse the repository at this point in the history
  30. blk-mq: Pass driver tags to blk_mq_clear_rq_mapping()

    Function blk_mq_clear_rq_mapping() will be used for shared sbitmap tags
    in future, so pass a driver tags pointer instead of the tagset container
    and HW queue index.
    
    Signed-off-by: John Garry <john.garry@huawei.com>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Reviewed-by: Ming Lei <ming.lei@redhat.com>
    Link: https://lore.kernel.org/r/1633429419-228500-8-git-send-email-john.garry@huawei.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    John Garry authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    f32e4ea View commit details
    Browse the repository at this point in the history
  31. blk-mq: Don't clear driver tags own mapping

    Function blk_mq_clear_rq_mapping() is required to clear the sched tags
    mappings in driver tags rqs[].
    
    But there is no need for a driver tags to clear its own mapping, so skip
    clearing the mapping in this scenario.
    
    Signed-off-by: John Garry <john.garry@huawei.com>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Reviewed-by: Ming Lei <ming.lei@redhat.com>
    Link: https://lore.kernel.org/r/1633429419-228500-9-git-send-email-john.garry@huawei.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    John Garry authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    4f245d5 View commit details
    Browse the repository at this point in the history
  32. blk-mq: Add blk_mq_tag_update_sched_shared_sbitmap()

    Put the functionality to update the sched shared sbitmap size in a common
    function.
    
    Since the same formula is always used to resize, and it can be got from
    the request queue argument, so just pass the request queue pointer.
    
    Signed-off-by: John Garry <john.garry@huawei.com>
    Reviewed-by: Ming Lei <ming.lei@redhat.com>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Link: https://lore.kernel.org/r/1633429419-228500-10-git-send-email-john.garry@huawei.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    John Garry authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    a7e7388 View commit details
    Browse the repository at this point in the history
  33. blk-mq: Add blk_mq_alloc_map_and_rqs()

    Add a function to combine allocating tags and the associated requests,
    and factor out common patterns to use this new function.
    
    Some function only call blk_mq_alloc_map_and_rqs() now, but more
    functionality will be added later.
    
    Also make blk_mq_alloc_rq_map() and blk_mq_alloc_rqs() static since they
    are only used in blk-mq.c, and finally rename some functions for
    conciseness and consistency with other function names:
    - __blk_mq_alloc_map_and_{request -> rqs}()
    - blk_mq_alloc_{map_and_requests -> set_map_and_rqs}()
    
    Suggested-by: Ming Lei <ming.lei@redhat.com>
    Signed-off-by: John Garry <john.garry@huawei.com>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Reviewed-by: Ming Lei <ming.lei@redhat.com>
    Link: https://lore.kernel.org/r/1633429419-228500-11-git-send-email-john.garry@huawei.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    John Garry authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    63064be View commit details
    Browse the repository at this point in the history
  34. blk-mq: Refactor and rename blk_mq_free_map_and_{requests->rqs}()

    Refactor blk_mq_free_map_and_requests() such that it can be used at many
    sites at which the tag map and rqs are freed.
    
    Also rename to blk_mq_free_map_and_rqs(), which is shorter and matches the
    alloc equivalent.
    
    Suggested-by: Ming Lei <ming.lei@redhat.com>
    Signed-off-by: John Garry <john.garry@huawei.com>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Link: https://lore.kernel.org/r/1633429419-228500-12-git-send-email-john.garry@huawei.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    John Garry authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    645db34 View commit details
    Browse the repository at this point in the history
  35. blk-mq: Use shared tags for shared sbitmap support

    Currently we use separate sbitmap pairs and active_queues atomic_t for
    shared sbitmap support.
    
    However a full sets of static requests are used per HW queue, which is
    quite wasteful, considering that the total number of requests usable at
    any given time across all HW queues is limited by the shared sbitmap depth.
    
    As such, it is considerably more memory efficient in the case of shared
    sbitmap to allocate a set of static rqs per tag set or request queue, and
    not per HW queue.
    
    So replace the sbitmap pairs and active_queues atomic_t with a shared
    tags per tagset and request queue, which will hold a set of shared static
    rqs.
    
    Since there is now no valid HW queue index to be passed to the blk_mq_ops
    .init and .exit_request callbacks, pass an invalid index token. This
    changes the semantics of the APIs, such that the callback would need to
    validate the HW queue index before using it. Currently no user of shared
    sbitmap actually uses the HW queue index (as would be expected).
    
    Signed-off-by: John Garry <john.garry@huawei.com>
    Reviewed-by: Ming Lei <ming.lei@redhat.com>
    Link: https://lore.kernel.org/r/1633429419-228500-13-git-send-email-john.garry@huawei.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    John Garry authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    e155b0c View commit details
    Browse the repository at this point in the history
  36. blk-mq: Stop using pointers for blk_mq_tags bitmap tags

    Now that we use shared tags for shared sbitmap support, we don't require
    the tags sbitmap pointers, so drop them.
    
    This essentially reverts commit 222a5ae ("blk-mq: Use pointers for
    blk_mq_tags bitmap tags").
    
    Function blk_mq_init_bitmap_tags() is removed also, since it would be only
    a wrappper for blk_mq_init_bitmaps().
    
    Reviewed-by: Ming Lei <ming.lei@redhat.com>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Signed-off-by: John Garry <john.garry@huawei.com>
    Link: https://lore.kernel.org/r/1633429419-228500-14-git-send-email-john.garry@huawei.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    John Garry authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    ae0f1a7 View commit details
    Browse the repository at this point in the history
  37. blk-mq: Change shared sbitmap naming to shared tags

    Now that shared sbitmap support really means shared tags, rename symbols
    to match that.
    
    Signed-off-by: John Garry <john.garry@huawei.com>
    Link: https://lore.kernel.org/r/1633429419-228500-15-git-send-email-john.garry@huawei.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    John Garry authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    079a2e3 View commit details
    Browse the repository at this point in the history
  38. block: move blk-throtl fast path inline

    Even if no policies are defined, we spend ~2% of the total IO time
    checking. Move the fast path inline.
    
    Acked-by: Tejun Heo <tj@kernel.org>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    a7b36ee View commit details
    Browse the repository at this point in the history
  39. block: inherit request start time from bio for BLK_CGROUP

    Doing high IOPS testing with blk-cgroups enabled spends ~15-20% of the
    time just doing ktime_get_ns() -> readtsc. We essentially read and
    set the start time twice, one for the bio and then again when that bio
    is mapped to a request.
    
    Given that the time between the two is very short, inherit the bio
    start time instead of reading it again. This cuts 1/3rd of the overhead
    of the time keeping.
    
    Acked-by: Tejun Heo <tj@kernel.org>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    0006707 View commit details
    Browse the repository at this point in the history
  40. block: bump max plugged deferred size from 16 to 32

    Particularly for NVMe with efficient deferred submission for many
    requests, there are nice benefits to be seen by bumping the default max
    plug count from 16 to 32. This is especially true for virtualized setups,
    where the submit part is more expensive. But can be noticed even on
    native hardware.
    
    Reduce the multiple queue factor from 4 to 2, since we're changing the
    default size.
    
    While changing it, move the defines into the block layer private header.
    These aren't values that anyone outside of the block layer uses, or
    should use.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    ba0ffdd View commit details
    Browse the repository at this point in the history
  41. block: pre-allocate requests if plug is started and is a batch

    The caller typically has a good (or even exact) idea of how many requests
    it needs to submit. We can make the request/tag allocation a lot more
    efficient if we just allocate N requests/tags upfront when we queue the
    first bio from the batch.
    
    Provide a new plug start helper that allows the caller to specify how many
    IOs are expected. This sets plug->nr_ios, and we can use that for smarter
    request allocation. The plug provides a holding spot for requests, and
    request allocation will check it before calling into the normal request
    allocation path.
    
    The blk_finish_plug() is called, check if there are unused requests and
    free them. This should not happen in normal operations. The exception is
    if we get merging, then we may be left with requests that need freeing
    when done.
    
    This raises the per-core performance on my setup from ~5.8M to ~6.1M
    IOPS.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    47c122e View commit details
    Browse the repository at this point in the history
  42. blk-mq: cleanup and rename __blk_mq_alloc_request

    The newly added loop for the cached requests in __blk_mq_alloc_request
    is a little too convoluted for my taste, so unwind it a bit.  Also
    rename the function to __blk_mq_alloc_requests now that it can allocate
    more than a single request.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211012104045.658051-2-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    b90cfae View commit details
    Browse the repository at this point in the history
  43. blk-mq: cleanup blk_mq_submit_bio

    Move the blk_mq_alloc_data stack allocation only into the branch
    that actually needs it, and use rq->mq_hctx instead of data.hctx
    to refer to the hctx.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211012104045.658051-3-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    0f38d76 View commit details
    Browse the repository at this point in the history
  44. block: don't dereference request after flush insertion

    We could have a race here, where the request gets freed before we call
    into blk_mq_run_hw_queue(). If this happens, we cannot rely on the state
    of the request.
    
    Grab the hardware context before inserting the flush.
    
    Fixes: 0f38d76 ("blk-mq: cleanup blk_mq_submit_bio")
    Reviewed-by: Ming Lei <ming.lei@redhat.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    4a60f36 View commit details
    Browse the repository at this point in the history
  45. block: unexport blkdev_ioctl

    With the raw driver gone, there is no modular user left.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211012104450.659013-2-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    fea349b View commit details
    Browse the repository at this point in the history
  46. block: move the *blkdev_ioctl declarations out of blkdev.h

    These are only used inside of block/.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211012104450.659013-3-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    84b8514 View commit details
    Browse the repository at this point in the history
  47. block: merge block_ioctl into blkdev_ioctl

    Simplify the ioctl path and match the code structure on the compat side.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211012104450.659013-4-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    8a70951 View commit details
    Browse the repository at this point in the history
  48. block: inline hot paths of blk_account_io_*()

    Extract hot paths of __blk_account_io_start() and
    __blk_account_io_done() into inline functions, so we don't always pay
    for function calls.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/b0662a636bd4cc7b4f84c9d0a41efa46a688ef13.1633781740.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    be6bfe3 View commit details
    Browse the repository at this point in the history
  49. blk-mq: inline hot part of __blk_mq_sched_restart

    Extract a fast check out of __block_mq_sched_restart() and inline it for
    performance reasons.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/894abaa0998e5999f2fe18f271e5efdfc2c32bd2.1633781740.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    e9ea159 View commit details
    Browse the repository at this point in the history
  50. block: remove BIO_BUG_ON

    BIO_DEBUG is always defined, so just switch the two instances to use
    BUG_ON directly.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211012161804.991559-2-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    9e8c0d0 View commit details
    Browse the repository at this point in the history
  51. block: don't include <linux/ioprio.h> in <linux/bio.h>

    bio.h doesn't need any of the definitions from ioprio.h.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211012161804.991559-3-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    11d9cab View commit details
    Browse the repository at this point in the history
  52. block: move bio_mergeable out of bio.h

    bio_mergeable is only needed by I/O schedulers, so move it to
    blk-mq-sched.h.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211012161804.991559-4-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    8addffd View commit details
    Browse the repository at this point in the history
  53. block: fold bio_cur_bytes into blk_rq_cur_bytes

    Fold bio_cur_bytes into the only caller.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211012161804.991559-5-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    b6559d8 View commit details
    Browse the repository at this point in the history
  54. block: move bio_full out of bio.h

    bio_full is only used in bio.c, so move it there.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211012161804.991559-6-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    9a6083b View commit details
    Browse the repository at this point in the history
  55. block: mark __bio_try_merge_page static

    Mark __bio_try_merge_page static and move it up a bit to avoid the need
    for a forward declaration.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211012161804.991559-7-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    9774b39 View commit details
    Browse the repository at this point in the history
  56. block: move bio_get_{first,last}_bvec out of bio.h

    bio_get_first_bvec and bio_get_last_bvec are only used in blk-merge.c,
    so move them there.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211012161804.991559-8-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    ff18d77 View commit details
    Browse the repository at this point in the history
  57. block: mark bio_truncate static

    bio_truncate is only used in bio.c, so mark it static.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211012161804.991559-9-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    4f7ab09 View commit details
    Browse the repository at this point in the history
  58. blk-mq: optimise *end_request non-stat path

    We already have a blk_mq_need_time_stamp() check in
    __blk_mq_end_request() to get a timestamp, hide all the statistics
    accounting under it. It cuts some cycles for requests that don't need
    stats, and is free otherwise.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/e0f2ea812e93a8adcd07101212e7d7e70ca304e7.1634115360.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    8971a3b View commit details
    Browse the repository at this point in the history
  59. sbitmap: add __sbitmap_queue_get_batch()

    The block layer tag allocation batching still calls into sbitmap to get
    each tag, but we can improve on that. Add __sbitmap_queue_get_batch(),
    which returns a mask of tags all at once, along with an offset for
    those tags.
    
    An example return would be 0xff, where bits 0..7 are set, with
    tag_offset == 128. The valid tags in this case would be 128..135.
    
    A batch is specific to an individual sbitmap_map, hence it cannot be
    larger than that. The requested number of tags is automatically reduced
    to the max that can be satisfied with a single map.
    
    On failure, 0 is returned. Caller should fall back to single tag
    allocation at that point/
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    9672b0d View commit details
    Browse the repository at this point in the history
  60. block: improve batched tag allocation

    Add a blk_mq_get_tags() helper, which uses the new sbitmap API for
    allocating a batch of tags all at once. This both simplifies the block
    code for batched allocation, and it is also more efficient than just
    doing repeated calls into __sbitmap_queue_get().
    
    This reduces the sbitmap overhead in peak runs from ~3% to ~1% and
    yields a performanc increase from 6.6M IOPS to 6.8M IOPS for a single
    CPU core.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    349302d View commit details
    Browse the repository at this point in the history
  61. block: remove redundant =y from BLK_CGROUP dependency

    CONFIG_BLK_CGROUP is a boolean option, that is, its value is 'y' or 'n'.
    The comparison to 'y' is redundant.
    
    Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20210927140000.866249-2-masahiroy@kernel.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    masahir0y authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    df252bd View commit details
    Browse the repository at this point in the history
  62. block: simplify Kconfig files

    Everything under block/ depends on BLOCK. BLOCK_HOLDER_DEPRECATED is
    selected from drivers/md/Kconfig, which is entirely dependent on BLOCK.
    
    Extend the 'if BLOCK' ... 'endif' so it covers the whole block/Kconfig.
    
    Also, clean up the definition of BLOCK_COMPAT and BLK_MQ_PCI because
    COMPAT and PCI are boolean.
    
    Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20210927140000.866249-3-masahiroy@kernel.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    masahir0y authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    c50fca5 View commit details
    Browse the repository at this point in the history
  63. block: move menu "Partition type" to block/partitions/Kconfig

    Move the menu to the relevant place.
    
    Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20210927140000.866249-4-masahiroy@kernel.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    masahir0y authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    b8b98a6 View commit details
    Browse the repository at this point in the history
  64. block: move CONFIG_BLOCK guard to top Makefile

    Every object under block/ depends on CONFIG_BLOCK.
    
    Move the guard to the top Makefile since there is no point to
    descend into block/ if CONFIG_BLOCK=n.
    
    Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20210927140000.866249-5-masahiroy@kernel.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    masahir0y authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    4c92890 View commit details
    Browse the repository at this point in the history
  65. block: only check previous entry for plug merge attempt

    Currently we scan the entire plug list, which is potentially very
    expensive. In an IOPS bound workload, we can drive about 5.6M IOPS with
    merging enabled, and profiling shows that the plug merge check is the
    (by far) most expensive thing we're doing:
    
      Overhead  Command   Shared Object     Symbol
      +   20.89%  io_uring  [kernel.vmlinux]  [k] blk_attempt_plug_merge
      +    4.98%  io_uring  [kernel.vmlinux]  [k] io_submit_sqes
      +    4.78%  io_uring  [kernel.vmlinux]  [k] blkdev_direct_IO
      +    4.61%  io_uring  [kernel.vmlinux]  [k] blk_mq_submit_bio
    
    Instead of browsing the whole list, just check the previously inserted
    entry. That is enough for a naive merge check and will catch most cases,
    and for devices that need full merging, the IO scheduler attached to
    such devices will do that anyway. The plug merge is meant to be an
    inexpensive check to avoid getting a request, but if we repeatedly
    scan the list for every single insert, it is very much not a cheap
    check.
    
    With this patch, the workload instead runs at ~7.0M IOPS, providing
    a 25% improvement. Disabling merging entirely yields another 5%
    improvement.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    d38a9c0 View commit details
    Browse the repository at this point in the history
  66. direct-io: remove blk_poll support

    The polling support in the legacy direct-io support is a little crufty.
    It already doesn't support the asynchronous polling needed for io_uring
    polling, and is hard to adopt to upcoming changes in the polling
    interfaces.  Given that all the major file systems already use the iomap
    direct I/O code, just drop the polling support.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
    Link: https://lore.kernel.org/r/20211012111226.760968-2-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    94c2ed5 View commit details
    Browse the repository at this point in the history
  67. block: don't try to poll multi-bio I/Os in __blkdev_direct_IO

    If an iocb is split into multiple bios we can't poll for both.  So don't
    even bother to try to poll in that case.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211012111226.760968-3-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    71fc3f5 View commit details
    Browse the repository at this point in the history
  68. iomap: don't try to poll multi-bio I/Os in __iomap_dio_rw

    If an iocb is split into multiple bios we can't poll for both.  So don't
    bother to even try to poll in that case.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
    Link: https://lore.kernel.org/r/20211012111226.760968-4-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    f79d474 View commit details
    Browse the repository at this point in the history
  69. io_uring: fix a layering violation in io_iopoll_req_issued

    syscall-level code can't just poke into the details of the poll cookie,
    which is private information of the block layer.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211012111226.760968-5-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    30da1b4 View commit details
    Browse the repository at this point in the history
  70. blk-mq: factor out a blk_qc_to_hctx helper

    Add a helper to get the hctx from a request_queue and cookie, and fold
    the blk_qc_t_to_queue_num helper into it as no other callers are left.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
    Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
    Link: https://lore.kernel.org/r/20211012111226.760968-6-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    f70299f View commit details
    Browse the repository at this point in the history
  71. blk-mq: factor out a "classic" poll helper

    Factor the code to do the classic full metal polling out of blk_poll into
    a separate blk_mq_poll_classic helper.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
    Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
    Link: https://lore.kernel.org/r/20211012111226.760968-7-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    c6699d6 View commit details
    Browse the repository at this point in the history
  72. blk-mq: remove blk_qc_t_to_tag and blk_qc_t_is_internal

    Merge both functions into their only caller to keep the blk-mq tag to
    blk_qc_t mapping as private as possible in blk-mq.c.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
    Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
    Link: https://lore.kernel.org/r/20211012111226.760968-8-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    efbabbe View commit details
    Browse the repository at this point in the history
  73. blk-mq: remove blk_qc_t_valid

    Move the trivial check into the only caller.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
    Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
    Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
    Link: https://lore.kernel.org/r/20211012111226.760968-9-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    28a1ae6 View commit details
    Browse the repository at this point in the history
  74. block: replace the spin argument to blk_iopoll with a flags argument

    Switch the boolean spin argument to blk_poll to passing a set of flags
    instead.  This will allow to control polling behavior in a more fine
    grained way.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
    Link: https://lore.kernel.org/r/20211012111226.760968-10-hch@lst.de
    [axboe: adapt to changed io_uring iopoll]
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    ef99b2d View commit details
    Browse the repository at this point in the history
  75. io_uring: don't sleep when polling for I/O

    There is no point in sleeping for the expected I/O completion timeout
    in the io_uring async polling model as we never poll for a specific
    I/O.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
    Link: https://lore.kernel.org/r/20211012111226.760968-11-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    d729cf9 View commit details
    Browse the repository at this point in the history
  76. block: rename REQ_HIPRI to REQ_POLLED

    Unlike the RWF_HIPRI userspace ABI which is intentionally kept vague,
    the bio flag is specific to the polling implementation, so rename and
    document it properly.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
    Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
    Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
    Link: https://lore.kernel.org/r/20211012111226.760968-12-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    6ce913f View commit details
    Browse the repository at this point in the history
  77. block: use SLAB_TYPESAFE_BY_RCU for the bio slab

    This flags ensures that the pages will not be reused for non-bio
    allocations before the end of an RCU grace period.  With that we can
    safely use a RCU lookup for bio polling as long as we are fine with
    occasionally polling the wrong device.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
    Link: https://lore.kernel.org/r/20211012111226.760968-13-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    1a7e76e View commit details
    Browse the repository at this point in the history
  78. block: define 'struct bvec_iter' as packed

    'struct bvec_iter' is embedded into 'struct bio', define it as packed
    so that we can get one extra 4bytes for other uses without expanding
    bio.
    
    'struct bvec_iter' is often allocated on stack, so making it packed
    doesn't affect performance. Also I have run io_uring on both
    nvme/null_blk, and not observe performance effect in this way.
    
    Suggested-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Ming Lei <ming.lei@redhat.com>
    Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
    Link: https://lore.kernel.org/r/20211012111226.760968-14-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Ming Lei authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    1941612 View commit details
    Browse the repository at this point in the history
  79. block: switch polling to be bio based

    Replace the blk_poll interface that requires the caller to keep a queue
    and cookie from the submissions with polling based on the bio.
    
    Polling for the bio itself leads to a few advantages:
    
     - the cookie construction can made entirely private in blk-mq.c
     - the caller does not need to remember the request_queue and cookie
       separately and thus sidesteps their lifetime issues
     - keeping the device and the cookie inside the bio allows to trivially
       support polling BIOs remapping by stacking drivers
     - a lot of code to propagate the cookie back up the submission path can
       be removed entirely.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
    Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    3e08773 View commit details
    Browse the repository at this point in the history
  80. block: don't allow writing to the poll queue attribute

    The poll attribute is a historic artefact from before when we had
    explicit poll queues that require driver specific configuration.
    Just print a warning when writing to the attribute.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
    Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
    Link: https://lore.kernel.org/r/20211012111226.760968-16-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    a614dd2 View commit details
    Browse the repository at this point in the history
  81. nvme-multipath: enable polled I/O

    Set the poll queue flag to enable polling, given that the multipath
    node just dispatches the bios to a lower queue.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
    Link: https://lore.kernel.org/r/20211012111226.760968-17-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    c712dcc View commit details
    Browse the repository at this point in the history
  82. block: cache bdev in struct file for raw bdev IO

    bdev = &BDEV_I(file->f_mapping->host)->bdev
    
    Getting struct block_device from a file requires 2 memory dereferences
    as illustrated above, that takes a toll on performance, so cache it in
    yet unused file->private_data. That gives a noticeable peak performance
    improvement.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/8415f9fe12e544b9da89593dfbca8de2b52efe03.1634115360.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    fac7c6d View commit details
    Browse the repository at this point in the history
  83. block: use flags instead of bit fields for blkdev_dio

    This generates a lot better code for me, and bumps performance from
    7650K IOPS to 7750K IOPS. Looking at profiles for the run and running
    perf diff, it confirms that we're now sending a lot less time there:
    
         6.38%     -2.80%  [kernel.vmlinux]  [k] blkdev_direct_IO
    
    Taking it from the 2nd most cycle consumer to only the 9th most at
    3.35% of the CPU time.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    09ce874 View commit details
    Browse the repository at this point in the history
  84. block: handle fast path of bio splitting inline

    The fast path is no splitting needed. Separate the handling into a
    check part we can inline, and an out-of-line handling path if we do
    need to split.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    abd45c1 View commit details
    Browse the repository at this point in the history
  85. block: cache request queue in bdev

    There are tons of places where we need to get a request_queue only
    having bdev, which turns into bdev->bd_disk->queue. There are probably a
    hundred of such places considering inline helpers, and enough of them
    are in hot paths.
    
    Cache queue pointer in struct block_device and make use of it in
    bdev_get_queue().
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/a3bfaecdd28956f03629d0ca5c63ebc096e1c809.1634219547.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    17220ca View commit details
    Browse the repository at this point in the history
  86. block: use bdev_get_queue() in bdev.c

    Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached
    queue pointer and so is faster.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/a352936ce5d9ac719645b1e29b173d931ebcdc02.1634219547.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    025a386 View commit details
    Browse the repository at this point in the history
  87. block: use bdev_get_queue() in bio.c

    Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached
    queue pointer and so is faster.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/85c36ea784d285a5075baa10049e6b59e15fb484.1634219547.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    3caee46 View commit details
    Browse the repository at this point in the history
  88. block: use bdev_get_queue() in blk-core.c

    Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached
    queue pointer and so is faster.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/efc41f880262517c8dc32f932f1b23112f21b255.1634219547.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    eab4e02 View commit details
    Browse the repository at this point in the history
  89. block: convert the rest of block to bdev_get_queue

    Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached
    queue pointer and so is faster.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/addf6ea988c04213697ba3684c853e4ed7642a39.1634219547.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    ed6cdde View commit details
    Browse the repository at this point in the history
  90. block: don't bother iter advancing a fully done bio

    If we're completing nbytes and nbytes is the size of the bio, don't bother
    with calling into the iterator increment helpers. Just clear the bio
    size and we're done.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    d4aa57a View commit details
    Browse the repository at this point in the history
  91. block: remove useless caller argument to print_req_error()

    We have exactly one caller of this, just get rid of adding the useless
    function name to the output.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    c477b79 View commit details
    Browse the repository at this point in the history
  92. block: move update request helpers into blk-mq.c

    For some reason we still have them in blk-core, with the rest of the
    request completion being in blk-mq. That causes and out-of-line call
    for each completion.
    
    Move them into blk-mq.c instead, where they belong.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    9be3e06 View commit details
    Browse the repository at this point in the history
  93. block: improve layout of struct request

    It's been a while since this was analyzed, move some members around to
    better flow with the use case. Initial state up top, and queued state
    after that. This improves my peak case by about 1.5%, from 7750K to
    7900K IOPS.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    b608762 View commit details
    Browse the repository at this point in the history
  94. block: only mark bio as tracked if it really is tracked

    We set BIO_TRACKED unconditionally when rq_qos_throttle() is called, even
    though we may not even have an rq_qos handler. Only mark it as TRACKED if
    it really is potentially tracked.
    
    This saves considerable time for the case where the bio isn't tracked:
    
         2.64%     -1.65%  [kernel.vmlinux]  [k] bio_endio
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    90b8faa View commit details
    Browse the repository at this point in the history
  95. block: store elevator state in request

    Add an rq private RQF_ELV flag, which tells the block layer that this
    request was initialized on a queue that has an IO scheduler attached.
    This allows for faster checking in the fast path, rather than having to
    deference rq->q later on.
    
    Elevator switching does full quiesce of the queue before detaching an
    IO scheduler, so it's safe to cache this in the request itself.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    2ff0682 View commit details
    Browse the repository at this point in the history
  96. block: skip elevator fields init for non-elv queue

    Don't init rq->hash and rq->rb_node in blk_mq_rq_ctx_init() if there is
    no elevator. Also, move some other initialisers that imply barriers to
    the end, so the compiler is free to rearrange and optimise other the
    rest of them.
    
    note: fold in a change from Jens leaving queue_list unconditional, as
    it might lead to problems otherwise.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    4f266f2 View commit details
    Browse the repository at this point in the history
  97. block: blk_mq_rq_ctx_init cache ctx/q/hctx

    We should have enough of registers in blk_mq_rq_ctx_init(), store them
    in local vars, so we don't keep reloading them.
    
    note: keeping q->elevator may look unnecessary, but it's also used
    inside inlined blk_mq_tags_from_data().
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    605f784 View commit details
    Browse the repository at this point in the history
  98. block: cache rq_flags inside blk_mq_rq_ctx_init()

    Add a local variable for rq_flags, it helps to compile out some of
    rq_flags reloads.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    1284590 View commit details
    Browse the repository at this point in the history
  99. block: remove debugfs blk_mq_ctx dispatched/merged/completed attributes

    These were added as part of early days debugging for blk-mq, and they
    are not really useful anymore. Rather than spend cycles updating them,
    just get rid of them.
    
    As a bonus, this shrinks the per-cpu software queue size from 256b
    to 192b. That's a whole cacheline less.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    9a14d6c View commit details
    Browse the repository at this point in the history
  100. block: remove some blk_mq_hw_ctx debugfs entries

    Just like the blk_mq_ctx counterparts, we've got a bunch of counters
    in here that are only for debugfs and are of questionnable value. They
    are:
    
    - dispatched, index of how many requests were dispatched in one go
    
    - poll_{considered,invoked,success}, which track poll sucess rates. We're
      confident in the iopoll implementation at this point, don't bother
      tracking these.
    
    As a bonus, this shrinks each hardware queue from 576 bytes to 512 bytes,
    dropping a whole cacheline.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    afd7de0 View commit details
    Browse the repository at this point in the history
  101. block: provide helpers for rq_list manipulation

    Instead of open-coding the list additions, traversal, and removal,
    provide a basic set of helpers.
    
    Suggested-by: Christoph Hellwig <hch@infradead.org>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    013a7f9 View commit details
    Browse the repository at this point in the history
  102. block: add a struct io_comp_batch argument to fops->iopoll()

    struct io_comp_batch contains a list head and a completion handler, which
    will allow completions to more effciently completed batches of IO.
    
    For now, no functional changes in this patch, we just define the
    io_comp_batch structure and add the argument to the file_operations iopoll
    handler.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    5a72e89 View commit details
    Browse the repository at this point in the history
  103. sbitmap: add helper to clear a batch of tags

    sbitmap currently only supports clearing tags one-by-one, add a helper
    that allows the caller to pass in an array of tags to clear.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    1aec5e4 View commit details
    Browse the repository at this point in the history
  104. block: add support for blk_mq_end_request_batch()

    Instead of calling blk_mq_end_request() on a single request, add a helper
    that takes the new struct io_comp_batch and completes any request stored
    in there.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    f794f33 View commit details
    Browse the repository at this point in the history
  105. nvme: add support for batched completion of polled IO

    Take advantage of struct io_comp_batch, if passed in to the nvme poll
    handler. If it's set, rather than complete each request individually
    inline, store them in the io_comp_batch list. We only do so for requests
    that will complete successfully, anything else will be completed inline as
    before.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    c234a65 View commit details
    Browse the repository at this point in the history
  106. io_uring: utilize the io batching infrastructure for more efficient p…

    …olled IO
    
    Wire up using an io_comp_batch for f_op->iopoll(). If the lower stack
    supports it, we can handle high rates of polled IO more efficiently.
    
    This raises the single core efficiency on my system from ~6.1M IOPS to
    ~6.6M IOPS running a random read workload at depth 128 on two gen2
    Optane drives.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    b688f11 View commit details
    Browse the repository at this point in the history
  107. nvme: wire up completion batching for the IRQ path

    Trivial to do now, just need our own io_comp_batch on the stack and pass
    that in to the usual command completion handling.
    
    I pondered making this dependent on how many entries we had to process,
    but even for a single entry there's no discernable difference in
    performance or latency. Running a sync workload over io_uring:
    
    t/io_uring -b512 -d1 -s1 -c1 -p0 -F1 -B1 -n2 /dev/nvme1n1 /dev/nvme2n1
    
    yields the below performance before the patch:
    
    IOPS=254820, BW=124MiB/s, IOS/call=1/1, inflight=(1 1)
    IOPS=251174, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)
    IOPS=250806, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)
    
    and the following after:
    
    IOPS=255972, BW=124MiB/s, IOS/call=1/1, inflight=(1 1)
    IOPS=251920, BW=123MiB/s, IOS/call=1/1, inflight=(1 1)
    IOPS=251794, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)
    
    which definitely isn't slower, about the same if you factor in a bit of
    variance. For peak performance workloads, benchmarking shows a 2%
    improvement.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 18, 2021
    Configuration menu
    Copy the full SHA
    4f50224 View commit details
    Browse the repository at this point in the history

Commits on Oct 19, 2021

  1. block: fix too broad elevator check in blk_mq_free_request()

    We added RQF_ELV to tell whether there's an IO scheduler attached, and
    RQF_ELVPRIV tells us whether there's an IO scheduler with private data
    attached. Don't check RQF_ELV in blk_mq_free_request(), what we care
    about here is just if we have scheduler private data attached.
    
    This fixes a boot crash
    
    Fixes: 2ff0682 ("block: store elevator state in request")
    Reported-by: Yi Zhang <yi.zhang@redhat.com>
    Reported-by: syzbot+eb8104072aeab6cc1195@syzkaller.appspotmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 19, 2021
    Configuration menu
    Copy the full SHA
    e0d78af View commit details
    Browse the repository at this point in the history
  2. block: move bdev_read_only() into the header

    This is called for every write in the fast path, move it inline next
    to get_disk_ro() which is called internally.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 19, 2021
    Configuration menu
    Copy the full SHA
    db9a02b View commit details
    Browse the repository at this point in the history
  3. block: don't call blk_status_to_errno in blk_update_request

    We only need to call it to resolve the blk_status_t -> errno mapping for
    tracing, so move the conversion into the tracepoints that are not called
    at all when tracing isn't enabled.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 19, 2021
    Configuration menu
    Copy the full SHA
    8a7d267 View commit details
    Browse the repository at this point in the history
  4. block: return whether or not to unplug through boolean

    Instead of returning the same queue request through a request pointer,
    use a boolean to accomplish the same.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 19, 2021
    Configuration menu
    Copy the full SHA
    87c037d View commit details
    Browse the repository at this point in the history
  5. block: get rid of plug list sorting

    Even if we have multiple queues in the plug list, chances that they
    are very interspersed is minimal. Don't bother spending CPU cycles
    sorting the list.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 19, 2021
    Configuration menu
    Copy the full SHA
    df87eb0 View commit details
    Browse the repository at this point in the history
  6. block: move blk_mq_tag_to_rq() inline

    This is in the fast path of driver issue or completion, and it's a single
    array index operation. Move it inline to avoid a function call for it.
    
    This does mean making struct blk_mq_tags block layer public, but there's
    not really much in there.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 19, 2021
    Configuration menu
    Copy the full SHA
    e028f16 View commit details
    Browse the repository at this point in the history
  7. block: align blkdev_dio inlined bio to a cacheline

    We get all sorts of unreliable and funky results since the bio is
    designed to align on a cacheline, which it does not when inlined like
    this.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 19, 2021
    Configuration menu
    Copy the full SHA
    6155631 View commit details
    Browse the repository at this point in the history
  8. blk-wbt: prevent NULL pointer dereference in wb_timer_fn

    The timer callback used to evaluate if the latency is exceeded can be
    executed after the corresponding disk has been released, causing the
    following NULL pointer dereference:
    
    [ 119.987108] BUG: kernel NULL pointer dereference, address: 0000000000000098
    [ 119.987617] #PF: supervisor read access in kernel mode
    [ 119.987971] #PF: error_code(0x0000) - not-present page
    [ 119.988325] PGD 7c4a4067 P4D 7c4a4067 PUD 7bf63067 PMD 0
    [ 119.988697] Oops: 0000 [#1] SMP NOPTI
    [ 119.988959] CPU: 1 PID: 9353 Comm: cloud-init Not tainted 5.15-rc5+arighi #rc5+arighi
    [ 119.989520] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
    [ 119.990055] RIP: 0010:wb_timer_fn+0x44/0x3c0
    [ 119.990376] Code: 41 8b 9c 24 98 00 00 00 41 8b 94 24 b8 00 00 00 41 8b 84 24 d8 00 00 00 4d 8b 74 24 28 01 d3 01 c3 49 8b 44 24 60 48 8b 40 78 <4c> 8b b8 98 00 00 00 4d 85 f6 0f 84 c4 00 00 00 49 83 7c 24 30 00
    [ 119.991578] RSP: 0000:ffffb5f580957da8 EFLAGS: 00010246
    [ 119.991937] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
    [ 119.992412] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88f476d7f780
    [ 119.992895] RBP: ffffb5f580957dd0 R08: 0000000000000000 R09: 0000000000000000
    [ 119.993371] R10: 0000000000000004 R11: 0000000000000002 R12: ffff88f476c84500
    [ 119.993847] R13: ffff88f4434390c0 R14: 0000000000000000 R15: ffff88f4bdc98c00
    [ 119.994323] FS: 00007fb90bcd9c00(0000) GS:ffff88f4bdc80000(0000) knlGS:0000000000000000
    [ 119.994952] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 119.995380] CR2: 0000000000000098 CR3: 000000007c0d6000 CR4: 00000000000006e0
    [ 119.995906] Call Trace:
    [ 119.996130] ? blk_stat_free_callback_rcu+0x30/0x30
    [ 119.996505] blk_stat_timer_fn+0x138/0x140
    [ 119.996830] call_timer_fn+0x2b/0x100
    [ 119.997136] __run_timers.part.0+0x1d1/0x240
    [ 119.997470] ? kvm_clock_get_cycles+0x11/0x20
    [ 119.997826] ? ktime_get+0x3e/0xa0
    [ 119.998110] ? native_apic_msr_write+0x2c/0x30
    [ 119.998456] ? lapic_next_event+0x20/0x30
    [ 119.998779] ? clockevents_program_event+0x94/0xf0
    [ 119.999150] run_timer_softirq+0x2a/0x50
    [ 119.999465] __do_softirq+0xcb/0x26f
    [ 119.999764] irq_exit_rcu+0x8c/0xb0
    [ 120.000057] sysvec_apic_timer_interrupt+0x43/0x90
    [ 120.000429] ? asm_sysvec_apic_timer_interrupt+0xa/0x20
    [ 120.000836] asm_sysvec_apic_timer_interrupt+0x12/0x20
    
    In this case simply return from the timer callback (no action
    required) to prevent the NULL pointer dereference.
    
    BugLink: https://bugs.launchpad.net/bugs/1947557
    Link: https://lore.kernel.org/linux-mm/YWRNVTk9N8K0RMst@arighi-desktop/
    Fixes: 34dbad5 ("blk-stat: convert to callback-based statistics reporting")
    Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
    Link: https://lore.kernel.org/r/YW6N2qXpBU3oc50q@arighi-desktop
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Andrea Righi authored and axboe committed Oct 19, 2021
    Configuration menu
    Copy the full SHA
    480d42d View commit details
    Browse the repository at this point in the history
  9. block: change plugging to use a singly linked list

    Use a singly linked list for the blk_plug. This saves 8 bytes in the
    blk_plug struct, and makes for faster list manipulations than doubly
    linked lists. As we don't use the doubly linked lists for anything,
    singly linked is just fine.
    
    This yields a bump in default (merging enabled) performance from 7.0
    to 7.1M IOPS, and ~7.5M IOPS with merging disabled.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 19, 2021
    Configuration menu
    Copy the full SHA
    bc490f8 View commit details
    Browse the repository at this point in the history
  10. block: attempt direct issue of plug list

    If we have just one queue type in the plug list, then we can extend our
    direct issue to cover a full plug list as well. This allows sending a
    batch of requests for direct issue, which is more efficient than doing
    one-at-a-time kind of issue.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 19, 2021
    Configuration menu
    Copy the full SHA
    dc5fc36 View commit details
    Browse the repository at this point in the history
  11. blk-mq: don't handle non-flush requests in blk_insert_flush

    Return to the normal blk_mq_submit_bio flow if the bio did not end up
    actually being a flush because the device didn't support it.  Note that
    this is basically impossible to hit without special instrumentation given
    that submit_bio_checks already clears these flags usually, so we'd need a
    tight race to actually hit this code path.
    
    With this the call to blk_mq_run_hw_queue for the flush requests can be
    removed given that the actual flush requests are always issued via the
    requeue workqueue which runs the queue unconditionally.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211019122553.2467817-1-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 19, 2021
    Configuration menu
    Copy the full SHA
    d92ca9d View commit details
    Browse the repository at this point in the history
  12. block: inline fast path of driver tag allocation

    If we don't use an IO scheduler or have shared tags, then we don't need
    to call into this external function at all. This saves ~2% for such
    a setup.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 19, 2021
    Configuration menu
    Copy the full SHA
    a808a9d View commit details
    Browse the repository at this point in the history
  13. block, bfq: fix UAF problem in bfqg_stats_init()

    In bfq_pd_alloc(), the function bfqg_stats_init() init bfqg. If
    blkg_rwstat_init() init bfqg_stats->bytes successful and init
    bfqg_stats->ios failed, bfqg_stats_init() return failed, bfqg will
    be freed. But blkg_rwstat->cpu_cnt is not deleted from the list of
    percpu_counters. If we traverse the list of percpu_counters, It will
    have UAF problem.
    
    we should use blkg_rwstat_exit() to cleanup bfqg_stats bytes in the
    above scenario.
    
    Fixes: commit fd41e60 ("bfq-iosched: stop using blkg->stat_bytes and ->stat_ios")
    Signed-off-by: Zheng Liang <zhengliang6@huawei.com>
    Acked-by: Tejun Heo <tj@kernel.org>
    Link: https://lore.kernel.org/r/20211018024225.1493938-1-zhengliang6@huawei.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    z00448126 authored and axboe committed Oct 19, 2021
    Configuration menu
    Copy the full SHA
    2fc428f View commit details
    Browse the repository at this point in the history

Commits on Oct 20, 2021

  1. nvme: add APIs for stopping/starting admin queue

    Add two APIs for stopping and starting admin queue.
    
    Signed-off-by: Ming Lei <ming.lei@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211014081710.1871747-2-ming.lei@redhat.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Ming Lei authored and axboe committed Oct 20, 2021
    Configuration menu
    Copy the full SHA
    a277654 View commit details
    Browse the repository at this point in the history
  2. nvme: apply nvme API to quiesce/unquiesce admin queue

    Apply the added two APIs to quiesce/unquiesce admin queue.
    
    Signed-off-by: Ming Lei <ming.lei@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211014081710.1871747-3-ming.lei@redhat.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Ming Lei authored and axboe committed Oct 20, 2021
    Configuration menu
    Copy the full SHA
    6ca1d90 View commit details
    Browse the repository at this point in the history
  3. nvme: prepare for pairing quiescing and unquiescing

    Add two helpers so that we can prepare for pairing quiescing and
    unquiescing which will be done in next patch.
    
    Signed-off-by: Ming Lei <ming.lei@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211014081710.1871747-4-ming.lei@redhat.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Ming Lei authored and axboe committed Oct 20, 2021
    Configuration menu
    Copy the full SHA
    ebc9b95 View commit details
    Browse the repository at this point in the history
  4. nvme: paring quiesce/unquiesce

    The current blk_mq_quiesce_queue() and blk_mq_unquiesce_queue() always
    stops and starts the queue unconditionally. And there can be concurrent
    quiesce/unquiesce coming from different unrelated code paths, so
    unquiesce may come unexpectedly and start queue too early.
    
    Prepare for supporting concurrent quiesce/unquiesce from multiple
    contexts, so that we can address the above issue.
    
    NVMe has very complicated quiesce/unquiesce use pattern, add one atomic
    bit for makeiing sure that blk-mq quiece/unquiesce is always called in
    pair.
    
    Signed-off-by: Ming Lei <ming.lei@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211014081710.1871747-5-ming.lei@redhat.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Ming Lei authored and axboe committed Oct 20, 2021
    Configuration menu
    Copy the full SHA
    9e6a6b1 View commit details
    Browse the repository at this point in the history
  5. nvme: loop: clear NVME_CTRL_ADMIN_Q_STOPPED after admin queue is real…

    …located
    
    The nvme-loop's admin queue may be freed and reallocated, and we have to
    reset the flag of NVME_CTRL_ADMIN_Q_STOPPED so that the flag can match
    with the quiesce state of the admin queue.
    
    nvme-loop is the only driver to reallocate request queue, and not see
    such usage in other nvme drivers.
    
    Signed-off-by: Ming Lei <ming.lei@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211014081710.1871747-6-ming.lei@redhat.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Ming Lei authored and axboe committed Oct 20, 2021
    Configuration menu
    Copy the full SHA
    1d35d51 View commit details
    Browse the repository at this point in the history
  6. blk-mq: support concurrent queue quiesce/unquiesce

    blk_mq_quiesce_queue() has been used a bit wide now, so far we don't support
    concurrent/nested quiesce. One biggest issue is that unquiesce can happen
    unexpectedly in case that quiesce/unquiesce are run concurrently from
    more than one context.
    
    This patch introduces q->mq_quiesce_depth to deal concurrent quiesce,
    and we only unquiesce queue when it is the last/outer-most one of all
    contexts.
    
    Several kernel panic issue has been reported[1][2][3] when running stress
    quiesce test. And this patch has been verified in these reports.
    
    [1] https://lore.kernel.org/linux-block/9b21c797-e505-3821-4f5b-df7bf9380328@huawei.com/T/#m1fc52431fad7f33b1ffc3f12c4450e4238540787
    [2] https://lore.kernel.org/linux-block/9b21c797-e505-3821-4f5b-df7bf9380328@huawei.com/T/#m10ad90afeb9c8cc318334190a7c24c8b5c5e0722
    [3] https://listman.redhat.com/archives/dm-devel/2021-September/msg00189.html
    
    Signed-off-by: Ming Lei <ming.lei@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211014081710.1871747-7-ming.lei@redhat.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Ming Lei authored and axboe committed Oct 20, 2021
    Configuration menu
    Copy the full SHA
    e70feb8 View commit details
    Browse the repository at this point in the history
  7. block: turn macro helpers into inline functions

    Replace bio_set_dev() with an identical inline helper and move it
    further to fix a dependency problem with bio_associate_blkg(). Do the
    same for bio_copy_dev().
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 20, 2021
    Configuration menu
    Copy the full SHA
    cf6d623 View commit details
    Browse the repository at this point in the history
  8. block: convert leftovers to bdev_get_queue

    Convert bdev->bd_disk->queue to bdev_get_queue(), which is faster.
    Apparently, there are a few such spots in block that got lost during
    rebases.
    
    Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 20, 2021
    Configuration menu
    Copy the full SHA
    859897c View commit details
    Browse the repository at this point in the history
  9. block: optimise req_bio_endio()

    First, get rid of an extra branch and chain error checks. Also reshuffle
    it with bio_advance(), so it goes closer to the final check, with that
    the compiler loads rq->rq_flags only once, and also doesn't reload
    bio->bi_iter.bi_size if bio_advance() didn't actually advanced the iter.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 20, 2021
    Configuration menu
    Copy the full SHA
    478eb72 View commit details
    Browse the repository at this point in the history
  10. block: don't bloat enter_queue with percpu_ref

    percpu_ref_put() are inlined for performance and bloat the binary, we
    don't care about the fail case of blk_try_enter_queue(), so we can
    replace it with a call to blk_queue_exit().
    
    Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 20, 2021
    Configuration menu
    Copy the full SHA
    1497a51 View commit details
    Browse the repository at this point in the history
  11. block: inline a part of bio_release_pages()

    Inline BIO_NO_PAGE_REF check of bio_release_pages() to avoid function
    call.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 20, 2021
    Configuration menu
    Copy the full SHA
    c809084 View commit details
    Browse the repository at this point in the history
  12. block: remove inaccurate requeue check

    This check is meant to catch cases where a requeue is attempted on a
    request that is still inserted. It's never really been useful to catch any
    misuse, and now it's actively wrong. Outside of that, this should not be a
    BUG_ON() to begin with.
    
    Remove the check as it's now causing active harm, as requeue off the plug
    path will trigger it even though the request state is just fine.
    
    Reported-by: Yi Zhang <yi.zhang@redhat.com>
    Link: https://lore.kernel.org/linux-block/CAHj4cs80zAUc2grnCZ015-2Rvd-=gXRfB_dFKy=RTm+wRo09HQ@mail.gmail.com/
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 20, 2021
    Configuration menu
    Copy the full SHA
    037057a View commit details
    Browse the repository at this point in the history
  13. blk-mq: only flush requests from the plug in blk_mq_submit_bio

    Replace the call to blk_flush_plug_list in blk_mq_submit_bio with a
    direct call to blk_mq_flush_plug_list.  This means we do not flush
    plug callback from stackable devices, which doesn't really help with
    the accumulated requests anyway, and it also means the cached requests
    aren't freed here as they can still be used later on.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211020144119.142582-2-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 20, 2021
    Configuration menu
    Copy the full SHA
    a214b94 View commit details
    Browse the repository at this point in the history
  14. blk-mq: move blk_mq_flush_plug_list to block/blk-mq.h

    This helper is internal to the block layer.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211020144119.142582-3-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 20, 2021
    Configuration menu
    Copy the full SHA
    dbb6f76 View commit details
    Browse the repository at this point in the history
  15. block: optimise blk_flush_plug_list

    Don't call flush_plug_callbacks if there are no plug callbacks.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    [hch: split from a larger patch]
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211020144119.142582-4-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 20, 2021
    Configuration menu
    Copy the full SHA
    b600455 View commit details
    Browse the repository at this point in the history
  16. block: cleanup the flush plug helpers

    Consolidate the various helpers into a single blk_flush_plug helper that
    takes a plk_plug and the from_scheduler bool and switch all callsites to
    call it directly.  Checks that the plug is non-NULL must be performed by
    the caller, something that most already do anyway.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211020144119.142582-5-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Christoph Hellwig authored and axboe committed Oct 20, 2021
    Configuration menu
    Copy the full SHA
    008f75a View commit details
    Browse the repository at this point in the history

Commits on Oct 21, 2021

  1. blk-mq: Fix blk_mq_tagset_busy_iter() for shared tags

    Since it is now possible for a tagset to share a single set of tags, the
    iter function should not re-iter the tags for the count of #hw queues in
    that case. Rather it should just iter once.
    
    Fixes: e155b0c ("blk-mq: Use shared tags for shared sbitmap support")
    Reported-by: Kashyap Desai <kashyap.desai@broadcom.com>
    Signed-off-by: John Garry <john.garry@huawei.com>
    Reviewed-by: Ming Lei <ming.lei@redhat.com>
    Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
    Link: https://lore.kernel.org/r/1634550083-202815-1-git-send-email-john.garry@huawei.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    John Garry authored and axboe committed Oct 21, 2021
    Configuration menu
    Copy the full SHA
    0994c64 View commit details
    Browse the repository at this point in the history
  2. fs: bdev: fix conflicting comment from lookup_bdev

    We switched to directly use dev_t to get block device, lookup changed the
    meaning of use, now we fix this conflicting comment.
    
    Fixes: 4e7b567 ("block: remove i_bdev")
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20211021071344.1600362-1-liu.yun@linux.dev
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    JackieLiu1 authored and axboe committed Oct 21, 2021
    Configuration menu
    Copy the full SHA
    057178c View commit details
    Browse the repository at this point in the history
  3. block: optimise boundary blkdev_read_iter's checks

    Combine pos and len checks and mark unlikely. Also, don't reexpand if
    it's not truncated.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/fff34e613aeaae1ad12977dc4592cb1a1f5d3190.1634755800.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 21, 2021
    Configuration menu
    Copy the full SHA
    6450fe1 View commit details
    Browse the repository at this point in the history
  4. block: clean up blk_mq_submit_bio() merging

    Combine blk_mq_sched_bio_merge() and blk_attempt_plug_merge() under a
    common if, so we don't check it twice.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/daedc90d4029a5d1d73344771632b1faca3aaf81.1634755800.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 21, 2021
    Configuration menu
    Copy the full SHA
    179ae84 View commit details
    Browse the repository at this point in the history
  5. block: convert fops.c magic constants to SHIFT_SECTOR

    Don't use shifting by a magic number 9 but replace with a more
    descriptive SHIFT_SECTOR.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/068782b9f7e97569fb59a99529b23bb17ea4c5e2.1634755800.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 21, 2021
    Configuration menu
    Copy the full SHA
    6549a87 View commit details
    Browse the repository at this point in the history
  6. percpu_ref: percpu_ref_tryget_live() version holding RCU

    Add percpu_ref_tryget_live_rcu(), which is a version of
    percpu_ref_tryget_live() but the user is responsible for enclosing it in
    a RCU read lock section.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Acked-by: Dennis Zhou <dennis@kernel.org>
    Link: https://lore.kernel.org/r/3066500d7a6eb3e03f10adf98b87fdb3b1c49db8.1634822969.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 21, 2021
    Configuration menu
    Copy the full SHA
    3b13c16 View commit details
    Browse the repository at this point in the history
  7. block: kill extra rcu lock/unlock in queue enter

    blk_try_enter_queue() already takes rcu_read_lock/unlock, so we can
    avoid the second pair in percpu_ref_tryget_live(), use a newly added
    percpu_ref_tryget_live_rcu().
    
    As rcu_read_lock/unlock imply barrier()s, it's pretty noticeable,
    especially for for !CONFIG_PREEMPT_RCU (default for some distributions),
    where __rcu_read_lock/unlock() are not inlined.
    
    3.20%  io_uring  [kernel.vmlinux]  [k] __rcu_read_unlock
    3.05%  io_uring  [kernel.vmlinux]  [k] __rcu_read_lock
    
    2.52%  io_uring  [kernel.vmlinux]  [k] __rcu_read_unlock
    2.28%  io_uring  [kernel.vmlinux]  [k] __rcu_read_lock
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/6b11c67ea495ed9d44f067622d852de4a510ce65.1634822969.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 21, 2021
    Configuration menu
    Copy the full SHA
    e94f685 View commit details
    Browse the repository at this point in the history
  8. block: Add invalidate_disk() helper to invalidate the gendisk

    To hide internal implementation and simplify some driver code,
    this adds a helper to invalidate the gendisk. It will clean the
    gendisk's associated buffer/page caches and reset its internal
    states.
    
    Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20210922123711.187-2-xieyongji@bytedance.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    YongjiXie authored and axboe committed Oct 21, 2021
    Configuration menu
    Copy the full SHA
    f059a1d View commit details
    Browse the repository at this point in the history
  9. loop: Use invalidate_disk() helper to invalidate gendisk

    Use invalidate_disk() helper to simplify the code for gendisk
    invalidation.
    
    Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20210922123711.187-3-xieyongji@bytedance.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    YongjiXie authored and axboe committed Oct 21, 2021
    Configuration menu
    Copy the full SHA
    e515be8 View commit details
    Browse the repository at this point in the history
  10. loop: Remove the unnecessary bdev checks and unused bdev variable

    The lo->lo_device can't be null if the lo->lo_backing_file is set.
    So let's remove the unnecessary bdev checks and the entire bdev
    variable in __loop_clr_fd() since the lo->lo_backing_file is already
    checked before.
    
    Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20210922123711.187-4-xieyongji@bytedance.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    YongjiXie authored and axboe committed Oct 21, 2021
    Configuration menu
    Copy the full SHA
    19f553d View commit details
    Browse the repository at this point in the history
  11. nbd: Use invalidate_disk() helper on disconnect

    When a nbd device encounters a writeback error, that error will
    get propagated to the bd_inode's wb_err field. Then if this nbd
    device's backend is disconnected and another is attached, we will
    get back the previous writeback error on fsync, which is unexpected.
    
    To fix it, let's use invalidate_disk() helper to invalidate the
    disk on disconnect instead of just setting disk's capacity to zero.
    
    Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20210922123711.187-5-xieyongji@bytedance.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    YongjiXie authored and axboe committed Oct 21, 2021
    Configuration menu
    Copy the full SHA
    435c2ac View commit details
    Browse the repository at this point in the history
  12. blk-crypto-fallback: properly prefix function and struct names

    For clarity, avoid using just the "blk_crypto_" prefix for functions and
    structs that are specific to blk-crypto-fallback.  Instead, use
    "blk_crypto_fallback_".  Some places already did this, but others
    didn't.
    
    This is also a prerequisite for using "struct blk_crypto_keyslot" to
    mean a generic blk-crypto keyslot (which is what it sounds like).
    Rename the fallback one to "struct blk_crypto_fallback_keyslot".
    
    No change in behavior.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
    Reviewed-by: Mike Snitzer <snitzer@redhat.com>
    Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
    Signed-off-by: Eric Biggers <ebiggers@google.com>
    Link: https://lore.kernel.org/r/20211018180453.40441-2-ebiggers@kernel.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    ebiggers authored and axboe committed Oct 21, 2021
    Configuration menu
    Copy the full SHA
    eebcafa View commit details
    Browse the repository at this point in the history
  13. blk-crypto: rename keyslot-manager files to blk-crypto-profile

    In preparation for renaming struct blk_keyslot_manager to struct
    blk_crypto_profile, rename the keyslot-manager.h and keyslot-manager.c
    source files.  Renaming these files separately before making a lot of
    changes to their contents makes it easier for git to understand that
    they were renamed.
    
    Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Mike Snitzer <snitzer@redhat.com>
    Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
    Signed-off-by: Eric Biggers <ebiggers@google.com>
    Link: https://lore.kernel.org/r/20211018180453.40441-3-ebiggers@kernel.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    ebiggers authored and axboe committed Oct 21, 2021
    Configuration menu
    Copy the full SHA
    1e8d44b View commit details
    Browse the repository at this point in the history
  14. blk-crypto: rename blk_keyslot_manager to blk_crypto_profile

    blk_keyslot_manager is misnamed because it doesn't necessarily manage
    keyslots.  It actually does several different things:
    
      - Contains the crypto capabilities of the device.
    
      - Provides functions to control the inline encryption hardware.
        Originally these were just for programming/evicting keyslots;
        however, new functionality (hardware-wrapped keys) will require new
        functions here which are unrelated to keyslots.  Moreover,
        device-mapper devices already (ab)use "keyslot_evict" to pass key
        eviction requests to their underlying devices even though
        device-mapper devices don't have any keyslots themselves (so it
        really should be "evict_key", not "keyslot_evict").
    
      - Sometimes (but not always!) it manages keyslots.  Originally it
        always did, but device-mapper devices don't have keyslots
        themselves, so they use a "passthrough keyslot manager" which
        doesn't actually manage keyslots.  This hack works, but the
        terminology is unnatural.  Also, some hardware doesn't have keyslots
        and thus also uses a "passthrough keyslot manager" (support for such
        hardware is yet to be upstreamed, but it will happen eventually).
    
    Let's stop having keyslot managers which don't actually manage keyslots.
    Instead, rename blk_keyslot_manager to blk_crypto_profile.
    
    This is a fairly big change, since for consistency it also has to update
    keyslot manager-related function names, variable names, and comments --
    not just the actual struct name.  However it's still a fairly
    straightforward change, as it doesn't change any actual functionality.
    
    Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC
    Reviewed-by: Mike Snitzer <snitzer@redhat.com>
    Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
    Signed-off-by: Eric Biggers <ebiggers@google.com>
    Link: https://lore.kernel.org/r/20211018180453.40441-4-ebiggers@kernel.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    ebiggers authored and axboe committed Oct 21, 2021
    Configuration menu
    Copy the full SHA
    cb77cb5 View commit details
    Browse the repository at this point in the history
  15. blk-crypto: update inline encryption documentation

    Rework most of inline-encryption.rst to be easier to follow, to correct
    some information, to add some important details and remove some
    unimportant details, and to take into account the renaming from
    blk_keyslot_manager to blk_crypto_profile.
    
    Reviewed-by: Mike Snitzer <snitzer@redhat.com>
    Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
    Signed-off-by: Eric Biggers <ebiggers@google.com>
    Link: https://lore.kernel.org/r/20211018180453.40441-5-ebiggers@kernel.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    ebiggers authored and axboe committed Oct 21, 2021
    Configuration menu
    Copy the full SHA
    8e9f666 View commit details
    Browse the repository at this point in the history

Commits on Oct 22, 2021

  1. block: fix req_bio_endio append error handling

    Shinichiro Kawasaki reports that there is a bug in a recent
    req_bio_endio() patch causing problems with zonefs. As Shinichiro
    suggested, inverse the condition in zone append path to resemble how it
    was before: fail when it's not fully completed.
    
    Fixes: 478eb72 ("block: optimise req_bio_endio()")
    Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/344ea4e334aace9148b41af5f2426da38c8aa65a.1634914228.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 22, 2021
    Configuration menu
    Copy the full SHA
    297db73 View commit details
    Browse the repository at this point in the history
  2. blk-mq-sched: Don't reference queue tagset in blk_mq_sched_tags_teard…

    …own()
    
    We should not reference the queue tagset in blk_mq_sched_tags_teardown()
    (see function comment) for the blk-mq flags, so use the passed flags
    instead.
    
    This solves a use-after-free, similarly fixed earlier (and since broken
    again) in commit f0c1c4d ("blk-mq: fix use-after-free in
    blk_mq_exit_sched").
    
    Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
    Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
    Tested-by: Anders Roxell <anders.roxell@linaro.org>
    Fixes: e155b0c ("blk-mq: Use shared tags for shared sbitmap support")
    Signed-off-by: John Garry <john.garry@huawei.com>
    Link: https://lore.kernel.org/r/1634890340-15432-1-git-send-email-john.garry@huawei.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    John Garry authored and axboe committed Oct 22, 2021
    Configuration menu
    Copy the full SHA
    8bdf7b3 View commit details
    Browse the repository at this point in the history

Commits on Oct 23, 2021

  1. sched: make task_struct->plug always defined

    If CONFIG_BLOCK isn't set, then it's an empty struct anyway. Just make
    it generally available, so we don't break the compile:
    
    kernel/sched/core.c: In function ‘sched_submit_work’:
    kernel/sched/core.c:6346:35: error: ‘struct task_struct’ has no member named ‘plug’
     6346 |                 blk_flush_plug(tsk->plug, true);
          |                                   ^~
    kernel/sched/core.c: In function ‘io_schedule_prepare’:
    kernel/sched/core.c:8357:20: error: ‘struct task_struct’ has no member named ‘plug’
     8357 |         if (current->plug)
          |                    ^~
    kernel/sched/core.c:8358:39: error: ‘struct task_struct’ has no member named ‘plug’
     8358 |                 blk_flush_plug(current->plug, true);
          |                                       ^~
    
    Reported-by: Nathan Chancellor <nathan@kernel.org>
    Fixes: 008f75a ("block: cleanup the flush plug helpers")
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 23, 2021
    Configuration menu
    Copy the full SHA
    599593a View commit details
    Browse the repository at this point in the history

Commits on Oct 25, 2021

  1. block: add single bio async direct IO helper

    As with __blkdev_direct_IO_simple(), we can implement direct IO more
    efficiently if there is only one bio. Add __blkdev_direct_IO_async() and
    blkdev_bio_end_io_async(). This patch brings me from 4.45-4.5 MIOPS with
    nullblk to 4.7+.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/f0ae4109b7a6934adede490f84d188d53b97051b.1635006010.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 25, 2021
    Configuration menu
    Copy the full SHA
    54a88eb View commit details
    Browse the repository at this point in the history
  2. block: refactor bio_iov_bvec_set()

    Combine bio_iov_bvec_set() and bio_iov_bvec_set_append() and let the
    caller to do iov_iter_advance(). Also get rid of __bio_iov_bvec_set(),
    which was duplicated in the final binary, and replace a weird
    iov_iter_truncate() of a temporal iter copy with min() better reflecting
    the intention.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/bcf1ac36fce769a514e19475f3623cd86a1d8b72.1635006010.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    isilence authored and axboe committed Oct 25, 2021
    Configuration menu
    Copy the full SHA
    fa5fa8e View commit details
    Browse the repository at this point in the history
  3. blk-cgroup: synchronize blkg creation against policy deactivation

    Our test reports a null pointer dereference:
    
    [  168.534653] ==================================================================
    [  168.535614] Disabling lock debugging due to kernel taint
    [  168.536346] BUG: kernel NULL pointer dereference, address: 0000000000000008
    [  168.537274] #PF: supervisor read access in kernel mode
    [  168.537964] #PF: error_code(0x0000) - not-present page
    [  168.538667] PGD 0 P4D 0
    [  168.539025] Oops: 0000 [#1] PREEMPT SMP KASAN
    [  168.539656] CPU: 13 PID: 759 Comm: bash Tainted: G    B             5.15.0-rc2-next-202100
    [  168.540954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_0738364
    [  168.542736] RIP: 0010:bfq_pd_init+0x88/0x1e0
    [  168.543318] Code: 98 00 00 00 e8 c9 e4 5b ff 4c 8b 65 00 49 8d 7c 24 08 e8 bb e4 5b ff 4d0
    [  168.545803] RSP: 0018:ffff88817095f9c0 EFLAGS: 00010002
    [  168.546497] RAX: 0000000000000001 RBX: ffff888101a1c000 RCX: 0000000000000000
    [  168.547438] RDX: 0000000000000003 RSI: 0000000000000002 RDI: ffff888106553428
    [  168.548402] RBP: ffff888106553400 R08: ffffffff961bcaf4 R09: 0000000000000001
    [  168.549365] R10: ffffffffa2e16c27 R11: fffffbfff45c2d84 R12: 0000000000000000
    [  168.550291] R13: ffff888101a1c098 R14: ffff88810c7a08c8 R15: ffffffffa55541a0
    [  168.551221] FS:  00007fac75227700(0000) GS:ffff88839ba80000(0000) knlGS:0000000000000000
    [  168.552278] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [  168.553040] CR2: 0000000000000008 CR3: 0000000165ce7000 CR4: 00000000000006e0
    [  168.554000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [  168.554929] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [  168.555888] Call Trace:
    [  168.556221]  <TASK>
    [  168.556510]  blkg_create+0x1c0/0x8c0
    [  168.556989]  blkg_conf_prep+0x574/0x650
    [  168.557502]  ? stack_trace_save+0x99/0xd0
    [  168.558033]  ? blkcg_conf_open_bdev+0x1b0/0x1b0
    [  168.558629]  tg_set_conf.constprop.0+0xb9/0x280
    [  168.559231]  ? kasan_set_track+0x29/0x40
    [  168.559758]  ? kasan_set_free_info+0x30/0x60
    [  168.560344]  ? tg_set_limit+0xae0/0xae0
    [  168.560853]  ? do_sys_openat2+0x33b/0x640
    [  168.561383]  ? do_sys_open+0xa2/0x100
    [  168.561877]  ? __x64_sys_open+0x4e/0x60
    [  168.562383]  ? __kasan_check_write+0x20/0x30
    [  168.562951]  ? copyin+0x48/0x70
    [  168.563390]  ? _copy_from_iter+0x234/0x9e0
    [  168.563948]  tg_set_conf_u64+0x17/0x20
    [  168.564467]  cgroup_file_write+0x1ad/0x380
    [  168.565014]  ? cgroup_file_poll+0x80/0x80
    [  168.565568]  ? __mutex_lock_slowpath+0x30/0x30
    [  168.566165]  ? pgd_free+0x100/0x160
    [  168.566649]  kernfs_fop_write_iter+0x21d/0x340
    [  168.567246]  ? cgroup_file_poll+0x80/0x80
    [  168.567796]  new_sync_write+0x29f/0x3c0
    [  168.568314]  ? new_sync_read+0x410/0x410
    [  168.568840]  ? __handle_mm_fault+0x1c97/0x2d80
    [  168.569425]  ? copy_page_range+0x2b10/0x2b10
    [  168.570007]  ? _raw_read_lock_bh+0xa0/0xa0
    [  168.570622]  vfs_write+0x46e/0x630
    [  168.571091]  ksys_write+0xcd/0x1e0
    [  168.571563]  ? __x64_sys_read+0x60/0x60
    [  168.572081]  ? __kasan_check_write+0x20/0x30
    [  168.572659]  ? do_user_addr_fault+0x446/0xff0
    [  168.573264]  __x64_sys_write+0x46/0x60
    [  168.573774]  do_syscall_64+0x35/0x80
    [  168.574264]  entry_SYSCALL_64_after_hwframe+0x44/0xae
    [  168.574960] RIP: 0033:0x7fac74915130
    [  168.575456] Code: 73 01 c3 48 8b 0d 58 ed 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 444
    [  168.577969] RSP: 002b:00007ffc3080e288 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    [  168.578986] RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007fac74915130
    [  168.579937] RDX: 0000000000000009 RSI: 000056007669f080 RDI: 0000000000000001
    [  168.580884] RBP: 000056007669f080 R08: 000000000000000a R09: 00007fac75227700
    [  168.581841] R10: 000056007655c8f0 R11: 0000000000000246 R12: 0000000000000009
    [  168.582796] R13: 0000000000000001 R14: 00007fac74be55e0 R15: 00007fac74be08c0
    [  168.583757]  </TASK>
    [  168.584063] Modules linked in:
    [  168.584494] CR2: 0000000000000008
    [  168.584964] ---[ end trace 2475611ad0f77a1a ]---
    
    This is because blkg_alloc() is called from blkg_conf_prep() without
    holding 'q->queue_lock', and elevator is exited before blkg_create():
    
    thread 1                            thread 2
    blkg_conf_prep
     spin_lock_irq(&q->queue_lock);
     blkg_lookup_check -> return NULL
     spin_unlock_irq(&q->queue_lock);
    
     blkg_alloc
      blkcg_policy_enabled -> true
      pd = ->pd_alloc_fn
      blkg->pd[i] = pd
                                       blk_mq_exit_sched
                                        bfq_exit_queue
                                         blkcg_deactivate_policy
                                          spin_lock_irq(&q->queue_lock);
                                          __clear_bit(pol->plid, q->blkcg_pols);
                                          spin_unlock_irq(&q->queue_lock);
                                        q->elevator = NULL;
      spin_lock_irq(&q->queue_lock);
       blkg_create
        if (blkg->pd[i])
         ->pd_init_fn -> q->elevator is NULL
      spin_unlock_irq(&q->queue_lock);
    
    Because blkcg_deactivate_policy() requires queue to be frozen, we can
    grab q_usage_counter to synchoronize blkg_conf_prep() against
    blkcg_deactivate_policy().
    
    Fixes: e21b7a0 ("block, bfq: add full hierarchical scheduling and cgroups support")
    Signed-off-by: Yu Kuai <yukuai3@huawei.com>
    Acked-by: Tejun Heo <tj@kernel.org>
    Link: https://lore.kernel.org/r/20211020014036.2141723-1-yukuai3@huawei.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Yu Kuai authored and axboe committed Oct 25, 2021
    Configuration menu
    Copy the full SHA
    0c9d338 View commit details
    Browse the repository at this point in the history
  4. sbitmap: silence data race warning

    KCSAN complaints about the sbitmap hint update:
    
    ==================================================================
    BUG: KCSAN: data-race in sbitmap_queue_clear / sbitmap_queue_clear
    
    write to 0xffffe8ffffd145b8 of 4 bytes by interrupt on cpu 1:
     sbitmap_queue_clear+0xca/0xf0 lib/sbitmap.c:606
     blk_mq_put_tag+0x82/0x90
     __blk_mq_free_request+0x114/0x180 block/blk-mq.c:507
     blk_mq_free_request+0x2c8/0x340 block/blk-mq.c:541
     __blk_mq_end_request+0x214/0x230 block/blk-mq.c:565
     blk_mq_end_request+0x37/0x50 block/blk-mq.c:574
     lo_complete_rq+0xca/0x170 drivers/block/loop.c:541
     blk_complete_reqs block/blk-mq.c:584 [inline]
     blk_done_softirq+0x69/0x90 block/blk-mq.c:589
     __do_softirq+0x12c/0x26e kernel/softirq.c:558
     run_ksoftirqd+0x13/0x20 kernel/softirq.c:920
     smpboot_thread_fn+0x22f/0x330 kernel/smpboot.c:164
     kthread+0x262/0x280 kernel/kthread.c:319
     ret_from_fork+0x1f/0x30
    
    write to 0xffffe8ffffd145b8 of 4 bytes by interrupt on cpu 0:
     sbitmap_queue_clear+0xca/0xf0 lib/sbitmap.c:606
     blk_mq_put_tag+0x82/0x90
     __blk_mq_free_request+0x114/0x180 block/blk-mq.c:507
     blk_mq_free_request+0x2c8/0x340 block/blk-mq.c:541
     __blk_mq_end_request+0x214/0x230 block/blk-mq.c:565
     blk_mq_end_request+0x37/0x50 block/blk-mq.c:574
     lo_complete_rq+0xca/0x170 drivers/block/loop.c:541
     blk_complete_reqs block/blk-mq.c:584 [inline]
     blk_done_softirq+0x69/0x90 block/blk-mq.c:589
     __do_softirq+0x12c/0x26e kernel/softirq.c:558
     run_ksoftirqd+0x13/0x20 kernel/softirq.c:920
     smpboot_thread_fn+0x22f/0x330 kernel/smpboot.c:164
     kthread+0x262/0x280 kernel/kthread.c:319
     ret_from_fork+0x1f/0x30
    
    value changed: 0x00000035 -> 0x00000044
    
    Reported by Kernel Concurrency Sanitizer on:
    CPU: 0 PID: 10 Comm: ksoftirqd/0 Not tainted 5.15.0-rc6-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    ==================================================================
    
    which is a data race, but not an important one. This is just updating the
    percpu alloc hint, and the reader of that hint doesn't ever require it to
    be valid.
    
    Just annotate it with data_race() to silence this one.
    
    Reported-by: syzbot+4f8bfd804b4a1f95b8f6@syzkaller.appspotmail.com
    Acked-by: Marco Elver <elver@google.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    axboe committed Oct 25, 2021
    Configuration menu
    Copy the full SHA
    9f8b93a View commit details
    Browse the repository at this point in the history

Commits on Oct 26, 2021

  1. blk-mq: don't issue request directly in case that current is to be bl…

    …ocked
    
    When flushing plug list in case that current will be blocked, we can't
    issue request directly because ->queue_rq() may sleep, otherwise scheduler
    may complain.
    
    Fixes: dc5fc36 ("block: attempt direct issue of plug list")
    Signed-off-by: Ming Lei <ming.lei@redhat.com>
    Link: https://lore.kernel.org/r/20211026082257.2889890-1-ming.lei@redhat.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Ming Lei authored and axboe committed Oct 26, 2021
    Configuration menu
    Copy the full SHA
    ff15522 View commit details
    Browse the repository at this point in the history

Commits on Oct 27, 2021

  1. block: Add independent access ranges support

    The Concurrent Positioning Ranges VPD page (for SCSI) and data log page
    (for ATA) contain parameters describing the set of contiguous LBAs that
    can be served independently by a single LUN multi-actuator hard-disk.
    Similarly, a logically defined block device composed of multiple disks
    can in some cases execute requests directed at different sector ranges
    in parallel. A dm-linear device aggregating 2 block devices together is
    an example.
    
    This patch implements support for exposing a block device independent
    access ranges to the user through sysfs to allow optimizing device
    accesses to increase performance.
    
    To describe the set of independent sector ranges of a device (actuators
    of a multi-actuator HDDs or table entries of a dm-linear device),
    The type struct blk_independent_access_ranges is introduced. This
    structure describes the sector ranges using an array of
    struct blk_independent_access_range structures. This range structure
    defines the start sector and number of sectors of the access range.
    The ranges in the array cannot overlap and must contain all sectors
    within the device capacity.
    
    The function disk_set_independent_access_ranges() allows a device
    driver to signal to the block layer that a device has multiple
    independent access ranges.  In this case, a struct
    blk_independent_access_ranges is attached to the device request queue
    by the function disk_set_independent_access_ranges(). The function
    disk_alloc_independent_access_ranges() is provided for drivers to
    allocate this structure.
    
    struct blk_independent_access_ranges contains kobjects (struct kobject)
    to expose to the user through sysfs the set of independent access ranges
    supported by a device. When the device is initialized, sysfs
    registration of the ranges information is done from blk_register_queue()
    using the block layer internal function
    disk_register_independent_access_ranges(). If a driver calls
    disk_set_independent_access_ranges() for a registered queue, e.g. when a
    device is revalidated, disk_set_independent_access_ranges() will execute
    disk_register_independent_access_ranges() to update the sysfs attribute
    files.  The sysfs file structure created starts from the
    independent_access_ranges sub-directory and contains the start sector
    and number of sectors of each range, with the information for each range
    grouped in numbered sub-directories.
    
    E.g. for a dual actuator HDD, the user sees:
    
    $ tree /sys/block/sdk/queue/independent_access_ranges/
    /sys/block/sdk/queue/independent_access_ranges/
    |-- 0
    |   |-- nr_sectors
    |   `-- sector
    `-- 1
        |-- nr_sectors
        `-- sector
    
    For a regular device with a single access range, the
    independent_access_ranges sysfs directory does not exist.
    
    Device revalidation may lead to changes to this structure and to the
    attribute values. When manipulated, the queue sysfs_lock and
    sysfs_dir_lock mutexes are held for atomicity, similarly to how the
    blk-mq and elevator sysfs queue sub-directories are protected.
    
    The code related to the management of independent access ranges is
    added in the new file block/blk-ia-ranges.c.
    
    Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
    Reviewed-by: Keith Busch <kbusch@kernel.org>
    Link: https://lore.kernel.org/r/20211027022223.183838-2-damien.lemoal@wdc.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    damien-lemoal authored and axboe committed Oct 27, 2021
    Configuration menu
    Copy the full SHA
    a2247f1 View commit details
    Browse the repository at this point in the history
  2. scsi: sd: add concurrent positioning ranges support

    Add the sd_read_cpr() function to the sd scsi disk driver to discover
    if a device has multiple concurrent positioning ranges (i.e. multiple
    actuators on an HDD). The existence of VPD page B9h indicates if a
    device has multiple concurrent positioning ranges. The page content
    describes each range supported by the device.
    
    sd_read_cpr() is called from sd_revalidate_disk() and uses the block
    layer functions disk_alloc_independent_access_ranges() and
    disk_set_independent_access_ranges() to represent the set of actuators
    of the device as independent access ranges.
    
    The format of the Concurrent Positioning Ranges VPD page B9h is defined
    in section 6.6.6 of SBC-5.
    
    Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
    Reviewed-by: Keith Busch <kbusch@kernel.org>
    Link: https://lore.kernel.org/r/20211027022223.183838-3-damien.lemoal@wdc.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    damien-lemoal authored and axboe committed Oct 27, 2021
    Configuration menu
    Copy the full SHA
    e815d36 View commit details
    Browse the repository at this point in the history
  3. libata: support concurrent positioning ranges log

    Add support to discover if an ATA device supports the Concurrent
    Positioning Ranges data log (address 0x47), indicating that the device
    is capable of seeking to multiple different locations in parallel using
    multiple actuators serving different LBA ranges.
    
    Also add support to translate the concurrent positioning ranges log
    into its equivalent Concurrent Positioning Ranges VPD page B9h in
    libata-scsi.c.
    
    The format of the Concurrent Positioning Ranges Log is defined in ACS-5
    r9.
    
    Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
    Reviewed-by: Keith Busch <kbusch@kernel.org>
    Link: https://lore.kernel.org/r/20211027022223.183838-4-damien.lemoal@wdc.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    damien-lemoal authored and axboe committed Oct 27, 2021
    Configuration menu
    Copy the full SHA
    fe22e1c View commit details
    Browse the repository at this point in the history
  4. doc: document sysfs queue/independent_access_ranges attributes

    Update the file Documentation/block/queue-sysfs.rst to add a description
    of a device queue sysfs entries related to independent access ranges
    (e.g. concurrent positioning ranges for multi-actuator hard-disks).
    
    Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
    Reviewed-by: Keith Busch <kbusch@kernel.org>
    Link: https://lore.kernel.org/r/20211027022223.183838-5-damien.lemoal@wdc.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    damien-lemoal authored and axboe committed Oct 27, 2021
    Configuration menu
    Copy the full SHA
    6b3bae2 View commit details
    Browse the repository at this point in the history
  5. doc: Fix typo in request queue sysfs documentation

    Fix a typo (are -> as) in the introduction paragraph of
    Documentation/block/queue-sysfs.rst.
    
    Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
    Reviewed-by: Keith Busch <kbusch@kernel.org>
    Link: https://lore.kernel.org/r/20211027022223.183838-6-damien.lemoal@wdc.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    damien-lemoal authored and axboe committed Oct 27, 2021
    Configuration menu
    Copy the full SHA
    9d82464 View commit details
    Browse the repository at this point in the history