Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev39 #112

Closed
wants to merge 2 commits into from
Closed

Dev39 #112

wants to merge 2 commits into from

Conversation

tgsergeant
Copy link

No description provided.

@tgsergeant tgsergeant closed this Aug 5, 2014
@tgsergeant tgsergeant deleted the dev39 branch August 5, 2014 04:51
@tgsergeant
Copy link
Author

I'm really sorry, this was created by mistake. Please ignore!

cyndis pushed a commit to cyndis/linux that referenced this pull request Aug 5, 2014
Set disconnected flag in struct usbhid when a usb device is removed. Check for
disconnected flag before sending urb requests. This prevents a kernel panic
when a hid driver calls hid_hw_request() after removing a usb device.

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
 IP: [<ffffffff8161746f>] hid_submit_ctrl+0x7f/0x290
 PGD 0
 Oops: 0002 [#1] PREEMPT SMP
 CPU: 2 PID: 39 Comm: khubd Tainted: G          IO  3.16.0-rc5+ torvalds#112
 Hardware name: Microsoft Corporation Surface Pro 2/Surface Pro 2, BIOS 2.03.0250 09/06/2013
 task: ffff880118aba6e0 ti: ffff8800daf80000 task.ti: ffff8800daf80000
 RIP: 0010:[<ffffffff8161746f>]  [<ffffffff8161746f>] hid_submit_ctrl+0x7f/0x290
 RSP: 0018:ffff8800daf83750  EFLAGS: 00010086
 RAX: 0000000080000300 RBX: ffff88003f60c000 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff880117f78000
 RBP: ffff8800daf83788 R08: 0000000000000001 R09: 0000000000000001
 R10: 0000000000000001 R11: 0000000000000000 R12: ffff880117f78000
 R13: ffff88003f11a290 R14: 000000000000000c R15: ffff880091cb3ab8
 FS:  0000000000000000(0000) GS:ffff88011b000000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000058 CR3: 0000000001c11000 CR4: 00000000001407e0
 Stack:
  ffff880117f3dcd0 ffff880117f78000 ffff88003f60c000 ffff880117f78000
  ffff880117f78000 ffff88003f11a290 0000000000000000 ffff8800daf837b0
  ffffffff81617707 ffff880117f78000 ffff88003f60c000 0000000000000013
 Call Trace:
  [<ffffffff81617707>] usbhid_restart_ctrl_queue+0x87/0x140
  [<ffffffff81617a88>] usbhid_submit_report+0x2c8/0x370
  [<ffffffff81617b4a>] usbhid_request+0x1a/0x30
  [<ffffffffa020edfb>] sensor_hub_set_feature+0x8b/0xd0 [hid_sensor_hub]
  [<ffffffffa02d9084>] hid_sensor_power_state+0x84/0x110 [hid_sensor_trigger]
  [<ffffffffa02d9129>] hid_sensor_data_rdy_trigger_set_state+0x19/0x20 [hid_sensor_trigger]
  [<ffffffffa034d5b7>] iio_triggered_buffer_predisable+0xa7/0xb0 [industrialio]
  [<ffffffffa034cc4a>] iio_disable_all_buffers+0x3a/0xc0 [industrialio]
  [<ffffffffa03487d3>] iio_device_unregister+0x53/0x80 [industrialio]
  [<ffffffffa026c06a>] hid_accel_3d_remove+0x2a/0x50 [hid_sensor_accel_3d]
  [<ffffffff814f433d>] platform_drv_remove+0x1d/0x40
  [<ffffffff814f18bf>] __device_release_driver+0x7f/0xf0
  [<ffffffff814f1955>] device_release_driver+0x25/0x40
  [<ffffffff814f121c>] bus_remove_device+0x11c/0x1a0
  [<ffffffff814ed7d6>] device_del+0x136/0x1e0
  [<ffffffff81512190>] ? mfd_cell_disable+0x80/0x80
  [<ffffffff814f41d1>] platform_device_del+0x21/0xc0
  [<ffffffff814f4282>] platform_device_unregister+0x12/0x30
  [<ffffffff815121d3>] mfd_remove_devices_fn+0x43/0x50
  [<ffffffff814ed3e3>] device_for_each_child+0x43/0x70
  [<ffffffff81512105>] mfd_remove_devices+0x25/0x30
  [<ffffffffa020ebd7>] sensor_hub_remove+0x87/0x140 [hid_sensor_hub]
  [<ffffffff81607c5b>] hid_device_remove+0x6b/0xd0
  [<ffffffff814f18bf>] __device_release_driver+0x7f/0xf0
  [<ffffffff814f1955>] device_release_driver+0x25/0x40
  [<ffffffff814f121c>] bus_remove_device+0x11c/0x1a0
  [<ffffffff814ed7d6>] device_del+0x136/0x1e0
  [<ffffffff81607d47>] hid_destroy_device+0x27/0x60
  [<ffffffff81616972>] usbhid_disconnect+0x22/0x50
  [<ffffffff81568597>] usb_unbind_interface+0x77/0x2b0
  [<ffffffff814f18bf>] __device_release_driver+0x7f/0xf0
  [<ffffffff814f1955>] device_release_driver+0x25/0x40
  [<ffffffff814f121c>] bus_remove_device+0x11c/0x1a0
  [<ffffffff814ed7d6>] device_del+0x136/0x1e0
  [<ffffffff81565cd1>] usb_disable_device+0x91/0x2a0
  [<ffffffff8155b046>] usb_disconnect+0x96/0x2e0
  [<ffffffff8155d74a>] hub_thread+0xb5a/0x1840

Signed-off-by: Reyad Attiyat <reyad.attiyat@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
swarren pushed a commit to swarren/linux-tegra that referenced this pull request Oct 16, 2014
WARNING: 'retuned' may be misspelled - perhaps 'returned'?
torvalds#112: FILE: fs/ocfs2/file.c:2329:
+		 * If generic_file_buffered_write() retuned a synchronous error

total: 0 errors, 1 warnings, 121 lines checked

./patches/ocfs2-do-not-fallback-to-buffer-i-o-write-if-fill-holes.patch has style problems, please review.

If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Weiwei Wang <wangww631@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
aryabinin pushed a commit to aryabinin/linux that referenced this pull request Oct 27, 2014
WARNING: 'retuned' may be misspelled - perhaps 'returned'?
torvalds#112: FILE: fs/ocfs2/file.c:2329:
+		 * If generic_file_buffered_write() retuned a synchronous error

total: 0 errors, 1 warnings, 121 lines checked

./patches/ocfs2-do-not-fallback-to-buffer-i-o-write-if-fill-holes.patch has style problems, please review.

If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Weiwei Wang <wangww631@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
tobetter pushed a commit to tobetter/linux that referenced this pull request Sep 22, 2015
Remove excessive kernel log of amvideocap driver
0day-ci pushed a commit to 0day-ci/linux that referenced this pull request Jun 4, 2016
Commit e8d975e ("fixing infinite OPEN loop in 4.0 stateid recovery")
introduced access to state after it was just potentially freed by
nfs4_put_open_state leading to a random data corruption somewhere.

BUG: unable to handle kernel paging request at ffff88004941ee40
IP: [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740
PGD 3501067 PUD 3504067 PMD 6ff37067 PTE 800000004941e060
Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: loop rpcsec_gss_krb5 acpi_cpufreq tpm_tis joydev i2c_piix4 pcspkr tpm virtio_console nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops floppy serio_raw virtio_blk drm
CPU: 6 PID: 2161 Comm: 192.168.10.253- Not tainted 4.7.0-rc1-vm-nfs+ torvalds#112
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800463dcd00 ti: ffff88003ff48000 task.ti: ffff88003ff48000
RIP: 0010:[<ffffffff813baf01>]  [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740
RSP: 0018:ffff88003ff4bd68  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffffff81a49900 RCX: 00000000000000e8
RDX: 00000000000000e8 RSI: ffff8800418b9930 RDI: ffff880040c96c88
RBP: ffff88003ff4bdf8 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880040c96c98
R13: ffff88004941ee20 R14: ffff88004941ee40 R15: ffff88004941ee00
FS:  0000000000000000(0000) GS:ffff88006d000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88004941ee40 CR3: 0000000060b0b000 CR4: 00000000000006e0
Stack:
 ffffffff813baad5 ffff8800463dcd00 ffff880000000001 ffffffff810e6b68
 ffff880043ddbc88 ffff8800418b9800 ffff8800418b98c8 ffff88004941ee48
 ffff880040c96c90 ffff880040c96c00 ffff880040c96c20 ffff880040c96c40
Call Trace:
 [<ffffffff813baad5>] ? nfs4_do_reclaim+0x35/0x740
 [<ffffffff810e6b68>] ? trace_hardirqs_on_caller+0x128/0x1b0
 [<ffffffff813bb7cd>] nfs4_run_state_manager+0x5ed/0xa40
 [<ffffffff813bb1e0>] ? nfs4_do_reclaim+0x740/0x740
 [<ffffffff813bb1e0>] ? nfs4_do_reclaim+0x740/0x740
 [<ffffffff810af0d1>] kthread+0x101/0x120
 [<ffffffff810e6b68>] ? trace_hardirqs_on_caller+0x128/0x1b0
 [<ffffffff818843af>] ret_from_fork+0x1f/0x40
 [<ffffffff810aefd0>] ? kthread_create_on_node+0x250/0x250
Code: 65 80 4c 8b b5 78 ff ff ff e8 fc 88 4c 00 48 8b 7d 88 e8 13 67 d2 ff 49 8b 47 40 a8 02 0f 84 d3 01 00 00 4c 89 ff e8 7f f9 ff ff <f0> 41 80 26 7f 48 8b 7d c8 e8 b1 84 4c 00 e9 39 fd ff ff 3d e6
RIP  [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740
 RSP <ffff88003ff4bd68>
CR2: ffff88004941ee40

Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
0day-ci pushed a commit to 0day-ci/linux that referenced this pull request Jun 15, 2016
Commit e8d975e ("fixing infinite OPEN loop in 4.0 stateid recovery")
introduced access to state after it was just potentially freed by
nfs4_put_open_state leading to a random data corruption somewhere.

BUG: unable to handle kernel paging request at ffff88004941ee40
IP: [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740
PGD 3501067 PUD 3504067 PMD 6ff37067 PTE 800000004941e060
Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: loop rpcsec_gss_krb5 acpi_cpufreq tpm_tis joydev i2c_piix4 pcspkr tpm virtio_console nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops floppy serio_raw virtio_blk drm
CPU: 6 PID: 2161 Comm: 192.168.10.253- Not tainted 4.7.0-rc1-vm-nfs+ torvalds#112
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800463dcd00 ti: ffff88003ff48000 task.ti: ffff88003ff48000
RIP: 0010:[<ffffffff813baf01>]  [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740
RSP: 0018:ffff88003ff4bd68  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffffff81a49900 RCX: 00000000000000e8
RDX: 00000000000000e8 RSI: ffff8800418b9930 RDI: ffff880040c96c88
RBP: ffff88003ff4bdf8 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880040c96c98
R13: ffff88004941ee20 R14: ffff88004941ee40 R15: ffff88004941ee00
FS:  0000000000000000(0000) GS:ffff88006d000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88004941ee40 CR3: 0000000060b0b000 CR4: 00000000000006e0
Stack:
 ffffffff813baad5 ffff8800463dcd00 ffff880000000001 ffffffff810e6b68
 ffff880043ddbc88 ffff8800418b9800 ffff8800418b98c8 ffff88004941ee48
 ffff880040c96c90 ffff880040c96c00 ffff880040c96c20 ffff880040c96c40
Call Trace:
 [<ffffffff813baad5>] ? nfs4_do_reclaim+0x35/0x740
 [<ffffffff810e6b68>] ? trace_hardirqs_on_caller+0x128/0x1b0
 [<ffffffff813bb7cd>] nfs4_run_state_manager+0x5ed/0xa40
 [<ffffffff813bb1e0>] ? nfs4_do_reclaim+0x740/0x740
 [<ffffffff813bb1e0>] ? nfs4_do_reclaim+0x740/0x740
 [<ffffffff810af0d1>] kthread+0x101/0x120
 [<ffffffff810e6b68>] ? trace_hardirqs_on_caller+0x128/0x1b0
 [<ffffffff818843af>] ret_from_fork+0x1f/0x40
 [<ffffffff810aefd0>] ? kthread_create_on_node+0x250/0x250
Code: 65 80 4c 8b b5 78 ff ff ff e8 fc 88 4c 00 48 8b 7d 88 e8 13 67 d2 ff 49 8b 47 40 a8 02 0f 84 d3 01 00 00 4c 89 ff e8 7f f9 ff ff <f0> 41 80 26 7f 48 8b 7d c8 e8 b1 84 4c 00 e9 39 fd ff ff 3d e6
RIP  [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740
 RSP <ffff88003ff4bd68>
CR2: ffff88004941ee40

Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
torvalds pushed a commit that referenced this pull request Jun 29, 2016
Commit e8d975e ("fixing infinite OPEN loop in 4.0 stateid recovery")
introduced access to state after it was just potentially freed by
nfs4_put_open_state leading to a random data corruption somewhere.

BUG: unable to handle kernel paging request at ffff88004941ee40
IP: [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740
PGD 3501067 PUD 3504067 PMD 6ff37067 PTE 800000004941e060
Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: loop rpcsec_gss_krb5 acpi_cpufreq tpm_tis joydev i2c_piix4 pcspkr tpm virtio_console nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops floppy serio_raw virtio_blk drm
CPU: 6 PID: 2161 Comm: 192.168.10.253- Not tainted 4.7.0-rc1-vm-nfs+ #112
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800463dcd00 ti: ffff88003ff48000 task.ti: ffff88003ff48000
RIP: 0010:[<ffffffff813baf01>]  [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740
RSP: 0018:ffff88003ff4bd68  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffffff81a49900 RCX: 00000000000000e8
RDX: 00000000000000e8 RSI: ffff8800418b9930 RDI: ffff880040c96c88
RBP: ffff88003ff4bdf8 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880040c96c98
R13: ffff88004941ee20 R14: ffff88004941ee40 R15: ffff88004941ee00
FS:  0000000000000000(0000) GS:ffff88006d000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88004941ee40 CR3: 0000000060b0b000 CR4: 00000000000006e0
Stack:
 ffffffff813baad5 ffff8800463dcd00 ffff880000000001 ffffffff810e6b68
 ffff880043ddbc88 ffff8800418b9800 ffff8800418b98c8 ffff88004941ee48
 ffff880040c96c90 ffff880040c96c00 ffff880040c96c20 ffff880040c96c40
Call Trace:
 [<ffffffff813baad5>] ? nfs4_do_reclaim+0x35/0x740
 [<ffffffff810e6b68>] ? trace_hardirqs_on_caller+0x128/0x1b0
 [<ffffffff813bb7cd>] nfs4_run_state_manager+0x5ed/0xa40
 [<ffffffff813bb1e0>] ? nfs4_do_reclaim+0x740/0x740
 [<ffffffff813bb1e0>] ? nfs4_do_reclaim+0x740/0x740
 [<ffffffff810af0d1>] kthread+0x101/0x120
 [<ffffffff810e6b68>] ? trace_hardirqs_on_caller+0x128/0x1b0
 [<ffffffff818843af>] ret_from_fork+0x1f/0x40
 [<ffffffff810aefd0>] ? kthread_create_on_node+0x250/0x250
Code: 65 80 4c 8b b5 78 ff ff ff e8 fc 88 4c 00 48 8b 7d 88 e8 13 67 d2 ff 49 8b 47 40 a8 02 0f 84 d3 01 00 00 4c 89 ff e8 7f f9 ff ff <f0> 41 80 26 7f 48 8b 7d c8 e8 b1 84 4c 00 e9 39 fd ff ff 3d e6
RIP  [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740
 RSP <ffff88003ff4bd68>
CR2: ffff88004941ee40

Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
CkNoSFeRaTU pushed a commit to CkNoSFeRaTU/linux that referenced this pull request Aug 25, 2016
laijs pushed a commit to laijs/linux that referenced this pull request Feb 13, 2017
Revert "lkl: Fix missing screen_info during compilation."
fengguang pushed a commit to 0day-ci/linux that referenced this pull request May 11, 2017
Shubham was recently asking on netdev why in arm64 JIT we don't multiply
the index for accessing the tail call map by 8. That led me into testing
out arm64 JIT wrt tail calls and it turned out I got a NULL pointer
dereference on the tail call.

The buggy access is at:

  prog = array->ptrs[index];
  if (prog == NULL)
      goto out;

  [...]
  00000060:  d2800e0a  mov x10, #0x70 // torvalds#112
  00000064:  f86a682a  ldr x10, [x1,x10]
  00000068:  f862694b  ldr x11, [x10,x2]
  0000006c:  b40000ab  cbz x11, 0x00000080
  [...]

The code triggering the crash is f862694b. x1 at the time contains the
address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning,
above we load the pointer to the program at map slot 0 into x10. x10
can then be NULL if the slot is not occupied, which we later on try to
access with a user given offset in x2 that is the map index.

Fix this by emitting the following instead:

  [...]
  00000060:  d2800e0a  mov x10, #0x70 // torvalds#112
  00000064:  8b0a002a  add x10, x1, x10
  00000068:  d37df04b  lsl x11, x2, #3
  0000006c:  f86b694b  ldr x11, [x10,x11]
  00000070:  b40000ab  cbz x11, 0x00000084
  [...]

This basically adds the offset to ptrs to the base address of the bpf
array we got and we later on access the map with an index * 8 offset
relative to that. The tail call map itself is basically one large area
with meta data at the head followed by the array of prog pointers.
This makes tail calls working again, tested on Cavium ThunderX ARMv8.

Fixes: ddb5599 ("arm64: bpf: implement bpf_tail_call() helper")
Reported-by: Shubham Bansal <illusionist.neo@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
fengguang pushed a commit to 0day-ci/linux that referenced this pull request May 11, 2017
Shubham was recently asking on netdev why in arm64 JIT we don't multiply
the index for accessing the tail call map by 8. That led me into testing
out arm64 JIT wrt tail calls and it turned out I got a NULL pointer
dereference on the tail call.

The buggy access is at:

  prog = array->ptrs[index];
  if (prog == NULL)
      goto out;

  [...]
  00000060:  d2800e0a  mov x10, #0x70 // torvalds#112
  00000064:  f86a682a  ldr x10, [x1,x10]
  00000068:  f862694b  ldr x11, [x10,x2]
  0000006c:  b40000ab  cbz x11, 0x00000080
  [...]

The code triggering the crash is f862694b. x1 at the time contains the
address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning,
above we load the pointer to the program at map slot 0 into x10. x10
can then be NULL if the slot is not occupied, which we later on try to
access with a user given offset in x2 that is the map index.

Fix this by emitting the following instead:

  [...]
  00000060:  d2800e0a  mov x10, #0x70 // torvalds#112
  00000064:  8b0a002a  add x10, x1, x10
  00000068:  d37df04b  lsl x11, x2, #3
  0000006c:  f86b694b  ldr x11, [x10,x11]
  00000070:  b40000ab  cbz x11, 0x00000084
  [...]

This basically adds the offset to ptrs to the base address of the bpf
array we got and we later on access the map with an index * 8 offset
relative to that. The tail call map itself is basically one large area
with meta data at the head followed by the array of prog pointers.
This makes tail calls working again, tested on Cavium ThunderX ARMv8.

Fixes: ddb5599 ("arm64: bpf: implement bpf_tail_call() helper")
Reported-by: Shubham Bansal <illusionist.neo@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
fengguang pushed a commit to 0day-ci/linux that referenced this pull request May 21, 2017
GIT dac94e29110cd606dec37673644caf2cf6fd1dde

commit 7e1b9521f5a8356553f5e58b07952bf346632ea4
Author: Colin Ian King <colin.king@canonical.com>
Date:   Sat Mar 11 19:09:45 2017 +0000

    dm cache: handle kmalloc failure allocating background_tracker struct
    
    Currently there is no kmalloc failure check on the allocation of
    the background_tracker struct in btracker_create(), and so a NULL return
    will lead to a NULL pointer dereference.  Add a NULL check.
    
    Detected by CoverityScan, CID#1416587 ("Dereference null return value")
    
    Fixes: b29d4986d ("dm cache: significant rework to leverage dm-bio-prison-v2")
    Signed-off-by: Colin Ian King <colin.king@canonical.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>

commit 83345d51a49a4b3f3b4a08a5db644dae438b0189
Author: Tin Huynh <tnhuynh@apm.com>
Date:   Wed May 17 11:25:34 2017 +0700

    i2c: xgene: Set ACPI_COMPANION_I2C
    
    With ACPI, i2c-core requires ACPI companion to be set in order for it
    to create slave device.
    This patch sets the ACPI companion accordingly.
    
    Signed-off-by: Tin Huynh <tnhuynh@apm.com>
    Signed-off-by: Wolfram Sang <wsa@the-dreams.de>

commit 88ad60c23a394b2f8bf1e570c756f415435d1d35
Author: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Date:   Tue May 16 14:07:24 2017 +0200

    i2c: mv64xxx: don't override deferred probing when getting irq
    
    There is no reason to use platform_get_irq() for non-DT probing and
    irq_of_parse_and_map() for DT probing. Indeed, platform_get_irq()
    works fine for both.
    
    In addition, using platform_get_irq() properly returns -EPROBE_DEFER
    when the interrupt controller is not yet available, so instead of
    inventing our own error code (-ENXIO), return the one provided by
    platform_get_irq().
    
    Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
    Signed-off-by: Wolfram Sang <wsa@the-dreams.de>

commit 13840d38016203f0095cd547b90352812d24b787
Author: Mikulas Patocka <mpatocka@redhat.com>
Date:   Sun Apr 30 17:32:28 2017 -0400

    dm bufio: make the parameter "retain_bytes" unsigned long
    
    Change the type of the parameter "retain_bytes" from unsigned to
    unsigned long, so that on 64-bit machines the user can set more than
    4GiB of data to be retained.
    
    Also, change the type of the variable "count" in the function
    "__evict_old_buffers" to unsigned long.  The assignment
    "count = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];"
    could result in unsigned long to unsigned overflow and that could result
    in buffers not being freed when they should.
    
    While at it, avoid division in get_retain_buffers().  Division is slow,
    we can change it to shift because we have precalculated the log2 of
    block size.
    
    Cc: stable@vger.kernel.org
    Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>

commit 6f61dd3aa35179043f1fcdb0965c5d56278ab724
Author: Kees Cook <keescook@chromium.org>
Date:   Fri May 12 14:52:34 2017 -0700

    efi-pstore: Fix read iter after pstore API refactor
    
    During the internal pstore API refactoring, the EFI vars read entry was
    accidentally made to update a stack variable instead of the pstore
    private data pointer. This corrects the problem (and removes the now
    needless argument).
    
    Fixes: 125cc42baf8a ("pstore: Replace arguments for read() API")
    Signed-off-by: Kees Cook <keescook@chromium.org>

commit 8b671f906c2debc4f2393438c4e7668936522e99
Author: Shannon Nelson <shannon.nelson@oracle.com>
Date:   Mon May 15 10:51:08 2017 -0700

    ldmvsw: stop the clean timer at beginning of remove
    
    Stop the clean timer earlier to be sure there's no asynchronous
    interference while stopping the port.
    
    Orabug: 25748241
    
    Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit b18e5e86b44be0dca399d8e2383f97c8077392ce
Author: Thomas Tai <thomas.tai@oracle.com>
Date:   Mon May 15 10:51:07 2017 -0700

    ldmvsw: unregistering netdev before disable hardware
    
    When running LDom binding/unbinding test, kernel may panic
    in ldmvsw_open(). It is more likely that because we're removing
    the ldc connection before unregistering the netdev in vsw_port_remove(),
    we set up a window of time where one process could be removing the
    device while another trying to UP the device. This also sometimes causes
    vio handshake error due to opening a device without closing it completely.
    We should unregister the netdev before we disable the "hardware".
    
    Orabug: 25980913, 25925306
    
    Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
    Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit ca9df7ede41afd006d74fd6f09f36d909d0eaad7
Author: Miroslav Lichvar <mlichvar@redhat.com>
Date:   Mon May 15 16:04:36 2017 +0200

    net: netcp: fix check of requested timestamping filter
    
    The driver doesn't support timestamping of all received packets and
    should return error when trying to enable the HWTSTAMP_FILTER_ALL
    filter.
    
    Cc: WingMan Kwok <w-kwok2@ti.com>
    Cc: Richard Cochran <richardcochran@gmail.com>
    Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
    Acked-by: Richard Cochran <richardcochran@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit f98e0eb68008aff9824d1c4dad7276c8bab83ca5
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon May 15 17:28:38 2017 +0200

    dm mpath: multipath_clone_and_map must not return -EIO
    
    Since 412445ac ("dm: introduce a new DM_MAPIO_KILL return value"), the
    clone_and_map_rq methods must not return errno values, so fix it up
    to properly return DM_MAPIO_KILL, instead of the -EIO value that snuck
    in due to a conflict between two patches.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>

commit 18a482f5245cc875755090853e84283512b3e6bd
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon May 15 17:28:37 2017 +0200

    dm mpath: don't return -EIO from dm_report_EIO
    
    Instead just turn the macro into a helper for the warning message.
    This removes an unnecessary assignment and will allow the next commit to
    fix a place where -EIO is the wrong return value.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>

commit ece0728037b15f4d31198f12b359104bcb5db4c8
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon May 15 17:28:36 2017 +0200

    dm rq: add a missing break to map_request
    
    We don't want to bug when receiving a DM_MAPIO_KILL value..
    
    Fixes: 412445ac ("dm: introduce a new DM_MAPIO_KILL return value")
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>

commit 0377a07c7a035e0d033cd8b29f0cb15244c0916a
Author: Joe Thornber <ejt@redhat.com>
Date:   Mon May 15 09:45:40 2017 -0400

    dm space map disk: fix some book keeping in the disk space map
    
    When decrementing the reference count for a block, the free count wasn't
    being updated if the reference count went to zero.
    
    Cc: stable@vger.kernel.org
    Signed-off-by: Joe Thornber <ejt@redhat.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>

commit 91bcdb92d39711d1adb40c26b653b7978d93eb98
Author: Joe Thornber <ejt@redhat.com>
Date:   Mon May 15 09:43:05 2017 -0400

    dm thin metadata: call precommit before saving the roots
    
    These calls were the wrong way round in __write_initial_superblock.
    
    Cc: stable@vger.kernel.org
    Signed-off-by: Joe Thornber <ejt@redhat.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>

commit 66eb9f86e50547ec2a8ff7a75997066a74ef584b
Author: Mahesh Bandewar <maheshb@google.com>
Date:   Fri May 12 17:03:39 2017 -0700

    ipv6: avoid dad-failures for addresses with NODAD
    
    Every address gets added with TENTATIVE flag even for the addresses with
    IFA_F_NODAD flag and dad-work is scheduled for them. During this DAD process
    we realize it's an address with NODAD and complete the process without
    sending any probe. However the TENTATIVE flags stays on the
    address for sometime enough to cause misinterpretation when we receive a NS.
    While processing NS, if the address has TENTATIVE flag, we mark it DADFAILED
    and endup with an address that was originally configured as NODAD with
    DADFAILED.
    
    We can't avoid scheduling dad_work for addresses with NODAD but we can
    avoid adding TENTATIVE flag to avoid this racy situation.
    
    Signed-off-by: Mahesh Bandewar <maheshb@google.com>
    Acked-by: David Ahern <dsahern@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit aa4ad88cfcd4ee45f527fb982140576711e3b501
Author: Mintz, Yuval <Yuval.Mintz@cavium.com>
Date:   Sun May 14 12:21:23 2017 +0300

    qed: Fix uninitialized data in aRFS infrastructure
    
    Current memset is using incorrect type of variable, causing the
    upper-half of the strucutre to be left uninitialized and causing:
    
      ethernet/qlogic/qed/qed_init_fw_funcs.c: In function 'qed_set_rfs_mode_disable':
      ethernet/qlogic/qed/qed_init_fw_funcs.c:993:3: error: '*((void *)&ramline+4)' is used uninitialized in this function [-Werror=uninitialized]
    
    Fixes: d51e4af5c209 ("qed: aRFS infrastructure support")
    Reported-by: Arnd Bergmann <arnd@arndb.de>
    Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
    Reviewed-by: Arnd Bergmann <arnd@arndb.de>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit 8c977f5a856a7276450ddf3a7b57b4a8815b63f9
Author: Julia Lawall <julia.lawall@lip6.fr>
Date:   Fri May 12 22:54:23 2017 +0800

    mdio: mux: fix device_node_continue.cocci warnings
    
    Device node iterators put the previous value of the index variable, so an
    explicit put causes a double put.
    
    In particular, of_mdiobus_register can fail before doing anything
    interesting, so one could view it as a no-op from the reference count
    point of view.
    
    Generated by: scripts/coccinelle/iterators/device_node_continue.cocci
    
    CC: Jon Mason <jon.mason@broadcom.com>
    Signed-off-by: Julia Lawall <julia.lawall@lip6.fr>
    Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit d19b183cdc1fa3d70d6abe2a4c369e748cd7ebb8
Author: Douglas Caetano dos Santos <douglascs@taghos.com.br>
Date:   Fri May 12 15:19:15 2017 -0300

    net/packet: fix missing net_device reference release
    
    When using a TX ring buffer, if an error occurs processing a control
    message (e.g. invalid message), the net_device reference is not
    released.
    
    Fixes c14ac9451c348 ("sock: enable timestamping using control messages")
    Signed-off-by: Douglas Caetano dos Santos <douglascs@taghos.com.br>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit 4762010f09ac0453f613df345c5281e7f2dec510
Author: yuval.shaia@oracle.com <yuval.shaia@oracle.com>
Date:   Fri May 12 09:10:51 2017 +0300

    net/mlx4_core: Use min3 to select number of MSI-X vectors
    
    Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
    Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit 70957eaecc2e43308e403c80293bec3d59632412
Author: Vlad Yasevich <vyasevich@gmail.com>
Date:   Thu May 11 11:09:52 2017 -0400

    macvlan: Fix performance issues with vlan tagged packets
    
    Macvlan always turns on offload features that have sofware
    fallback (NETIF_GSO_SOFTWARE).  This allows much higher guest-guest
    communications over macvtap.
    
    However, macvtap does not turn on these features for vlan tagged traffic.
    As a result, depending on the HW that mactap is configured on, the
    performance of guest-guest communication over a vlan is very
    inconsistent.  If the HW supports TSO/UFO over vlans, then the
    performance will be fine.  If not, the the performance will suffer
    greatly since the VM may continue using TSO/UFO, and will force the host
    segment the traffic and possibly overlow the macvtap queue.
    
    This patch adds the always on offloads to vlan_features.  This
    makes sure that any vlan tagged traffic between 2 guest will not
    be segmented needlessly.
    
    Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit 9fce894d03a98ec8e8e8106a964644633d2772ee
Author: Peter Rosin <peda@axentia.se>
Date:   Mon May 15 09:03:50 2017 +0200

    i2c: mux: only print failure message on error
    
    As is, a failure message is printed unconditionally, which is confusing.
    And noisy.
    
    Fixes: 8d4d159f25a7 ("i2c: mux: provide more info on failure in i2c_mux_add_adapter")
    Signed-off-by: Peter Rosin <peda@axentia.se>

commit a36d4637e4a06be067b8e327a0b1118bb2a73cb8
Author: Peter Rosin <peda@axentia.se>
Date:   Mon May 15 18:48:55 2017 +0200

    i2c: mux: reg: rename label to indicate what it does
    
    That maintains sanity if it is ever called from some other spot, and
    also makes the label names coherent.
    
    Signed-off-by: Peter Rosin <peda@axentia.se>

commit 68118e0e73aa3a6291c8b9eb1ee708e05f110cea
Author: Peter Rosin <peda@axentia.se>
Date:   Sun May 7 07:16:30 2017 +0200

    i2c: mux: reg: put away the parent i2c adapter on probe failure
    
    It is only prudent to let go of resources that are not used.
    
    Fixes: b3fdd32799d8 ("i2c: mux: Add register-based mux i2c-mux-reg")
    Signed-off-by: Peter Rosin <peda@axentia.se>

commit 66c25f6e31766a9ec19c2bdc7f5f69f9c59bafd7
Author: Niklas Cassel <niklas.cassel@axis.com>
Date:   Mon May 15 10:56:06 2017 +0200

    net: stmmac: use correct pointer when printing normal descriptor ring
    
    There are two pointers in sysfs_display_ring,
    one that increments if using normal dma descriptors,
    another if using extended dma descriptors.
    
    When printing the normal dma descriptors, the wrong pointer is used,
    thus the printed descriptor addresses are incorrect.
    
    Signed-off-by: Niklas Cassel <niklas.cassel@axis.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit fb317002ab4419ae7e068bee6897f2d5745aa3b9
Author: Heiko Carstens <heiko.carstens@de.ibm.com>
Date:   Tue May 9 12:50:53 2017 +0200

    s390/virtio: change virtio_feature_desc:features type to __le32
    
    The feature member of virtio_feature_desc contains little endian
    values, given that it contents will be converted with
    le32_to_cpu(). The "wrong" __u32 type leads to the sparse warnings
    below.
    In order to avoid them, use the correct __le32 type instead.
    
    drivers/s390/virtio/virtio_ccw.c:749:14: warning: cast to restricted __le32
    drivers/s390/virtio/virtio_ccw.c:762:28: warning: cast to restricted __le32
    
    Acked-by: Halil Pasic <pasic@linux.vnet.ibm.com>
    Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
    Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
    Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

commit 2e63309507c818e8b631a03f02c363031c007fb7
Author: Joe Thornber <ejt@redhat.com>
Date:   Thu May 11 09:09:04 2017 -0400

    dm cache policy smq: don't do any writebacks unless IDLE
    
    If there are no clean blocks to be demoted the writeback will be
    triggered at that point.  Preemptively writing back can hurt high IO
    load scenarios.
    
    Signed-off-by: Joe Thornber <ejt@redhat.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>

commit 49b7f768900f4084a65c3689d955b2fceac39e53
Author: Joe Thornber <ejt@redhat.com>
Date:   Thu May 11 09:07:16 2017 -0400

    dm cache: simplify the IDLE vs BUSY state calculation
    
    Drop the MODERATE state since it wasn't buying us much.
    
    Also, in check_migrations(), prepare for the next commit ("dm cache
    policy smq: don't do any writebacks unless IDLE") by deferring to the
    policy to make the final decision on whether writebacks can be
    serviced.
    
    Signed-off-by: Joe Thornber <ejt@redhat.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>

commit 701e03e4e180f0cd97d4139a32e2b2d879d12da2
Author: Joe Thornber <ejt@redhat.com>
Date:   Thu May 11 08:22:31 2017 -0400

    dm cache: track all IO to the cache rather than just the origin device's IO
    
    IO tracking used to throttle writebacks when the origin device is busy.
    
    Even if all the IO is going to the fast device, writebacks can
    significantly degrade performance.  So track all IO to gauge whether the
    cache is busy or not.
    
    Otherwise, synthetic IO tests (e.g. fio) that might send all IO to the
    fast device wouldn't cause writebacks to get throttled.
    
    Signed-off-by: Joe Thornber <ejt@redhat.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>

commit 6cf4cc8f8b3b7bc9e3c04a7eab44b985d50029fc
Author: Joe Thornber <ejt@redhat.com>
Date:   Thu May 11 07:48:18 2017 -0400

    dm cache policy smq: stop preemptively demoting blocks
    
    It causes a lot of churn if the working set's size is close to the fast
    device's size.
    
    Signed-off-by: Joe Thornber <ejt@redhat.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>

commit 4d44ec5ab751be63c5d348f13294304d87baa8c3
Author: Joe Thornber <ejt@redhat.com>
Date:   Thu May 11 05:11:06 2017 -0400

    dm cache policy smq: put newly promoted entries at the top of the multiqueue
    
    This stops entries bouncing in and out of the cache quickly.
    
    Signed-off-by: Joe Thornber <ejt@redhat.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>

commit 78c45607b909fb384c47c134d89b39285a6a8b45
Author: Joe Thornber <ejt@redhat.com>
Date:   Thu May 11 05:09:38 2017 -0400

    dm cache policy smq: be more aggressive about triggering a writeback
    
    If there are no clean entries to demote we really want to writeback
    immediately.
    
    Signed-off-by: Joe Thornber <ejt@redhat.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>

commit a8cd1eba6135e086109e2b94bf96deb17456ede8
Author: Joe Thornber <ejt@redhat.com>
Date:   Thu May 11 05:07:34 2017 -0400

    dm cache policy smq: only demote entries in bottom half of the clean multiqueue
    
    Heavy IO load may mean there are very few clean blocks in the cache, and
    we risk demoting entries that get hit a lot.
    
    Signed-off-by: Joe Thornber <ejt@redhat.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>

commit 072792dcdfc8d5f91a26050e5665285f50afebf5
Author: Joe Thornber <ejt@redhat.com>
Date:   Thu May 11 06:14:16 2017 -0400

    dm cache: fix incorrect 'idle_time' reset in IO tracker
    
    Some bios have no payload (eg, a FLUSH), don't reset the idle_time when
    these come in.
    
    Signed-off-by: Joe Thornber <ejt@redhat.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>

commit 508541146af18e43072e41a31aa62fac2b01aac1
Author: Yishai Hadas <yishaih@mellanox.com>
Date:   Tue Apr 25 10:39:57 2017 +0300

    net/mlx5: Use underlay QPN from the root name space
    
    Root flow table is dynamically changed by the underlying flow steering
    layer, and IPoIB/ULPs have no idea what will be the root flow table in
    the future, hence we need a dynamic infrastructure to move Underlay QPs
    with the root flow table.
    
    Fixes: b3ba51498bdd ("net/mlx5: Refactor create flow table method to accept underlay QP")
    Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
    Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
    Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
    Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

commit 5360fd473c23f18b42722cdb13a1c6ec7acd96ff
Author: Saeed Mahameed <saeedm@mellanox.com>
Date:   Thu May 4 17:53:32 2017 +0300

    net/mlx5e: IPoIB, Only support regular RQ for now
    
    IPoIB doesn't support striding RQ at the moment, for this
    we need to explicitly choose non striding RQ in IPoIB init,
    even if the HW supports it.
    
    Fixes: 8f493ffd88ea ("net/mlx5e: IPoIB, RX steering RSS RQTs and TIRs")
    Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

commit 20b6a1c78280dbeb45214c463cf9cbccb3665146
Author: Saeed Mahameed <saeedm@mellanox.com>
Date:   Tue May 9 16:40:46 2017 +0300

    net/mlx5e: Fix setup TC ndo
    
    Fail-safe support patches introduced a trivial bug,
    setup tc callback is doing a wrong check of the netdevice state,
    the fix is simply to invert the condition.
    
    Fixes: 6f9485af4020 ("net/mlx5e: Fail safe tc setup")
    Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

commit e3c19503712d6360239b19c14cded56dd63c40d7
Author: Gal Pressman <galp@mellanox.com>
Date:   Wed Apr 19 14:35:15 2017 +0300

    net/mlx5e: Fix ethtool pause support and advertise reporting
    
    Pause bit should set when RX pause is on, not TX pause.
    Also, setting Asym_Pause is incorrect, and should be turned off.
    
    Fixes: 665bc53969d7 ("net/mlx5e: Use new ethtool get/set link ksettings API")
    Signed-off-by: Gal Pressman <galp@mellanox.com>
    Cc: kernel-team@fb.com
    Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

commit b383b544f2666d67446b951a9a97af239dafed5d
Author: Gal Pressman <galp@mellanox.com>
Date:   Mon Apr 3 15:11:22 2017 +0300

    net/mlx5e: Use the correct pause values for ethtool advertising
    
    Query the operational pause from firmware (PFCC register) instead of
    always passing zeros.
    
    Fixes: 665bc53969d7 ("net/mlx5e: Use new ethtool get/set link ksettings API")
    Signed-off-by: Gal Pressman <galp@mellanox.com>
    Cc: kernel-team@fb.com
    Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

commit 67b4c889cc835a2a6e2ff4e20544a33e37e2875d
Author: Steve French <smfrench@gmail.com>
Date:   Fri May 12 20:59:10 2017 -0500

    [CIFS] Minor cleanup of xattr query function
    
    Some minor cleanup of cifs query xattr functions (will also make
    SMB3 xattr implementation cleaner as well).
    
    Signed-off-by: Steve French <steve.french@primarydata.com>

commit 4328fea77ca30ef6af938ae3f263a3d055a9c0e6
Author: Karim Eshapa <karim.eshapa@gmail.com>
Date:   Fri May 12 01:53:38 2017 +0200

    fs: cifs: transport: Use time_after for time comparison
    
    Use time_after kernel macro for time comparison
    that has safety check.
    
    Signed-off-by: Karim Eshapa <karim.eshapa@gmail.com>
    Signed-off-by: Steve French <smfrench@gmail.com>

commit cd1230070ae1c12fd34cf6a557bfa81bf9311009
Author: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Date:   Fri May 12 17:59:32 2017 +0200

    SMB2: Fix share type handling
    
    In fs/cifs/smb2pdu.h, we have:
    #define SMB2_SHARE_TYPE_DISK    0x01
    #define SMB2_SHARE_TYPE_PIPE    0x02
    #define SMB2_SHARE_TYPE_PRINT   0x03
    
    Knowing that, with the current code, the SMB2_SHARE_TYPE_PRINT case can
    never trigger and printer share would be interpreted as disk share.
    
    So, test the ShareType value for equality instead.
    
    Fixes: faaf946a7d5b ("CIFS: Add tree connect/disconnect capability for SMB2")
    Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
    Acked-by: Aurelien Aptel <aaptel@suse.com>
    Signed-off-by: Steve French <smfrench@gmail.com>

commit ecdcf622eb74b52cebde1387a7a1852a787d8050
Author: Joe Perches via samba-technical <samba-technical@lists.samba.org>
Date:   Sun May 7 01:31:47 2017 -0700

    cifs: cifsacl: Use a temporary ops variable to reduce code length
    
    Create an ops variable to store tcon->ses->server->ops and cache
    indirections and reduce code size a trivial bit.
    
    $ size fs/cifs/cifsacl.o*
       text    data     bss     dec     hex filename
       5338     136       8    5482    156a fs/cifs/cifsacl.o.new
       5371     136       8    5515    158b fs/cifs/cifsacl.o.old
    
    Signed-off-by: Joe Perches <joe@perches.com>
    Acked-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
    Signed-off-by: Steve French <smfrench@gmail.com>

commit 1c4d5f51a812a82de97beee24f48ed05c65ebda5
Author: Neil Horman <nhorman@tuxdriver.com>
Date:   Fri May 12 12:00:01 2017 -0400

    vmxnet3: ensure that adapter is in proper state during force_close
    
    There are several paths in vmxnet3, where settings changes cause the
    adapter to be brought down and back up (vmxnet3_set_ringparam among
    them).  Should part of the reset operation fail, these paths call
    vmxnet3_force_close, which enables all napi instances prior to calling
    dev_close (with the expectation that vmxnet3_close will then properly
    disable them again).  However, vmxnet3_force_close neglects to clear
    VMXNET3_STATE_BIT_QUIESCED prior to calling dev_close.  As a result
    vmxnet3_quiesce_dev (called from vmxnet3_close), returns early, and
    leaves all the napi instances in a enabled state while the device itself
    is closed.  If a device in this state is activated again, napi_enable
    will be called on already enabled napi_instances, leading to a BUG halt.
    
    The fix is to simply enausre that the QUIESCED bit is cleared in
    vmxnet3_force_close to allow quesence to be completed properly on close.
    
    Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
    CC: Shrikrishna Khare <skhare@vmware.com>
    CC: "VMware, Inc." <pv-drivers@vmware.com>
    CC: "David S. Miller" <davem@davemloft.net>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit 42e6cae1b35b186650f5abc7b24c20ab6986c5a0
Author: Bert Kenward <bkenward@solarflare.com>
Date:   Fri May 12 17:18:50 2017 +0100

    sfc: revert changes to NIC revision numbers
    
    The revision enum values (eg EFX_REV_HUNT_A0) form part of our API,
     and are included in ethtool. If these are inconsistent then ethtool
     will print garbage for a register dump (ethtool -d).
    
    Fixes: 5a6681e22c14 ("sfc: separate out SFC4000 ("Falcon") support into new sfc-falcon driver")
    Signed-off-by: Edward Cree <ecree@solarflare.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit b12ca80ca145cecadf841ba27cc061c510cd97ca
Author: Johan Hovold <johan@kernel.org>
Date:   Fri May 12 12:13:26 2017 +0200

    net: ch9200: add missing USB-descriptor endianness conversions
    
    Add the missing endianness conversions to a debug statement printing
    the USB device-descriptor idVendor and idProduct fields during probe.
    
    Signed-off-by: Johan Hovold <johan@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit 75cf067953d5ee543b3bda90bbfcbee5e1f94ae8
Author: Johan Hovold <johan@kernel.org>
Date:   Fri May 12 12:11:13 2017 +0200

    net: irda: irda-usb: fix firmware name on big-endian hosts
    
    Add missing endianness conversion when using the USB device-descriptor
    bcdDevice field to construct a firmware file name.
    
    Fixes: 8ef80aef118e ("[IRDA]: irda-usb.c: STIR421x cleanups")
    Cc: stable <stable@vger.kernel.org>     # 2.6.18
    Cc: Nick Fedchik <nfedchik@atlantic-link.com.ua>
    Signed-off-by: Johan Hovold <johan@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit 9fc3e4dc67fde953fd26adb3cea8f92597810b64
Author: Gustavo A. R. Silva <garsilva@embeddedor.com>
Date:   Thu May 11 22:11:29 2017 -0500

    net: dsa: mv88e6xxx: add default case to switch
    
    Add default case to switch in order to avoid any chance of using an
    uninitialized variable _low_, in case s->type does not match any of
    the listed case values.
    
    Addresses-Coverity-ID: 1398130
    Suggested-by: Andrew Lunn <andrew@lunn.ch>
    Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641
Author: Xin Long <lucien.xin@gmail.com>
Date:   Fri May 12 14:39:52 2017 +0800

    sctp: fix src address selection if using secondary addresses for ipv6
    
    Commit 0ca50d12fe46 ("sctp: fix src address selection if using secondary
    addresses") has fixed a src address selection issue when using secondary
    addresses for ipv4.
    
    Now sctp ipv6 also has the similar issue. When using a secondary address,
    sctp_v6_get_dst tries to choose the saddr which has the most same bits
    with the daddr by sctp_v6_addr_match_len. It may make some cases not work
    as expected.
    
    hostA:
      [1] fd21:356b:459a:cf10::11 (eth1)
      [2] fd21:356b:459a:cf20::11 (eth2)
    
    hostB:
      [a] fd21:356b:459a:cf30::2  (eth1)
      [b] fd21:356b:459a:cf40::2  (eth2)
    
    route from hostA to hostB:
      fd21:356b:459a:cf30::/64 dev eth1  metric 1024  mtu 1500
    
    The expected path should be:
      fd21:356b:459a:cf10::11 <-> fd21:356b:459a:cf30::2
    But addr[2] matches addr[a] more bits than addr[1] does, according to
    sctp_v6_addr_match_len. It causes the path to be:
      fd21:356b:459a:cf20::11 <-> fd21:356b:459a:cf30::2
    
    This patch is to fix it with the same way as Marcelo's fix for sctp ipv4.
    As no ip_dev_find for ipv6, this patch is to use ipv6_chk_addr to check
    if the saddr is in a dev instead.
    
    Note that for backwards compatibility, it will still do the addr_match_len
    check here when no optimal is found.
    
    Reported-by: Patrick Talbert <ptalbert@redhat.com>
    Signed-off-by: Xin Long <lucien.xin@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit df0c8d911abf6ba97b2c2fc3c5a12769e0b081a3
Author: Florian Fainelli <f.fainelli@gmail.com>
Date:   Thu May 11 11:24:16 2017 -0700

    net: phy: Call bus->reset() after releasing PHYs from reset
    
    The API convention makes it that a given MDIO bus reset should be able
    to access PHY devices in its reset() callback and perform additional
    MDIO accesses in order to bring the bus and PHYs in a working state.
    
    Commit 69226896ad63 ("mdio_bus: Issue GPIO RESET to PHYs.") broke that
    contract by first calling bus->reset() and then release all PHYs from
    reset using their shared GPIO line, so restore the expected
    functionality here.
    
    Fixes: 69226896ad63 ("mdio_bus: Issue GPIO RESET to PHYs.")
    Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit 6832a333ed4a7cc4fcb170c045d1d96d0976fdd4
Author: David S. Miller <davem@davemloft.net>
Date:   Thu May 11 19:30:02 2017 -0700

    bpf: Handle multiple variable additions into packet pointers in verifier.
    
    We must accumulate into reg->aux_off rather than use a plain assignment.
    
    Add a test for this situation to test_align.
    
    Reported-by: Alexei Starovoitov <ast@fb.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit 844cf763fba654436d3a4279b6a672c196cf1901
Author: Jon Paul Maloy <jon.maloy@ericsson.com>
Date:   Thu May 11 20:28:15 2017 +0200

    tipc: make macro tipc_wait_for_cond() smp safe
    
    The macro tipc_wait_for_cond() is embedding the macro sk_wait_event()
    to fulfil its task. The latter, in turn, is evaluating the stated
    condition outside the socket lock context. This is problematic if
    the condition is accessing non-trivial data structures which may be
    altered by incoming interrupts, as is the case with the cong_links()
    linked list, used by socket to keep track of the current set of
    congested links. We sometimes see crashes when this list is accessed
    by a condition function at the same time as a SOCK_WAKEUP interrupt
    is removing an element from the list.
    
    We fix this by expanding selected parts of sk_wait_event() into the
    outer macro, while ensuring that all evaluations of a given condition
    are performed under socket lock protection.
    
    Fixes: commit 365ad353c256 ("tipc: reduce risk of user starvation during link congestion")
    Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
    Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit ad990dbe6d3ac3af1f5f4484b1126b9fc601e98a
Author: Andy Gospodarek <andy@greyhouse.net>
Date:   Thu May 11 15:52:30 2017 -0400

    samples/bpf: run cleanup routines when receiving SIGTERM
    
    Shahid Habib noticed that when xdp1 was killed from a different console the xdp
    program was not cleaned-up properly in the kernel and it continued to forward
    traffic.
    
    Most of the applications in samples/bpf cleanup properly, but only when getting
    SIGINT.  Since kill defaults to using SIGTERM, add support to cleanup when the
    application receives either SIGINT or SIGTERM.
    
    Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
    Reported-by: Shahid Habib <shahid.habib@broadcom.com>
    Acked-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Daniel Borkmann <daniel@iogearbox.net>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit d2be3667f3769b3c60aa294ef7f2b03d1b16559c
Author: Colin Ian King <colin.king@canonical.com>
Date:   Thu May 11 19:29:40 2017 +0100

    ethernet: aquantia: remove redundant checks on error status
    
    The error status err is initialized as zero and then being checked
    several times to see if it is less than zero even when it has not
    been updated.  It may seem that the err should be assigned to the
    return code of the call to the various *offload_en_set calls and
    then we check for failure, however, these functions are void and
    never actually return any status.
    
    Since these error checks are redundant we can remove these
    as well as err and the error exit label err_exit.
    
    Detected by CoverityScan, CID#1398313 and CID#1398306 ("Logically
    dead code")
    
    Signed-off-by: Colin Ian King <colin.king@canonical.com>
    Reviewed-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
    Acked-by: Pavel Belous <pavel.belous@aquantia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit 69a73e744db1a57175039e3d1e6ef3913e816eb8
Author: David S. Miller <davem@davemloft.net>
Date:   Thu May 11 21:41:09 2017 -0400

    bpf: Remove commented out debugging hack in test_align.
    
    Reported-by: Alexander Alemayhu <alexander@alemayhu.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit 33c16bfd5bc00bbb9823d25af46c496966e0ac0d
Author: Chopra, Manish <Manish.Chopra@cavium.com>
Date:   Thu May 11 07:12:48 2017 -0700

    qlcnic: Update version to 5.3.66
    
    Bumping up the version as couple of fixes added after 5.3.65
    
    Signed-off-by: Manish Chopra <manish.chopra@cavium.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit f9c3fe2f43343be7972226c2e736e7a68e383dc1
Author: Chopra, Manish <Manish.Chopra@cavium.com>
Date:   Thu May 11 07:12:47 2017 -0700

    qlcnic: Fix link configuration with autoneg disabled
    
    Currently driver returns error on speed configurations
    for 83xx adapter's non XGBE ports, due to this link doesn't
    come up on the ports using 1000Base-T as a connector with
    autoneg disabled. This patch fixes this with initializing
    appropriate port type based on queried module/connector
    types from hardware before any speed/autoneg configuration.
    
    Signed-off-by: Manish Chopra <manish.chopra@cavium.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit d86b5672b1adb98b4cdd6fbf0224bbfb03db6e2e
Author: Vitaly Kuznetsov <vkuznets@redhat.com>
Date:   Thu May 11 13:58:06 2017 +0200

    xen-netfront: avoid crashing on resume after a failure in talk_to_netback()
    
    Unavoidable crashes in netfront_resume() and netback_changed() after a
    previous fail in talk_to_netback() (e.g. when we fail to read MAC from
    xenstore) were discovered. The failure path in talk_to_netback() does
    unregister/free for netdev but we don't reset drvdata and we try accessing
    it after resume.
    
    Fix the bug by removing the whole xen device completely with
    device_unregister(), this guarantees we won't have any calls into netfront
    after a failure.
    
    Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit cb395b2010879a8461aa1b1c37025769708c32cf
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed May 10 21:59:28 2017 -0700

    net: sched: optimize class dumps
    
    In commit 59cc1f61f09c ("net: sched: convert qdisc linked list to
    hashtable") we missed the opportunity to considerably speed up
    tc_dump_tclass_root() if a qdisc handle is provided by user.
    
    Instead of iterating all the qdiscs, use qdisc_match_from_root()
    to directly get the one we look for.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Jiri Kosina <jkosina@suse.cz>
    Cc: Jamal Hadi Salim <jhs@mojatatu.com>
    Cc: Cong Wang <xiyou.wangcong@gmail.com>
    Cc: Jiri Pirko <jiri@resnulli.us>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit b451e5d24ba6687c6f0e7319c727a709a1846c06
Author: Yuchung Cheng <ycheng@google.com>
Date:   Wed May 10 17:01:27 2017 -0700

    tcp: avoid fragmenting peculiar skbs in SACK
    
    This patch fixes a bug in splitting an SKB during SACK
    processing. Specifically if an skb contains multiple
    packets and is only partially sacked in the higher sequences,
    tcp_match_sack_to_skb() splits the skb and marks the second fragment
    as SACKed.
    
    The current code further attempts rounding up the first fragment
    to MSS boundaries. But it misses a boundary condition when the
    rounded-up fragment size (pkt_len) is exactly skb size.  Spliting
    such an skb is pointless and causses a kernel warning and aborts
    the SACK processing. This patch universally checks such over-split
    before calling tcp_fragment to prevent these unnecessary warnings.
    
    Fixes: adb92db857ee ("tcp: Make SACK code to split only at mss boundaries")
    Signed-off-by: Yuchung Cheng <ycheng@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
    Acked-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit f6ba8d33cfbb46df569972e64dbb5bb7e929bfd9
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu May 11 15:24:41 2017 -0700

    netem: fix skb_orphan_partial()
    
    I should have known that lowering skb->truesize was dangerous :/
    
    In case packets are not leaving the host via a standard Ethernet device,
    but looped back to local sockets, bad things can happen, as reported
    by Michael Madsen ( https://bugzilla.kernel.org/show_bug.cgi?id=195713 )
    
    So instead of tweaking skb->truesize, lets change skb->destructor
    and keep a reference on the owner socket via its sk_refcnt.
    
    Fixes: f2f872f9272a ("netem: Introduce skb_orphan_partial() helper")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reported-by: Michael Madsen <mkm@nabto.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit d67b9cd28c1d7f82c2e5e727731ea7c89b23a0a8
Author: Daniel Borkmann <daniel@iogearbox.net>
Date:   Fri May 12 01:04:46 2017 +0200

    xdp: refine xdp api with regards to generic xdp
    
    While working on the iproute2 generic XDP frontend, I noticed that
    as of right now it's possible to have native *and* generic XDP
    programs loaded both at the same time for the case when a driver
    supports native XDP.
    
    The intended model for generic XDP from b5cdae3291f7 ("net: Generic
    XDP") is, however, that only one out of the two can be present at
    once which is also indicated as such in the XDP netlink dump part.
    The main rationale for generic XDP is to ease accessibility (in
    case a driver does not yet have XDP support) and to generically
    provide a semantical model as an example for driver developers
    wanting to add XDP support. The generic XDP option for an XDP
    aware driver can still be useful for comparing and testing both
    implementations.
    
    However, it is not intended to have a second XDP processing stage
    or layer with exactly the same functionality of the first native
    stage. Only reason could be to have a partial fallback for future
    XDP features that are not supported yet in the native implementation
    and we probably also shouldn't strive for such fallback and instead
    encourage native feature support in the first place. Given there's
    currently no such fallback issue or use case, lets not go there yet
    if we don't need to.
    
    Therefore, change semantics for loading XDP and bail out if the
    user tries to load a generic XDP program when a native one is
    present and vice versa. Another alternative to bailing out would
    be to handle the transition from one flavor to another gracefully,
    but that would require to bring the device down, exchange both
    types of programs, and bring it up again in order to avoid a tiny
    window where a packet could hit both hooks. Given this complicates
    the logic for just a debugging feature in the native case, I went
    with the simpler variant.
    
    For the dump, remove IFLA_XDP_FLAGS that was added with b5cdae3291f7
    and reuse IFLA_XDP_ATTACHED for indicating the mode. Dumping all
    or just a subset of flags that were used for loading the XDP prog
    is suboptimal in the long run since not all flags are useful for
    dumping and if we start to reuse the same flag definitions for
    load and dump, then we'll waste bit space. What we really just
    want is to dump the mode for now.
    
    Current IFLA_XDP_ATTACHED semantics are: nothing was installed (0),
    a program is running at the native driver layer (1). Thus, add a
    mode that says that a program is running at generic XDP layer (2).
    Applications will handle this fine in that older binaries will
    just indicate that something is attached at XDP layer, effectively
    this is similar to IFLA_XDP_FLAGS attr that we would have had
    modulo the redundancy.
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit 0489df9a430e9607de8130a6bc4bf4c02f96eaf1
Author: Daniel Borkmann <daniel@iogearbox.net>
Date:   Fri May 12 01:04:45 2017 +0200

    xdp: add flag to enforce driver mode
    
    After commit b5cdae3291f7 ("net: Generic XDP") we automatically fall
    back to a generic XDP variant if the driver does not support native
    XDP. Allow for an option where the user can specify that always the
    native XDP variant should be selected and in case it's not supported
    by a driver, just bail out.
    
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit 0a5539f66133a02b24f9cc43da5b84b7e6f3f436
Author: David S. Miller <davem@davemloft.net>
Date:   Thu May 11 12:00:50 2017 -0700

    bpf: Provide a linux/types.h override for bpf selftests.
    
    We do not want to use the architecture's type.h header when
    building BPF programs which are always 64-bit.
    
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit 18b3ad90b64e9893297357608abddd26170730eb
Author: David S. Miller <davem@davemloft.net>
Date:   Wed May 10 11:43:51 2017 -0700

    bpf: Add verifier test case for alignment.
    
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Acked-by: Daniel Borkmann <daniel@iogearbox.net>

commit 91045f5e5238a6d2ad21d41d7e86e2fa65f90d76
Author: David S. Miller <davem@davemloft.net>
Date:   Wed May 10 11:42:48 2017 -0700

    bpf: Add bpf_verify_program() to the library.
    
    This allows a test case to load a BPF program and unconditionally
    acquire the verifier log.
    
    It also allows specification of the strict alignment flag.
    
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Acked-by: Daniel Borkmann <daniel@iogearbox.net>

commit e07b98d9bffe410019dfcf62c3428d4a96c56a2c
Author: David S. Miller <davem@davemloft.net>
Date:   Wed May 10 11:38:07 2017 -0700

    bpf: Add strict alignment flag for BPF_PROG_LOAD.
    
    Add a new field, "prog_flags", and an initial flag value
    BPF_F_STRICT_ALIGNMENT.
    
    When set, the verifier will enforce strict pointer alignment
    regardless of the setting of CONFIG_EFFICIENT_UNALIGNED_ACCESS.
    
    The verifier, in this mode, will also use a fixed value of "2" in
    place of NET_IP_ALIGN.
    
    This facilitates test cases that will exercise and validate this part
    of the verifier even when run on architectures where alignment doesn't
    matter.
    
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Acked-by: Daniel Borkmann <daniel@iogearbox.net>

commit c5fc9692d101d1318b0f53f9f691cd88ac029317
Author: David S. Miller <davem@davemloft.net>
Date:   Wed May 10 11:25:17 2017 -0700

    bpf: Do per-instruction state dumping in verifier when log_level > 1.
    
    If log_level > 1, do a state dump every instruction and emit it in
    a more compact way (without a leading newline).
    
    This will facilitate more sophisticated test cases which inspect the
    verifier log for register state.
    
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Acked-by: Daniel Borkmann <daniel@iogearbox.net>

commit d1174416747d790d750742d0514915deeed93acf
Author: David S. Miller <davem@davemloft.net>
Date:   Wed May 10 11:22:52 2017 -0700

    bpf: Track alignment of register values in the verifier.
    
    Currently if we add only constant values to pointers we can fully
    validate the alignment, and properly check if we need to reject the
    program on !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS architectures.
    
    However, once an unknown value is introduced we only allow byte sized
    memory accesses which is too restrictive.
    
    Add logic to track the known minimum alignment of register values,
    and propagate this state into registers containing pointers.
    
    The most common paradigm that makes use of this new logic is computing
    the transport header using the IP header length field.  For example:
    
            struct ethhdr *ep = skb->data;
            struct iphdr *iph = (struct iphdr *) (ep + 1);
            struct tcphdr *th;
     ...
            n = iph->ihl;
            th = ((void *)iph + (n * 4));
            port = th->dest;
    
    The existing code will reject the load of th->dest because it cannot
    validate that the alignment is at least 2 once "n * 4" is added the
    the packet pointer.
    
    In the new code, the register holding "n * 4" will have a reg->min_align
    value of 4, because any value multiplied by 4 will be at least 4 byte
    aligned.  (actually, the eBPF code emitted by the compiler in this case
    is most likely to use a shift left by 2, but the end result is identical)
    
    At the critical addition:
    
            th = ((void *)iph + (n * 4));
    
    The register holding 'th' will start with reg->off value of 14.  The
    pointer addition will transform that reg into something that looks like:
    
            reg->aux_off = 14
            reg->aux_off_align = 4
    
    Next, the verifier will look at the th->dest load, and it will see
    a load offset of 2, and first check:
    
            if (reg->aux_off_align % size)
    
    which will pass because aux_off_align is 4.  reg_off will be computed:
    
            reg_off = reg->off;
     ...
                    reg_off += reg->aux_off;
    
    plus we have off==2, and it will thus check:
    
            if ((NET_IP_ALIGN + reg_off + off) % size != 0)
    
    which evaluates to:
    
            if ((NET_IP_ALIGN + 14 + 2) % size != 0)
    
    On strict alignment architectures, NET_IP_ALIGN is 2, thus:
    
            if ((2 + 14 + 2) % size != 0)
    
    which passes.
    
    These pointer transformations and checks work regardless of whether
    the constant offset or the variable with known alignment is added
    first to the pointer register.
    
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Acked-by: Daniel Borkmann <daniel@iogearbox.net>

commit d8b54110ee944de522ccd3531191f39986ec20f9
Author: Daniel Borkmann <daniel@iogearbox.net>
Date:   Thu May 11 01:53:15 2017 +0200

    bpf, arm64: fix faulty emission of map access in tail calls
    
    Shubham was recently asking on netdev why in arm64 JIT we don't multiply
    the index for accessing the tail call map by 8. That led me into testing
    out arm64 JIT wrt tail calls and it turned out I got a NULL pointer
    dereference on the tail call.
    
    The buggy access is at:
    
      prog = array->ptrs[index];
      if (prog == NULL)
          goto out;
    
      [...]
      00000060:  d2800e0a  mov x10, #0x70 // #112
      00000064:  f86a682a  ldr x10, [x1,x10]
      00000068:  f862694b  ldr x11, [x10,x2]
      0000006c:  b40000ab  cbz x11, 0x00000080
      [...]
    
    The code triggering the crash is f862694b. x1 at the time contains the
    address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning,
    above we load the pointer to the program at map slot 0 into x10. x10
    can then be NULL if the slot is not occupied, which we later on try to
    access with a user given offset in x2 that is the map index.
    
    Fix this by emitting the following instead:
    
      [...]
      00000060:  d2800e0a  mov x10, #0x70 // #112
      00000064:  8b0a002a  add x10, x1, x10
      00000068:  d37df04b  lsl x11, x2, #3
      0000006c:  f86b694b  ldr x11, [x10,x11]
      00000070:  b40000ab  cbz x11, 0x00000084
      [...]
    
    This basically adds the offset to ptrs to the base address of the bpf
    array we got and we later on access the map with an index * 8 offset
    relative to that. The tail call map itself is basically one large area
    with meta data at the head followed by the array of prog pointers.
    This makes tail calls working again, tested on Cavium ThunderX ARMv8.
    
    Fixes: ddb55992b04d ("arm64: bpf: implement bpf_tail_call() helper")
    Reported-by: Shubham Bansal <illusionist.neo@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit 5b6cb43b4d625b04a4049d727a116edbfe5cf0f4
Author: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Date:   Wed May 10 10:28:05 2017 -0700

    net: ethernet: ti: netcp_core: return error while dma channel open issue
    
    Fix error path while dma open channel issue. Also, no need to check output
    on NULL if it's never returned.
    
    Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit ebccc7397e4a49ff64c8f44a54895de9d32fe742
Author: Ursula Braun <ubraun@linux.vnet.ibm.com>
Date:   Wed May 10 19:07:54 2017 +0200

    s390/qeth: add missing hash table initializations
    
    commit 5f78e29ceebf ("qeth: optimize IP handling in rx_mode callback")
    added new hash tables, but missed to initialize them.
    
    Fixes: 5f78e29ceebf ("qeth: optimize IP handling in rx_mode callback")
    Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
    Reviewed-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
    Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit 25e2c341e7818a394da9abc403716278ee646014
Author: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Date:   Wed May 10 19:07:53 2017 +0200

    s390/qeth: avoid null pointer dereference on OSN
    
    Access card->dev only after checking whether's its valid.
    
    Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
    Reviewed-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit 2d2ebb3ed0c6acfb014f98e427298673a5d07b82
Author: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Date:   Wed May 10 19:07:52 2017 +0200

    s390/qeth: unbreak OSM and OSN support
    
    commit b4d72c08b358 ("qeth: bridgeport support - basic control")
    broke the support for OSM and OSN devices as follows:
    
    As OSM and OSN are L2 only, qeth_core_probe_device() does an early
    setup by loading the l2 discipline and calling qeth_l2_probe_device().
    In this context, adding the l2-specific bridgeport sysfs attributes
    via qeth_l2_create_device_attributes() hits a BUG_ON in fs/sysfs/group.c,
    since the basic sysfs infrastructure for the device hasn't been
    established yet.
    
    Note that OSN actually has its own unique sysfs attributes
    (qeth_osn_devtype), so the additional attributes shouldn't be created
    at all.
    For OSM, add a new qeth_l2_devtype that contains all the common
    and l2-specific sysfs attributes.
    When qeth_core_probe_device() does early setup for OSM or OSN, assign
    the corresponding devtype so that the ccwgroup probe code creates the
    full set of sysfs attributes.
    This allows us to skip qeth_l2_create_device_attributes() in case
    of an early setup.
    
    Any device that can't do early setup will initially have only the
    generic sysfs attributes, and when it's probed later
    qeth_l2_probe_device() adds the l2-specific attributes.
    
    If an early-setup device is removed (by calling ccwgroup_ungroup()),
    device_unregister() will - using the devtype - delete the
    l2-specific attributes before qeth_l2_remove_device() is called.
    So make sure to not remove them twice.
    
    What complicates the issue is that qeth_l2_probe_device() and
    qeth_l2_remove_device() is also called on a device when its
    layer2 attribute changes (ie. its layer mode is switched).
    For early-setup devices this wouldn't work properly - we wouldn't
    remove the l2-specific attributes when switching to L3.
    But switching the layer mode doesn't actually make any sense;
    we already decided that the device can only operate in L2!
    So just refuse to switch the layer mode on such devices. Note that
    OSN doesn't have a layer2 attribute, so we only need to special-case
    OSM.
    
    Based on an initial patch by Ursula Braun.
    
    Fixes: b4d72c08b358 ("qeth: bridgeport support - basic control")
    Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit 9111e7880ccf419548c7b0887df020b08eadb075
Author: Ursula Braun <ubraun@linux.vnet.ibm.com>
Date:   Wed May 10 19:07:51 2017 +0200

    s390/qeth: handle sysfs error during initialization
    
    When setting up the device from within the layer discipline's
    probe routine, creating the layer-specific sysfs attributes can fail.
    Report this error back to the caller, and handle it by
    releasing the layer discipline.
    
    Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
    [jwi: updated commit msg, moved an OSN change to a subsequent patch]
    Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit b60161668199ac62011c024adc9e66713b9554e7
Author: Jon Mason <jon.mason@broadcom.com>
Date:   Wed May 10 11:20:27 2017 -0400

    mdio: mux: Correct mdio_mux_init error path issues
    
    There is a potential unnecessary refcount decrement on error path of
    put_device(&pb->mii_bus->dev), as it is possible to avoid the
    of_mdio_find_bus() call if mux_bus is specified by the calling function.
    
    The same put_device() is not called in the error path if the
    devm_kzalloc of pb fails.  This caused the variable used in the
    put_device() to be changed, as the pb pointer was obviously not set up.
    
    There is an unnecessary of_node_get() on child_bus_node if the
    of_mdiobus_register() is successful, as the
    for_each_available_child_of_node() automatically increments this.
    Thus the refcount on this node will always be +1 more than it should be.
    
    There is no of_node_put() on child_bus_node if the of_mdiobus_register()
    call fails.
    
    Finally, it is lacking devm_kfree() of pb in the error path.  While this
    might not be technically necessary, it was present in other parts of the
    function.  So, I am adding it where necessary to make it uniform.
    
    Signed-off-by: Jon Mason <jon.mason@broadcom.com>
    Fixes: f20e6657a875 ("mdio: mux: Enhanced MDIO mux framework for integrated multiplexers")
    Fixes: 0ca2997d1452 ("netdev/of/phy: Add MDIO bus multiplexer support.")
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit 83eaddab4378db256d00d295bda6ca997cd13a52
Author: WANG Cong <xiyou.wangcong@gmail.com>
Date:   Tue May 9 16:59:54 2017 -0700

    ipv6/dccp: do not inherit ipv6_mc_list from parent
    
    Like commit 657831ffc38e ("dccp/tcp: do not inherit mc_list from parent")
    we should clear ipv6_mc_list etc. for IPv6 sockets too.
    
    Cc: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
    Acked-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit 0fe20fafd1791f993806d417048213ec57b81045
Author: Colin Ian King <colin.king@canonical.com>
Date:   Tue May 9 17:19:42 2017 +0100

    netxen_nic: set rcode to the return status from the call to netxen_issue_cmd
    
    Currently rcode is being initialized to NX_RCODE_SUCCESS and later it
    is checked to see if it is not NX_RCODE_SUCCESS which is never true. It
    appears that there is an unintentional missing assignment of rcode from
    the return of the call to netxen_issue_cmd() that was dropped in
    an earlier fix, so add it in.
    
    Detected by CoverityScan, CID#401900 ("Logically dead code")
    
    Fixes: 2dcd5d95ad6b2 ("netxen_nic: fix cdrp race condition")
    Signed-off-by: Colin Ian King <colin.king@canonical.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit 8d66c30b12ed3cb533696dea8b9a9eadd5da426a
Author: Stefan Wahren <stefan.wahren@i2se.com>
Date:   Tue May 9 15:40:38 2017 +0200

    net: qca_spi: Fix alignment issues in rx path
    
    The qca_spi driver causes alignment issues on ARM devices.
    So fix this by using netdev_alloc_skb_ip_align().
    
    Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com>
    Fixes: 291ab06ecf67 ("net: qualcomm: new Ethernet over SPI driver for QCA7000")
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit 1a4a5bf52a4adb477adb075e5afce925824ad132
Author: Gao Feng <gfree.wind@vip.163.com>
Date:   Tue May 9 18:27:33 2017 +0800

    driver: vrf: Fix one possible use-after-free issue
    
    The current codes only deal with the case that the skb is dropped, it
    may meet one use-after-free issue when NF_HOOK returns 0 that means
    the skb is stolen by one netfilter rule or hook.
    
    When one netfilter rule or hook stoles the skb and return NF_STOLEN,
    it means the skb is taken by the rule, and other modules should not
    touch this skb ever. Maybe the skb is queued or freed directly by the
    rule.
    
    Now uses the nf_hook instead of NF_HOOK to get the result of netfilter,
    and check the return value of nf_hook. Only when its value equals 1, it
    means the skb could go ahead. Or reset the skb as NULL.
    
    BTW, because vrf_rcv_finish is empty function, so needn't invoke it
    even though nf_hook returns 1. But we need to modify vrf_rcv_finish
    to deal with the NF_STOLEN case.
    
    There are two cases when skb is stolen.
    1. The skb is stolen and freed directly.
       There is nothing we need to do, and vrf_rcv_finish isn't invoked.
    2. The skb is queued and reinjected again.
       The vrf_rcv_finish would be invoked as okfn, so need to free the
       skb in it.
    
    Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

commit efc0c21c9ea786d6f019d7df7b4e3932f3578d90
Author: Elena Reshetova <elena.reshetova@intel.com>
Date:   Thu Mar 2 12:23:45 2017 +0100

    s390: convert debug_info.ref_count from atomic_t to refcount_t
    
    refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.
    
    Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
    Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: David Windsor <dwindsor@gmail.com>
    Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
    Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

commit de1892b887eeb85ce458a93979c2108e6f329618
Author: Steve French <smfrench@gmail.com>
Date:   Thu May 4 07:54:04 2017 -0500

    Don't delay freeing mids when blocked on slow socket write of request
    
    When processing responses, and in particular freeing mids (DeleteMidQEntry),
    which is very important since it also frees the associated buffers (cifs_buf_release),
    we can block a long time if (writes to) socket is slow due to low memory or networking
    issues.
    
    We can block in send (smb request) waiting for memory, and be blocked in processing
    responess (which could free memory if we let it) - since they both grab the
    server->srv_mutex.
    
    In practice, in the DeleteMidQEntry case - there is no reason we need to
    grab the srv_mutex so remove these around DeleteMidQEntry, and it allows
    us to free memory faster.
    
    Signed-off-by: Steve French <steve.french@primarydata.com>
    Acked-by: Pavel Shilovsky <pshilov@microsoft.com>

commit 560d388950ceda5e7c7cdef7f3d9a8ff297bbf9d
Author: Rabin Vincent <rabinv@axis.com>
Date:   Wed May 3 17:17:21 2017 +0200

    CIFS: silence lockdep splat in cifs_relock_file()
    
    cifs_relock_file() can perform a down_write() on the inode's lock_sem even
    though it was already performed in cifs_strict_readv().  Lockdep complains
    about this.  AFAICS, there is no problem here, and lockdep just needs to be
    told that this nesting is OK.
    
     =============================================
     [ INFO: possible recursive locking detected ]
     4.11.0+ #20 Not tainted
     ---------------------------------------------
     cat/701 is trying to acquire lock:
      (&cifsi->lock_sem){++++.+}, at: cifs_reopen_file+0x7a7/0xc00
    
     but task is already holding lock:
      (&cifsi->lock_sem){++++.+}, at: cifs_strict_readv+0x177/0x310
    
     other info that might help us debug this:
      Possible unsafe locking scenario:
    
            CPU0
            ----
       lock(&cifsi->lock_sem);
       lock(&cifsi->lock_sem);
    
      *** DEADLOCK ***
    
      May be due to missing lock nesting notation
    
     1 lock held by cat/701:
      #0:  (&cifsi->lock_sem){++++.+}, at: cifs_strict_readv+0x177/0x310
    
     stack backtrace:
     CPU: 0 PID: 701 Comm: cat Not tainted 4.11.0+ #20
     Call Trace:
      dump_stack+0x85/0xc2
      __lock_acquire+0x17dd/0x2260
      ? trace_hardirqs_on_thunk+0x1a/0x1c
      ? preempt_schedule_irq+0x6b/0x80
      lock_acquire+0xcc/0x260
      ? lock_acquire+0xcc/0x260
      ? cifs_reopen_file+0x7a7/0xc00
      down_read+0x2d/0x70
      ? cifs_reopen_file+0x7a7/0xc00
      cifs_reopen_file+0x7a7/0xc00
      ? printk+0x43/0x4b
      cifs_readpage_worker+0x327/0x8a0
      cifs_readpage+0x8c/0x2a0
      generic_file_read_iter+0x692/0xd00
      cifs_strict_readv+0x29f/0x310
      generic_file_splice_read+0x11c/0x1c0
      do_splice_to+0xa5/0xc0
      splice_direct_to_actor+0xfa/0x350
      ? generic_pipe_buf_nosteal+0x10/0x10
      do_splice_direct+0xb5/0xe0
      do_sendfile+0x278/0x3a0
      SyS_sendfile64+0xc4/0xe0
      entry_SYSCALL_64_fastpath+0x1f/0xbe
    
    Signed-off-by: Rabin Vincent <rabinv@axis.com>
    Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
    Signed-off-by: Steve French <smfrench@gmail.com>

commit d04a4c76f71dd5335f8e499b59617382d84e2b8d
Author: Heiko Carstens <heiko.carstens@de.ibm.com>
Date:   Thu May 4 09:42:22 2017 +0200

    s390: move _text symbol to address higher than zero
    
    The perf tool assumes that kernel symbols are never present at address
    zero. In fact it assumes if functions that map symbols to addresses
    return zero, that the symbol was not found.
    
    Given that s390's _text symbol historically is located at address zero
    this yields at least a couple of false errors and warnings in one of
    perf's test cases about not present symbols ("perf test 1").
    
    To fix this simply move the _text symbol to address 0x200, just behind
    the initial psw and channel program located at the beginning of the
    kernel image. This is now hard coded within the linker script.
    
    I tried a nicer solution which moves the initial psw and channel
    program into an own section. However that would move the symbols
    within the "real" head.text section to different addresses, since the
    ".org" statements within head.S are relative to the head.text
    section. If there is a new section in front, everything else will be
    moved…
damentz referenced this pull request in zen-kernel/zen-kernel May 31, 2017
[ Upstream commit d8b5411 ]

Shubham was recently asking on netdev why in arm64 JIT we don't multiply
the index for accessing the tail call map by 8. That led me into testing
out arm64 JIT wrt tail calls and it turned out I got a NULL pointer
dereference on the tail call.

The buggy access is at:

  prog = array->ptrs[index];
  if (prog == NULL)
      goto out;

  [...]
  00000060:  d2800e0a  mov x10, #0x70 // #112
  00000064:  f86a682a  ldr x10, [x1,x10]
  00000068:  f862694b  ldr x11, [x10,x2]
  0000006c:  b40000ab  cbz x11, 0x00000080
  [...]

The code triggering the crash is f862694b. x1 at the time contains the
address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning,
above we load the pointer to the program at map slot 0 into x10. x10
can then be NULL if the slot is not occupied, which we later on try to
access with a user given offset in x2 that is the map index.

Fix this by emitting the following instead:

  [...]
  00000060:  d2800e0a  mov x10, #0x70 // #112
  00000064:  8b0a002a  add x10, x1, x10
  00000068:  d37df04b  lsl x11, x2, #3
  0000006c:  f86b694b  ldr x11, [x10,x11]
  00000070:  b40000ab  cbz x11, 0x00000084
  [...]

This basically adds the offset to ptrs to the base address of the bpf
array we got and we later on access the map with an index * 8 offset
relative to that. The tail call map itself is basically one large area
with meta data at the head followed by the array of prog pointers.
This makes tail calls working again, tested on Cavium ThunderX ARMv8.

Fixes: ddb5599 ("arm64: bpf: implement bpf_tail_call() helper")
Reported-by: Shubham Bansal <illusionist.neo@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Noltari pushed a commit to Noltari/linux that referenced this pull request Jun 7, 2017
[ Upstream commit d8b5411 ]

Shubham was recently asking on netdev why in arm64 JIT we don't multiply
the index for accessing the tail call map by 8. That led me into testing
out arm64 JIT wrt tail calls and it turned out I got a NULL pointer
dereference on the tail call.

The buggy access is at:

  prog = array->ptrs[index];
  if (prog == NULL)
      goto out;

  [...]
  00000060:  d2800e0a  mov x10, #0x70 // torvalds#112
  00000064:  f86a682a  ldr x10, [x1,x10]
  00000068:  f862694b  ldr x11, [x10,x2]
  0000006c:  b40000ab  cbz x11, 0x00000080
  [...]

The code triggering the crash is f862694b. x1 at the time contains the
address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning,
above we load the pointer to the program at map slot 0 into x10. x10
can then be NULL if the slot is not occupied, which we later on try to
access with a user given offset in x2 that is the map index.

Fix this by emitting the following instead:

  [...]
  00000060:  d2800e0a  mov x10, #0x70 // torvalds#112
  00000064:  8b0a002a  add x10, x1, x10
  00000068:  d37df04b  lsl x11, x2, #3
  0000006c:  f86b694b  ldr x11, [x10,x11]
  00000070:  b40000ab  cbz x11, 0x00000084
  [...]

This basically adds the offset to ptrs to the base address of the bpf
array we got and we later on access the map with an index * 8 offset
relative to that. The tail call map itself is basically one large area
with meta data at the head followed by the array of prog pointers.
This makes tail calls working again, tested on Cavium ThunderX ARMv8.

Fixes: ddb5599 ("arm64: bpf: implement bpf_tail_call() helper")
Reported-by: Shubham Bansal <illusionist.neo@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
koenkooi pushed a commit to koenkooi/linux that referenced this pull request Jun 7, 2017
[ Upstream commit d8b5411 ]

Shubham was recently asking on netdev why in arm64 JIT we don't multiply
the index for accessing the tail call map by 8. That led me into testing
out arm64 JIT wrt tail calls and it turned out I got a NULL pointer
dereference on the tail call.

The buggy access is at:

  prog = array->ptrs[index];
  if (prog == NULL)
      goto out;

  [...]
  00000060:  d2800e0a  mov x10, #0x70 // torvalds#112
  00000064:  f86a682a  ldr x10, [x1,x10]
  00000068:  f862694b  ldr x11, [x10,x2]
  0000006c:  b40000ab  cbz x11, 0x00000080
  [...]

The code triggering the crash is f862694b. x1 at the time contains the
address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning,
above we load the pointer to the program at map slot 0 into x10. x10
can then be NULL if the slot is not occupied, which we later on try to
access with a user given offset in x2 that is the map index.

Fix this by emitting the following instead:

  [...]
  00000060:  d2800e0a  mov x10, #0x70 // torvalds#112
  00000064:  8b0a002a  add x10, x1, x10
  00000068:  d37df04b  lsl x11, x2, #3
  0000006c:  f86b694b  ldr x11, [x10,x11]
  00000070:  b40000ab  cbz x11, 0x00000084
  [...]

This basically adds the offset to ptrs to the base address of the bpf
array we got and we later on access the map with an index * 8 offset
relative to that. The tail call map itself is basically one large area
with meta data at the head followed by the array of prog pointers.
This makes tail calls working again, tested on Cavium ThunderX ARMv8.

Fixes: ddb5599 ("arm64: bpf: implement bpf_tail_call() helper")
Reported-by: Shubham Bansal <illusionist.neo@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
fengguang pushed a commit to 0day-ci/linux that referenced this pull request Jun 12, 2017
Wire up the existing arm64 support for SMBIOS tables (aka DMI) for ARM as
well, by moving the arm64 init code to drivers/firmware/efi/arm-runtime.c
(which is shared between ARM and arm64), and adding a asm/dmi.h header to
ARM that defines the mapping routines for the firmware tables.

This allows userspace to access these tables to discover system information
exposed by the firmware. It also sets the hardware name used in crash
dumps, e.g.:

  Unable to handle kernel NULL pointer dereference at virtual address 00000000
  pgd = ed3c0000
  [00000000] *pgd=bf1f3835
  Internal error: Oops: 817 [#1] SMP THUMB2
  Modules linked in:
  CPU: 0 PID: 759 Comm: bash Not tainted 4.10.0-09601-g0e8f38792120-dirty torvalds#112
  Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
  ^^^

NOTE: This does *NOT* enable or encourage the use of DMI quirks, i.e., the
      the practice of identifying the platform via DMI to decide whether
      certain workarounds for buggy hardware and/or firmware need to be
      enabled. This would require the DMI subsystem to be enabled much
      earlier than we do on ARM, which is non-trivial.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20170602135207.21708-14-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
wzyy2 pushed a commit to wzyy2/linux that referenced this pull request Jun 19, 2017
(1) use cpu id from bl31 delivers;
(2) sp_el0 should point to kernel address in EL1 mode.

On ARM64, kernel uses sp_el0 to store current_thread_info(),
we see a problem: when fiq occurs, cpu is EL1 mode but sp_el0
point to userspace address. At this moment, if we read
'current_thread_info()->cpu' or other, it leads an error.

We find above situation happens when save/restore cpu context
between system mode and user mode under heavy load.
Like 'ret_fast_syscall()', kernel restore context of user mode,
but fiq occurs before the instruction 'eret', so this causes the
above situation.

Assembly code:

ffffff80080826c8 <ret_fast_syscall>:

...skipping...

ffffff80080826fc:       d503201f        nop
ffffff8008082700:       d5384100        mrs     x0, sp_el0
ffffff8008082704:       f9400c00        ldr     x0, [x0,torvalds#24]
ffffff8008082708:       d5182000        msr     ttbr0_el1, x0
ffffff800808270c:       d5033fdf        isb
ffffff8008082710:       f9407ff7        ldr     x23, [sp,torvalds#248]
ffffff8008082714:       d5184117        msr     sp_el0, x23
ffffff8008082718:       d503201f        nop
ffffff800808271c:       d503201f        nop
ffffff8008082720:       d5184035        msr     elr_el1, x21
ffffff8008082724:       d5184016        msr     spsr_el1, x22
ffffff8008082728:       a94007e0        ldp     x0, x1, [sp]
ffffff800808272c:       a9410fe2        ldp     x2, x3, [sp,torvalds#16]
ffffff8008082730:       a94217e4        ldp     x4, x5, [sp,torvalds#32]
ffffff8008082734:       a9431fe6        ldp     x6, x7, [sp,torvalds#48]
ffffff8008082738:       a94427e8        ldp     x8, x9, [sp,torvalds#64]
ffffff800808273c:       a9452fea        ldp     x10, x11, [sp,torvalds#80]
ffffff8008082740:       a94637ec        ldp     x12, x13, [sp,torvalds#96]
ffffff8008082744:       a9473fee        ldp     x14, x15, [sp,torvalds#112]
ffffff8008082748:       a94847f0        ldp     x16, x17, [sp,torvalds#128]
ffffff800808274c:       a9494ff2        ldp     x18, x19, [sp,torvalds#144]
ffffff8008082750:       a94a57f4        ldp     x20, x21, [sp,torvalds#160]
ffffff8008082754:       a94b5ff6        ldp     x22, x23, [sp,torvalds#176]
ffffff8008082758:       a94c67f8        ldp     x24, x25, [sp,torvalds#192]
ffffff800808275c:       a94d6ffa        ldp     x26, x27, [sp,torvalds#208]
ffffff8008082760:       a94e77fc        ldp     x28, x29, [sp,torvalds#224]
ffffff8008082764:       f9407bfe        ldr     x30, [sp,torvalds#240]
ffffff8008082768:       9104c3ff        add     sp, sp, #0x130
ffffff800808276c:       d69f03e0        eret

Change-Id: I071e899f8a407764e166ca0403199c9d87d6ce78
Signed-off-by: chenjh <chenjh@rock-chips.com>
dcui pushed a commit to dcui/linux that referenced this pull request Jul 26, 2017
BugLink: http://bugs.launchpad.net/bugs/1696723

[ Upstream commit d8b5411 ]

Shubham was recently asking on netdev why in arm64 JIT we don't multiply
the index for accessing the tail call map by 8. That led me into testing
out arm64 JIT wrt tail calls and it turned out I got a NULL pointer
dereference on the tail call.

The buggy access is at:

  prog = array->ptrs[index];
  if (prog == NULL)
      goto out;

  [...]
  00000060:  d2800e0a  mov x10, #0x70 // torvalds#112
  00000064:  f86a682a  ldr x10, [x1,x10]
  00000068:  f862694b  ldr x11, [x10,x2]
  0000006c:  b40000ab  cbz x11, 0x00000080
  [...]

The code triggering the crash is f862694b. x1 at the time contains the
address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning,
above we load the pointer to the program at map slot 0 into x10. x10
can then be NULL if the slot is not occupied, which we later on try to
access with a user given offset in x2 that is the map index.

Fix this by emitting the following instead:

  [...]
  00000060:  d2800e0a  mov x10, #0x70 // torvalds#112
  00000064:  8b0a002a  add x10, x1, x10
  00000068:  d37df04b  lsl x11, x2, #3
  0000006c:  f86b694b  ldr x11, [x10,x11]
  00000070:  b40000ab  cbz x11, 0x00000084
  [...]

This basically adds the offset to ptrs to the base address of the bpf
array we got and we later on access the map with an index * 8 offset
relative to that. The tail call map itself is basically one large area
with meta data at the head followed by the array of prog pointers.
This makes tail calls working again, tested on Cavium ThunderX ARMv8.

Fixes: ddb5599 ("arm64: bpf: implement bpf_tail_call() helper")
Reported-by: Shubham Bansal <illusionist.neo@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
iaguis pushed a commit to kinvolk/linux that referenced this pull request Feb 6, 2018
commodo pushed a commit to commodo/linux that referenced this pull request Jun 19, 2018
…lise-support

Add adrv9009 talise support
fengguang pushed a commit to 0day-ci/linux that referenced this pull request Jun 22, 2018
On fast hosts or malicious bots, we trigger a DCCP_BUG() which
seems excessive.

syzbot reported :

BUG: delta (-6195) <= 0 at net/dccp/ccids/ccid3.c:628/ccid3_hc_rx_send_feedback()
CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.18.0-rc1+ torvalds#112
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
 ccid3_hc_rx_send_feedback net/dccp/ccids/ccid3.c:628 [inline]
 ccid3_hc_rx_packet_recv.cold.16+0x38/0x71 net/dccp/ccids/ccid3.c:793
 ccid_hc_rx_packet_recv net/dccp/ccid.h:185 [inline]
 dccp_deliver_input_to_ccids+0xf0/0x280 net/dccp/input.c:180
 dccp_rcv_established+0x87/0xb0 net/dccp/input.c:378
 dccp_v4_do_rcv+0x153/0x180 net/dccp/ipv4.c:654
 sk_backlog_rcv include/net/sock.h:914 [inline]
 __sk_receive_skb+0x3ba/0xd80 net/core/sock.c:517
 dccp_v4_rcv+0x10f9/0x1f58 net/dccp/ipv4.c:875
 ip_local_deliver_finish+0x2eb/0xda0 net/ipv4/ip_input.c:215
 NF_HOOK include/linux/netfilter.h:287 [inline]
 ip_local_deliver+0x1e9/0x750 net/ipv4/ip_input.c:256
 dst_input include/net/dst.h:450 [inline]
 ip_rcv_finish+0x823/0x2220 net/ipv4/ip_input.c:396
 NF_HOOK include/linux/netfilter.h:287 [inline]
 ip_rcv+0xa18/0x1284 net/ipv4/ip_input.c:492
 __netif_receive_skb_core+0x2488/0x3680 net/core/dev.c:4628
 __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:4693
 process_backlog+0x219/0x760 net/core/dev.c:5373
 napi_poll net/core/dev.c:5771 [inline]
 net_rx_action+0x7da/0x1980 net/core/dev.c:5837
 __do_softirq+0x2e8/0xb17 kernel/softirq.c:284
 run_ksoftirqd+0x86/0x100 kernel/softirq.c:645
 smpboot_thread_fn+0x417/0x870 kernel/smpboot.c:164
 kthread+0x345/0x410 kernel/kthread.c:240
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Cc: dccp@vger.kernel.org
fengguang pushed a commit to 0day-ci/linux that referenced this pull request Jun 23, 2018
On fast hosts or malicious bots, we trigger a DCCP_BUG() which
seems excessive.

syzbot reported :

BUG: delta (-6195) <= 0 at net/dccp/ccids/ccid3.c:628/ccid3_hc_rx_send_feedback()
CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.18.0-rc1+ torvalds#112
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
 ccid3_hc_rx_send_feedback net/dccp/ccids/ccid3.c:628 [inline]
 ccid3_hc_rx_packet_recv.cold.16+0x38/0x71 net/dccp/ccids/ccid3.c:793
 ccid_hc_rx_packet_recv net/dccp/ccid.h:185 [inline]
 dccp_deliver_input_to_ccids+0xf0/0x280 net/dccp/input.c:180
 dccp_rcv_established+0x87/0xb0 net/dccp/input.c:378
 dccp_v4_do_rcv+0x153/0x180 net/dccp/ipv4.c:654
 sk_backlog_rcv include/net/sock.h:914 [inline]
 __sk_receive_skb+0x3ba/0xd80 net/core/sock.c:517
 dccp_v4_rcv+0x10f9/0x1f58 net/dccp/ipv4.c:875
 ip_local_deliver_finish+0x2eb/0xda0 net/ipv4/ip_input.c:215
 NF_HOOK include/linux/netfilter.h:287 [inline]
 ip_local_deliver+0x1e9/0x750 net/ipv4/ip_input.c:256
 dst_input include/net/dst.h:450 [inline]
 ip_rcv_finish+0x823/0x2220 net/ipv4/ip_input.c:396
 NF_HOOK include/linux/netfilter.h:287 [inline]
 ip_rcv+0xa18/0x1284 net/ipv4/ip_input.c:492
 __netif_receive_skb_core+0x2488/0x3680 net/core/dev.c:4628
 __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:4693
 process_backlog+0x219/0x760 net/core/dev.c:5373
 napi_poll net/core/dev.c:5771 [inline]
 net_rx_action+0x7da/0x1980 net/core/dev.c:5837
 __do_softirq+0x2e8/0xb17 kernel/softirq.c:284
 run_ksoftirqd+0x86/0x100 kernel/softirq.c:645
 smpboot_thread_fn+0x417/0x870 kernel/smpboot.c:164
 kthread+0x345/0x410 kernel/kthread.c:240
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Cc: dccp@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Noltari pushed a commit to Noltari/linux that referenced this pull request Jul 22, 2018
[ Upstream commit 74174fe ]

On fast hosts or malicious bots, we trigger a DCCP_BUG() which
seems excessive.

syzbot reported :

BUG: delta (-6195) <= 0 at net/dccp/ccids/ccid3.c:628/ccid3_hc_rx_send_feedback()
CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.18.0-rc1+ torvalds#112
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
 ccid3_hc_rx_send_feedback net/dccp/ccids/ccid3.c:628 [inline]
 ccid3_hc_rx_packet_recv.cold.16+0x38/0x71 net/dccp/ccids/ccid3.c:793
 ccid_hc_rx_packet_recv net/dccp/ccid.h:185 [inline]
 dccp_deliver_input_to_ccids+0xf0/0x280 net/dccp/input.c:180
 dccp_rcv_established+0x87/0xb0 net/dccp/input.c:378
 dccp_v4_do_rcv+0x153/0x180 net/dccp/ipv4.c:654
 sk_backlog_rcv include/net/sock.h:914 [inline]
 __sk_receive_skb+0x3ba/0xd80 net/core/sock.c:517
 dccp_v4_rcv+0x10f9/0x1f58 net/dccp/ipv4.c:875
 ip_local_deliver_finish+0x2eb/0xda0 net/ipv4/ip_input.c:215
 NF_HOOK include/linux/netfilter.h:287 [inline]
 ip_local_deliver+0x1e9/0x750 net/ipv4/ip_input.c:256
 dst_input include/net/dst.h:450 [inline]
 ip_rcv_finish+0x823/0x2220 net/ipv4/ip_input.c:396
 NF_HOOK include/linux/netfilter.h:287 [inline]
 ip_rcv+0xa18/0x1284 net/ipv4/ip_input.c:492
 __netif_receive_skb_core+0x2488/0x3680 net/core/dev.c:4628
 __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:4693
 process_backlog+0x219/0x760 net/core/dev.c:5373
 napi_poll net/core/dev.c:5771 [inline]
 net_rx_action+0x7da/0x1980 net/core/dev.c:5837
 __do_softirq+0x2e8/0xb17 kernel/softirq.c:284
 run_ksoftirqd+0x86/0x100 kernel/softirq.c:645
 smpboot_thread_fn+0x417/0x870 kernel/smpboot.c:164
 kthread+0x345/0x410 kernel/kthread.c:240
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Cc: dccp@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 15, 2022
WARNING: line length of 108 exceeds 100 columns
torvalds#97: FILE: tools/testing/selftests/vm/mremap_test.c:136:
+	char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

WARNING: Missing a blank line after declarations
torvalds#98: FILE: tools/testing/selftests/vm/mremap_test.c:137:
+	char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	munmap(start + page_size, page_size);

ERROR: space required before the open parenthesis '('
torvalds#107: FILE: tools/testing/selftests/vm/mremap_test.c:146:
+	while(getline(&line, &len, fp) != -1) {

ERROR: space required after that ',' (ctx:VxV)
torvalds#108: FILE: tools/testing/selftests/vm/mremap_test.c:147:
+		char *first = strtok(line,"- ");
 		                         ^

ERROR: space required after that ',' (ctx:VxV)
torvalds#110: FILE: tools/testing/selftests/vm/mremap_test.c:149:
+		char *second = strtok(NULL,"- ");
 		                          ^

WARNING: Missing a blank line after declarations
torvalds#112: FILE: tools/testing/selftests/vm/mremap_test.c:151:
+		void *second_val = (void *) strtol(second, NULL, 16);
+		if (first_val == start && second_val == start + 3 * page_size) {

total: 3 errors, 3 warnings, 113 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

./patches/mm-add-merging-after-mremap-resize.patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Jakub Matěna <matenajakub@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
staging-kernelci-org pushed a commit to kernelci/linux that referenced this pull request Sep 16, 2022
WARNING: line length of 108 exceeds 100 columns
torvalds#97: FILE: tools/testing/selftests/vm/mremap_test.c:136:
+	char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

WARNING: Missing a blank line after declarations
torvalds#98: FILE: tools/testing/selftests/vm/mremap_test.c:137:
+	char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	munmap(start + page_size, page_size);

ERROR: space required before the open parenthesis '('
torvalds#107: FILE: tools/testing/selftests/vm/mremap_test.c:146:
+	while(getline(&line, &len, fp) != -1) {

ERROR: space required after that ',' (ctx:VxV)
torvalds#108: FILE: tools/testing/selftests/vm/mremap_test.c:147:
+		char *first = strtok(line,"- ");
 		                         ^

ERROR: space required after that ',' (ctx:VxV)
torvalds#110: FILE: tools/testing/selftests/vm/mremap_test.c:149:
+		char *second = strtok(NULL,"- ");
 		                          ^

WARNING: Missing a blank line after declarations
torvalds#112: FILE: tools/testing/selftests/vm/mremap_test.c:151:
+		void *second_val = (void *) strtol(second, NULL, 16);
+		if (first_val == start && second_val == start + 3 * page_size) {

total: 3 errors, 3 warnings, 113 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

./patches/mm-add-merging-after-mremap-resize.patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Jakub Matěna <matenajakub@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 16, 2022
WARNING: line length of 108 exceeds 100 columns
torvalds#97: FILE: tools/testing/selftests/vm/mremap_test.c:136:
+	char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

WARNING: Missing a blank line after declarations
torvalds#98: FILE: tools/testing/selftests/vm/mremap_test.c:137:
+	char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	munmap(start + page_size, page_size);

ERROR: space required before the open parenthesis '('
torvalds#107: FILE: tools/testing/selftests/vm/mremap_test.c:146:
+	while(getline(&line, &len, fp) != -1) {

ERROR: space required after that ',' (ctx:VxV)
torvalds#108: FILE: tools/testing/selftests/vm/mremap_test.c:147:
+		char *first = strtok(line,"- ");
 		                         ^

ERROR: space required after that ',' (ctx:VxV)
torvalds#110: FILE: tools/testing/selftests/vm/mremap_test.c:149:
+		char *second = strtok(NULL,"- ");
 		                          ^

WARNING: Missing a blank line after declarations
torvalds#112: FILE: tools/testing/selftests/vm/mremap_test.c:151:
+		void *second_val = (void *) strtol(second, NULL, 16);
+		if (first_val == start && second_val == start + 3 * page_size) {

total: 3 errors, 3 warnings, 113 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

./patches/mm-add-merging-after-mremap-resize.patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Jakub Matěna <matenajakub@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 19, 2022
WARNING: line length of 108 exceeds 100 columns
torvalds#97: FILE: tools/testing/selftests/vm/mremap_test.c:136:
+	char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

WARNING: Missing a blank line after declarations
torvalds#98: FILE: tools/testing/selftests/vm/mremap_test.c:137:
+	char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	munmap(start + page_size, page_size);

ERROR: space required before the open parenthesis '('
torvalds#107: FILE: tools/testing/selftests/vm/mremap_test.c:146:
+	while(getline(&line, &len, fp) != -1) {

ERROR: space required after that ',' (ctx:VxV)
torvalds#108: FILE: tools/testing/selftests/vm/mremap_test.c:147:
+		char *first = strtok(line,"- ");
 		                         ^

ERROR: space required after that ',' (ctx:VxV)
torvalds#110: FILE: tools/testing/selftests/vm/mremap_test.c:149:
+		char *second = strtok(NULL,"- ");
 		                          ^

WARNING: Missing a blank line after declarations
torvalds#112: FILE: tools/testing/selftests/vm/mremap_test.c:151:
+		void *second_val = (void *) strtol(second, NULL, 16);
+		if (first_val == start && second_val == start + 3 * page_size) {

total: 3 errors, 3 warnings, 113 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

./patches/mm-add-merging-after-mremap-resize.patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Jakub Matěna <matenajakub@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 22, 2022
WARNING: line length of 108 exceeds 100 columns
torvalds#97: FILE: tools/testing/selftests/vm/mremap_test.c:136:
+	char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

WARNING: Missing a blank line after declarations
torvalds#98: FILE: tools/testing/selftests/vm/mremap_test.c:137:
+	char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	munmap(start + page_size, page_size);

ERROR: space required before the open parenthesis '('
torvalds#107: FILE: tools/testing/selftests/vm/mremap_test.c:146:
+	while(getline(&line, &len, fp) != -1) {

ERROR: space required after that ',' (ctx:VxV)
torvalds#108: FILE: tools/testing/selftests/vm/mremap_test.c:147:
+		char *first = strtok(line,"- ");
 		                         ^

ERROR: space required after that ',' (ctx:VxV)
torvalds#110: FILE: tools/testing/selftests/vm/mremap_test.c:149:
+		char *second = strtok(NULL,"- ");
 		                          ^

WARNING: Missing a blank line after declarations
torvalds#112: FILE: tools/testing/selftests/vm/mremap_test.c:151:
+		void *second_val = (void *) strtol(second, NULL, 16);
+		if (first_val == start && second_val == start + 3 * page_size) {

total: 3 errors, 3 warnings, 113 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

./patches/mm-add-merging-after-mremap-resize.patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Jakub Matěna <matenajakub@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 25, 2022
WARNING: line length of 108 exceeds 100 columns
torvalds#97: FILE: tools/testing/selftests/vm/mremap_test.c:136:
+	char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

WARNING: Missing a blank line after declarations
torvalds#98: FILE: tools/testing/selftests/vm/mremap_test.c:137:
+	char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	munmap(start + page_size, page_size);

ERROR: space required before the open parenthesis '('
torvalds#107: FILE: tools/testing/selftests/vm/mremap_test.c:146:
+	while(getline(&line, &len, fp) != -1) {

ERROR: space required after that ',' (ctx:VxV)
torvalds#108: FILE: tools/testing/selftests/vm/mremap_test.c:147:
+		char *first = strtok(line,"- ");
 		                         ^

ERROR: space required after that ',' (ctx:VxV)
torvalds#110: FILE: tools/testing/selftests/vm/mremap_test.c:149:
+		char *second = strtok(NULL,"- ");
 		                          ^

WARNING: Missing a blank line after declarations
torvalds#112: FILE: tools/testing/selftests/vm/mremap_test.c:151:
+		void *second_val = (void *) strtol(second, NULL, 16);
+		if (first_val == start && second_val == start + 3 * page_size) {

total: 3 errors, 3 warnings, 113 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

./patches/mm-add-merging-after-mremap-resize.patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Jakub Matěna <matenajakub@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 27, 2022
WARNING: line length of 108 exceeds 100 columns
torvalds#97: FILE: tools/testing/selftests/vm/mremap_test.c:136:
+	char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

WARNING: Missing a blank line after declarations
torvalds#98: FILE: tools/testing/selftests/vm/mremap_test.c:137:
+	char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	munmap(start + page_size, page_size);

ERROR: space required before the open parenthesis '('
torvalds#107: FILE: tools/testing/selftests/vm/mremap_test.c:146:
+	while(getline(&line, &len, fp) != -1) {

ERROR: space required after that ',' (ctx:VxV)
torvalds#108: FILE: tools/testing/selftests/vm/mremap_test.c:147:
+		char *first = strtok(line,"- ");
 		                         ^

ERROR: space required after that ',' (ctx:VxV)
torvalds#110: FILE: tools/testing/selftests/vm/mremap_test.c:149:
+		char *second = strtok(NULL,"- ");
 		                          ^

WARNING: Missing a blank line after declarations
torvalds#112: FILE: tools/testing/selftests/vm/mremap_test.c:151:
+		void *second_val = (void *) strtol(second, NULL, 16);
+		if (first_val == start && second_val == start + 3 * page_size) {

total: 3 errors, 3 warnings, 113 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

./patches/mm-add-merging-after-mremap-resize.patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Jakub Matěna <matenajakub@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Oct 26, 2022
zsmalloc has 255 size classes. Size classes contain a number of zspages,
which store objects of the same size. zspage can consist of up to four
physical pages. The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of
objects zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96
we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of
1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes.
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects.
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes torvalds#96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

1) ChromeOS memory pressure test
=============================================================================

Our standard memory pressure test, that is designed with reproducibility
in mind.

zram is configured as a swap device, lzo-rle compression algorithm.
We captured /sys/block/zram0/mm_stat after every test and rebooted
device.

Columns per (Documentation/admin-guide/blockdev/zram.rst)

orig_data_size        mem_used_total      mem_used_max         pages_compacted
          compr_data_size         mem_limit           same_pages          huge_pages

ORDER 2 (BASE) zspage

10353639424 2981711944 3166896128        0 3543158784   579494   825135   123707
10168573952 2932288347 3106541568        0 3499085824   565187   853137   126153
9950461952 2815911234 3035693056        0 3441090560   586696   748054   122103
9892335616 2779566152 2943459328        0 3514736640   591541   650696   119621
9993949184 2814279212 3021357056        0 3336421376   582488   711744   121273
9953226752 2856382009 3025649664        0 3512893440   564559   787861   123034
9838448640 2785481728 2997575680        0 3367219200   573282   777099   122739

ORDER 3 zspage

9509138432 2706941227 2823393280        0 3389587456   535856  1011472    90223
10105245696 2882368370 3013095424        0 3296165888   563896  1059033    94808
9531236352 2666125512 2867650560        0 3396173824   567117  1126396    88807
9561812992 2714536764 2956652544        0 3310505984   548223   827322    90992
9807470592 2790315707 2908053504        0 3378315264   563670  1020933    93725
10178371584 2948838782 3071209472        0 3329548288   548533   954546    90730
9925165056 2849839413 2958274560        0 3336978432   551464  1058302    89381

ORDER 4 zspage

9444515840 2613362645 2668232704        0 3396759552   573735  1162207    83475
10129108992 2925888488 3038351360        0 3499597824   555634  1231542    84525
9876594688 2786692282 2897006592        0 3469463552   584835  1290535    84133
10012909568 2649711847 2801512448        0 3171323904   675405   750728    80424
10120966144 2866742402 2978639872        0 3257815040   587435  1093981    83587
9578790912 2671245225 2802270208        0 3376353280   545548  1047930    80895
10108588032 2888433523 2983960576        0 3316641792   571445  1290640    81402

First, we establish that order 3 and 4 don't cause any statistically
significant change in `orig_data_size` (number of bytes we store during
the test), in other words larger zspages don't cause regressions.

T-test for order 3:

x order-2-stored
+ order-3-stored
+-----------------------------------------------------------------------------+
|+ +  +                     +  x   x  +  x   x         +    x+               x|
| |________________________AM__|_________M_____A____|__________|              |
+-----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   7 9.8384486e+09 1.0353639e+10 9.9532268e+09 1.0021519e+10 1.7916718e+08
+   7 9.5091384e+09 1.0178372e+10 9.8074706e+09 9.8026344e+09 2.7856206e+08
No difference proven at 95.0% confidence

T-test for order 4:

x order-2-stored
+ order-4-stored
+-----------------------------------------------------------------------------+
|                                                         +                   |
|+          +                     x  +x    xx  x +       ++   x              x|
|              |__________________|____A____M____M____________|_|             |
+-----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   7 9.8384486e+09 1.0353639e+10 9.9532268e+09 1.0021519e+10 1.7916718e+08
+   7 9.4445158e+09 1.0129109e+10  1.001291e+10 9.8959249e+09 2.7947784e+08
No difference proven at 95.0% confidence

Next we establish that there is a statistically significant improvement
in `mem_used_total` metrics.

T-test for order 3:

x order-2-usedmem
+ order-3-usedmem
+-----------------------------------------------------------------------------+
|+         +        +       x ++        x  + xx x       +       x            x|
|        |_________________A__M__|____________|__A________________|           |
+-----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   7 2.9434593e+09 3.1668961e+09 3.0256497e+09 3.0424532e+09      73235062
+   7 2.8233933e+09 3.0712095e+09 2.9566525e+09 2.9426185e+09      84630851
Difference at 95.0% confidence
	-9.98347e+07 +/- 9.21744e+07
	-3.28139% +/- 3.02961%
	(Student's t, pooled s = 7.91383e+07)

T-test for order 4:

x order-2-usedmem
+ order-4-usedmem
+-----------------------------------------------------------------------------+
|                    +                                 x                      |
|+                   +              +      x    ++ x   x *          x        x|
|             |__________________A__M__________|_____|_M__A__________|        |
+-----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   7 2.9434593e+09 3.1668961e+09 3.0256497e+09 3.0424532e+09      73235062
+   7 2.6682327e+09 3.0383514e+09 2.8970066e+09 2.8814248e+09 1.3098053e+08
Difference at 95.0% confidence
	-1.61028e+08 +/- 1.23591e+08
	-5.29272% +/- 4.0622%
	(Student's t, pooled s = 1.06111e+08)

Order 3 zspages also show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
+-----------------------------------------------------------------------------+
|+   +     + x+        x  +   + +             x                x    x        x|
|    |________M__A_________|_|_____________________A___________M____________| |
+-----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   7 3.3364214e+09 3.5431588e+09 3.4990858e+09 3.4592294e+09      80073158
+   7 3.2961659e+09 3.3961738e+09 3.3369784e+09 3.3481822e+09      39840377
Difference at 95.0% confidence
	-1.11047e+08 +/- 7.36589e+07
	-3.21017% +/- 2.12934%
	(Student's t, pooled s = 6.32415e+07)

Order 4 zspages, on the other hand, do not show any statistically significant
improvement in `mem_used_max` metrics.

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
+-----------------------------------------------------------------------------+
|+                 +           +   x     x +   +        x     +     *  x     x|
|              |_______________________A___M________________A_|_____M_______| |
+-----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   7 3.3364214e+09 3.5431588e+09 3.4990858e+09 3.4592294e+09      80073158
+   7 3.1713239e+09 3.4995978e+09 3.3763533e+09 3.3554221e+09 1.1609062e+08
No difference proven at 95.0% confidence

Overall, with sufficient level of confidence, order 3 zspages appear to be
beneficial for these particular use-case and data patterns.

Rather expectedly we also observed lower numbers of huge-pages when zsmalloc
is configured with order 3 and order 4 zspages, for the reason already
explained.

2) Synthetic test
=============================================================================

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
+--------------------------------------------------------------------------+
|+                                                                        x|
|+                                                                        x|
|+                                                                        x|
|++                                                                       x|
|A|                                                                       A|
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
+--------------------------------------------------------------------------+
|+                                                                        x|
|+                                                                        x|
|+                                                                        x|
|+                                                                        x|
|+                                                                        x|
|A                                                                        A|
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes torvalds#112-torvalds#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class.
For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Oct 27, 2022
zsmalloc has 255 size classes. Size classes contain a number of zspages,
which store objects of the same size. zspage can consist of up to four
physical pages. The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of
objects zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96
we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of
1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes.
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects.
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes torvalds#96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes torvalds#112-torvalds#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class.
For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
jonhunter pushed a commit to jonhunter/linux that referenced this pull request Oct 28, 2022
zsmalloc has 255 size classes.  Size classes contain a number of zspages,
which store objects of the same size.  zspage can consist of up to four
physical pages.  The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of objects
zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100.  That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96 we
end up storing it in size class torvalds#100.  Class torvalds#100 is for objects of 1632
bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. 
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. 
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes torvalds#96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes torvalds#112-torvalds#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class.
For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Link: https://lkml.kernel.org/r/20221027042651.234524-3-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Alexey Romanov <avromanov@sberdevices.ru>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Oct 29, 2022
zsmalloc has 255 size classes.  Size classes contain a number of zspages,
which store objects of the same size.  zspage can consist of up to four
physical pages.  The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of objects
zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100.  That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96 we
end up storing it in size class torvalds#100.  Class torvalds#100 is for objects of 1632
bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. 
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. 
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes torvalds#96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes torvalds#112-torvalds#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class.
For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Link: https://lkml.kernel.org/r/20221027042651.234524-3-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Alexey Romanov <avromanov@sberdevices.ru>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Nov 1, 2022
zsmalloc has 255 size classes.  Size classes contain a number of zspages,
which store objects of the same size.  zspage can consist of up to four
physical pages.  The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of objects
zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96
we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of
1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes.
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects.
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes torvalds#96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes torvalds#112-torvalds#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class. 
For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Link: https://lkml.kernel.org/r/20221031054108.541190-3-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Alexey Romanov <avromanov@sberdevices.ru>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Nov 1, 2022
zsmalloc has 255 size classes.  Size classes contain a number of zspages,
which store objects of the same size.  zspage can consist of up to four
physical pages.  The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of objects
zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96
we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of
1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes.
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects.
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes torvalds#96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes torvalds#112-torvalds#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class. 
For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Link: https://lkml.kernel.org/r/20221031054108.541190-3-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Alexey Romanov <avromanov@sberdevices.ru>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Nov 2, 2022
zsmalloc has 255 size classes.  Size classes contain a number of zspages,
which store objects of the same size.  zspage can consist of up to four
physical pages.  The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of objects
zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96
we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of
1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes.
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects.
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes torvalds#96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes torvalds#112-torvalds#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class. 
For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Link: https://lkml.kernel.org/r/20221031054108.541190-3-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Alexey Romanov <avromanov@sberdevices.ru>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Nov 3, 2022
zsmalloc has 255 size classes.  Size classes contain a number of zspages,
which store objects of the same size.  zspage can consist of up to four
physical pages.  The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of objects
zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96
we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of
1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes.
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects.
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes torvalds#96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes torvalds#112-torvalds#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class. 
For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Link: https://lkml.kernel.org/r/20221031054108.541190-3-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Alexey Romanov <avromanov@sberdevices.ru>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Nov 5, 2022
zsmalloc has 255 size classes.  Size classes contain a number of zspages,
which store objects of the same size.  zspage can consist of up to four
physical pages.  The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of objects
zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96
we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of
1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes.
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects.
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes torvalds#96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes torvalds#112-torvalds#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class. 
For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Link: https://lkml.kernel.org/r/20221031054108.541190-3-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Alexey Romanov <avromanov@sberdevices.ru>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
jonhunter pushed a commit to jonhunter/linux that referenced this pull request Nov 7, 2022
zsmalloc has 255 size classes.  Size classes contain a number of zspages,
which store objects of the same size.  zspage can consist of up to four
physical pages.  The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of objects
zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96
we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of
1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes.
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects.
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes torvalds#96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes torvalds#112-torvalds#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class. 
For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Link: https://lkml.kernel.org/r/20221031054108.541190-3-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Alexey Romanov <avromanov@sberdevices.ru>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
jonhunter pushed a commit to jonhunter/linux that referenced this pull request Nov 8, 2022
zsmalloc has 255 size classes.  Size classes contain a number of zspages,
which store objects of the same size.  zspage can consist of up to four
physical pages.  The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of objects
zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96
we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of
1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes.
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects.
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes torvalds#96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes torvalds#112-torvalds#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class. 
For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Link: https://lkml.kernel.org/r/20221031054108.541190-3-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Alexey Romanov <avromanov@sberdevices.ru>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Nov 9, 2022
zsmalloc has 255 size classes.  Size classes contain a number of zspages,
which store objects of the same size.  zspage can consist of up to four
physical pages.  The exact (most optimal) zspage size is calculated for
each size class during zsmalloc pool creation.

As a reasonable optimization, zsmalloc merges size classes that have
similar characteristics: number of pages per zspage and number of objects
zspage can store.

For example, let's look at the following size classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
..
   94  1536           0            0             0          0          0                3        0
  100  1632           0            0             0          0          0                2        0
..

Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time
we store an object of size, say, 1568 bytes instead of using class torvalds#96
we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of
1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes.
Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects.
When we need to store, say, 13 objects of size 1568 we end up allocating
three zspages; in other words, 6 physical pages.

However, if we'll look closer at size class torvalds#96 (which should hold objects
of size 1568 bytes) and trace get_pages_per_zspage():

    pages per zspage      wasted bytes     used%
           1                  960           76
           2                  352           95
           3                 1312           89
           4                  704           95
           5                   96           99

We'd notice that the most optimal zspage configuration for this class is
when it consists of 5 physical pages, but currently we never let zspages
to consists of more than 4 pages. A 5 page class torvalds#96 configuration would
store 13 objects of size 1568 in a single zspage, allocating 5 physical
pages, as opposed to 6 physical pages that class torvalds#100 will allocate.

A higher order zspage for class torvalds#96 also changes its key characteristics:
pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100
are not merged anymore, which gives us more compact zsmalloc.

Of course the described effect does not apply only to size classes torvalds#96 and
We still merge classes, but less often so. In other words classes are grouped
in a more compact way, which decreases memory wastage:

zspage order               # unique size classes
     2                                69
     3                               123
     4                               191

Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  254  4096           0            0             0          0          0                1        0
...

For exactly same reason - maximum 4 pages per zspage - the last non-huge
size class is torvalds#202, which stores objects of size 3264 bytes. Any object
larger than 3264 bytes, hence, is considered to be huge and lands in size
class torvalds#254, which uses a whole physical page to store every object. To put
it slightly differently - objects in huge classes don't share physical pages.

3264 bytes is too low of a watermark and we have too many huge classes:
classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order
zspages change key characteristics for some of those huge size classes and
thus those classes become normal classes, where stored objects share physical
pages.

Hence yet another consequence of higher order zspages: we move the huge
size class watermark with higher order zspages, have less huge classes and
store large objects in a more compact way.

For order 3, huge class watermark becomes 3632 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  211  3408           0            0             0          0          0                5        0
  217  3504           0            0             0          0          0                6        0
  222  3584           0            0             0          0          0                7        0
  225  3632           0            0             0          0          0                8        0
  254  4096           0            0             0          0          0                1        0
...

For order 4, huge class watermark becomes 3840 bytes:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
  202  3264           0            0             0          0          0                4        0
  206  3328           0            0             0          0          0               13        0
  207  3344           0            0             0          0          0                9        0
  208  3360           0            0             0          0          0               14        0
  211  3408           0            0             0          0          0                5        0
  212  3424           0            0             0          0          0               16        0
  214  3456           0            0             0          0          0               11        0
  217  3504           0            0             0          0          0                6        0
  219  3536           0            0             0          0          0               13        0
  222  3584           0            0             0          0          0                7        0
  223  3600           0            0             0          0          0               15        0
  225  3632           0            0             0          0          0                8        0
  228  3680           0            0             0          0          0                9        0
  230  3712           0            0             0          0          0               10        0
  232  3744           0            0             0          0          0               11        0
  234  3776           0            0             0          0          0               12        0
  235  3792           0            0             0          0          0               13        0
  236  3808           0            0             0          0          0               14        0
  238  3840           0            0             0          0          0               15        0
  254  4096           0            0             0          0          0                1        0
...

TESTS
=====

Test untars linux-6.0.tar.xz and compiles the kernel.

zram is configured as a block device with ext4 file system, lzo-rle
compression algorithm. We captured /sys/block/zram0/mm_stat after
every test and rebooted the VM.

orig_data_size       mem_used_total     mem_used_max       pages_compacted
          compr_data_size         mem_limit         same_pages       huge_pages

ORDER 2 (BASE) zspage

1691791360 628086729 655171584        0 655171584       60        0    34043
1691787264 628089196 655175680        0 655175680       60        0    34046
1691803648 628098840 655187968        0 655187968       59        0    34047
1691795456 628091503 655183872        0 655183872       60        0    34044
1691799552 628086877 655183872        0 655183872       60        0    34047

ORDER 3 zspage

1691803648 627792993 641794048        0 641794048       60        0    33591
1691787264 627779342 641708032        0 641708032       59        0    33591
1691811840 627786616 641769472        0 641769472       60        0    33591
1691803648 627794468 641818624        0 641818624       59        0    33592
1691783168 627780882 641794048        0 641794048       61        0    33591

ORDER 4 zspage

1691803648 627726635 639655936        0 639655936       60        0    33435
1691811840 627733348 639643648        0 639643648       61        0    33434
1691795456 627726290 639614976        0 639614976       60        0    33435
1691803648 627730458 639688704        0 639688704       60        0    33434
1691811840 627727771 639688704        0 639688704       60        0    33434

Order 3 and order 4 show statistically significant improvement in
`mem_used_max` metrics.

T-test for order 3:

x order-2-maxmem
+ order-3-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08     42210.666
Difference at 95.0% confidence
	-1.34038e+07 +/- 44080.7
	-2.04581% +/- 0.00672802%
	(Student's t, pooled s = 30224.5)

T-test for order 4:

x order-2-maxmem
+ order-4-maxmem
    N           Min           Max        Median           Avg        Stddev
x   5 6.5517158e+08 6.5518797e+08 6.5518387e+08  6.551806e+08     6730.4157
+   5 6.3961498e+08  6.396887e+08 6.3965594e+08 6.3965839e+08     31408.602
Difference at 95.0% confidence
	-1.55222e+07 +/- 33126.2
	-2.36915% +/- 0.00505604%
	(Student's t, pooled s = 22713.4)

This test tends to benefit more from order 4 zspages, due to test's data
patterns.

zsmalloc object distribution analysis
=============================================================================

Order 2 (4 pages per zspage) tends to put many objects in size class 2048,
which is merged with size classes torvalds#112-torvalds#125:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            0          6146       6146       1756                2        0
    74  1216           0            1          4560       4552       1368                3        0
    76  1248           0            1          2938       2934        904                4        0
    83  1360           0            0         10971      10971       3657                1        0
    91  1488           0            0         16126      16126       5864                4        0
    94  1536           0            1          5912       5908       2217                3        0
   100  1632           0            0         11990      11990       4796                2        0
   107  1744           0            1         15771      15768       6759                3        0
   111  1808           0            1         10386      10380       4616                4        0
   126  2048           0            0         45444      45444      22722                1        0
   144  2336           0            0         47446      47446      27112                4        0
   151  2448           1            0         10760      10759       6456                3        0
   168  2720           0            0         10173      10173       6782                2        0
   190  3072           0            1          1700       1697       1275                3        0
   202  3264           0            1           290        286        232                4        0
   254  4096           0            0         34051      34051      34051                1        0

Order 3 (8 pages per zspage) changed pool characteristics and unmerged
some of the size classes, which resulted in less objects being put into
size class 2048, because there are lower size classes are now available
for more compact object storage:

class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
...
    71  1168           0            1          2996       2994        856                2        0
    72  1184           0            1          1632       1609        476                7        0
    73  1200           1            0          1445       1442        425                5        0
    74  1216           0            0          1510       1510        453                3        0
    75  1232           0            1          1495       1479        455                7        0
    76  1248           0            1          1456       1451        448                4        0
    78  1280           0            1          3040       3033        950                5        0
    79  1296           0            1          1584       1571        504                7        0
    83  1360           0            0          6375       6375       2125                1        0
    84  1376           0            1          1817       1796        632                8        0
    87  1424           0            1          6020       6006       2107                7        0
    88  1440           0            1          2108       2101        744                6        0
    89  1456           0            1          2072       2064        740                5        0
    91  1488           0            1          4169       4159       1516                4        0
    92  1504           0            1          2014       2007        742                7        0
    94  1536           0            1          3904       3900       1464                3        0
    95  1552           0            1          1890       1873        720                8        0
    96  1568           0            1          1963       1958        755                5        0
    97  1584           0            1          1980       1974        770                7        0
   100  1632           0            1          6190       6187       2476                2        0
   103  1680           0            0          6477       6477       2667                7        0
   104  1696           0            1          2256       2253        940                5        0
   105  1712           0            1          2356       2340        992                8        0
   107  1744           1            0          4697       4696       2013                3        0
   110  1792           0            1          7744       7734       3388                7        0
   111  1808           0            1          2655       2649       1180                4        0
   114  1856           0            1          8371       8365       3805                5        0
   116  1888           1            0          5863       5862       2706                6        0
   117  1904           0            1          2955       2942       1379                7        0
   118  1920           0            1          3009       2997       1416                8        0
   126  2048           0            0         25276      25276      12638                1        0
   128  2080           0            1          6060       6052       3232                8        0
   129  2096           1            0          3081       3080       1659                7        0
   134  2176           0            1         14835      14830       7912                8        0
   135  2192           0            1          2769       2758       1491                7        0
   137  2224           0            1          5082       5077       2772                6        0
   140  2272           0            1          7236       7232       4020                5        0
   144  2336           0            1          8428       8423       4816                4        0
   147  2384           0            1          5316       5313       3101                7        0
   151  2448           0            1          5445       5443       3267                3        0
   155  2512           0            0          4121       4121       2536                8        0
   158  2560           0            1          2208       2205       1380                5        0
   160  2592           0            0          1133       1133        721                7        0
   168  2720           0            0          2712       2712       1808                2        0
   177  2864           1            0          1100       1098        770                7        0
   180  2912           0            1           189        183        135                5        0
   184  2976           0            1           176        166        128                8        0
   190  3072           0            0           252        252        189                3        0
   197  3184           0            1           198        192        154                7        0
   202  3264           0            1           100         96         80                4        0
   211  3408           0            1           210        208        175                5        0
   217  3504           0            1            98         94         84                6        0
   222  3584           0            0           104        104         91                7        0
   225  3632           0            1            54         50         48                8        0
   254  4096           0            0         33591      33591      33591                1        0

Note, the huge size watermark is above 3632 and there are a number of new
normal classes available that previously were merged with the huge class. 
For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175
physical pages, while previously for those objects we would have used 210
physical pages.

Link: https://lkml.kernel.org/r/20221031054108.541190-3-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Alexey Romanov <avromanov@sberdevices.ru>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
gatieme pushed a commit to gatieme/linux that referenced this pull request Nov 24, 2022
ANBZ: torvalds#112

commit bd74048 upstream

If we hit an earlier error path in io_uring_create(), then we will have
accounted memory, but not set ctx->{sq,cq}_entries yet. Then when the
ring is torn down in error, we use those values to unaccount the memory.

Ensure we set the ctx entries before we're able to hit a potential error
path.

Cc: stable@vger.kernel.org
Reported-by: Tomáš Chaloupka <chalucha@gmail.com>
Tested-by: Tomáš Chaloupka <chalucha@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Damenly pushed a commit to Damenly/linux that referenced this pull request Jul 25, 2023
Signed-off-by: nascs <nascs@radxa.com>
mj22226 pushed a commit to mj22226/linux that referenced this pull request Jul 30, 2023
Signed-off-by: nascs <nascs@radxa.com>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Nov 27, 2023
With latest upstream llvm18, the following test cases failed:
  $ ./test_progs -j
  torvalds#13/2    bpf_cookie/multi_kprobe_link_api:FAIL
  torvalds#13/3    bpf_cookie/multi_kprobe_attach_api:FAIL
  torvalds#13      bpf_cookie:FAIL
  torvalds#77      fentry_fexit:FAIL
  torvalds#78/1    fentry_test/fentry:FAIL
  torvalds#78      fentry_test:FAIL
  torvalds#82/1    fexit_test/fexit:FAIL
  torvalds#82      fexit_test:FAIL
  torvalds#112/1   kprobe_multi_test/skel_api:FAIL
  torvalds#112/2   kprobe_multi_test/link_api_addrs:FAIL
  ...
  torvalds#112     kprobe_multi_test:FAIL
  torvalds#356/17  test_global_funcs/global_func17:FAIL
  torvalds#356     test_global_funcs:FAIL

Further analysis shows llvm upstream patch [1] is responsible
for the above failures. For example, for function bpf_fentry_test7()
in net/bpf/test_run.c, without [1], the asm code is:
  0000000000000400 <bpf_fentry_test7>:
     400: f3 0f 1e fa                   endbr64
     404: e8 00 00 00 00                callq   0x409 <bpf_fentry_test7+0x9>
     409: 48 89 f8                      movq    %rdi, %rax
     40c: c3                            retq
     40d: 0f 1f 00                      nopl    (%rax)
and with [1], the asm code is:
  0000000000005d20 <bpf_fentry_test7.specialized.1>:
    5d20: e8 00 00 00 00                callq   0x5d25 <bpf_fentry_test7.specialized.1+0x5>
    5d25: c3                            retq
and <bpf_fentry_test7.specialized.1> is called instead of <bpf_fentry_test7>
and this caused test failures for torvalds#13/torvalds#77 etc. except torvalds#356.

For test case torvalds#356/17, with [1] (progs/test_global_func17.c)),
the main prog looks like:
  0000000000000000 <global_func17>:
       0:       b4 00 00 00 2a 00 00 00 w0 = 0x2a
       1:       95 00 00 00 00 00 00 00 exit
which passed verification while the test itself expects a verification
failure.

Let us add 'barrier_var' style asm code in both places to prevent
function specialization which caused selftests failure.

  [1] llvm/llvm-project#72903

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
alobakin pushed a commit to alobakin/linux that referenced this pull request Nov 27, 2023
With latest upstream llvm18, the following test cases failed:

  $ ./test_progs -j
  #13/2    bpf_cookie/multi_kprobe_link_api:FAIL
  #13/3    bpf_cookie/multi_kprobe_attach_api:FAIL
  #13      bpf_cookie:FAIL
  torvalds#77      fentry_fexit:FAIL
  torvalds#78/1    fentry_test/fentry:FAIL
  torvalds#78      fentry_test:FAIL
  torvalds#82/1    fexit_test/fexit:FAIL
  torvalds#82      fexit_test:FAIL
  torvalds#112/1   kprobe_multi_test/skel_api:FAIL
  torvalds#112/2   kprobe_multi_test/link_api_addrs:FAIL
  [...]
  torvalds#112     kprobe_multi_test:FAIL
  torvalds#356/17  test_global_funcs/global_func17:FAIL
  torvalds#356     test_global_funcs:FAIL

Further analysis shows llvm upstream patch [1] is responsible for the above
failures. For example, for function bpf_fentry_test7() in net/bpf/test_run.c,
without [1], the asm code is:

  0000000000000400 <bpf_fentry_test7>:
     400: f3 0f 1e fa                   endbr64
     404: e8 00 00 00 00                callq   0x409 <bpf_fentry_test7+0x9>
     409: 48 89 f8                      movq    %rdi, %rax
     40c: c3                            retq
     40d: 0f 1f 00                      nopl    (%rax)

... and with [1], the asm code is:

  0000000000005d20 <bpf_fentry_test7.specialized.1>:
    5d20: e8 00 00 00 00                callq   0x5d25 <bpf_fentry_test7.specialized.1+0x5>
    5d25: c3                            retq

... and <bpf_fentry_test7.specialized.1> is called instead of <bpf_fentry_test7>
and this caused test failures for #13/torvalds#77 etc. except torvalds#356.

For test case torvalds#356/17, with [1] (progs/test_global_func17.c)), the main prog
looks like:

  0000000000000000 <global_func17>:
       0:       b4 00 00 00 2a 00 00 00 w0 = 0x2a
       1:       95 00 00 00 00 00 00 00 exit

... which passed verification while the test itself expects a verification
failure.

Let us add 'barrier_var' style asm code in both places to prevent function
specialization which caused selftests failure.

  [1] llvm/llvm-project#72903

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231127050342.1945270-1-yonghong.song@linux.dev
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Dec 25, 2023
If CONFIG_BPF_JIT_ALWAYS_ON is not set and bpf_jit_enable is 0, there
exist 6 failed tests.

  [root@linux bpf]# echo 0 > /proc/sys/net/core/bpf_jit_enable
  [root@linux bpf]# ./test_verifier | grep FAIL
  torvalds#107/p inline simple bpf_loop call FAIL
  torvalds#108/p don't inline bpf_loop call, flags non-zero FAIL
  torvalds#109/p don't inline bpf_loop call, callback non-constant FAIL
  torvalds#110/p bpf_loop_inline and a dead func FAIL
  torvalds#111/p bpf_loop_inline stack locations for loop vars FAIL
  torvalds#112/p inline bpf_loop call in a big program FAIL
  Summary: 505 PASSED, 266 SKIPPED, 6 FAILED

The test log shows that callbacks are not allowed in non-JITed programs,
interpreter doesn't support them yet, thus these tests should be skipped
if jit is disabled, just return -ENOTSUPP instead of -EINVAL for pseudo
calls in fixup_call_args().

With this patch:

  [root@linux bpf]# echo 0 > /proc/sys/net/core/bpf_jit_enable
  [root@linux bpf]# ./test_verifier | grep FAIL
  Summary: 505 PASSED, 272 SKIPPED, 0 FAILED

Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
logic10492 pushed a commit to logic10492/linux-amd-zen2 that referenced this pull request Jan 18, 2024
Kaz205 pushed a commit to Kaz205/linux that referenced this pull request Feb 5, 2024
[ Upstream commit b16904f ]

With latest upstream llvm18, the following test cases failed:

  $ ./test_progs -j
  torvalds#13/2    bpf_cookie/multi_kprobe_link_api:FAIL
  torvalds#13/3    bpf_cookie/multi_kprobe_attach_api:FAIL
  torvalds#13      bpf_cookie:FAIL
  torvalds#77      fentry_fexit:FAIL
  torvalds#78/1    fentry_test/fentry:FAIL
  torvalds#78      fentry_test:FAIL
  torvalds#82/1    fexit_test/fexit:FAIL
  torvalds#82      fexit_test:FAIL
  torvalds#112/1   kprobe_multi_test/skel_api:FAIL
  torvalds#112/2   kprobe_multi_test/link_api_addrs:FAIL
  [...]
  torvalds#112     kprobe_multi_test:FAIL
  torvalds#356/17  test_global_funcs/global_func17:FAIL
  torvalds#356     test_global_funcs:FAIL

Further analysis shows llvm upstream patch [1] is responsible for the above
failures. For example, for function bpf_fentry_test7() in net/bpf/test_run.c,
without [1], the asm code is:

  0000000000000400 <bpf_fentry_test7>:
     400: f3 0f 1e fa                   endbr64
     404: e8 00 00 00 00                callq   0x409 <bpf_fentry_test7+0x9>
     409: 48 89 f8                      movq    %rdi, %rax
     40c: c3                            retq
     40d: 0f 1f 00                      nopl    (%rax)

... and with [1], the asm code is:

  0000000000005d20 <bpf_fentry_test7.specialized.1>:
    5d20: e8 00 00 00 00                callq   0x5d25 <bpf_fentry_test7.specialized.1+0x5>
    5d25: c3                            retq

... and <bpf_fentry_test7.specialized.1> is called instead of <bpf_fentry_test7>
and this caused test failures for torvalds#13/torvalds#77 etc. except torvalds#356.

For test case torvalds#356/17, with [1] (progs/test_global_func17.c)), the main prog
looks like:

  0000000000000000 <global_func17>:
       0:       b4 00 00 00 2a 00 00 00 w0 = 0x2a
       1:       95 00 00 00 00 00 00 00 exit

... which passed verification while the test itself expects a verification
failure.

Let us add 'barrier_var' style asm code in both places to prevent function
specialization which caused selftests failure.

  [1] llvm/llvm-project#72903

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231127050342.1945270-1-yonghong.song@linux.dev
Signed-off-by: Sasha Levin <sashal@kernel.org>
1054009064 pushed a commit to 1054009064/linux that referenced this pull request Feb 5, 2024
[ Upstream commit b16904f ]

With latest upstream llvm18, the following test cases failed:

  $ ./test_progs -j
  torvalds#13/2    bpf_cookie/multi_kprobe_link_api:FAIL
  torvalds#13/3    bpf_cookie/multi_kprobe_attach_api:FAIL
  torvalds#13      bpf_cookie:FAIL
  torvalds#77      fentry_fexit:FAIL
  torvalds#78/1    fentry_test/fentry:FAIL
  torvalds#78      fentry_test:FAIL
  torvalds#82/1    fexit_test/fexit:FAIL
  torvalds#82      fexit_test:FAIL
  torvalds#112/1   kprobe_multi_test/skel_api:FAIL
  torvalds#112/2   kprobe_multi_test/link_api_addrs:FAIL
  [...]
  torvalds#112     kprobe_multi_test:FAIL
  torvalds#356/17  test_global_funcs/global_func17:FAIL
  torvalds#356     test_global_funcs:FAIL

Further analysis shows llvm upstream patch [1] is responsible for the above
failures. For example, for function bpf_fentry_test7() in net/bpf/test_run.c,
without [1], the asm code is:

  0000000000000400 <bpf_fentry_test7>:
     400: f3 0f 1e fa                   endbr64
     404: e8 00 00 00 00                callq   0x409 <bpf_fentry_test7+0x9>
     409: 48 89 f8                      movq    %rdi, %rax
     40c: c3                            retq
     40d: 0f 1f 00                      nopl    (%rax)

... and with [1], the asm code is:

  0000000000005d20 <bpf_fentry_test7.specialized.1>:
    5d20: e8 00 00 00 00                callq   0x5d25 <bpf_fentry_test7.specialized.1+0x5>
    5d25: c3                            retq

... and <bpf_fentry_test7.specialized.1> is called instead of <bpf_fentry_test7>
and this caused test failures for torvalds#13/torvalds#77 etc. except torvalds#356.

For test case torvalds#356/17, with [1] (progs/test_global_func17.c)), the main prog
looks like:

  0000000000000000 <global_func17>:
       0:       b4 00 00 00 2a 00 00 00 w0 = 0x2a
       1:       95 00 00 00 00 00 00 00 exit

... which passed verification while the test itself expects a verification
failure.

Let us add 'barrier_var' style asm code in both places to prevent function
specialization which caused selftests failure.

  [1] llvm/llvm-project#72903

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231127050342.1945270-1-yonghong.song@linux.dev
Signed-off-by: Sasha Levin <sashal@kernel.org>
rodriguezst pushed a commit to rodriguezst/linux that referenced this pull request May 17, 2024
This change adds support for a configurable eir_max_name_len for
platforms which requires a larger than 48 bytes complete name in EIR.

From bluetoothctl:
[bluetooth]# system-alias
012345678901234567890123456789012345678901234567890123456789
Changing 012345678901234567890123456789012345678901234567890123456789
succeeded
[CHG] Controller DC:71:96:69:02:89 Alias:
012345678901234567890123456789012345678901234567890123456789

From btmon:
< HCI Command: Write Local Name (0x03|0x0013) plen 248     torvalds#109
[hci0] 88.567990
        Name:
012345678901234567890123456789012345678901234567890123456789
> HCI Event: Command Complete (0x0e) plen 4  torvalds#110 [hci0] 88.663854
      Write Local Name (0x03|0x0013) ncmd 1
        Status: Success (0x00)
@ MGMT Event: Local Name Changed (0x0008) plen 260               
{0x0004} [hci0] 88.663948
        Name:
012345678901234567890123456789012345678901234567890123456789
        Short name:
< HCI Command: Write Extended Inquiry Response (0x03|0x0052) plen 241
 torvalds#111 [hci0] 88.663977
        FEC: Not required (0x00)
        Name (complete):
012345678901234567890123456789012345678901234567890123456789
        TX power: 12 dBm
        Device ID: Bluetooth SIG assigned (0x0001)
          Vendor: Google (224)
          Product: 0xc405
          Version: 0.5.6 (0x0056)
        16-bit Service UUIDs (complete): 7 entries
          Generic Access Profile (0x1800)
          Generic Attribute Profile (0x1801)
          Device Information (0x180a)
          A/V Remote Control (0x110e)
          A/V Remote Control Target (0x110c)
          Handsfree Audio Gateway (0x111f)
          Audio Source (0x110a)
> HCI Event: Command Complete (0x0e) plen 4    torvalds#112 [hci0] 88.664874
      Write Extended Inquiry Response (0x03|0x0052) ncmd 1
        Status: Success (0x00)

    (am from https://patchwork.kernel.org/patch/11687367/)

Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Alain Michaud <alainm@chromium.org>

Backport notes: HDEV_PARAM_U16 is changed from two parameters to one
parameter.

BUG=none
TEST=build

Signed-off-by: Zhengping Jiang <jiangzp@google.com>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Oct 1, 2024
This change adds support for a configurable eir_max_name_len for
platforms which requires a larger than 48 bytes complete name in EIR.

From bluetoothctl:
[bluetooth]# system-alias
012345678901234567890123456789012345678901234567890123456789
Changing 012345678901234567890123456789012345678901234567890123456789
succeeded
[CHG] Controller DC:71:96:69:02:89 Alias:
012345678901234567890123456789012345678901234567890123456789

From btmon:
< HCI Command: Write Local Name (0x03|0x0013) plen 248     torvalds#109
[hci0] 88.567990
        Name:
012345678901234567890123456789012345678901234567890123456789
> HCI Event: Command Complete (0x0e) plen 4  torvalds#110 [hci0] 88.663854
      Write Local Name (0x03|0x0013) ncmd 1
        Status: Success (0x00)
@ MGMT Event: Local Name Changed (0x0008) plen 260               
{0x0004} [hci0] 88.663948
        Name:
012345678901234567890123456789012345678901234567890123456789
        Short name:
< HCI Command: Write Extended Inquiry Response (0x03|0x0052) plen 241
 torvalds#111 [hci0] 88.663977
        FEC: Not required (0x00)
        Name (complete):
012345678901234567890123456789012345678901234567890123456789
        TX power: 12 dBm
        Device ID: Bluetooth SIG assigned (0x0001)
          Vendor: Google (224)
          Product: 0xc405
          Version: 0.5.6 (0x0056)
        16-bit Service UUIDs (complete): 7 entries
          Generic Access Profile (0x1800)
          Generic Attribute Profile (0x1801)
          Device Information (0x180a)
          A/V Remote Control (0x110e)
          A/V Remote Control Target (0x110c)
          Handsfree Audio Gateway (0x111f)
          Audio Source (0x110a)
> HCI Event: Command Complete (0x0e) plen 4    torvalds#112 [hci0] 88.664874
      Write Extended Inquiry Response (0x03|0x0052) ncmd 1
        Status: Success (0x00)

    (am from https://patchwork.kernel.org/patch/11687367/)

Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Alain Michaud <alainm@chromium.org>

Backport notes: HDEV_PARAM_U16 is changed from two parameters to one
parameter.

BUG=none
TEST=build

Signed-off-by: Zhengping Jiang <jiangzp@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant