Dev39 #112

tgsergeant · 2014-08-05T04:51:04Z

No description provided.

mysteries of the universe

tgsergeant · 2014-08-05T04:53:50Z

I'm really sorry, this was created by mistake. Please ignore!

Set disconnected flag in struct usbhid when a usb device is removed. Check for disconnected flag before sending urb requests. This prevents a kernel panic when a hid driver calls hid_hw_request() after removing a usb device. BUG: unable to handle kernel NULL pointer dereference at 0000000000000058 IP: [<ffffffff8161746f>] hid_submit_ctrl+0x7f/0x290 PGD 0 Oops: 0002 [#1] PREEMPT SMP CPU: 2 PID: 39 Comm: khubd Tainted: G IO 3.16.0-rc5+ torvalds#112 Hardware name: Microsoft Corporation Surface Pro 2/Surface Pro 2, BIOS 2.03.0250 09/06/2013 task: ffff880118aba6e0 ti: ffff8800daf80000 task.ti: ffff8800daf80000 RIP: 0010:[<ffffffff8161746f>] [<ffffffff8161746f>] hid_submit_ctrl+0x7f/0x290 RSP: 0018:ffff8800daf83750 EFLAGS: 00010086 RAX: 0000000080000300 RBX: ffff88003f60c000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff880117f78000 RBP: ffff8800daf83788 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000000001 R11: 0000000000000000 R12: ffff880117f78000 R13: ffff88003f11a290 R14: 000000000000000c R15: ffff880091cb3ab8 FS: 0000000000000000(0000) GS:ffff88011b000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000058 CR3: 0000000001c11000 CR4: 00000000001407e0 Stack: ffff880117f3dcd0 ffff880117f78000 ffff88003f60c000 ffff880117f78000 ffff880117f78000 ffff88003f11a290 0000000000000000 ffff8800daf837b0 ffffffff81617707 ffff880117f78000 ffff88003f60c000 0000000000000013 Call Trace: [<ffffffff81617707>] usbhid_restart_ctrl_queue+0x87/0x140 [<ffffffff81617a88>] usbhid_submit_report+0x2c8/0x370 [<ffffffff81617b4a>] usbhid_request+0x1a/0x30 [<ffffffffa020edfb>] sensor_hub_set_feature+0x8b/0xd0 [hid_sensor_hub] [<ffffffffa02d9084>] hid_sensor_power_state+0x84/0x110 [hid_sensor_trigger] [<ffffffffa02d9129>] hid_sensor_data_rdy_trigger_set_state+0x19/0x20 [hid_sensor_trigger] [<ffffffffa034d5b7>] iio_triggered_buffer_predisable+0xa7/0xb0 [industrialio] [<ffffffffa034cc4a>] iio_disable_all_buffers+0x3a/0xc0 [industrialio] [<ffffffffa03487d3>] iio_device_unregister+0x53/0x80 [industrialio] [<ffffffffa026c06a>] hid_accel_3d_remove+0x2a/0x50 [hid_sensor_accel_3d] [<ffffffff814f433d>] platform_drv_remove+0x1d/0x40 [<ffffffff814f18bf>] __device_release_driver+0x7f/0xf0 [<ffffffff814f1955>] device_release_driver+0x25/0x40 [<ffffffff814f121c>] bus_remove_device+0x11c/0x1a0 [<ffffffff814ed7d6>] device_del+0x136/0x1e0 [<ffffffff81512190>] ? mfd_cell_disable+0x80/0x80 [<ffffffff814f41d1>] platform_device_del+0x21/0xc0 [<ffffffff814f4282>] platform_device_unregister+0x12/0x30 [<ffffffff815121d3>] mfd_remove_devices_fn+0x43/0x50 [<ffffffff814ed3e3>] device_for_each_child+0x43/0x70 [<ffffffff81512105>] mfd_remove_devices+0x25/0x30 [<ffffffffa020ebd7>] sensor_hub_remove+0x87/0x140 [hid_sensor_hub] [<ffffffff81607c5b>] hid_device_remove+0x6b/0xd0 [<ffffffff814f18bf>] __device_release_driver+0x7f/0xf0 [<ffffffff814f1955>] device_release_driver+0x25/0x40 [<ffffffff814f121c>] bus_remove_device+0x11c/0x1a0 [<ffffffff814ed7d6>] device_del+0x136/0x1e0 [<ffffffff81607d47>] hid_destroy_device+0x27/0x60 [<ffffffff81616972>] usbhid_disconnect+0x22/0x50 [<ffffffff81568597>] usb_unbind_interface+0x77/0x2b0 [<ffffffff814f18bf>] __device_release_driver+0x7f/0xf0 [<ffffffff814f1955>] device_release_driver+0x25/0x40 [<ffffffff814f121c>] bus_remove_device+0x11c/0x1a0 [<ffffffff814ed7d6>] device_del+0x136/0x1e0 [<ffffffff81565cd1>] usb_disable_device+0x91/0x2a0 [<ffffffff8155b046>] usb_disconnect+0x96/0x2e0 [<ffffffff8155d74a>] hub_thread+0xb5a/0x1840 Signed-off-by: Reyad Attiyat <reyad.attiyat@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>

WARNING: 'retuned' may be misspelled - perhaps 'returned'? torvalds#112: FILE: fs/ocfs2/file.c:2329: + * If generic_file_buffered_write() retuned a synchronous error total: 0 errors, 1 warnings, 121 lines checked ./patches/ocfs2-do-not-fallback-to-buffer-i-o-write-if-fill-holes.patch has style problems, please review. If any of these errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. Please run checkpatch prior to sending patches Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Weiwei Wang <wangww631@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Remove excessive kernel log of amvideocap driver

Commit e8d975e ("fixing infinite OPEN loop in 4.0 stateid recovery") introduced access to state after it was just potentially freed by nfs4_put_open_state leading to a random data corruption somewhere. BUG: unable to handle kernel paging request at ffff88004941ee40 IP: [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740 PGD 3501067 PUD 3504067 PMD 6ff37067 PTE 800000004941e060 Oops: 0002 [#1] SMP DEBUG_PAGEALLOC Modules linked in: loop rpcsec_gss_krb5 acpi_cpufreq tpm_tis joydev i2c_piix4 pcspkr tpm virtio_console nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops floppy serio_raw virtio_blk drm CPU: 6 PID: 2161 Comm: 192.168.10.253- Not tainted 4.7.0-rc1-vm-nfs+ torvalds#112 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff8800463dcd00 ti: ffff88003ff48000 task.ti: ffff88003ff48000 RIP: 0010:[<ffffffff813baf01>] [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740 RSP: 0018:ffff88003ff4bd68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffffff81a49900 RCX: 00000000000000e8 RDX: 00000000000000e8 RSI: ffff8800418b9930 RDI: ffff880040c96c88 RBP: ffff88003ff4bdf8 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff880040c96c98 R13: ffff88004941ee20 R14: ffff88004941ee40 R15: ffff88004941ee00 FS: 0000000000000000(0000) GS:ffff88006d000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88004941ee40 CR3: 0000000060b0b000 CR4: 00000000000006e0 Stack: ffffffff813baad5 ffff8800463dcd00 ffff880000000001 ffffffff810e6b68 ffff880043ddbc88 ffff8800418b9800 ffff8800418b98c8 ffff88004941ee48 ffff880040c96c90 ffff880040c96c00 ffff880040c96c20 ffff880040c96c40 Call Trace: [<ffffffff813baad5>] ? nfs4_do_reclaim+0x35/0x740 [<ffffffff810e6b68>] ? trace_hardirqs_on_caller+0x128/0x1b0 [<ffffffff813bb7cd>] nfs4_run_state_manager+0x5ed/0xa40 [<ffffffff813bb1e0>] ? nfs4_do_reclaim+0x740/0x740 [<ffffffff813bb1e0>] ? nfs4_do_reclaim+0x740/0x740 [<ffffffff810af0d1>] kthread+0x101/0x120 [<ffffffff810e6b68>] ? trace_hardirqs_on_caller+0x128/0x1b0 [<ffffffff818843af>] ret_from_fork+0x1f/0x40 [<ffffffff810aefd0>] ? kthread_create_on_node+0x250/0x250 Code: 65 80 4c 8b b5 78 ff ff ff e8 fc 88 4c 00 48 8b 7d 88 e8 13 67 d2 ff 49 8b 47 40 a8 02 0f 84 d3 01 00 00 4c 89 ff e8 7f f9 ff ff <f0> 41 80 26 7f 48 8b 7d c8 e8 b1 84 4c 00 e9 39 fd ff ff 3d e6 RIP [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740 RSP <ffff88003ff4bd68> CR2: ffff88004941ee40 Signed-off-by: Oleg Drokin <green@linuxhacker.ru>

Commit e8d975e ("fixing infinite OPEN loop in 4.0 stateid recovery") introduced access to state after it was just potentially freed by nfs4_put_open_state leading to a random data corruption somewhere. BUG: unable to handle kernel paging request at ffff88004941ee40 IP: [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740 PGD 3501067 PUD 3504067 PMD 6ff37067 PTE 800000004941e060 Oops: 0002 [#1] SMP DEBUG_PAGEALLOC Modules linked in: loop rpcsec_gss_krb5 acpi_cpufreq tpm_tis joydev i2c_piix4 pcspkr tpm virtio_console nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops floppy serio_raw virtio_blk drm CPU: 6 PID: 2161 Comm: 192.168.10.253- Not tainted 4.7.0-rc1-vm-nfs+ torvalds#112 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff8800463dcd00 ti: ffff88003ff48000 task.ti: ffff88003ff48000 RIP: 0010:[<ffffffff813baf01>] [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740 RSP: 0018:ffff88003ff4bd68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffffff81a49900 RCX: 00000000000000e8 RDX: 00000000000000e8 RSI: ffff8800418b9930 RDI: ffff880040c96c88 RBP: ffff88003ff4bdf8 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff880040c96c98 R13: ffff88004941ee20 R14: ffff88004941ee40 R15: ffff88004941ee00 FS: 0000000000000000(0000) GS:ffff88006d000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88004941ee40 CR3: 0000000060b0b000 CR4: 00000000000006e0 Stack: ffffffff813baad5 ffff8800463dcd00 ffff880000000001 ffffffff810e6b68 ffff880043ddbc88 ffff8800418b9800 ffff8800418b98c8 ffff88004941ee48 ffff880040c96c90 ffff880040c96c00 ffff880040c96c20 ffff880040c96c40 Call Trace: [<ffffffff813baad5>] ? nfs4_do_reclaim+0x35/0x740 [<ffffffff810e6b68>] ? trace_hardirqs_on_caller+0x128/0x1b0 [<ffffffff813bb7cd>] nfs4_run_state_manager+0x5ed/0xa40 [<ffffffff813bb1e0>] ? nfs4_do_reclaim+0x740/0x740 [<ffffffff813bb1e0>] ? nfs4_do_reclaim+0x740/0x740 [<ffffffff810af0d1>] kthread+0x101/0x120 [<ffffffff810e6b68>] ? trace_hardirqs_on_caller+0x128/0x1b0 [<ffffffff818843af>] ret_from_fork+0x1f/0x40 [<ffffffff810aefd0>] ? kthread_create_on_node+0x250/0x250 Code: 65 80 4c 8b b5 78 ff ff ff e8 fc 88 4c 00 48 8b 7d 88 e8 13 67 d2 ff 49 8b 47 40 a8 02 0f 84 d3 01 00 00 4c 89 ff e8 7f f9 ff ff <f0> 41 80 26 7f 48 8b 7d c8 e8 b1 84 4c 00 e9 39 fd ff ff 3d e6 RIP [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740 RSP <ffff88003ff4bd68> CR2: ffff88004941ee40 Signed-off-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

Commit e8d975e ("fixing infinite OPEN loop in 4.0 stateid recovery") introduced access to state after it was just potentially freed by nfs4_put_open_state leading to a random data corruption somewhere. BUG: unable to handle kernel paging request at ffff88004941ee40 IP: [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740 PGD 3501067 PUD 3504067 PMD 6ff37067 PTE 800000004941e060 Oops: 0002 [#1] SMP DEBUG_PAGEALLOC Modules linked in: loop rpcsec_gss_krb5 acpi_cpufreq tpm_tis joydev i2c_piix4 pcspkr tpm virtio_console nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops floppy serio_raw virtio_blk drm CPU: 6 PID: 2161 Comm: 192.168.10.253- Not tainted 4.7.0-rc1-vm-nfs+ #112 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff8800463dcd00 ti: ffff88003ff48000 task.ti: ffff88003ff48000 RIP: 0010:[<ffffffff813baf01>] [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740 RSP: 0018:ffff88003ff4bd68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffffff81a49900 RCX: 00000000000000e8 RDX: 00000000000000e8 RSI: ffff8800418b9930 RDI: ffff880040c96c88 RBP: ffff88003ff4bdf8 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff880040c96c98 R13: ffff88004941ee20 R14: ffff88004941ee40 R15: ffff88004941ee00 FS: 0000000000000000(0000) GS:ffff88006d000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88004941ee40 CR3: 0000000060b0b000 CR4: 00000000000006e0 Stack: ffffffff813baad5 ffff8800463dcd00 ffff880000000001 ffffffff810e6b68 ffff880043ddbc88 ffff8800418b9800 ffff8800418b98c8 ffff88004941ee48 ffff880040c96c90 ffff880040c96c00 ffff880040c96c20 ffff880040c96c40 Call Trace: [<ffffffff813baad5>] ? nfs4_do_reclaim+0x35/0x740 [<ffffffff810e6b68>] ? trace_hardirqs_on_caller+0x128/0x1b0 [<ffffffff813bb7cd>] nfs4_run_state_manager+0x5ed/0xa40 [<ffffffff813bb1e0>] ? nfs4_do_reclaim+0x740/0x740 [<ffffffff813bb1e0>] ? nfs4_do_reclaim+0x740/0x740 [<ffffffff810af0d1>] kthread+0x101/0x120 [<ffffffff810e6b68>] ? trace_hardirqs_on_caller+0x128/0x1b0 [<ffffffff818843af>] ret_from_fork+0x1f/0x40 [<ffffffff810aefd0>] ? kthread_create_on_node+0x250/0x250 Code: 65 80 4c 8b b5 78 ff ff ff e8 fc 88 4c 00 48 8b 7d 88 e8 13 67 d2 ff 49 8b 47 40 a8 02 0f 84 d3 01 00 00 4c 89 ff e8 7f f9 ff ff <f0> 41 80 26 7f 48 8b 7d c8 e8 b1 84 4c 00 e9 39 fd ff ff 3d e6 RIP [<ffffffff813baf01>] nfs4_do_reclaim+0x461/0x740 RSP <ffff88003ff4bd68> CR2: ffff88004941ee40 Signed-off-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

TBS 5220, 5881 support.

Revert "lkl: Fix missing screen_info during compilation."

Shubham was recently asking on netdev why in arm64 JIT we don't multiply the index for accessing the tail call map by 8. That led me into testing out arm64 JIT wrt tail calls and it turned out I got a NULL pointer dereference on the tail call. The buggy access is at: prog = array->ptrs[index]; if (prog == NULL) goto out; [...] 00000060: d2800e0a mov x10, #0x70 // torvalds#112 00000064: f86a682a ldr x10, [x1,x10] 00000068: f862694b ldr x11, [x10,x2] 0000006c: b40000ab cbz x11, 0x00000080 [...] The code triggering the crash is f862694b. x1 at the time contains the address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning, above we load the pointer to the program at map slot 0 into x10. x10 can then be NULL if the slot is not occupied, which we later on try to access with a user given offset in x2 that is the map index. Fix this by emitting the following instead: [...] 00000060: d2800e0a mov x10, #0x70 // torvalds#112 00000064: 8b0a002a add x10, x1, x10 00000068: d37df04b lsl x11, x2, #3 0000006c: f86b694b ldr x11, [x10,x11] 00000070: b40000ab cbz x11, 0x00000084 [...] This basically adds the offset to ptrs to the base address of the bpf array we got and we later on access the map with an index * 8 offset relative to that. The tail call map itself is basically one large area with meta data at the head followed by the array of prog pointers. This makes tail calls working again, tested on Cavium ThunderX ARMv8. Fixes: ddb5599 ("arm64: bpf: implement bpf_tail_call() helper") Reported-by: Shubham Bansal <illusionist.neo@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Shubham was recently asking on netdev why in arm64 JIT we don't multiply the index for accessing the tail call map by 8. That led me into testing out arm64 JIT wrt tail calls and it turned out I got a NULL pointer dereference on the tail call. The buggy access is at: prog = array->ptrs[index]; if (prog == NULL) goto out; [...] 00000060: d2800e0a mov x10, #0x70 // torvalds#112 00000064: f86a682a ldr x10, [x1,x10] 00000068: f862694b ldr x11, [x10,x2] 0000006c: b40000ab cbz x11, 0x00000080 [...] The code triggering the crash is f862694b. x1 at the time contains the address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning, above we load the pointer to the program at map slot 0 into x10. x10 can then be NULL if the slot is not occupied, which we later on try to access with a user given offset in x2 that is the map index. Fix this by emitting the following instead: [...] 00000060: d2800e0a mov x10, #0x70 // torvalds#112 00000064: 8b0a002a add x10, x1, x10 00000068: d37df04b lsl x11, x2, #3 0000006c: f86b694b ldr x11, [x10,x11] 00000070: b40000ab cbz x11, 0x00000084 [...] This basically adds the offset to ptrs to the base address of the bpf array we got and we later on access the map with an index * 8 offset relative to that. The tail call map itself is basically one large area with meta data at the head followed by the array of prog pointers. This makes tail calls working again, tested on Cavium ThunderX ARMv8. Fixes: ddb5599 ("arm64: bpf: implement bpf_tail_call() helper") Reported-by: Shubham Bansal <illusionist.neo@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>

GIT dac94e29110cd606dec37673644caf2cf6fd1dde commit 7e1b9521f5a8356553f5e58b07952bf346632ea4 Author: Colin Ian King <colin.king@canonical.com> Date: Sat Mar 11 19:09:45 2017 +0000 dm cache: handle kmalloc failure allocating background_tracker struct Currently there is no kmalloc failure check on the allocation of the background_tracker struct in btracker_create(), and so a NULL return will lead to a NULL pointer dereference. Add a NULL check. Detected by CoverityScan, CID#1416587 ("Dereference null return value") Fixes: b29d4986d ("dm cache: significant rework to leverage dm-bio-prison-v2") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> commit 83345d51a49a4b3f3b4a08a5db644dae438b0189 Author: Tin Huynh <tnhuynh@apm.com> Date: Wed May 17 11:25:34 2017 +0700 i2c: xgene: Set ACPI_COMPANION_I2C With ACPI, i2c-core requires ACPI companion to be set in order for it to create slave device. This patch sets the ACPI companion accordingly. Signed-off-by: Tin Huynh <tnhuynh@apm.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de> commit 88ad60c23a394b2f8bf1e570c756f415435d1d35 Author: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Date: Tue May 16 14:07:24 2017 +0200 i2c: mv64xxx: don't override deferred probing when getting irq There is no reason to use platform_get_irq() for non-DT probing and irq_of_parse_and_map() for DT probing. Indeed, platform_get_irq() works fine for both. In addition, using platform_get_irq() properly returns -EPROBE_DEFER when the interrupt controller is not yet available, so instead of inventing our own error code (-ENXIO), return the one provided by platform_get_irq(). Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de> commit 13840d38016203f0095cd547b90352812d24b787 Author: Mikulas Patocka <mpatocka@redhat.com> Date: Sun Apr 30 17:32:28 2017 -0400 dm bufio: make the parameter "retain_bytes" unsigned long Change the type of the parameter "retain_bytes" from unsigned to unsigned long, so that on 64-bit machines the user can set more than 4GiB of data to be retained. Also, change the type of the variable "count" in the function "__evict_old_buffers" to unsigned long. The assignment "count = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];" could result in unsigned long to unsigned overflow and that could result in buffers not being freed when they should. While at it, avoid division in get_retain_buffers(). Division is slow, we can change it to shift because we have precalculated the log2 of block size. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> commit 6f61dd3aa35179043f1fcdb0965c5d56278ab724 Author: Kees Cook <keescook@chromium.org> Date: Fri May 12 14:52:34 2017 -0700 efi-pstore: Fix read iter after pstore API refactor During the internal pstore API refactoring, the EFI vars read entry was accidentally made to update a stack variable instead of the pstore private data pointer. This corrects the problem (and removes the now needless argument). Fixes: 125cc42baf8a ("pstore: Replace arguments for read() API") Signed-off-by: Kees Cook <keescook@chromium.org> commit 8b671f906c2debc4f2393438c4e7668936522e99 Author: Shannon Nelson <shannon.nelson@oracle.com> Date: Mon May 15 10:51:08 2017 -0700 ldmvsw: stop the clean timer at beginning of remove Stop the clean timer earlier to be sure there's no asynchronous interference while stopping the port. Orabug: 25748241 Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit b18e5e86b44be0dca399d8e2383f97c8077392ce Author: Thomas Tai <thomas.tai@oracle.com> Date: Mon May 15 10:51:07 2017 -0700 ldmvsw: unregistering netdev before disable hardware When running LDom binding/unbinding test, kernel may panic in ldmvsw_open(). It is more likely that because we're removing the ldc connection before unregistering the netdev in vsw_port_remove(), we set up a window of time where one process could be removing the device while another trying to UP the device. This also sometimes causes vio handshake error due to opening a device without closing it completely. We should unregister the netdev before we disable the "hardware". Orabug: 25980913, 25925306 Signed-off-by: Thomas Tai <thomas.tai@oracle.com> Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit ca9df7ede41afd006d74fd6f09f36d909d0eaad7 Author: Miroslav Lichvar <mlichvar@redhat.com> Date: Mon May 15 16:04:36 2017 +0200 net: netcp: fix check of requested timestamping filter The driver doesn't support timestamping of all received packets and should return error when trying to enable the HWTSTAMP_FILTER_ALL filter. Cc: WingMan Kwok <w-kwok2@ti.com> Cc: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit f98e0eb68008aff9824d1c4dad7276c8bab83ca5 Author: Christoph Hellwig <hch@lst.de> Date: Mon May 15 17:28:38 2017 +0200 dm mpath: multipath_clone_and_map must not return -EIO Since 412445ac ("dm: introduce a new DM_MAPIO_KILL return value"), the clone_and_map_rq methods must not return errno values, so fix it up to properly return DM_MAPIO_KILL, instead of the -EIO value that snuck in due to a conflict between two patches. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> commit 18a482f5245cc875755090853e84283512b3e6bd Author: Christoph Hellwig <hch@lst.de> Date: Mon May 15 17:28:37 2017 +0200 dm mpath: don't return -EIO from dm_report_EIO Instead just turn the macro into a helper for the warning message. This removes an unnecessary assignment and will allow the next commit to fix a place where -EIO is the wrong return value. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> commit ece0728037b15f4d31198f12b359104bcb5db4c8 Author: Christoph Hellwig <hch@lst.de> Date: Mon May 15 17:28:36 2017 +0200 dm rq: add a missing break to map_request We don't want to bug when receiving a DM_MAPIO_KILL value.. Fixes: 412445ac ("dm: introduce a new DM_MAPIO_KILL return value") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> commit 0377a07c7a035e0d033cd8b29f0cb15244c0916a Author: Joe Thornber <ejt@redhat.com> Date: Mon May 15 09:45:40 2017 -0400 dm space map disk: fix some book keeping in the disk space map When decrementing the reference count for a block, the free count wasn't being updated if the reference count went to zero. Cc: stable@vger.kernel.org Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> commit 91bcdb92d39711d1adb40c26b653b7978d93eb98 Author: Joe Thornber <ejt@redhat.com> Date: Mon May 15 09:43:05 2017 -0400 dm thin metadata: call precommit before saving the roots These calls were the wrong way round in __write_initial_superblock. Cc: stable@vger.kernel.org Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> commit 66eb9f86e50547ec2a8ff7a75997066a74ef584b Author: Mahesh Bandewar <maheshb@google.com> Date: Fri May 12 17:03:39 2017 -0700 ipv6: avoid dad-failures for addresses with NODAD Every address gets added with TENTATIVE flag even for the addresses with IFA_F_NODAD flag and dad-work is scheduled for them. During this DAD process we realize it's an address with NODAD and complete the process without sending any probe. However the TENTATIVE flags stays on the address for sometime enough to cause misinterpretation when we receive a NS. While processing NS, if the address has TENTATIVE flag, we mark it DADFAILED and endup with an address that was originally configured as NODAD with DADFAILED. We can't avoid scheduling dad_work for addresses with NODAD but we can avoid adding TENTATIVE flag to avoid this racy situation. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit aa4ad88cfcd4ee45f527fb982140576711e3b501 Author: Mintz, Yuval <Yuval.Mintz@cavium.com> Date: Sun May 14 12:21:23 2017 +0300 qed: Fix uninitialized data in aRFS infrastructure Current memset is using incorrect type of variable, causing the upper-half of the strucutre to be left uninitialized and causing: ethernet/qlogic/qed/qed_init_fw_funcs.c: In function 'qed_set_rfs_mode_disable': ethernet/qlogic/qed/qed_init_fw_funcs.c:993:3: error: '*((void *)&ramline+4)' is used uninitialized in this function [-Werror=uninitialized] Fixes: d51e4af5c209 ("qed: aRFS infrastructure support") Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net> commit 8c977f5a856a7276450ddf3a7b57b4a8815b63f9 Author: Julia Lawall <julia.lawall@lip6.fr> Date: Fri May 12 22:54:23 2017 +0800 mdio: mux: fix device_node_continue.cocci warnings Device node iterators put the previous value of the index variable, so an explicit put causes a double put. In particular, of_mdiobus_register can fail before doing anything interesting, so one could view it as a no-op from the reference count point of view. Generated by: scripts/coccinelle/iterators/device_node_continue.cocci CC: Jon Mason <jon.mason@broadcom.com> Signed-off-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit d19b183cdc1fa3d70d6abe2a4c369e748cd7ebb8 Author: Douglas Caetano dos Santos <douglascs@taghos.com.br> Date: Fri May 12 15:19:15 2017 -0300 net/packet: fix missing net_device reference release When using a TX ring buffer, if an error occurs processing a control message (e.g. invalid message), the net_device reference is not released. Fixes c14ac9451c348 ("sock: enable timestamping using control messages") Signed-off-by: Douglas Caetano dos Santos <douglascs@taghos.com.br> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 4762010f09ac0453f613df345c5281e7f2dec510 Author: yuval.shaia@oracle.com <yuval.shaia@oracle.com> Date: Fri May 12 09:10:51 2017 +0300 net/mlx4_core: Use min3 to select number of MSI-X vectors Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 70957eaecc2e43308e403c80293bec3d59632412 Author: Vlad Yasevich <vyasevich@gmail.com> Date: Thu May 11 11:09:52 2017 -0400 macvlan: Fix performance issues with vlan tagged packets Macvlan always turns on offload features that have sofware fallback (NETIF_GSO_SOFTWARE). This allows much higher guest-guest communications over macvtap. However, macvtap does not turn on these features for vlan tagged traffic. As a result, depending on the HW that mactap is configured on, the performance of guest-guest communication over a vlan is very inconsistent. If the HW supports TSO/UFO over vlans, then the performance will be fine. If not, the the performance will suffer greatly since the VM may continue using TSO/UFO, and will force the host segment the traffic and possibly overlow the macvtap queue. This patch adds the always on offloads to vlan_features. This makes sure that any vlan tagged traffic between 2 guest will not be segmented needlessly. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 9fce894d03a98ec8e8e8106a964644633d2772ee Author: Peter Rosin <peda@axentia.se> Date: Mon May 15 09:03:50 2017 +0200 i2c: mux: only print failure message on error As is, a failure message is printed unconditionally, which is confusing. And noisy. Fixes: 8d4d159f25a7 ("i2c: mux: provide more info on failure in i2c_mux_add_adapter") Signed-off-by: Peter Rosin <peda@axentia.se> commit a36d4637e4a06be067b8e327a0b1118bb2a73cb8 Author: Peter Rosin <peda@axentia.se> Date: Mon May 15 18:48:55 2017 +0200 i2c: mux: reg: rename label to indicate what it does That maintains sanity if it is ever called from some other spot, and also makes the label names coherent. Signed-off-by: Peter Rosin <peda@axentia.se> commit 68118e0e73aa3a6291c8b9eb1ee708e05f110cea Author: Peter Rosin <peda@axentia.se> Date: Sun May 7 07:16:30 2017 +0200 i2c: mux: reg: put away the parent i2c adapter on probe failure It is only prudent to let go of resources that are not used. Fixes: b3fdd32799d8 ("i2c: mux: Add register-based mux i2c-mux-reg") Signed-off-by: Peter Rosin <peda@axentia.se> commit 66c25f6e31766a9ec19c2bdc7f5f69f9c59bafd7 Author: Niklas Cassel <niklas.cassel@axis.com> Date: Mon May 15 10:56:06 2017 +0200 net: stmmac: use correct pointer when printing normal descriptor ring There are two pointers in sysfs_display_ring, one that increments if using normal dma descriptors, another if using extended dma descriptors. When printing the normal dma descriptors, the wrong pointer is used, thus the printed descriptor addresses are incorrect. Signed-off-by: Niklas Cassel <niklas.cassel@axis.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit fb317002ab4419ae7e068bee6897f2d5745aa3b9 Author: Heiko Carstens <heiko.carstens@de.ibm.com> Date: Tue May 9 12:50:53 2017 +0200 s390/virtio: change virtio_feature_desc:features type to __le32 The feature member of virtio_feature_desc contains little endian values, given that it contents will be converted with le32_to_cpu(). The "wrong" __u32 type leads to the sparse warnings below. In order to avoid them, use the correct __le32 type instead. drivers/s390/virtio/virtio_ccw.c:749:14: warning: cast to restricted __le32 drivers/s390/virtio/virtio_ccw.c:762:28: warning: cast to restricted __le32 Acked-by: Halil Pasic <pasic@linux.vnet.ibm.com> Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> commit 2e63309507c818e8b631a03f02c363031c007fb7 Author: Joe Thornber <ejt@redhat.com> Date: Thu May 11 09:09:04 2017 -0400 dm cache policy smq: don't do any writebacks unless IDLE If there are no clean blocks to be demoted the writeback will be triggered at that point. Preemptively writing back can hurt high IO load scenarios. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> commit 49b7f768900f4084a65c3689d955b2fceac39e53 Author: Joe Thornber <ejt@redhat.com> Date: Thu May 11 09:07:16 2017 -0400 dm cache: simplify the IDLE vs BUSY state calculation Drop the MODERATE state since it wasn't buying us much. Also, in check_migrations(), prepare for the next commit ("dm cache policy smq: don't do any writebacks unless IDLE") by deferring to the policy to make the final decision on whether writebacks can be serviced. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> commit 701e03e4e180f0cd97d4139a32e2b2d879d12da2 Author: Joe Thornber <ejt@redhat.com> Date: Thu May 11 08:22:31 2017 -0400 dm cache: track all IO to the cache rather than just the origin device's IO IO tracking used to throttle writebacks when the origin device is busy. Even if all the IO is going to the fast device, writebacks can significantly degrade performance. So track all IO to gauge whether the cache is busy or not. Otherwise, synthetic IO tests (e.g. fio) that might send all IO to the fast device wouldn't cause writebacks to get throttled. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> commit 6cf4cc8f8b3b7bc9e3c04a7eab44b985d50029fc Author: Joe Thornber <ejt@redhat.com> Date: Thu May 11 07:48:18 2017 -0400 dm cache policy smq: stop preemptively demoting blocks It causes a lot of churn if the working set's size is close to the fast device's size. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> commit 4d44ec5ab751be63c5d348f13294304d87baa8c3 Author: Joe Thornber <ejt@redhat.com> Date: Thu May 11 05:11:06 2017 -0400 dm cache policy smq: put newly promoted entries at the top of the multiqueue This stops entries bouncing in and out of the cache quickly. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> commit 78c45607b909fb384c47c134d89b39285a6a8b45 Author: Joe Thornber <ejt@redhat.com> Date: Thu May 11 05:09:38 2017 -0400 dm cache policy smq: be more aggressive about triggering a writeback If there are no clean entries to demote we really want to writeback immediately. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> commit a8cd1eba6135e086109e2b94bf96deb17456ede8 Author: Joe Thornber <ejt@redhat.com> Date: Thu May 11 05:07:34 2017 -0400 dm cache policy smq: only demote entries in bottom half of the clean multiqueue Heavy IO load may mean there are very few clean blocks in the cache, and we risk demoting entries that get hit a lot. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> commit 072792dcdfc8d5f91a26050e5665285f50afebf5 Author: Joe Thornber <ejt@redhat.com> Date: Thu May 11 06:14:16 2017 -0400 dm cache: fix incorrect 'idle_time' reset in IO tracker Some bios have no payload (eg, a FLUSH), don't reset the idle_time when these come in. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> commit 508541146af18e43072e41a31aa62fac2b01aac1 Author: Yishai Hadas <yishaih@mellanox.com> Date: Tue Apr 25 10:39:57 2017 +0300 net/mlx5: Use underlay QPN from the root name space Root flow table is dynamically changed by the underlying flow steering layer, and IPoIB/ULPs have no idea what will be the root flow table in the future, hence we need a dynamic infrastructure to move Underlay QPs with the root flow table. Fixes: b3ba51498bdd ("net/mlx5: Refactor create flow table method to accept underlay QP") Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> commit 5360fd473c23f18b42722cdb13a1c6ec7acd96ff Author: Saeed Mahameed <saeedm@mellanox.com> Date: Thu May 4 17:53:32 2017 +0300 net/mlx5e: IPoIB, Only support regular RQ for now IPoIB doesn't support striding RQ at the moment, for this we need to explicitly choose non striding RQ in IPoIB init, even if the HW supports it. Fixes: 8f493ffd88ea ("net/mlx5e: IPoIB, RX steering RSS RQTs and TIRs") Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> commit 20b6a1c78280dbeb45214c463cf9cbccb3665146 Author: Saeed Mahameed <saeedm@mellanox.com> Date: Tue May 9 16:40:46 2017 +0300 net/mlx5e: Fix setup TC ndo Fail-safe support patches introduced a trivial bug, setup tc callback is doing a wrong check of the netdevice state, the fix is simply to invert the condition. Fixes: 6f9485af4020 ("net/mlx5e: Fail safe tc setup") Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> commit e3c19503712d6360239b19c14cded56dd63c40d7 Author: Gal Pressman <galp@mellanox.com> Date: Wed Apr 19 14:35:15 2017 +0300 net/mlx5e: Fix ethtool pause support and advertise reporting Pause bit should set when RX pause is on, not TX pause. Also, setting Asym_Pause is incorrect, and should be turned off. Fixes: 665bc53969d7 ("net/mlx5e: Use new ethtool get/set link ksettings API") Signed-off-by: Gal Pressman <galp@mellanox.com> Cc: kernel-team@fb.com Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> commit b383b544f2666d67446b951a9a97af239dafed5d Author: Gal Pressman <galp@mellanox.com> Date: Mon Apr 3 15:11:22 2017 +0300 net/mlx5e: Use the correct pause values for ethtool advertising Query the operational pause from firmware (PFCC register) instead of always passing zeros. Fixes: 665bc53969d7 ("net/mlx5e: Use new ethtool get/set link ksettings API") Signed-off-by: Gal Pressman <galp@mellanox.com> Cc: kernel-team@fb.com Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> commit 67b4c889cc835a2a6e2ff4e20544a33e37e2875d Author: Steve French <smfrench@gmail.com> Date: Fri May 12 20:59:10 2017 -0500 [CIFS] Minor cleanup of xattr query function Some minor cleanup of cifs query xattr functions (will also make SMB3 xattr implementation cleaner as well). Signed-off-by: Steve French <steve.french@primarydata.com> commit 4328fea77ca30ef6af938ae3f263a3d055a9c0e6 Author: Karim Eshapa <karim.eshapa@gmail.com> Date: Fri May 12 01:53:38 2017 +0200 fs: cifs: transport: Use time_after for time comparison Use time_after kernel macro for time comparison that has safety check. Signed-off-by: Karim Eshapa <karim.eshapa@gmail.com> Signed-off-by: Steve French <smfrench@gmail.com> commit cd1230070ae1c12fd34cf6a557bfa81bf9311009 Author: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Date: Fri May 12 17:59:32 2017 +0200 SMB2: Fix share type handling In fs/cifs/smb2pdu.h, we have: #define SMB2_SHARE_TYPE_DISK 0x01 #define SMB2_SHARE_TYPE_PIPE 0x02 #define SMB2_SHARE_TYPE_PRINT 0x03 Knowing that, with the current code, the SMB2_SHARE_TYPE_PRINT case can never trigger and printer share would be interpreted as disk share. So, test the ShareType value for equality instead. Fixes: faaf946a7d5b ("CIFS: Add tree connect/disconnect capability for SMB2") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Acked-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <smfrench@gmail.com> commit ecdcf622eb74b52cebde1387a7a1852a787d8050 Author: Joe Perches via samba-technical <samba-technical@lists.samba.org> Date: Sun May 7 01:31:47 2017 -0700 cifs: cifsacl: Use a temporary ops variable to reduce code length Create an ops variable to store tcon->ses->server->ops and cache indirections and reduce code size a trivial bit. $ size fs/cifs/cifsacl.o* text data bss dec hex filename 5338 136 8 5482 156a fs/cifs/cifsacl.o.new 5371 136 8 5515 158b fs/cifs/cifsacl.o.old Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Steve French <smfrench@gmail.com> commit 1c4d5f51a812a82de97beee24f48ed05c65ebda5 Author: Neil Horman <nhorman@tuxdriver.com> Date: Fri May 12 12:00:01 2017 -0400 vmxnet3: ensure that adapter is in proper state during force_close There are several paths in vmxnet3, where settings changes cause the adapter to be brought down and back up (vmxnet3_set_ringparam among them). Should part of the reset operation fail, these paths call vmxnet3_force_close, which enables all napi instances prior to calling dev_close (with the expectation that vmxnet3_close will then properly disable them again). However, vmxnet3_force_close neglects to clear VMXNET3_STATE_BIT_QUIESCED prior to calling dev_close. As a result vmxnet3_quiesce_dev (called from vmxnet3_close), returns early, and leaves all the napi instances in a enabled state while the device itself is closed. If a device in this state is activated again, napi_enable will be called on already enabled napi_instances, leading to a BUG halt. The fix is to simply enausre that the QUIESCED bit is cleared in vmxnet3_force_close to allow quesence to be completed properly on close. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Shrikrishna Khare <skhare@vmware.com> CC: "VMware, Inc." <pv-drivers@vmware.com> CC: "David S. Miller" <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net> commit 42e6cae1b35b186650f5abc7b24c20ab6986c5a0 Author: Bert Kenward <bkenward@solarflare.com> Date: Fri May 12 17:18:50 2017 +0100 sfc: revert changes to NIC revision numbers The revision enum values (eg EFX_REV_HUNT_A0) form part of our API, and are included in ethtool. If these are inconsistent then ethtool will print garbage for a register dump (ethtool -d). Fixes: 5a6681e22c14 ("sfc: separate out SFC4000 ("Falcon") support into new sfc-falcon driver") Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit b12ca80ca145cecadf841ba27cc061c510cd97ca Author: Johan Hovold <johan@kernel.org> Date: Fri May 12 12:13:26 2017 +0200 net: ch9200: add missing USB-descriptor endianness conversions Add the missing endianness conversions to a debug statement printing the USB device-descriptor idVendor and idProduct fields during probe. Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> commit 75cf067953d5ee543b3bda90bbfcbee5e1f94ae8 Author: Johan Hovold <johan@kernel.org> Date: Fri May 12 12:11:13 2017 +0200 net: irda: irda-usb: fix firmware name on big-endian hosts Add missing endianness conversion when using the USB device-descriptor bcdDevice field to construct a firmware file name. Fixes: 8ef80aef118e ("[IRDA]: irda-usb.c: STIR421x cleanups") Cc: stable <stable@vger.kernel.org> # 2.6.18 Cc: Nick Fedchik <nfedchik@atlantic-link.com.ua> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> commit 9fc3e4dc67fde953fd26adb3cea8f92597810b64 Author: Gustavo A. R. Silva <garsilva@embeddedor.com> Date: Thu May 11 22:11:29 2017 -0500 net: dsa: mv88e6xxx: add default case to switch Add default case to switch in order to avoid any chance of using an uninitialized variable _low_, in case s->type does not match any of the listed case values. Addresses-Coverity-ID: 1398130 Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 Author: Xin Long <lucien.xin@gmail.com> Date: Fri May 12 14:39:52 2017 +0800 sctp: fix src address selection if using secondary addresses for ipv6 Commit 0ca50d12fe46 ("sctp: fix src address selection if using secondary addresses") has fixed a src address selection issue when using secondary addresses for ipv4. Now sctp ipv6 also has the similar issue. When using a secondary address, sctp_v6_get_dst tries to choose the saddr which has the most same bits with the daddr by sctp_v6_addr_match_len. It may make some cases not work as expected. hostA: [1] fd21:356b:459a:cf10::11 (eth1) [2] fd21:356b:459a:cf20::11 (eth2) hostB: [a] fd21:356b:459a:cf30::2 (eth1) [b] fd21:356b:459a:cf40::2 (eth2) route from hostA to hostB: fd21:356b:459a:cf30::/64 dev eth1 metric 1024 mtu 1500 The expected path should be: fd21:356b:459a:cf10::11 <-> fd21:356b:459a:cf30::2 But addr[2] matches addr[a] more bits than addr[1] does, according to sctp_v6_addr_match_len. It causes the path to be: fd21:356b:459a:cf20::11 <-> fd21:356b:459a:cf30::2 This patch is to fix it with the same way as Marcelo's fix for sctp ipv4. As no ip_dev_find for ipv6, this patch is to use ipv6_chk_addr to check if the saddr is in a dev instead. Note that for backwards compatibility, it will still do the addr_match_len check here when no optimal is found. Reported-by: Patrick Talbert <ptalbert@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit df0c8d911abf6ba97b2c2fc3c5a12769e0b081a3 Author: Florian Fainelli <f.fainelli@gmail.com> Date: Thu May 11 11:24:16 2017 -0700 net: phy: Call bus->reset() after releasing PHYs from reset The API convention makes it that a given MDIO bus reset should be able to access PHY devices in its reset() callback and perform additional MDIO accesses in order to bring the bus and PHYs in a working state. Commit 69226896ad63 ("mdio_bus: Issue GPIO RESET to PHYs.") broke that contract by first calling bus->reset() and then release all PHYs from reset using their shared GPIO line, so restore the expected functionality here. Fixes: 69226896ad63 ("mdio_bus: Issue GPIO RESET to PHYs.") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 6832a333ed4a7cc4fcb170c045d1d96d0976fdd4 Author: David S. Miller <davem@davemloft.net> Date: Thu May 11 19:30:02 2017 -0700 bpf: Handle multiple variable additions into packet pointers in verifier. We must accumulate into reg->aux_off rather than use a plain assignment. Add a test for this situation to test_align. Reported-by: Alexei Starovoitov <ast@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 844cf763fba654436d3a4279b6a672c196cf1901 Author: Jon Paul Maloy <jon.maloy@ericsson.com> Date: Thu May 11 20:28:15 2017 +0200 tipc: make macro tipc_wait_for_cond() smp safe The macro tipc_wait_for_cond() is embedding the macro sk_wait_event() to fulfil its task. The latter, in turn, is evaluating the stated condition outside the socket lock context. This is problematic if the condition is accessing non-trivial data structures which may be altered by incoming interrupts, as is the case with the cong_links() linked list, used by socket to keep track of the current set of congested links. We sometimes see crashes when this list is accessed by a condition function at the same time as a SOCK_WAKEUP interrupt is removing an element from the list. We fix this by expanding selected parts of sk_wait_event() into the outer macro, while ensuring that all evaluations of a given condition are performed under socket lock protection. Fixes: commit 365ad353c256 ("tipc: reduce risk of user starvation during link congestion") Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit ad990dbe6d3ac3af1f5f4484b1126b9fc601e98a Author: Andy Gospodarek <andy@greyhouse.net> Date: Thu May 11 15:52:30 2017 -0400 samples/bpf: run cleanup routines when receiving SIGTERM Shahid Habib noticed that when xdp1 was killed from a different console the xdp program was not cleaned-up properly in the kernel and it continued to forward traffic. Most of the applications in samples/bpf cleanup properly, but only when getting SIGINT. Since kill defaults to using SIGTERM, add support to cleanup when the application receives either SIGINT or SIGTERM. Signed-off-by: Andy Gospodarek <andy@greyhouse.net> Reported-by: Shahid Habib <shahid.habib@broadcom.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> commit d2be3667f3769b3c60aa294ef7f2b03d1b16559c Author: Colin Ian King <colin.king@canonical.com> Date: Thu May 11 19:29:40 2017 +0100 ethernet: aquantia: remove redundant checks on error status The error status err is initialized as zero and then being checked several times to see if it is less than zero even when it has not been updated. It may seem that the err should be assigned to the return code of the call to the various *offload_en_set calls and then we check for failure, however, these functions are void and never actually return any status. Since these error checks are redundant we can remove these as well as err and the error exit label err_exit. Detected by CoverityScan, CID#1398313 and CID#1398306 ("Logically dead code") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Lino Sanfilippo <LinoSanfilippo@gmx.de> Acked-by: Pavel Belous <pavel.belous@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 69a73e744db1a57175039e3d1e6ef3913e816eb8 Author: David S. Miller <davem@davemloft.net> Date: Thu May 11 21:41:09 2017 -0400 bpf: Remove commented out debugging hack in test_align. Reported-by: Alexander Alemayhu <alexander@alemayhu.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 33c16bfd5bc00bbb9823d25af46c496966e0ac0d Author: Chopra, Manish <Manish.Chopra@cavium.com> Date: Thu May 11 07:12:48 2017 -0700 qlcnic: Update version to 5.3.66 Bumping up the version as couple of fixes added after 5.3.65 Signed-off-by: Manish Chopra <manish.chopra@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit f9c3fe2f43343be7972226c2e736e7a68e383dc1 Author: Chopra, Manish <Manish.Chopra@cavium.com> Date: Thu May 11 07:12:47 2017 -0700 qlcnic: Fix link configuration with autoneg disabled Currently driver returns error on speed configurations for 83xx adapter's non XGBE ports, due to this link doesn't come up on the ports using 1000Base-T as a connector with autoneg disabled. This patch fixes this with initializing appropriate port type based on queried module/connector types from hardware before any speed/autoneg configuration. Signed-off-by: Manish Chopra <manish.chopra@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit d86b5672b1adb98b4cdd6fbf0224bbfb03db6e2e Author: Vitaly Kuznetsov <vkuznets@redhat.com> Date: Thu May 11 13:58:06 2017 +0200 xen-netfront: avoid crashing on resume after a failure in talk_to_netback() Unavoidable crashes in netfront_resume() and netback_changed() after a previous fail in talk_to_netback() (e.g. when we fail to read MAC from xenstore) were discovered. The failure path in talk_to_netback() does unregister/free for netdev but we don't reset drvdata and we try accessing it after resume. Fix the bug by removing the whole xen device completely with device_unregister(), this guarantees we won't have any calls into netfront after a failure. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit cb395b2010879a8461aa1b1c37025769708c32cf Author: Eric Dumazet <edumazet@google.com> Date: Wed May 10 21:59:28 2017 -0700 net: sched: optimize class dumps In commit 59cc1f61f09c ("net: sched: convert qdisc linked list to hashtable") we missed the opportunity to considerably speed up tc_dump_tclass_root() if a qdisc handle is provided by user. Instead of iterating all the qdiscs, use qdisc_match_from_root() to directly get the one we look for. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net> commit b451e5d24ba6687c6f0e7319c727a709a1846c06 Author: Yuchung Cheng <ycheng@google.com> Date: Wed May 10 17:01:27 2017 -0700 tcp: avoid fragmenting peculiar skbs in SACK This patch fixes a bug in splitting an SKB during SACK processing. Specifically if an skb contains multiple packets and is only partially sacked in the higher sequences, tcp_match_sack_to_skb() splits the skb and marks the second fragment as SACKed. The current code further attempts rounding up the first fragment to MSS boundaries. But it misses a boundary condition when the rounded-up fragment size (pkt_len) is exactly skb size. Spliting such an skb is pointless and causses a kernel warning and aborts the SACK processing. This patch universally checks such over-split before calling tcp_fragment to prevent these unnecessary warnings. Fixes: adb92db857ee ("tcp: Make SACK code to split only at mss boundaries") Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit f6ba8d33cfbb46df569972e64dbb5bb7e929bfd9 Author: Eric Dumazet <edumazet@google.com> Date: Thu May 11 15:24:41 2017 -0700 netem: fix skb_orphan_partial() I should have known that lowering skb->truesize was dangerous :/ In case packets are not leaving the host via a standard Ethernet device, but looped back to local sockets, bad things can happen, as reported by Michael Madsen ( https://bugzilla.kernel.org/show_bug.cgi?id=195713 ) So instead of tweaking skb->truesize, lets change skb->destructor and keep a reference on the owner socket via its sk_refcnt. Fixes: f2f872f9272a ("netem: Introduce skb_orphan_partial() helper") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Michael Madsen <mkm@nabto.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit d67b9cd28c1d7f82c2e5e727731ea7c89b23a0a8 Author: Daniel Borkmann <daniel@iogearbox.net> Date: Fri May 12 01:04:46 2017 +0200 xdp: refine xdp api with regards to generic xdp While working on the iproute2 generic XDP frontend, I noticed that as of right now it's possible to have native *and* generic XDP programs loaded both at the same time for the case when a driver supports native XDP. The intended model for generic XDP from b5cdae3291f7 ("net: Generic XDP") is, however, that only one out of the two can be present at once which is also indicated as such in the XDP netlink dump part. The main rationale for generic XDP is to ease accessibility (in case a driver does not yet have XDP support) and to generically provide a semantical model as an example for driver developers wanting to add XDP support. The generic XDP option for an XDP aware driver can still be useful for comparing and testing both implementations. However, it is not intended to have a second XDP processing stage or layer with exactly the same functionality of the first native stage. Only reason could be to have a partial fallback for future XDP features that are not supported yet in the native implementation and we probably also shouldn't strive for such fallback and instead encourage native feature support in the first place. Given there's currently no such fallback issue or use case, lets not go there yet if we don't need to. Therefore, change semantics for loading XDP and bail out if the user tries to load a generic XDP program when a native one is present and vice versa. Another alternative to bailing out would be to handle the transition from one flavor to another gracefully, but that would require to bring the device down, exchange both types of programs, and bring it up again in order to avoid a tiny window where a packet could hit both hooks. Given this complicates the logic for just a debugging feature in the native case, I went with the simpler variant. For the dump, remove IFLA_XDP_FLAGS that was added with b5cdae3291f7 and reuse IFLA_XDP_ATTACHED for indicating the mode. Dumping all or just a subset of flags that were used for loading the XDP prog is suboptimal in the long run since not all flags are useful for dumping and if we start to reuse the same flag definitions for load and dump, then we'll waste bit space. What we really just want is to dump the mode for now. Current IFLA_XDP_ATTACHED semantics are: nothing was installed (0), a program is running at the native driver layer (1). Thus, add a mode that says that a program is running at generic XDP layer (2). Applications will handle this fine in that older binaries will just indicate that something is attached at XDP layer, effectively this is similar to IFLA_XDP_FLAGS attr that we would have had modulo the redundancy. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> commit 0489df9a430e9607de8130a6bc4bf4c02f96eaf1 Author: Daniel Borkmann <daniel@iogearbox.net> Date: Fri May 12 01:04:45 2017 +0200 xdp: add flag to enforce driver mode After commit b5cdae3291f7 ("net: Generic XDP") we automatically fall back to a generic XDP variant if the driver does not support native XDP. Allow for an option where the user can specify that always the native XDP variant should be selected and in case it's not supported by a driver, just bail out. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> commit 0a5539f66133a02b24f9cc43da5b84b7e6f3f436 Author: David S. Miller <davem@davemloft.net> Date: Thu May 11 12:00:50 2017 -0700 bpf: Provide a linux/types.h override for bpf selftests. We do not want to use the architecture's type.h header when building BPF programs which are always 64-bit. Signed-off-by: David S. Miller <davem@davemloft.net> commit 18b3ad90b64e9893297357608abddd26170730eb Author: David S. Miller <davem@davemloft.net> Date: Wed May 10 11:43:51 2017 -0700 bpf: Add verifier test case for alignment. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> commit 91045f5e5238a6d2ad21d41d7e86e2fa65f90d76 Author: David S. Miller <davem@davemloft.net> Date: Wed May 10 11:42:48 2017 -0700 bpf: Add bpf_verify_program() to the library. This allows a test case to load a BPF program and unconditionally acquire the verifier log. It also allows specification of the strict alignment flag. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> commit e07b98d9bffe410019dfcf62c3428d4a96c56a2c Author: David S. Miller <davem@davemloft.net> Date: Wed May 10 11:38:07 2017 -0700 bpf: Add strict alignment flag for BPF_PROG_LOAD. Add a new field, "prog_flags", and an initial flag value BPF_F_STRICT_ALIGNMENT. When set, the verifier will enforce strict pointer alignment regardless of the setting of CONFIG_EFFICIENT_UNALIGNED_ACCESS. The verifier, in this mode, will also use a fixed value of "2" in place of NET_IP_ALIGN. This facilitates test cases that will exercise and validate this part of the verifier even when run on architectures where alignment doesn't matter. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> commit c5fc9692d101d1318b0f53f9f691cd88ac029317 Author: David S. Miller <davem@davemloft.net> Date: Wed May 10 11:25:17 2017 -0700 bpf: Do per-instruction state dumping in verifier when log_level > 1. If log_level > 1, do a state dump every instruction and emit it in a more compact way (without a leading newline). This will facilitate more sophisticated test cases which inspect the verifier log for register state. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> commit d1174416747d790d750742d0514915deeed93acf Author: David S. Miller <davem@davemloft.net> Date: Wed May 10 11:22:52 2017 -0700 bpf: Track alignment of register values in the verifier. Currently if we add only constant values to pointers we can fully validate the alignment, and properly check if we need to reject the program on !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS architectures. However, once an unknown value is introduced we only allow byte sized memory accesses which is too restrictive. Add logic to track the known minimum alignment of register values, and propagate this state into registers containing pointers. The most common paradigm that makes use of this new logic is computing the transport header using the IP header length field. For example: struct ethhdr *ep = skb->data; struct iphdr *iph = (struct iphdr *) (ep + 1); struct tcphdr *th; ... n = iph->ihl; th = ((void *)iph + (n * 4)); port = th->dest; The existing code will reject the load of th->dest because it cannot validate that the alignment is at least 2 once "n * 4" is added the the packet pointer. In the new code, the register holding "n * 4" will have a reg->min_align value of 4, because any value multiplied by 4 will be at least 4 byte aligned. (actually, the eBPF code emitted by the compiler in this case is most likely to use a shift left by 2, but the end result is identical) At the critical addition: th = ((void *)iph + (n * 4)); The register holding 'th' will start with reg->off value of 14. The pointer addition will transform that reg into something that looks like: reg->aux_off = 14 reg->aux_off_align = 4 Next, the verifier will look at the th->dest load, and it will see a load offset of 2, and first check: if (reg->aux_off_align % size) which will pass because aux_off_align is 4. reg_off will be computed: reg_off = reg->off; ... reg_off += reg->aux_off; plus we have off==2, and it will thus check: if ((NET_IP_ALIGN + reg_off + off) % size != 0) which evaluates to: if ((NET_IP_ALIGN + 14 + 2) % size != 0) On strict alignment architectures, NET_IP_ALIGN is 2, thus: if ((2 + 14 + 2) % size != 0) which passes. These pointer transformations and checks work regardless of whether the constant offset or the variable with known alignment is added first to the pointer register. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> commit d8b54110ee944de522ccd3531191f39986ec20f9 Author: Daniel Borkmann <daniel@iogearbox.net> Date: Thu May 11 01:53:15 2017 +0200 bpf, arm64: fix faulty emission of map access in tail calls Shubham was recently asking on netdev why in arm64 JIT we don't multiply the index for accessing the tail call map by 8. That led me into testing out arm64 JIT wrt tail calls and it turned out I got a NULL pointer dereference on the tail call. The buggy access is at: prog = array->ptrs[index]; if (prog == NULL) goto out; [...] 00000060: d2800e0a mov x10, #0x70 // #112 00000064: f86a682a ldr x10, [x1,x10] 00000068: f862694b ldr x11, [x10,x2] 0000006c: b40000ab cbz x11, 0x00000080 [...] The code triggering the crash is f862694b. x1 at the time contains the address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning, above we load the pointer to the program at map slot 0 into x10. x10 can then be NULL if the slot is not occupied, which we later on try to access with a user given offset in x2 that is the map index. Fix this by emitting the following instead: [...] 00000060: d2800e0a mov x10, #0x70 // #112 00000064: 8b0a002a add x10, x1, x10 00000068: d37df04b lsl x11, x2, #3 0000006c: f86b694b ldr x11, [x10,x11] 00000070: b40000ab cbz x11, 0x00000084 [...] This basically adds the offset to ptrs to the base address of the bpf array we got and we later on access the map with an index * 8 offset relative to that. The tail call map itself is basically one large area with meta data at the head followed by the array of prog pointers. This makes tail calls working again, tested on Cavium ThunderX ARMv8. Fixes: ddb55992b04d ("arm64: bpf: implement bpf_tail_call() helper") Reported-by: Shubham Bansal <illusionist.neo@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> commit 5b6cb43b4d625b04a4049d727a116edbfe5cf0f4 Author: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Date: Wed May 10 10:28:05 2017 -0700 net: ethernet: ti: netcp_core: return error while dma channel open issue Fix error path while dma open channel issue. Also, no need to check output on NULL if it's never returned. Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net> commit ebccc7397e4a49ff64c8f44a54895de9d32fe742 Author: Ursula Braun <ubraun@linux.vnet.ibm.com> Date: Wed May 10 19:07:54 2017 +0200 s390/qeth: add missing hash table initializations commit 5f78e29ceebf ("qeth: optimize IP handling in rx_mode callback") added new hash tables, but missed to initialize them. Fixes: 5f78e29ceebf ("qeth: optimize IP handling in rx_mode callback") Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Reviewed-by: Julian Wiedmann <jwi@linux.vnet.ibm.com> Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 25e2c341e7818a394da9abc403716278ee646014 Author: Julian Wiedmann <jwi@linux.vnet.ibm.com> Date: Wed May 10 19:07:53 2017 +0200 s390/qeth: avoid null pointer dereference on OSN Access card->dev only after checking whether's its valid. Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 2d2ebb3ed0c6acfb014f98e427298673a5d07b82 Author: Julian Wiedmann <jwi@linux.vnet.ibm.com> Date: Wed May 10 19:07:52 2017 +0200 s390/qeth: unbreak OSM and OSN support commit b4d72c08b358 ("qeth: bridgeport support - basic control") broke the support for OSM and OSN devices as follows: As OSM and OSN are L2 only, qeth_core_probe_device() does an early setup by loading the l2 discipline and calling qeth_l2_probe_device(). In this context, adding the l2-specific bridgeport sysfs attributes via qeth_l2_create_device_attributes() hits a BUG_ON in fs/sysfs/group.c, since the basic sysfs infrastructure for the device hasn't been established yet. Note that OSN actually has its own unique sysfs attributes (qeth_osn_devtype), so the additional attributes shouldn't be created at all. For OSM, add a new qeth_l2_devtype that contains all the common and l2-specific sysfs attributes. When qeth_core_probe_device() does early setup for OSM or OSN, assign the corresponding devtype so that the ccwgroup probe code creates the full set of sysfs attributes. This allows us to skip qeth_l2_create_device_attributes() in case of an early setup. Any device that can't do early setup will initially have only the generic sysfs attributes, and when it's probed later qeth_l2_probe_device() adds the l2-specific attributes. If an early-setup device is removed (by calling ccwgroup_ungroup()), device_unregister() will - using the devtype - delete the l2-specific attributes before qeth_l2_remove_device() is called. So make sure to not remove them twice. What complicates the issue is that qeth_l2_probe_device() and qeth_l2_remove_device() is also called on a device when its layer2 attribute changes (ie. its layer mode is switched). For early-setup devices this wouldn't work properly - we wouldn't remove the l2-specific attributes when switching to L3. But switching the layer mode doesn't actually make any sense; we already decided that the device can only operate in L2! So just refuse to switch the layer mode on such devices. Note that OSN doesn't have a layer2 attribute, so we only need to special-case OSM. Based on an initial patch by Ursula Braun. Fixes: b4d72c08b358 ("qeth: bridgeport support - basic control") Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 9111e7880ccf419548c7b0887df020b08eadb075 Author: Ursula Braun <ubraun@linux.vnet.ibm.com> Date: Wed May 10 19:07:51 2017 +0200 s390/qeth: handle sysfs error during initialization When setting up the device from within the layer discipline's probe routine, creating the layer-specific sysfs attributes can fail. Report this error back to the caller, and handle it by releasing the layer discipline. Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> [jwi: updated commit msg, moved an OSN change to a subsequent patch] Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit b60161668199ac62011c024adc9e66713b9554e7 Author: Jon Mason <jon.mason@broadcom.com> Date: Wed May 10 11:20:27 2017 -0400 mdio: mux: Correct mdio_mux_init error path issues There is a potential unnecessary refcount decrement on error path of put_device(&pb->mii_bus->dev), as it is possible to avoid the of_mdio_find_bus() call if mux_bus is specified by the calling function. The same put_device() is not called in the error path if the devm_kzalloc of pb fails. This caused the variable used in the put_device() to be changed, as the pb pointer was obviously not set up. There is an unnecessary of_node_get() on child_bus_node if the of_mdiobus_register() is successful, as the for_each_available_child_of_node() automatically increments this. Thus the refcount on this node will always be +1 more than it should be. There is no of_node_put() on child_bus_node if the of_mdiobus_register() call fails. Finally, it is lacking devm_kfree() of pb in the error path. While this might not be technically necessary, it was present in other parts of the function. So, I am adding it where necessary to make it uniform. Signed-off-by: Jon Mason <jon.mason@broadcom.com> Fixes: f20e6657a875 ("mdio: mux: Enhanced MDIO mux framework for integrated multiplexers") Fixes: 0ca2997d1452 ("netdev/of/phy: Add MDIO bus multiplexer support.") Signed-off-by: David S. Miller <davem@davemloft.net> commit 83eaddab4378db256d00d295bda6ca997cd13a52 Author: WANG Cong <xiyou.wangcong@gmail.com> Date: Tue May 9 16:59:54 2017 -0700 ipv6/dccp: do not inherit ipv6_mc_list from parent Like commit 657831ffc38e ("dccp/tcp: do not inherit mc_list from parent") we should clear ipv6_mc_list etc. for IPv6 sockets too. Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 0fe20fafd1791f993806d417048213ec57b81045 Author: Colin Ian King <colin.king@canonical.com> Date: Tue May 9 17:19:42 2017 +0100 netxen_nic: set rcode to the return status from the call to netxen_issue_cmd Currently rcode is being initialized to NX_RCODE_SUCCESS and later it is checked to see if it is not NX_RCODE_SUCCESS which is never true. It appears that there is an unintentional missing assignment of rcode from the return of the call to netxen_issue_cmd() that was dropped in an earlier fix, so add it in. Detected by CoverityScan, CID#401900 ("Logically dead code") Fixes: 2dcd5d95ad6b2 ("netxen_nic: fix cdrp race condition") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit 8d66c30b12ed3cb533696dea8b9a9eadd5da426a Author: Stefan Wahren <stefan.wahren@i2se.com> Date: Tue May 9 15:40:38 2017 +0200 net: qca_spi: Fix alignment issues in rx path The qca_spi driver causes alignment issues on ARM devices. So fix this by using netdev_alloc_skb_ip_align(). Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com> Fixes: 291ab06ecf67 ("net: qualcomm: new Ethernet over SPI driver for QCA7000") Signed-off-by: David S. Miller <davem@davemloft.net> commit 1a4a5bf52a4adb477adb075e5afce925824ad132 Author: Gao Feng <gfree.wind@vip.163.com> Date: Tue May 9 18:27:33 2017 +0800 driver: vrf: Fix one possible use-after-free issue The current codes only deal with the case that the skb is dropped, it may meet one use-after-free issue when NF_HOOK returns 0 that means the skb is stolen by one netfilter rule or hook. When one netfilter rule or hook stoles the skb and return NF_STOLEN, it means the skb is taken by the rule, and other modules should not touch this skb ever. Maybe the skb is queued or freed directly by the rule. Now uses the nf_hook instead of NF_HOOK to get the result of netfilter, and check the return value of nf_hook. Only when its value equals 1, it means the skb could go ahead. Or reset the skb as NULL. BTW, because vrf_rcv_finish is empty function, so needn't invoke it even though nf_hook returns 1. But we need to modify vrf_rcv_finish to deal with the NF_STOLEN case. There are two cases when skb is stolen. 1. The skb is stolen and freed directly. There is nothing we need to do, and vrf_rcv_finish isn't invoked. 2. The skb is queued and reinjected again. The vrf_rcv_finish would be invoked as okfn, so need to free the skb in it. Signed-off-by: Gao Feng <gfree.wind@vip.163.com> Signed-off-by: David S. Miller <davem@davemloft.net> commit efc0c21c9ea786d6f019d7df7b4e3932f3578d90 Author: Elena Reshetova <elena.reshetova@intel.com> Date: Thu Mar 2 12:23:45 2017 +0100 s390: convert debug_info.ref_count from atomic_t to refcount_t refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> commit de1892b887eeb85ce458a93979c2108e6f329618 Author: Steve French <smfrench@gmail.com> Date: Thu May 4 07:54:04 2017 -0500 Don't delay freeing mids when blocked on slow socket write of request When processing responses, and in particular freeing mids (DeleteMidQEntry), which is very important since it also frees the associated buffers (cifs_buf_release), we can block a long time if (writes to) socket is slow due to low memory or networking issues. We can block in send (smb request) waiting for memory, and be blocked in processing responess (which could free memory if we let it) - since they both grab the server->srv_mutex. In practice, in the DeleteMidQEntry case - there is no reason we need to grab the srv_mutex so remove these around DeleteMidQEntry, and it allows us to free memory faster. Signed-off-by: Steve French <steve.french@primarydata.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com> commit 560d388950ceda5e7c7cdef7f3d9a8ff297bbf9d Author: Rabin Vincent <rabinv@axis.com> Date: Wed May 3 17:17:21 2017 +0200 CIFS: silence lockdep splat in cifs_relock_file() cifs_relock_file() can perform a down_write() on the inode's lock_sem even though it was already performed in cifs_strict_readv(). Lockdep complains about this. AFAICS, there is no problem here, and lockdep just needs to be told that this nesting is OK. ============================================= [ INFO: possible recursive locking detected ] 4.11.0+ #20 Not tainted --------------------------------------------- cat/701 is trying to acquire lock: (&cifsi->lock_sem){++++.+}, at: cifs_reopen_file+0x7a7/0xc00 but task is already holding lock: (&cifsi->lock_sem){++++.+}, at: cifs_strict_readv+0x177/0x310 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&cifsi->lock_sem); lock(&cifsi->lock_sem); *** DEADLOCK *** May be due to missing lock nesting notation 1 lock held by cat/701: #0: (&cifsi->lock_sem){++++.+}, at: cifs_strict_readv+0x177/0x310 stack backtrace: CPU: 0 PID: 701 Comm: cat Not tainted 4.11.0+ #20 Call Trace: dump_stack+0x85/0xc2 __lock_acquire+0x17dd/0x2260 ? trace_hardirqs_on_thunk+0x1a/0x1c ? preempt_schedule_irq+0x6b/0x80 lock_acquire+0xcc/0x260 ? lock_acquire+0xcc/0x260 ? cifs_reopen_file+0x7a7/0xc00 down_read+0x2d/0x70 ? cifs_reopen_file+0x7a7/0xc00 cifs_reopen_file+0x7a7/0xc00 ? printk+0x43/0x4b cifs_readpage_worker+0x327/0x8a0 cifs_readpage+0x8c/0x2a0 generic_file_read_iter+0x692/0xd00 cifs_strict_readv+0x29f/0x310 generic_file_splice_read+0x11c/0x1c0 do_splice_to+0xa5/0xc0 splice_direct_to_actor+0xfa/0x350 ? generic_pipe_buf_nosteal+0x10/0x10 do_splice_direct+0xb5/0xe0 do_sendfile+0x278/0x3a0 SyS_sendfile64+0xc4/0xe0 entry_SYSCALL_64_fastpath+0x1f/0xbe Signed-off-by: Rabin Vincent <rabinv@axis.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> commit d04a4c76f71dd5335f8e499b59617382d84e2b8d Author: Heiko Carstens <heiko.carstens@de.ibm.com> Date: Thu May 4 09:42:22 2017 +0200 s390: move _text symbol to address higher than zero The perf tool assumes that kernel symbols are never present at address zero. In fact it assumes if functions that map symbols to addresses return zero, that the symbol was not found. Given that s390's _text symbol historically is located at address zero this yields at least a couple of false errors and warnings in one of perf's test cases about not present symbols ("perf test 1"). To fix this simply move the _text symbol to address 0x200, just behind the initial psw and channel program located at the beginning of the kernel image. This is now hard coded within the linker script. I tried a nicer solution which moves the initial psw and channel program into an own section. However that would move the symbols within the "real" head.text section to different addresses, since the ".org" statements within head.S are relative to the head.text section. If there is a new section in front, everything else will be moved…

[ Upstream commit d8b5411 ] Shubham was recently asking on netdev why in arm64 JIT we don't multiply the index for accessing the tail call map by 8. That led me into testing out arm64 JIT wrt tail calls and it turned out I got a NULL pointer dereference on the tail call. The buggy access is at: prog = array->ptrs[index]; if (prog == NULL) goto out; [...] 00000060: d2800e0a mov x10, #0x70 // #112 00000064: f86a682a ldr x10, [x1,x10] 00000068: f862694b ldr x11, [x10,x2] 0000006c: b40000ab cbz x11, 0x00000080 [...] The code triggering the crash is f862694b. x1 at the time contains the address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning, above we load the pointer to the program at map slot 0 into x10. x10 can then be NULL if the slot is not occupied, which we later on try to access with a user given offset in x2 that is the map index. Fix this by emitting the following instead: [...] 00000060: d2800e0a mov x10, #0x70 // #112 00000064: 8b0a002a add x10, x1, x10 00000068: d37df04b lsl x11, x2, #3 0000006c: f86b694b ldr x11, [x10,x11] 00000070: b40000ab cbz x11, 0x00000084 [...] This basically adds the offset to ptrs to the base address of the bpf array we got and we later on access the map with an index * 8 offset relative to that. The tail call map itself is basically one large area with meta data at the head followed by the array of prog pointers. This makes tail calls working again, tested on Cavium ThunderX ARMv8. Fixes: ddb5599 ("arm64: bpf: implement bpf_tail_call() helper") Reported-by: Shubham Bansal <illusionist.neo@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

[ Upstream commit d8b5411 ] Shubham was recently asking on netdev why in arm64 JIT we don't multiply the index for accessing the tail call map by 8. That led me into testing out arm64 JIT wrt tail calls and it turned out I got a NULL pointer dereference on the tail call. The buggy access is at: prog = array->ptrs[index]; if (prog == NULL) goto out; [...] 00000060: d2800e0a mov x10, #0x70 // torvalds#112 00000064: f86a682a ldr x10, [x1,x10] 00000068: f862694b ldr x11, [x10,x2] 0000006c: b40000ab cbz x11, 0x00000080 [...] The code triggering the crash is f862694b. x1 at the time contains the address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning, above we load the pointer to the program at map slot 0 into x10. x10 can then be NULL if the slot is not occupied, which we later on try to access with a user given offset in x2 that is the map index. Fix this by emitting the following instead: [...] 00000060: d2800e0a mov x10, #0x70 // torvalds#112 00000064: 8b0a002a add x10, x1, x10 00000068: d37df04b lsl x11, x2, #3 0000006c: f86b694b ldr x11, [x10,x11] 00000070: b40000ab cbz x11, 0x00000084 [...] This basically adds the offset to ptrs to the base address of the bpf array we got and we later on access the map with an index * 8 offset relative to that. The tail call map itself is basically one large area with meta data at the head followed by the array of prog pointers. This makes tail calls working again, tested on Cavium ThunderX ARMv8. Fixes: ddb5599 ("arm64: bpf: implement bpf_tail_call() helper") Reported-by: Shubham Bansal <illusionist.neo@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Wire up the existing arm64 support for SMBIOS tables (aka DMI) for ARM as well, by moving the arm64 init code to drivers/firmware/efi/arm-runtime.c (which is shared between ARM and arm64), and adding a asm/dmi.h header to ARM that defines the mapping routines for the firmware tables. This allows userspace to access these tables to discover system information exposed by the firmware. It also sets the hardware name used in crash dumps, e.g.: Unable to handle kernel NULL pointer dereference at virtual address 00000000 pgd = ed3c0000 [00000000] *pgd=bf1f3835 Internal error: Oops: 817 [#1] SMP THUMB2 Modules linked in: CPU: 0 PID: 759 Comm: bash Not tainted 4.10.0-09601-g0e8f38792120-dirty torvalds#112 Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 ^^^ NOTE: This does *NOT* enable or encourage the use of DMI quirks, i.e., the the practice of identifying the platform via DMI to decide whether certain workarounds for buggy hardware and/or firmware need to be enabled. This would require the DMI subsystem to be enabled much earlier than we do on ARM, which is non-trivial. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/20170602135207.21708-14-ard.biesheuvel@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>

(1) use cpu id from bl31 delivers; (2) sp_el0 should point to kernel address in EL1 mode. On ARM64, kernel uses sp_el0 to store current_thread_info(), we see a problem: when fiq occurs, cpu is EL1 mode but sp_el0 point to userspace address. At this moment, if we read 'current_thread_info()->cpu' or other, it leads an error. We find above situation happens when save/restore cpu context between system mode and user mode under heavy load. Like 'ret_fast_syscall()', kernel restore context of user mode, but fiq occurs before the instruction 'eret', so this causes the above situation. Assembly code: ffffff80080826c8 <ret_fast_syscall>: ...skipping... ffffff80080826fc: d503201f nop ffffff8008082700: d5384100 mrs x0, sp_el0 ffffff8008082704: f9400c00 ldr x0, [x0,torvalds#24] ffffff8008082708: d5182000 msr ttbr0_el1, x0 ffffff800808270c: d5033fdf isb ffffff8008082710: f9407ff7 ldr x23, [sp,torvalds#248] ffffff8008082714: d5184117 msr sp_el0, x23 ffffff8008082718: d503201f nop ffffff800808271c: d503201f nop ffffff8008082720: d5184035 msr elr_el1, x21 ffffff8008082724: d5184016 msr spsr_el1, x22 ffffff8008082728: a94007e0 ldp x0, x1, [sp] ffffff800808272c: a9410fe2 ldp x2, x3, [sp,torvalds#16] ffffff8008082730: a94217e4 ldp x4, x5, [sp,torvalds#32] ffffff8008082734: a9431fe6 ldp x6, x7, [sp,torvalds#48] ffffff8008082738: a94427e8 ldp x8, x9, [sp,torvalds#64] ffffff800808273c: a9452fea ldp x10, x11, [sp,torvalds#80] ffffff8008082740: a94637ec ldp x12, x13, [sp,torvalds#96] ffffff8008082744: a9473fee ldp x14, x15, [sp,torvalds#112] ffffff8008082748: a94847f0 ldp x16, x17, [sp,torvalds#128] ffffff800808274c: a9494ff2 ldp x18, x19, [sp,torvalds#144] ffffff8008082750: a94a57f4 ldp x20, x21, [sp,torvalds#160] ffffff8008082754: a94b5ff6 ldp x22, x23, [sp,torvalds#176] ffffff8008082758: a94c67f8 ldp x24, x25, [sp,torvalds#192] ffffff800808275c: a94d6ffa ldp x26, x27, [sp,torvalds#208] ffffff8008082760: a94e77fc ldp x28, x29, [sp,torvalds#224] ffffff8008082764: f9407bfe ldr x30, [sp,torvalds#240] ffffff8008082768: 9104c3ff add sp, sp, #0x130 ffffff800808276c: d69f03e0 eret Change-Id: I071e899f8a407764e166ca0403199c9d87d6ce78 Signed-off-by: chenjh <chenjh@rock-chips.com>

BugLink: http://bugs.launchpad.net/bugs/1696723 [ Upstream commit d8b5411 ] Shubham was recently asking on netdev why in arm64 JIT we don't multiply the index for accessing the tail call map by 8. That led me into testing out arm64 JIT wrt tail calls and it turned out I got a NULL pointer dereference on the tail call. The buggy access is at: prog = array->ptrs[index]; if (prog == NULL) goto out; [...] 00000060: d2800e0a mov x10, #0x70 // torvalds#112 00000064: f86a682a ldr x10, [x1,x10] 00000068: f862694b ldr x11, [x10,x2] 0000006c: b40000ab cbz x11, 0x00000080 [...] The code triggering the crash is f862694b. x1 at the time contains the address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning, above we load the pointer to the program at map slot 0 into x10. x10 can then be NULL if the slot is not occupied, which we later on try to access with a user given offset in x2 that is the map index. Fix this by emitting the following instead: [...] 00000060: d2800e0a mov x10, #0x70 // torvalds#112 00000064: 8b0a002a add x10, x1, x10 00000068: d37df04b lsl x11, x2, #3 0000006c: f86b694b ldr x11, [x10,x11] 00000070: b40000ab cbz x11, 0x00000084 [...] This basically adds the offset to ptrs to the base address of the bpf array we got and we later on access the map with an index * 8 offset relative to that. The tail call map itself is basically one large area with meta data at the head followed by the array of prog pointers. This makes tail calls working again, tested on Cavium ThunderX ARMv8. Fixes: ddb5599 ("arm64: bpf: implement bpf_tail_call() helper") Reported-by: Shubham Bansal <illusionist.neo@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

Rebase patches onto 4.13.16

…lise-support Add adrv9009 talise support

On fast hosts or malicious bots, we trigger a DCCP_BUG() which seems excessive. syzbot reported : BUG: delta (-6195) <= 0 at net/dccp/ccids/ccid3.c:628/ccid3_hc_rx_send_feedback() CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.18.0-rc1+ torvalds#112 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113 ccid3_hc_rx_send_feedback net/dccp/ccids/ccid3.c:628 [inline] ccid3_hc_rx_packet_recv.cold.16+0x38/0x71 net/dccp/ccids/ccid3.c:793 ccid_hc_rx_packet_recv net/dccp/ccid.h:185 [inline] dccp_deliver_input_to_ccids+0xf0/0x280 net/dccp/input.c:180 dccp_rcv_established+0x87/0xb0 net/dccp/input.c:378 dccp_v4_do_rcv+0x153/0x180 net/dccp/ipv4.c:654 sk_backlog_rcv include/net/sock.h:914 [inline] __sk_receive_skb+0x3ba/0xd80 net/core/sock.c:517 dccp_v4_rcv+0x10f9/0x1f58 net/dccp/ipv4.c:875 ip_local_deliver_finish+0x2eb/0xda0 net/ipv4/ip_input.c:215 NF_HOOK include/linux/netfilter.h:287 [inline] ip_local_deliver+0x1e9/0x750 net/ipv4/ip_input.c:256 dst_input include/net/dst.h:450 [inline] ip_rcv_finish+0x823/0x2220 net/ipv4/ip_input.c:396 NF_HOOK include/linux/netfilter.h:287 [inline] ip_rcv+0xa18/0x1284 net/ipv4/ip_input.c:492 __netif_receive_skb_core+0x2488/0x3680 net/core/dev.c:4628 __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:4693 process_backlog+0x219/0x760 net/core/dev.c:5373 napi_poll net/core/dev.c:5771 [inline] net_rx_action+0x7da/0x1980 net/core/dev.c:5837 __do_softirq+0x2e8/0xb17 kernel/softirq.c:284 run_ksoftirqd+0x86/0x100 kernel/softirq.c:645 smpboot_thread_fn+0x417/0x870 kernel/smpboot.c:164 kthread+0x345/0x410 kernel/kthread.c:240 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk> Cc: dccp@vger.kernel.org

On fast hosts or malicious bots, we trigger a DCCP_BUG() which seems excessive. syzbot reported : BUG: delta (-6195) <= 0 at net/dccp/ccids/ccid3.c:628/ccid3_hc_rx_send_feedback() CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.18.0-rc1+ torvalds#112 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113 ccid3_hc_rx_send_feedback net/dccp/ccids/ccid3.c:628 [inline] ccid3_hc_rx_packet_recv.cold.16+0x38/0x71 net/dccp/ccids/ccid3.c:793 ccid_hc_rx_packet_recv net/dccp/ccid.h:185 [inline] dccp_deliver_input_to_ccids+0xf0/0x280 net/dccp/input.c:180 dccp_rcv_established+0x87/0xb0 net/dccp/input.c:378 dccp_v4_do_rcv+0x153/0x180 net/dccp/ipv4.c:654 sk_backlog_rcv include/net/sock.h:914 [inline] __sk_receive_skb+0x3ba/0xd80 net/core/sock.c:517 dccp_v4_rcv+0x10f9/0x1f58 net/dccp/ipv4.c:875 ip_local_deliver_finish+0x2eb/0xda0 net/ipv4/ip_input.c:215 NF_HOOK include/linux/netfilter.h:287 [inline] ip_local_deliver+0x1e9/0x750 net/ipv4/ip_input.c:256 dst_input include/net/dst.h:450 [inline] ip_rcv_finish+0x823/0x2220 net/ipv4/ip_input.c:396 NF_HOOK include/linux/netfilter.h:287 [inline] ip_rcv+0xa18/0x1284 net/ipv4/ip_input.c:492 __netif_receive_skb_core+0x2488/0x3680 net/core/dev.c:4628 __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:4693 process_backlog+0x219/0x760 net/core/dev.c:5373 napi_poll net/core/dev.c:5771 [inline] net_rx_action+0x7da/0x1980 net/core/dev.c:5837 __do_softirq+0x2e8/0xb17 kernel/softirq.c:284 run_ksoftirqd+0x86/0x100 kernel/softirq.c:645 smpboot_thread_fn+0x417/0x870 kernel/smpboot.c:164 kthread+0x345/0x410 kernel/kthread.c:240 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk> Cc: dccp@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>

[ Upstream commit 74174fe ] On fast hosts or malicious bots, we trigger a DCCP_BUG() which seems excessive. syzbot reported : BUG: delta (-6195) <= 0 at net/dccp/ccids/ccid3.c:628/ccid3_hc_rx_send_feedback() CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.18.0-rc1+ torvalds#112 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113 ccid3_hc_rx_send_feedback net/dccp/ccids/ccid3.c:628 [inline] ccid3_hc_rx_packet_recv.cold.16+0x38/0x71 net/dccp/ccids/ccid3.c:793 ccid_hc_rx_packet_recv net/dccp/ccid.h:185 [inline] dccp_deliver_input_to_ccids+0xf0/0x280 net/dccp/input.c:180 dccp_rcv_established+0x87/0xb0 net/dccp/input.c:378 dccp_v4_do_rcv+0x153/0x180 net/dccp/ipv4.c:654 sk_backlog_rcv include/net/sock.h:914 [inline] __sk_receive_skb+0x3ba/0xd80 net/core/sock.c:517 dccp_v4_rcv+0x10f9/0x1f58 net/dccp/ipv4.c:875 ip_local_deliver_finish+0x2eb/0xda0 net/ipv4/ip_input.c:215 NF_HOOK include/linux/netfilter.h:287 [inline] ip_local_deliver+0x1e9/0x750 net/ipv4/ip_input.c:256 dst_input include/net/dst.h:450 [inline] ip_rcv_finish+0x823/0x2220 net/ipv4/ip_input.c:396 NF_HOOK include/linux/netfilter.h:287 [inline] ip_rcv+0xa18/0x1284 net/ipv4/ip_input.c:492 __netif_receive_skb_core+0x2488/0x3680 net/core/dev.c:4628 __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:4693 process_backlog+0x219/0x760 net/core/dev.c:5373 napi_poll net/core/dev.c:5771 [inline] net_rx_action+0x7da/0x1980 net/core/dev.c:5837 __do_softirq+0x2e8/0xb17 kernel/softirq.c:284 run_ksoftirqd+0x86/0x100 kernel/softirq.c:645 smpboot_thread_fn+0x417/0x870 kernel/smpboot.c:164 kthread+0x345/0x410 kernel/kthread.c:240 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk> Cc: dccp@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

WARNING: line length of 108 exceeds 100 columns torvalds#97: FILE: tools/testing/selftests/vm/mremap_test.c:136: + char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); WARNING: Missing a blank line after declarations torvalds#98: FILE: tools/testing/selftests/vm/mremap_test.c:137: + char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + munmap(start + page_size, page_size); ERROR: space required before the open parenthesis '(' torvalds#107: FILE: tools/testing/selftests/vm/mremap_test.c:146: + while(getline(&line, &len, fp) != -1) { ERROR: space required after that ',' (ctx:VxV) torvalds#108: FILE: tools/testing/selftests/vm/mremap_test.c:147: + char *first = strtok(line,"- "); ^ ERROR: space required after that ',' (ctx:VxV) torvalds#110: FILE: tools/testing/selftests/vm/mremap_test.c:149: + char *second = strtok(NULL,"- "); ^ WARNING: Missing a blank line after declarations torvalds#112: FILE: tools/testing/selftests/vm/mremap_test.c:151: + void *second_val = (void *) strtol(second, NULL, 16); + if (first_val == start && second_val == start + 3 * page_size) { total: 3 errors, 3 warnings, 113 lines checked NOTE: For some of the reported defects, checkpatch may be able to mechanically convert to the typical style using --fix or --fix-inplace. ./patches/mm-add-merging-after-mremap-resize.patch has style problems, please review. NOTE: If any of the errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. Please run checkpatch prior to sending patches Cc: Jakub Matěna <matenajakub@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

zsmalloc has 255 size classes. Size classes contain a number of zspages, which store objects of the same size. zspage can consist of up to four physical pages. The exact (most optimal) zspage size is calculated for each size class during zsmalloc pool creation. As a reasonable optimization, zsmalloc merges size classes that have similar characteristics: number of pages per zspage and number of objects zspage can store. For example, let's look at the following size classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable .. 94 1536 0 0 0 0 0 3 0 100 1632 0 0 0 0 0 2 0 .. Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time we store an object of size, say, 1568 bytes instead of using class torvalds#96 we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of 1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. When we need to store, say, 13 objects of size 1568 we end up allocating three zspages; in other words, 6 physical pages. However, if we'll look closer at size class torvalds#96 (which should hold objects of size 1568 bytes) and trace get_pages_per_zspage(): pages per zspage wasted bytes used% 1 960 76 2 352 95 3 1312 89 4 704 95 5 96 99 We'd notice that the most optimal zspage configuration for this class is when it consists of 5 physical pages, but currently we never let zspages to consists of more than 4 pages. A 5 page class torvalds#96 configuration would store 13 objects of size 1568 in a single zspage, allocating 5 physical pages, as opposed to 6 physical pages that class torvalds#100 will allocate. A higher order zspage for class torvalds#96 also changes its key characteristics: pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100 are not merged anymore, which gives us more compact zsmalloc. Of course the described effect does not apply only to size classes torvalds#96 and We still merge classes, but less often so. In other words classes are grouped in a more compact way, which decreases memory wastage: zspage order # unique size classes 2 69 3 123 4 191 Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 254 4096 0 0 0 0 0 1 0 ... For exactly same reason - maximum 4 pages per zspage - the last non-huge size class is torvalds#202, which stores objects of size 3264 bytes. Any object larger than 3264 bytes, hence, is considered to be huge and lands in size class torvalds#254, which uses a whole physical page to store every object. To put it slightly differently - objects in huge classes don't share physical pages. 3264 bytes is too low of a watermark and we have too many huge classes: classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order zspages change key characteristics for some of those huge size classes and thus those classes become normal classes, where stored objects share physical pages. Hence yet another consequence of higher order zspages: we move the huge size class watermark with higher order zspages, have less huge classes and store large objects in a more compact way. For order 3, huge class watermark becomes 3632 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 211 3408 0 0 0 0 0 5 0 217 3504 0 0 0 0 0 6 0 222 3584 0 0 0 0 0 7 0 225 3632 0 0 0 0 0 8 0 254 4096 0 0 0 0 0 1 0 ... For order 4, huge class watermark becomes 3840 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 206 3328 0 0 0 0 0 13 0 207 3344 0 0 0 0 0 9 0 208 3360 0 0 0 0 0 14 0 211 3408 0 0 0 0 0 5 0 212 3424 0 0 0 0 0 16 0 214 3456 0 0 0 0 0 11 0 217 3504 0 0 0 0 0 6 0 219 3536 0 0 0 0 0 13 0 222 3584 0 0 0 0 0 7 0 223 3600 0 0 0 0 0 15 0 225 3632 0 0 0 0 0 8 0 228 3680 0 0 0 0 0 9 0 230 3712 0 0 0 0 0 10 0 232 3744 0 0 0 0 0 11 0 234 3776 0 0 0 0 0 12 0 235 3792 0 0 0 0 0 13 0 236 3808 0 0 0 0 0 14 0 238 3840 0 0 0 0 0 15 0 254 4096 0 0 0 0 0 1 0 ... TESTS ===== 1) ChromeOS memory pressure test ============================================================================= Our standard memory pressure test, that is designed with reproducibility in mind. zram is configured as a swap device, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted device. Columns per (Documentation/admin-guide/blockdev/zram.rst) orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) zspage 10353639424 2981711944 3166896128 0 3543158784 579494 825135 123707 10168573952 2932288347 3106541568 0 3499085824 565187 853137 126153 9950461952 2815911234 3035693056 0 3441090560 586696 748054 122103 9892335616 2779566152 2943459328 0 3514736640 591541 650696 119621 9993949184 2814279212 3021357056 0 3336421376 582488 711744 121273 9953226752 2856382009 3025649664 0 3512893440 564559 787861 123034 9838448640 2785481728 2997575680 0 3367219200 573282 777099 122739 ORDER 3 zspage 9509138432 2706941227 2823393280 0 3389587456 535856 1011472 90223 10105245696 2882368370 3013095424 0 3296165888 563896 1059033 94808 9531236352 2666125512 2867650560 0 3396173824 567117 1126396 88807 9561812992 2714536764 2956652544 0 3310505984 548223 827322 90992 9807470592 2790315707 2908053504 0 3378315264 563670 1020933 93725 10178371584 2948838782 3071209472 0 3329548288 548533 954546 90730 9925165056 2849839413 2958274560 0 3336978432 551464 1058302 89381 ORDER 4 zspage 9444515840 2613362645 2668232704 0 3396759552 573735 1162207 83475 10129108992 2925888488 3038351360 0 3499597824 555634 1231542 84525 9876594688 2786692282 2897006592 0 3469463552 584835 1290535 84133 10012909568 2649711847 2801512448 0 3171323904 675405 750728 80424 10120966144 2866742402 2978639872 0 3257815040 587435 1093981 83587 9578790912 2671245225 2802270208 0 3376353280 545548 1047930 80895 10108588032 2888433523 2983960576 0 3316641792 571445 1290640 81402 First, we establish that order 3 and 4 don't cause any statistically significant change in `orig_data_size` (number of bytes we store during the test), in other words larger zspages don't cause regressions. T-test for order 3: x order-2-stored + order-3-stored +-----------------------------------------------------------------------------+ |+ + + + x x + x x + x+ x| | |________________________AM__|_________M_____A____|__________| | +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 9.8384486e+09 1.0353639e+10 9.9532268e+09 1.0021519e+10 1.7916718e+08 + 7 9.5091384e+09 1.0178372e+10 9.8074706e+09 9.8026344e+09 2.7856206e+08 No difference proven at 95.0% confidence T-test for order 4: x order-2-stored + order-4-stored +-----------------------------------------------------------------------------+ | + | |+ + x +x xx x + ++ x x| | |__________________|____A____M____M____________|_| | +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 9.8384486e+09 1.0353639e+10 9.9532268e+09 1.0021519e+10 1.7916718e+08 + 7 9.4445158e+09 1.0129109e+10 1.001291e+10 9.8959249e+09 2.7947784e+08 No difference proven at 95.0% confidence Next we establish that there is a statistically significant improvement in `mem_used_total` metrics. T-test for order 3: x order-2-usedmem + order-3-usedmem +-----------------------------------------------------------------------------+ |+ + + x ++ x + xx x + x x| | |_________________A__M__|____________|__A________________| | +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 2.9434593e+09 3.1668961e+09 3.0256497e+09 3.0424532e+09 73235062 + 7 2.8233933e+09 3.0712095e+09 2.9566525e+09 2.9426185e+09 84630851 Difference at 95.0% confidence -9.98347e+07 +/- 9.21744e+07 -3.28139% +/- 3.02961% (Student's t, pooled s = 7.91383e+07) T-test for order 4: x order-2-usedmem + order-4-usedmem +-----------------------------------------------------------------------------+ | + x | |+ + + x ++ x x * x x| | |__________________A__M__________|_____|_M__A__________| | +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 2.9434593e+09 3.1668961e+09 3.0256497e+09 3.0424532e+09 73235062 + 7 2.6682327e+09 3.0383514e+09 2.8970066e+09 2.8814248e+09 1.3098053e+08 Difference at 95.0% confidence -1.61028e+08 +/- 1.23591e+08 -5.29272% +/- 4.0622% (Student's t, pooled s = 1.06111e+08) Order 3 zspages also show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem + order-3-maxmem +-----------------------------------------------------------------------------+ |+ + + x+ x + + + x x x x| | |________M__A_________|_|_____________________A___________M____________| | +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 3.3364214e+09 3.5431588e+09 3.4990858e+09 3.4592294e+09 80073158 + 7 3.2961659e+09 3.3961738e+09 3.3369784e+09 3.3481822e+09 39840377 Difference at 95.0% confidence -1.11047e+08 +/- 7.36589e+07 -3.21017% +/- 2.12934% (Student's t, pooled s = 6.32415e+07) Order 4 zspages, on the other hand, do not show any statistically significant improvement in `mem_used_max` metrics. T-test for order 4: x order-2-maxmem + order-4-maxmem +-----------------------------------------------------------------------------+ |+ + + x x + + x + * x x| | |_______________________A___M________________A_|_____M_______| | +-----------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 3.3364214e+09 3.5431588e+09 3.4990858e+09 3.4592294e+09 80073158 + 7 3.1713239e+09 3.4995978e+09 3.3763533e+09 3.3554221e+09 1.1609062e+08 No difference proven at 95.0% confidence Overall, with sufficient level of confidence, order 3 zspages appear to be beneficial for these particular use-case and data patterns. Rather expectedly we also observed lower numbers of huge-pages when zsmalloc is configured with order 3 and order 4 zspages, for the reason already explained. 2) Synthetic test ============================================================================= Test untars linux-6.0.tar.xz and compiles the kernel. zram is configured as a block device with ext4 file system, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted the VM. orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) zspage 1691791360 628086729 655171584 0 655171584 60 0 34043 1691787264 628089196 655175680 0 655175680 60 0 34046 1691803648 628098840 655187968 0 655187968 59 0 34047 1691795456 628091503 655183872 0 655183872 60 0 34044 1691799552 628086877 655183872 0 655183872 60 0 34047 ORDER 3 zspage 1691803648 627792993 641794048 0 641794048 60 0 33591 1691787264 627779342 641708032 0 641708032 59 0 33591 1691811840 627786616 641769472 0 641769472 60 0 33591 1691803648 627794468 641818624 0 641818624 59 0 33592 1691783168 627780882 641794048 0 641794048 61 0 33591 ORDER 4 zspage 1691803648 627726635 639655936 0 639655936 60 0 33435 1691811840 627733348 639643648 0 639643648 61 0 33434 1691795456 627726290 639614976 0 639614976 60 0 33435 1691803648 627730458 639688704 0 639688704 60 0 33434 1691811840 627727771 639688704 0 639688704 60 0 33434 Order 3 and order 4 show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem + order-3-maxmem +--------------------------------------------------------------------------+ |+ x| |+ x| |+ x| |++ x| |A| A| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08 42210.666 Difference at 95.0% confidence -1.34038e+07 +/- 44080.7 -2.04581% +/- 0.00672802% (Student's t, pooled s = 30224.5) T-test for order 4: x order-2-maxmem + order-4-maxmem +--------------------------------------------------------------------------+ |+ x| |+ x| |+ x| |+ x| |+ x| |A A| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.3961498e+08 6.396887e+08 6.3965594e+08 6.3965839e+08 31408.602 Difference at 95.0% confidence -1.55222e+07 +/- 33126.2 -2.36915% +/- 0.00505604% (Student's t, pooled s = 22713.4) This test tends to benefit more from order 4 zspages, due to test's data patterns. zsmalloc object distribution analysis ============================================================================= Order 2 (4 pages per zspage) tends to put many objects in size class 2048, which is merged with size classes torvalds#112-torvalds#125: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 0 6146 6146 1756 2 0 74 1216 0 1 4560 4552 1368 3 0 76 1248 0 1 2938 2934 904 4 0 83 1360 0 0 10971 10971 3657 1 0 91 1488 0 0 16126 16126 5864 4 0 94 1536 0 1 5912 5908 2217 3 0 100 1632 0 0 11990 11990 4796 2 0 107 1744 0 1 15771 15768 6759 3 0 111 1808 0 1 10386 10380 4616 4 0 126 2048 0 0 45444 45444 22722 1 0 144 2336 0 0 47446 47446 27112 4 0 151 2448 1 0 10760 10759 6456 3 0 168 2720 0 0 10173 10173 6782 2 0 190 3072 0 1 1700 1697 1275 3 0 202 3264 0 1 290 286 232 4 0 254 4096 0 0 34051 34051 34051 1 0 Order 3 (8 pages per zspage) changed pool characteristics and unmerged some of the size classes, which resulted in less objects being put into size class 2048, because there are lower size classes are now available for more compact object storage: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 1 2996 2994 856 2 0 72 1184 0 1 1632 1609 476 7 0 73 1200 1 0 1445 1442 425 5 0 74 1216 0 0 1510 1510 453 3 0 75 1232 0 1 1495 1479 455 7 0 76 1248 0 1 1456 1451 448 4 0 78 1280 0 1 3040 3033 950 5 0 79 1296 0 1 1584 1571 504 7 0 83 1360 0 0 6375 6375 2125 1 0 84 1376 0 1 1817 1796 632 8 0 87 1424 0 1 6020 6006 2107 7 0 88 1440 0 1 2108 2101 744 6 0 89 1456 0 1 2072 2064 740 5 0 91 1488 0 1 4169 4159 1516 4 0 92 1504 0 1 2014 2007 742 7 0 94 1536 0 1 3904 3900 1464 3 0 95 1552 0 1 1890 1873 720 8 0 96 1568 0 1 1963 1958 755 5 0 97 1584 0 1 1980 1974 770 7 0 100 1632 0 1 6190 6187 2476 2 0 103 1680 0 0 6477 6477 2667 7 0 104 1696 0 1 2256 2253 940 5 0 105 1712 0 1 2356 2340 992 8 0 107 1744 1 0 4697 4696 2013 3 0 110 1792 0 1 7744 7734 3388 7 0 111 1808 0 1 2655 2649 1180 4 0 114 1856 0 1 8371 8365 3805 5 0 116 1888 1 0 5863 5862 2706 6 0 117 1904 0 1 2955 2942 1379 7 0 118 1920 0 1 3009 2997 1416 8 0 126 2048 0 0 25276 25276 12638 1 0 128 2080 0 1 6060 6052 3232 8 0 129 2096 1 0 3081 3080 1659 7 0 134 2176 0 1 14835 14830 7912 8 0 135 2192 0 1 2769 2758 1491 7 0 137 2224 0 1 5082 5077 2772 6 0 140 2272 0 1 7236 7232 4020 5 0 144 2336 0 1 8428 8423 4816 4 0 147 2384 0 1 5316 5313 3101 7 0 151 2448 0 1 5445 5443 3267 3 0 155 2512 0 0 4121 4121 2536 8 0 158 2560 0 1 2208 2205 1380 5 0 160 2592 0 0 1133 1133 721 7 0 168 2720 0 0 2712 2712 1808 2 0 177 2864 1 0 1100 1098 770 7 0 180 2912 0 1 189 183 135 5 0 184 2976 0 1 176 166 128 8 0 190 3072 0 0 252 252 189 3 0 197 3184 0 1 198 192 154 7 0 202 3264 0 1 100 96 80 4 0 211 3408 0 1 210 208 175 5 0 217 3504 0 1 98 94 84 6 0 222 3584 0 0 104 104 91 7 0 225 3632 0 1 54 50 48 8 0 254 4096 0 0 33591 33591 33591 1 0 Note, the huge size watermark is above 3632 and there are a number of new normal classes available that previously were merged with the huge class. For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175 physical pages, while previously for those objects we would have used 210 physical pages. Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>

zsmalloc has 255 size classes. Size classes contain a number of zspages, which store objects of the same size. zspage can consist of up to four physical pages. The exact (most optimal) zspage size is calculated for each size class during zsmalloc pool creation. As a reasonable optimization, zsmalloc merges size classes that have similar characteristics: number of pages per zspage and number of objects zspage can store. For example, let's look at the following size classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable .. 94 1536 0 0 0 0 0 3 0 100 1632 0 0 0 0 0 2 0 .. Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time we store an object of size, say, 1568 bytes instead of using class torvalds#96 we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of 1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. When we need to store, say, 13 objects of size 1568 we end up allocating three zspages; in other words, 6 physical pages. However, if we'll look closer at size class torvalds#96 (which should hold objects of size 1568 bytes) and trace get_pages_per_zspage(): pages per zspage wasted bytes used% 1 960 76 2 352 95 3 1312 89 4 704 95 5 96 99 We'd notice that the most optimal zspage configuration for this class is when it consists of 5 physical pages, but currently we never let zspages to consists of more than 4 pages. A 5 page class torvalds#96 configuration would store 13 objects of size 1568 in a single zspage, allocating 5 physical pages, as opposed to 6 physical pages that class torvalds#100 will allocate. A higher order zspage for class torvalds#96 also changes its key characteristics: pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100 are not merged anymore, which gives us more compact zsmalloc. Of course the described effect does not apply only to size classes torvalds#96 and We still merge classes, but less often so. In other words classes are grouped in a more compact way, which decreases memory wastage: zspage order # unique size classes 2 69 3 123 4 191 Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 254 4096 0 0 0 0 0 1 0 ... For exactly same reason - maximum 4 pages per zspage - the last non-huge size class is torvalds#202, which stores objects of size 3264 bytes. Any object larger than 3264 bytes, hence, is considered to be huge and lands in size class torvalds#254, which uses a whole physical page to store every object. To put it slightly differently - objects in huge classes don't share physical pages. 3264 bytes is too low of a watermark and we have too many huge classes: classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order zspages change key characteristics for some of those huge size classes and thus those classes become normal classes, where stored objects share physical pages. Hence yet another consequence of higher order zspages: we move the huge size class watermark with higher order zspages, have less huge classes and store large objects in a more compact way. For order 3, huge class watermark becomes 3632 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 211 3408 0 0 0 0 0 5 0 217 3504 0 0 0 0 0 6 0 222 3584 0 0 0 0 0 7 0 225 3632 0 0 0 0 0 8 0 254 4096 0 0 0 0 0 1 0 ... For order 4, huge class watermark becomes 3840 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 206 3328 0 0 0 0 0 13 0 207 3344 0 0 0 0 0 9 0 208 3360 0 0 0 0 0 14 0 211 3408 0 0 0 0 0 5 0 212 3424 0 0 0 0 0 16 0 214 3456 0 0 0 0 0 11 0 217 3504 0 0 0 0 0 6 0 219 3536 0 0 0 0 0 13 0 222 3584 0 0 0 0 0 7 0 223 3600 0 0 0 0 0 15 0 225 3632 0 0 0 0 0 8 0 228 3680 0 0 0 0 0 9 0 230 3712 0 0 0 0 0 10 0 232 3744 0 0 0 0 0 11 0 234 3776 0 0 0 0 0 12 0 235 3792 0 0 0 0 0 13 0 236 3808 0 0 0 0 0 14 0 238 3840 0 0 0 0 0 15 0 254 4096 0 0 0 0 0 1 0 ... TESTS ===== Test untars linux-6.0.tar.xz and compiles the kernel. zram is configured as a block device with ext4 file system, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted the VM. orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) zspage 1691791360 628086729 655171584 0 655171584 60 0 34043 1691787264 628089196 655175680 0 655175680 60 0 34046 1691803648 628098840 655187968 0 655187968 59 0 34047 1691795456 628091503 655183872 0 655183872 60 0 34044 1691799552 628086877 655183872 0 655183872 60 0 34047 ORDER 3 zspage 1691803648 627792993 641794048 0 641794048 60 0 33591 1691787264 627779342 641708032 0 641708032 59 0 33591 1691811840 627786616 641769472 0 641769472 60 0 33591 1691803648 627794468 641818624 0 641818624 59 0 33592 1691783168 627780882 641794048 0 641794048 61 0 33591 ORDER 4 zspage 1691803648 627726635 639655936 0 639655936 60 0 33435 1691811840 627733348 639643648 0 639643648 61 0 33434 1691795456 627726290 639614976 0 639614976 60 0 33435 1691803648 627730458 639688704 0 639688704 60 0 33434 1691811840 627727771 639688704 0 639688704 60 0 33434 Order 3 and order 4 show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem + order-3-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08 42210.666 Difference at 95.0% confidence -1.34038e+07 +/- 44080.7 -2.04581% +/- 0.00672802% (Student's t, pooled s = 30224.5) T-test for order 4: x order-2-maxmem + order-4-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.3961498e+08 6.396887e+08 6.3965594e+08 6.3965839e+08 31408.602 Difference at 95.0% confidence -1.55222e+07 +/- 33126.2 -2.36915% +/- 0.00505604% (Student's t, pooled s = 22713.4) This test tends to benefit more from order 4 zspages, due to test's data patterns. zsmalloc object distribution analysis ============================================================================= Order 2 (4 pages per zspage) tends to put many objects in size class 2048, which is merged with size classes torvalds#112-torvalds#125: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 0 6146 6146 1756 2 0 74 1216 0 1 4560 4552 1368 3 0 76 1248 0 1 2938 2934 904 4 0 83 1360 0 0 10971 10971 3657 1 0 91 1488 0 0 16126 16126 5864 4 0 94 1536 0 1 5912 5908 2217 3 0 100 1632 0 0 11990 11990 4796 2 0 107 1744 0 1 15771 15768 6759 3 0 111 1808 0 1 10386 10380 4616 4 0 126 2048 0 0 45444 45444 22722 1 0 144 2336 0 0 47446 47446 27112 4 0 151 2448 1 0 10760 10759 6456 3 0 168 2720 0 0 10173 10173 6782 2 0 190 3072 0 1 1700 1697 1275 3 0 202 3264 0 1 290 286 232 4 0 254 4096 0 0 34051 34051 34051 1 0 Order 3 (8 pages per zspage) changed pool characteristics and unmerged some of the size classes, which resulted in less objects being put into size class 2048, because there are lower size classes are now available for more compact object storage: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 1 2996 2994 856 2 0 72 1184 0 1 1632 1609 476 7 0 73 1200 1 0 1445 1442 425 5 0 74 1216 0 0 1510 1510 453 3 0 75 1232 0 1 1495 1479 455 7 0 76 1248 0 1 1456 1451 448 4 0 78 1280 0 1 3040 3033 950 5 0 79 1296 0 1 1584 1571 504 7 0 83 1360 0 0 6375 6375 2125 1 0 84 1376 0 1 1817 1796 632 8 0 87 1424 0 1 6020 6006 2107 7 0 88 1440 0 1 2108 2101 744 6 0 89 1456 0 1 2072 2064 740 5 0 91 1488 0 1 4169 4159 1516 4 0 92 1504 0 1 2014 2007 742 7 0 94 1536 0 1 3904 3900 1464 3 0 95 1552 0 1 1890 1873 720 8 0 96 1568 0 1 1963 1958 755 5 0 97 1584 0 1 1980 1974 770 7 0 100 1632 0 1 6190 6187 2476 2 0 103 1680 0 0 6477 6477 2667 7 0 104 1696 0 1 2256 2253 940 5 0 105 1712 0 1 2356 2340 992 8 0 107 1744 1 0 4697 4696 2013 3 0 110 1792 0 1 7744 7734 3388 7 0 111 1808 0 1 2655 2649 1180 4 0 114 1856 0 1 8371 8365 3805 5 0 116 1888 1 0 5863 5862 2706 6 0 117 1904 0 1 2955 2942 1379 7 0 118 1920 0 1 3009 2997 1416 8 0 126 2048 0 0 25276 25276 12638 1 0 128 2080 0 1 6060 6052 3232 8 0 129 2096 1 0 3081 3080 1659 7 0 134 2176 0 1 14835 14830 7912 8 0 135 2192 0 1 2769 2758 1491 7 0 137 2224 0 1 5082 5077 2772 6 0 140 2272 0 1 7236 7232 4020 5 0 144 2336 0 1 8428 8423 4816 4 0 147 2384 0 1 5316 5313 3101 7 0 151 2448 0 1 5445 5443 3267 3 0 155 2512 0 0 4121 4121 2536 8 0 158 2560 0 1 2208 2205 1380 5 0 160 2592 0 0 1133 1133 721 7 0 168 2720 0 0 2712 2712 1808 2 0 177 2864 1 0 1100 1098 770 7 0 180 2912 0 1 189 183 135 5 0 184 2976 0 1 176 166 128 8 0 190 3072 0 0 252 252 189 3 0 197 3184 0 1 198 192 154 7 0 202 3264 0 1 100 96 80 4 0 211 3408 0 1 210 208 175 5 0 217 3504 0 1 98 94 84 6 0 222 3584 0 0 104 104 91 7 0 225 3632 0 1 54 50 48 8 0 254 4096 0 0 33591 33591 33591 1 0 Note, the huge size watermark is above 3632 and there are a number of new normal classes available that previously were merged with the huge class. For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175 physical pages, while previously for those objects we would have used 210 physical pages. Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>

zsmalloc has 255 size classes. Size classes contain a number of zspages, which store objects of the same size. zspage can consist of up to four physical pages. The exact (most optimal) zspage size is calculated for each size class during zsmalloc pool creation. As a reasonable optimization, zsmalloc merges size classes that have similar characteristics: number of pages per zspage and number of objects zspage can store. For example, let's look at the following size classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable .. 94 1536 0 0 0 0 0 3 0 100 1632 0 0 0 0 0 2 0 .. Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time we store an object of size, say, 1568 bytes instead of using class torvalds#96 we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of 1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. When we need to store, say, 13 objects of size 1568 we end up allocating three zspages; in other words, 6 physical pages. However, if we'll look closer at size class torvalds#96 (which should hold objects of size 1568 bytes) and trace get_pages_per_zspage(): pages per zspage wasted bytes used% 1 960 76 2 352 95 3 1312 89 4 704 95 5 96 99 We'd notice that the most optimal zspage configuration for this class is when it consists of 5 physical pages, but currently we never let zspages to consists of more than 4 pages. A 5 page class torvalds#96 configuration would store 13 objects of size 1568 in a single zspage, allocating 5 physical pages, as opposed to 6 physical pages that class torvalds#100 will allocate. A higher order zspage for class torvalds#96 also changes its key characteristics: pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100 are not merged anymore, which gives us more compact zsmalloc. Of course the described effect does not apply only to size classes torvalds#96 and We still merge classes, but less often so. In other words classes are grouped in a more compact way, which decreases memory wastage: zspage order # unique size classes 2 69 3 123 4 191 Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 254 4096 0 0 0 0 0 1 0 ... For exactly same reason - maximum 4 pages per zspage - the last non-huge size class is torvalds#202, which stores objects of size 3264 bytes. Any object larger than 3264 bytes, hence, is considered to be huge and lands in size class torvalds#254, which uses a whole physical page to store every object. To put it slightly differently - objects in huge classes don't share physical pages. 3264 bytes is too low of a watermark and we have too many huge classes: classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order zspages change key characteristics for some of those huge size classes and thus those classes become normal classes, where stored objects share physical pages. Hence yet another consequence of higher order zspages: we move the huge size class watermark with higher order zspages, have less huge classes and store large objects in a more compact way. For order 3, huge class watermark becomes 3632 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 211 3408 0 0 0 0 0 5 0 217 3504 0 0 0 0 0 6 0 222 3584 0 0 0 0 0 7 0 225 3632 0 0 0 0 0 8 0 254 4096 0 0 0 0 0 1 0 ... For order 4, huge class watermark becomes 3840 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 206 3328 0 0 0 0 0 13 0 207 3344 0 0 0 0 0 9 0 208 3360 0 0 0 0 0 14 0 211 3408 0 0 0 0 0 5 0 212 3424 0 0 0 0 0 16 0 214 3456 0 0 0 0 0 11 0 217 3504 0 0 0 0 0 6 0 219 3536 0 0 0 0 0 13 0 222 3584 0 0 0 0 0 7 0 223 3600 0 0 0 0 0 15 0 225 3632 0 0 0 0 0 8 0 228 3680 0 0 0 0 0 9 0 230 3712 0 0 0 0 0 10 0 232 3744 0 0 0 0 0 11 0 234 3776 0 0 0 0 0 12 0 235 3792 0 0 0 0 0 13 0 236 3808 0 0 0 0 0 14 0 238 3840 0 0 0 0 0 15 0 254 4096 0 0 0 0 0 1 0 ... TESTS ===== Test untars linux-6.0.tar.xz and compiles the kernel. zram is configured as a block device with ext4 file system, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted the VM. orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) zspage 1691791360 628086729 655171584 0 655171584 60 0 34043 1691787264 628089196 655175680 0 655175680 60 0 34046 1691803648 628098840 655187968 0 655187968 59 0 34047 1691795456 628091503 655183872 0 655183872 60 0 34044 1691799552 628086877 655183872 0 655183872 60 0 34047 ORDER 3 zspage 1691803648 627792993 641794048 0 641794048 60 0 33591 1691787264 627779342 641708032 0 641708032 59 0 33591 1691811840 627786616 641769472 0 641769472 60 0 33591 1691803648 627794468 641818624 0 641818624 59 0 33592 1691783168 627780882 641794048 0 641794048 61 0 33591 ORDER 4 zspage 1691803648 627726635 639655936 0 639655936 60 0 33435 1691811840 627733348 639643648 0 639643648 61 0 33434 1691795456 627726290 639614976 0 639614976 60 0 33435 1691803648 627730458 639688704 0 639688704 60 0 33434 1691811840 627727771 639688704 0 639688704 60 0 33434 Order 3 and order 4 show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem + order-3-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08 42210.666 Difference at 95.0% confidence -1.34038e+07 +/- 44080.7 -2.04581% +/- 0.00672802% (Student's t, pooled s = 30224.5) T-test for order 4: x order-2-maxmem + order-4-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.3961498e+08 6.396887e+08 6.3965594e+08 6.3965839e+08 31408.602 Difference at 95.0% confidence -1.55222e+07 +/- 33126.2 -2.36915% +/- 0.00505604% (Student's t, pooled s = 22713.4) This test tends to benefit more from order 4 zspages, due to test's data patterns. zsmalloc object distribution analysis ============================================================================= Order 2 (4 pages per zspage) tends to put many objects in size class 2048, which is merged with size classes torvalds#112-torvalds#125: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 0 6146 6146 1756 2 0 74 1216 0 1 4560 4552 1368 3 0 76 1248 0 1 2938 2934 904 4 0 83 1360 0 0 10971 10971 3657 1 0 91 1488 0 0 16126 16126 5864 4 0 94 1536 0 1 5912 5908 2217 3 0 100 1632 0 0 11990 11990 4796 2 0 107 1744 0 1 15771 15768 6759 3 0 111 1808 0 1 10386 10380 4616 4 0 126 2048 0 0 45444 45444 22722 1 0 144 2336 0 0 47446 47446 27112 4 0 151 2448 1 0 10760 10759 6456 3 0 168 2720 0 0 10173 10173 6782 2 0 190 3072 0 1 1700 1697 1275 3 0 202 3264 0 1 290 286 232 4 0 254 4096 0 0 34051 34051 34051 1 0 Order 3 (8 pages per zspage) changed pool characteristics and unmerged some of the size classes, which resulted in less objects being put into size class 2048, because there are lower size classes are now available for more compact object storage: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 1 2996 2994 856 2 0 72 1184 0 1 1632 1609 476 7 0 73 1200 1 0 1445 1442 425 5 0 74 1216 0 0 1510 1510 453 3 0 75 1232 0 1 1495 1479 455 7 0 76 1248 0 1 1456 1451 448 4 0 78 1280 0 1 3040 3033 950 5 0 79 1296 0 1 1584 1571 504 7 0 83 1360 0 0 6375 6375 2125 1 0 84 1376 0 1 1817 1796 632 8 0 87 1424 0 1 6020 6006 2107 7 0 88 1440 0 1 2108 2101 744 6 0 89 1456 0 1 2072 2064 740 5 0 91 1488 0 1 4169 4159 1516 4 0 92 1504 0 1 2014 2007 742 7 0 94 1536 0 1 3904 3900 1464 3 0 95 1552 0 1 1890 1873 720 8 0 96 1568 0 1 1963 1958 755 5 0 97 1584 0 1 1980 1974 770 7 0 100 1632 0 1 6190 6187 2476 2 0 103 1680 0 0 6477 6477 2667 7 0 104 1696 0 1 2256 2253 940 5 0 105 1712 0 1 2356 2340 992 8 0 107 1744 1 0 4697 4696 2013 3 0 110 1792 0 1 7744 7734 3388 7 0 111 1808 0 1 2655 2649 1180 4 0 114 1856 0 1 8371 8365 3805 5 0 116 1888 1 0 5863 5862 2706 6 0 117 1904 0 1 2955 2942 1379 7 0 118 1920 0 1 3009 2997 1416 8 0 126 2048 0 0 25276 25276 12638 1 0 128 2080 0 1 6060 6052 3232 8 0 129 2096 1 0 3081 3080 1659 7 0 134 2176 0 1 14835 14830 7912 8 0 135 2192 0 1 2769 2758 1491 7 0 137 2224 0 1 5082 5077 2772 6 0 140 2272 0 1 7236 7232 4020 5 0 144 2336 0 1 8428 8423 4816 4 0 147 2384 0 1 5316 5313 3101 7 0 151 2448 0 1 5445 5443 3267 3 0 155 2512 0 0 4121 4121 2536 8 0 158 2560 0 1 2208 2205 1380 5 0 160 2592 0 0 1133 1133 721 7 0 168 2720 0 0 2712 2712 1808 2 0 177 2864 1 0 1100 1098 770 7 0 180 2912 0 1 189 183 135 5 0 184 2976 0 1 176 166 128 8 0 190 3072 0 0 252 252 189 3 0 197 3184 0 1 198 192 154 7 0 202 3264 0 1 100 96 80 4 0 211 3408 0 1 210 208 175 5 0 217 3504 0 1 98 94 84 6 0 222 3584 0 0 104 104 91 7 0 225 3632 0 1 54 50 48 8 0 254 4096 0 0 33591 33591 33591 1 0 Note, the huge size watermark is above 3632 and there are a number of new normal classes available that previously were merged with the huge class. For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175 physical pages, while previously for those objects we would have used 210 physical pages. Link: https://lkml.kernel.org/r/20221027042651.234524-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

zsmalloc has 255 size classes. Size classes contain a number of zspages, which store objects of the same size. zspage can consist of up to four physical pages. The exact (most optimal) zspage size is calculated for each size class during zsmalloc pool creation. As a reasonable optimization, zsmalloc merges size classes that have similar characteristics: number of pages per zspage and number of objects zspage can store. For example, let's look at the following size classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable .. 94 1536 0 0 0 0 0 3 0 100 1632 0 0 0 0 0 2 0 .. Size classes torvalds#95-99 are merged with size class torvalds#100. That is, each time we store an object of size, say, 1568 bytes instead of using class torvalds#96 we end up storing it in size class torvalds#100. Class torvalds#100 is for objects of 1632 bytes in size, hence every 1568 bytes object wastes 1632-1568 bytes. Class torvalds#100 zspages consist of 2 physical pages and can hold 5 objects. When we need to store, say, 13 objects of size 1568 we end up allocating three zspages; in other words, 6 physical pages. However, if we'll look closer at size class torvalds#96 (which should hold objects of size 1568 bytes) and trace get_pages_per_zspage(): pages per zspage wasted bytes used% 1 960 76 2 352 95 3 1312 89 4 704 95 5 96 99 We'd notice that the most optimal zspage configuration for this class is when it consists of 5 physical pages, but currently we never let zspages to consists of more than 4 pages. A 5 page class torvalds#96 configuration would store 13 objects of size 1568 in a single zspage, allocating 5 physical pages, as opposed to 6 physical pages that class torvalds#100 will allocate. A higher order zspage for class torvalds#96 also changes its key characteristics: pages per-zspage and objects per-zspage. As a result classes torvalds#96 and torvalds#100 are not merged anymore, which gives us more compact zsmalloc. Of course the described effect does not apply only to size classes torvalds#96 and We still merge classes, but less often so. In other words classes are grouped in a more compact way, which decreases memory wastage: zspage order # unique size classes 2 69 3 123 4 191 Let's take a closer look at the bottom of /sys/kernel/debug/zsmalloc/zram0/classes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 254 4096 0 0 0 0 0 1 0 ... For exactly same reason - maximum 4 pages per zspage - the last non-huge size class is torvalds#202, which stores objects of size 3264 bytes. Any object larger than 3264 bytes, hence, is considered to be huge and lands in size class torvalds#254, which uses a whole physical page to store every object. To put it slightly differently - objects in huge classes don't share physical pages. 3264 bytes is too low of a watermark and we have too many huge classes: classes from torvalds#203 to torvalds#254. Similarly to class size torvalds#96 above, higher order zspages change key characteristics for some of those huge size classes and thus those classes become normal classes, where stored objects share physical pages. Hence yet another consequence of higher order zspages: we move the huge size class watermark with higher order zspages, have less huge classes and store large objects in a more compact way. For order 3, huge class watermark becomes 3632 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 211 3408 0 0 0 0 0 5 0 217 3504 0 0 0 0 0 6 0 222 3584 0 0 0 0 0 7 0 225 3632 0 0 0 0 0 8 0 254 4096 0 0 0 0 0 1 0 ... For order 4, huge class watermark becomes 3840 bytes: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 202 3264 0 0 0 0 0 4 0 206 3328 0 0 0 0 0 13 0 207 3344 0 0 0 0 0 9 0 208 3360 0 0 0 0 0 14 0 211 3408 0 0 0 0 0 5 0 212 3424 0 0 0 0 0 16 0 214 3456 0 0 0 0 0 11 0 217 3504 0 0 0 0 0 6 0 219 3536 0 0 0 0 0 13 0 222 3584 0 0 0 0 0 7 0 223 3600 0 0 0 0 0 15 0 225 3632 0 0 0 0 0 8 0 228 3680 0 0 0 0 0 9 0 230 3712 0 0 0 0 0 10 0 232 3744 0 0 0 0 0 11 0 234 3776 0 0 0 0 0 12 0 235 3792 0 0 0 0 0 13 0 236 3808 0 0 0 0 0 14 0 238 3840 0 0 0 0 0 15 0 254 4096 0 0 0 0 0 1 0 ... TESTS ===== Test untars linux-6.0.tar.xz and compiles the kernel. zram is configured as a block device with ext4 file system, lzo-rle compression algorithm. We captured /sys/block/zram0/mm_stat after every test and rebooted the VM. orig_data_size mem_used_total mem_used_max pages_compacted compr_data_size mem_limit same_pages huge_pages ORDER 2 (BASE) zspage 1691791360 628086729 655171584 0 655171584 60 0 34043 1691787264 628089196 655175680 0 655175680 60 0 34046 1691803648 628098840 655187968 0 655187968 59 0 34047 1691795456 628091503 655183872 0 655183872 60 0 34044 1691799552 628086877 655183872 0 655183872 60 0 34047 ORDER 3 zspage 1691803648 627792993 641794048 0 641794048 60 0 33591 1691787264 627779342 641708032 0 641708032 59 0 33591 1691811840 627786616 641769472 0 641769472 60 0 33591 1691803648 627794468 641818624 0 641818624 59 0 33592 1691783168 627780882 641794048 0 641794048 61 0 33591 ORDER 4 zspage 1691803648 627726635 639655936 0 639655936 60 0 33435 1691811840 627733348 639643648 0 639643648 61 0 33434 1691795456 627726290 639614976 0 639614976 60 0 33435 1691803648 627730458 639688704 0 639688704 60 0 33434 1691811840 627727771 639688704 0 639688704 60 0 33434 Order 3 and order 4 show statistically significant improvement in `mem_used_max` metrics. T-test for order 3: x order-2-maxmem + order-3-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.4170803e+08 6.4181862e+08 6.4179405e+08 6.4177684e+08 42210.666 Difference at 95.0% confidence -1.34038e+07 +/- 44080.7 -2.04581% +/- 0.00672802% (Student's t, pooled s = 30224.5) T-test for order 4: x order-2-maxmem + order-4-maxmem N Min Max Median Avg Stddev x 5 6.5517158e+08 6.5518797e+08 6.5518387e+08 6.551806e+08 6730.4157 + 5 6.3961498e+08 6.396887e+08 6.3965594e+08 6.3965839e+08 31408.602 Difference at 95.0% confidence -1.55222e+07 +/- 33126.2 -2.36915% +/- 0.00505604% (Student's t, pooled s = 22713.4) This test tends to benefit more from order 4 zspages, due to test's data patterns. zsmalloc object distribution analysis ============================================================================= Order 2 (4 pages per zspage) tends to put many objects in size class 2048, which is merged with size classes torvalds#112-torvalds#125: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 0 6146 6146 1756 2 0 74 1216 0 1 4560 4552 1368 3 0 76 1248 0 1 2938 2934 904 4 0 83 1360 0 0 10971 10971 3657 1 0 91 1488 0 0 16126 16126 5864 4 0 94 1536 0 1 5912 5908 2217 3 0 100 1632 0 0 11990 11990 4796 2 0 107 1744 0 1 15771 15768 6759 3 0 111 1808 0 1 10386 10380 4616 4 0 126 2048 0 0 45444 45444 22722 1 0 144 2336 0 0 47446 47446 27112 4 0 151 2448 1 0 10760 10759 6456 3 0 168 2720 0 0 10173 10173 6782 2 0 190 3072 0 1 1700 1697 1275 3 0 202 3264 0 1 290 286 232 4 0 254 4096 0 0 34051 34051 34051 1 0 Order 3 (8 pages per zspage) changed pool characteristics and unmerged some of the size classes, which resulted in less objects being put into size class 2048, because there are lower size classes are now available for more compact object storage: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... 71 1168 0 1 2996 2994 856 2 0 72 1184 0 1 1632 1609 476 7 0 73 1200 1 0 1445 1442 425 5 0 74 1216 0 0 1510 1510 453 3 0 75 1232 0 1 1495 1479 455 7 0 76 1248 0 1 1456 1451 448 4 0 78 1280 0 1 3040 3033 950 5 0 79 1296 0 1 1584 1571 504 7 0 83 1360 0 0 6375 6375 2125 1 0 84 1376 0 1 1817 1796 632 8 0 87 1424 0 1 6020 6006 2107 7 0 88 1440 0 1 2108 2101 744 6 0 89 1456 0 1 2072 2064 740 5 0 91 1488 0 1 4169 4159 1516 4 0 92 1504 0 1 2014 2007 742 7 0 94 1536 0 1 3904 3900 1464 3 0 95 1552 0 1 1890 1873 720 8 0 96 1568 0 1 1963 1958 755 5 0 97 1584 0 1 1980 1974 770 7 0 100 1632 0 1 6190 6187 2476 2 0 103 1680 0 0 6477 6477 2667 7 0 104 1696 0 1 2256 2253 940 5 0 105 1712 0 1 2356 2340 992 8 0 107 1744 1 0 4697 4696 2013 3 0 110 1792 0 1 7744 7734 3388 7 0 111 1808 0 1 2655 2649 1180 4 0 114 1856 0 1 8371 8365 3805 5 0 116 1888 1 0 5863 5862 2706 6 0 117 1904 0 1 2955 2942 1379 7 0 118 1920 0 1 3009 2997 1416 8 0 126 2048 0 0 25276 25276 12638 1 0 128 2080 0 1 6060 6052 3232 8 0 129 2096 1 0 3081 3080 1659 7 0 134 2176 0 1 14835 14830 7912 8 0 135 2192 0 1 2769 2758 1491 7 0 137 2224 0 1 5082 5077 2772 6 0 140 2272 0 1 7236 7232 4020 5 0 144 2336 0 1 8428 8423 4816 4 0 147 2384 0 1 5316 5313 3101 7 0 151 2448 0 1 5445 5443 3267 3 0 155 2512 0 0 4121 4121 2536 8 0 158 2560 0 1 2208 2205 1380 5 0 160 2592 0 0 1133 1133 721 7 0 168 2720 0 0 2712 2712 1808 2 0 177 2864 1 0 1100 1098 770 7 0 180 2912 0 1 189 183 135 5 0 184 2976 0 1 176 166 128 8 0 190 3072 0 0 252 252 189 3 0 197 3184 0 1 198 192 154 7 0 202 3264 0 1 100 96 80 4 0 211 3408 0 1 210 208 175 5 0 217 3504 0 1 98 94 84 6 0 222 3584 0 0 104 104 91 7 0 225 3632 0 1 54 50 48 8 0 254 4096 0 0 33591 33591 33591 1 0 Note, the huge size watermark is above 3632 and there are a number of new normal classes available that previously were merged with the huge class. For instance, size class torvalds#211 holds 210 objects of size 3408 and uses 175 physical pages, while previously for those objects we would have used 210 physical pages. Link: https://lkml.kernel.org/r/20221031054108.541190-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ANBZ: torvalds#112 commit bd74048 upstream If we hit an earlier error path in io_uring_create(), then we will have accounted memory, but not set ctx->{sq,cq}_entries yet. Then when the ring is torn down in error, we use those values to unaccount the memory. Ensure we set the ctx entries before we're able to hit a potential error path. Cc: stable@vger.kernel.org Reported-by: Tomáš Chaloupka <chalucha@gmail.com> Tested-by: Tomáš Chaloupka <chalucha@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

Signed-off-by: nascs <nascs@radxa.com>

With latest upstream llvm18, the following test cases failed: $ ./test_progs -j torvalds#13/2 bpf_cookie/multi_kprobe_link_api:FAIL torvalds#13/3 bpf_cookie/multi_kprobe_attach_api:FAIL torvalds#13 bpf_cookie:FAIL torvalds#77 fentry_fexit:FAIL torvalds#78/1 fentry_test/fentry:FAIL torvalds#78 fentry_test:FAIL torvalds#82/1 fexit_test/fexit:FAIL torvalds#82 fexit_test:FAIL torvalds#112/1 kprobe_multi_test/skel_api:FAIL torvalds#112/2 kprobe_multi_test/link_api_addrs:FAIL ... torvalds#112 kprobe_multi_test:FAIL torvalds#356/17 test_global_funcs/global_func17:FAIL torvalds#356 test_global_funcs:FAIL Further analysis shows llvm upstream patch [1] is responsible for the above failures. For example, for function bpf_fentry_test7() in net/bpf/test_run.c, without [1], the asm code is: 0000000000000400 <bpf_fentry_test7>: 400: f3 0f 1e fa endbr64 404: e8 00 00 00 00 callq 0x409 <bpf_fentry_test7+0x9> 409: 48 89 f8 movq %rdi, %rax 40c: c3 retq 40d: 0f 1f 00 nopl (%rax) and with [1], the asm code is: 0000000000005d20 <bpf_fentry_test7.specialized.1>: 5d20: e8 00 00 00 00 callq 0x5d25 <bpf_fentry_test7.specialized.1+0x5> 5d25: c3 retq and <bpf_fentry_test7.specialized.1> is called instead of <bpf_fentry_test7> and this caused test failures for torvalds#13/torvalds#77 etc. except torvalds#356. For test case torvalds#356/17, with [1] (progs/test_global_func17.c)), the main prog looks like: 0000000000000000 <global_func17>: 0: b4 00 00 00 2a 00 00 00 w0 = 0x2a 1: 95 00 00 00 00 00 00 00 exit which passed verification while the test itself expects a verification failure. Let us add 'barrier_var' style asm code in both places to prevent function specialization which caused selftests failure. [1] llvm/llvm-project#72903 Signed-off-by: Yonghong Song <yonghong.song@linux.dev>

With latest upstream llvm18, the following test cases failed: $ ./test_progs -j #13/2 bpf_cookie/multi_kprobe_link_api:FAIL #13/3 bpf_cookie/multi_kprobe_attach_api:FAIL #13 bpf_cookie:FAIL torvalds#77 fentry_fexit:FAIL torvalds#78/1 fentry_test/fentry:FAIL torvalds#78 fentry_test:FAIL torvalds#82/1 fexit_test/fexit:FAIL torvalds#82 fexit_test:FAIL torvalds#112/1 kprobe_multi_test/skel_api:FAIL torvalds#112/2 kprobe_multi_test/link_api_addrs:FAIL [...] torvalds#112 kprobe_multi_test:FAIL torvalds#356/17 test_global_funcs/global_func17:FAIL torvalds#356 test_global_funcs:FAIL Further analysis shows llvm upstream patch [1] is responsible for the above failures. For example, for function bpf_fentry_test7() in net/bpf/test_run.c, without [1], the asm code is: 0000000000000400 <bpf_fentry_test7>: 400: f3 0f 1e fa endbr64 404: e8 00 00 00 00 callq 0x409 <bpf_fentry_test7+0x9> 409: 48 89 f8 movq %rdi, %rax 40c: c3 retq 40d: 0f 1f 00 nopl (%rax) ... and with [1], the asm code is: 0000000000005d20 <bpf_fentry_test7.specialized.1>: 5d20: e8 00 00 00 00 callq 0x5d25 <bpf_fentry_test7.specialized.1+0x5> 5d25: c3 retq ... and <bpf_fentry_test7.specialized.1> is called instead of <bpf_fentry_test7> and this caused test failures for #13/torvalds#77 etc. except torvalds#356. For test case torvalds#356/17, with [1] (progs/test_global_func17.c)), the main prog looks like: 0000000000000000 <global_func17>: 0: b4 00 00 00 2a 00 00 00 w0 = 0x2a 1: 95 00 00 00 00 00 00 00 exit ... which passed verification while the test itself expects a verification failure. Let us add 'barrier_var' style asm code in both places to prevent function specialization which caused selftests failure. [1] llvm/llvm-project#72903 Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231127050342.1945270-1-yonghong.song@linux.dev

If CONFIG_BPF_JIT_ALWAYS_ON is not set and bpf_jit_enable is 0, there exist 6 failed tests. [root@linux bpf]# echo 0 > /proc/sys/net/core/bpf_jit_enable [root@linux bpf]# ./test_verifier | grep FAIL torvalds#107/p inline simple bpf_loop call FAIL torvalds#108/p don't inline bpf_loop call, flags non-zero FAIL torvalds#109/p don't inline bpf_loop call, callback non-constant FAIL torvalds#110/p bpf_loop_inline and a dead func FAIL torvalds#111/p bpf_loop_inline stack locations for loop vars FAIL torvalds#112/p inline bpf_loop call in a big program FAIL Summary: 505 PASSED, 266 SKIPPED, 6 FAILED The test log shows that callbacks are not allowed in non-JITed programs, interpreter doesn't support them yet, thus these tests should be skipped if jit is disabled, just return -ENOTSUPP instead of -EINVAL for pseudo calls in fixup_call_args(). With this patch: [root@linux bpf]# echo 0 > /proc/sys/net/core/bpf_jit_enable [root@linux bpf]# ./test_verifier | grep FAIL Summary: 505 PASSED, 272 SKIPPED, 0 FAILED Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>

[ Upstream commit b16904f ] With latest upstream llvm18, the following test cases failed: $ ./test_progs -j torvalds#13/2 bpf_cookie/multi_kprobe_link_api:FAIL torvalds#13/3 bpf_cookie/multi_kprobe_attach_api:FAIL torvalds#13 bpf_cookie:FAIL torvalds#77 fentry_fexit:FAIL torvalds#78/1 fentry_test/fentry:FAIL torvalds#78 fentry_test:FAIL torvalds#82/1 fexit_test/fexit:FAIL torvalds#82 fexit_test:FAIL torvalds#112/1 kprobe_multi_test/skel_api:FAIL torvalds#112/2 kprobe_multi_test/link_api_addrs:FAIL [...] torvalds#112 kprobe_multi_test:FAIL torvalds#356/17 test_global_funcs/global_func17:FAIL torvalds#356 test_global_funcs:FAIL Further analysis shows llvm upstream patch [1] is responsible for the above failures. For example, for function bpf_fentry_test7() in net/bpf/test_run.c, without [1], the asm code is: 0000000000000400 <bpf_fentry_test7>: 400: f3 0f 1e fa endbr64 404: e8 00 00 00 00 callq 0x409 <bpf_fentry_test7+0x9> 409: 48 89 f8 movq %rdi, %rax 40c: c3 retq 40d: 0f 1f 00 nopl (%rax) ... and with [1], the asm code is: 0000000000005d20 <bpf_fentry_test7.specialized.1>: 5d20: e8 00 00 00 00 callq 0x5d25 <bpf_fentry_test7.specialized.1+0x5> 5d25: c3 retq ... and <bpf_fentry_test7.specialized.1> is called instead of <bpf_fentry_test7> and this caused test failures for torvalds#13/torvalds#77 etc. except torvalds#356. For test case torvalds#356/17, with [1] (progs/test_global_func17.c)), the main prog looks like: 0000000000000000 <global_func17>: 0: b4 00 00 00 2a 00 00 00 w0 = 0x2a 1: 95 00 00 00 00 00 00 00 exit ... which passed verification while the test itself expects a verification failure. Let us add 'barrier_var' style asm code in both places to prevent function specialization which caused selftests failure. [1] llvm/llvm-project#72903 Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231127050342.1945270-1-yonghong.song@linux.dev Signed-off-by: Sasha Levin <sashal@kernel.org>

This change adds support for a configurable eir_max_name_len for platforms which requires a larger than 48 bytes complete name in EIR. From bluetoothctl: [bluetooth]# system-alias 012345678901234567890123456789012345678901234567890123456789 Changing 012345678901234567890123456789012345678901234567890123456789 succeeded [CHG] Controller DC:71:96:69:02:89 Alias: 012345678901234567890123456789012345678901234567890123456789 From btmon: < HCI Command: Write Local Name (0x03|0x0013) plen 248 torvalds#109 [hci0] 88.567990 Name: 012345678901234567890123456789012345678901234567890123456789 > HCI Event: Command Complete (0x0e) plen 4 torvalds#110 [hci0] 88.663854 Write Local Name (0x03|0x0013) ncmd 1 Status: Success (0x00) @ MGMT Event: Local Name Changed (0x0008) plen 260 {0x0004} [hci0] 88.663948 Name: 012345678901234567890123456789012345678901234567890123456789 Short name: < HCI Command: Write Extended Inquiry Response (0x03|0x0052) plen 241 torvalds#111 [hci0] 88.663977 FEC: Not required (0x00) Name (complete): 012345678901234567890123456789012345678901234567890123456789 TX power: 12 dBm Device ID: Bluetooth SIG assigned (0x0001) Vendor: Google (224) Product: 0xc405 Version: 0.5.6 (0x0056) 16-bit Service UUIDs (complete): 7 entries Generic Access Profile (0x1800) Generic Attribute Profile (0x1801) Device Information (0x180a) A/V Remote Control (0x110e) A/V Remote Control Target (0x110c) Handsfree Audio Gateway (0x111f) Audio Source (0x110a) > HCI Event: Command Complete (0x0e) plen 4 torvalds#112 [hci0] 88.664874 Write Extended Inquiry Response (0x03|0x0052) ncmd 1 Status: Success (0x00) (am from https://patchwork.kernel.org/patch/11687367/) Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org> Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org> Signed-off-by: Alain Michaud <alainm@chromium.org> Backport notes: HDEV_PARAM_U16 is changed from two parameters to one parameter. BUG=none TEST=build Signed-off-by: Zhengping Jiang <jiangzp@google.com>

tgsergeant added 2 commits August 5, 2014 13:37

who is this kool kid, why is he wearing a cool hat?

8cf7bca

mysteries of the universe

mess with people who just want to use their computer in peace

39399a7

tgsergeant closed this Aug 5, 2014

tgsergeant deleted the dev39 branch August 5, 2014 04:51

tobetter pushed a commit to tobetter/linux that referenced this pull request Sep 22, 2015

Merge pull request torvalds#112 from ricardona/odroidc-3.10.y

73b1b60

Remove excessive kernel log of amvideocap driver

CkNoSFeRaTU pushed a commit to CkNoSFeRaTU/linux that referenced this pull request Aug 25, 2016

Merge pull request torvalds#112 from crazycat69/tbs

88df4e2

TBS 5220, 5881 support.

laijs pushed a commit to laijs/linux that referenced this pull request Feb 13, 2017

Merge pull request torvalds#112 from lkl/revert-109-fix-screeninfo

6de4cd5

Revert "lkl: Fix missing screen_info during compilation."

iaguis pushed a commit to kinvolk/linux that referenced this pull request Feb 6, 2018

Merge pull request torvalds#112 from coreosbot/v4.13.16-coreos

cb161c5

Rebase patches onto 4.13.16

commodo pushed a commit to commodo/linux that referenced this pull request Jun 19, 2018

Merge pull request torvalds#112 from analogdevicesinc/add-adrv9009-ta…

697f646

…lise-support Add adrv9009 talise support

Damenly pushed a commit to Damenly/linux that referenced this pull request Jul 25, 2023

add radxa rock 5b gpiod label support (torvalds#112)

4d16687

Signed-off-by: nascs <nascs@radxa.com>

mj22226 pushed a commit to mj22226/linux that referenced this pull request Jul 30, 2023

add radxa rock 5b gpiod label support (torvalds#112)

8b1626d

Signed-off-by: nascs <nascs@radxa.com>

logic10492 pushed a commit to logic10492/linux-amd-zen2 that referenced this pull request Jan 18, 2024

Merge pull request torvalds#112 from sched-ext/scx-sync-upstream

5445296

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev39 #112

Dev39 #112

tgsergeant commented Aug 5, 2014

tgsergeant commented Aug 5, 2014

Dev39 #112

Dev39 #112

Conversation

tgsergeant commented Aug 5, 2014

tgsergeant commented Aug 5, 2014