Make a branch for us to place our additional patches etc. onto #2

grahamwhaley · 2017-11-16T17:50:43Z

Our master branch is trackining the mainline kernel - we need a branch where we apply our config and small set of patches on top of.
We should probably define a naming scheme for the branches. Possibly just appending 'kata' to the end of the kernel version/tag will work - so we would be on branch 'v4.13.13-kata' when #1 lands for instance?

devimc · 2017-11-16T18:45:28Z

https://github.com/kata-containers/linux/tree/linux-kata-containers-4.13.3

commit 67f8a8c upstream. Aneesh Kumar reported seeing host crashes when running recent kernels on POWER8. The symptom was an oops like this: Unable to handle kernel paging request for data at address 0xf00000000786c620 Faulting instruction address: 0xc00000000030e1e4 Oops: Kernel access of bad area, sig: 11 [#1] LE SMP NR_CPUS=2048 NUMA PowerNV Modules linked in: powernv_op_panel CPU: 24 PID: 6663 Comm: qemu-system-ppc Tainted: G W 4.13.0-rc7-43932-gfc36c59 #2 task: c000000fdeadfe80 task.stack: c000000fdeb68000 NIP: c00000000030e1e4 LR: c00000000030de6c CTR: c000000000103620 REGS: c000000fdeb6b450 TRAP: 0300 Tainted: G W (4.13.0-rc7-43932-gfc36c59) MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24044428 XER: 20000000 CFAR: c00000000030e134 DAR: f00000000786c620 DSISR: 40000000 SOFTE: 0 GPR00: 0000000000000000 c000000fdeb6b6d0 c0000000010bd000 000000000000e1b0 GPR04: c00000000115e168 c000001fffa6e4b0 c00000000115d000 c000001e1b180386 GPR08: f000000000000000 c000000f9a8913e0 f00000000786c600 00007fff587d0000 GPR12: c000000fdeb68000 c00000000fb0f000 0000000000000001 00007fff587cffff GPR16: 0000000000000000 c000000000000000 00000000003fffff c000000fdebfe1f8 GPR20: 0000000000000004 c000000fdeb6b8a8 0000000000000001 0008000000000040 GPR24: 07000000000000c0 00007fff587cffff c000000fdec20bf8 00007fff587d0000 GPR28: c000000fdeca9ac0 00007fff587d0000 00007fff587c0000 00007fff587d0000 NIP [c00000000030e1e4] __get_user_pages_fast+0x434/0x1070 LR [c00000000030de6c] __get_user_pages_fast+0xbc/0x1070 Call Trace: [c000000fdeb6b6d0] [c00000000139dab8] lock_classes+0x0/0x35fe50 (unreliable) [c000000fdeb6b7e0] [c00000000030ef38] get_user_pages_fast+0xf8/0x120 [c000000fdeb6b830] [c000000000112318] kvmppc_book3s_hv_page_fault+0x308/0xf30 [c000000fdeb6b960] [c00000000010e10c] kvmppc_vcpu_run_hv+0xfdc/0x1f00 [c000000fdeb6bb20] [c0000000000e915c] kvmppc_vcpu_run+0x2c/0x40 [c000000fdeb6bb40] [c0000000000e5650] kvm_arch_vcpu_ioctl_run+0x110/0x300 [c000000fdeb6bbe0] [c0000000000d6468] kvm_vcpu_ioctl+0x528/0x900 [c000000fdeb6bd40] [c0000000003bc04c] do_vfs_ioctl+0xcc/0x950 [c000000fdeb6bde0] [c0000000003bc930] SyS_ioctl+0x60/0x100 [c000000fdeb6be30] [c00000000000b96c] system_call+0x58/0x6c Instruction dump: 7ca81a14 2fa50000 41de0010 7cc8182a 68c60002 78c6ffe2 0b060000 3cc2000a 794a3664 390610d8 e9080000 7d485214 <e90a0020> 7d435378 790507e1 408202f0 ---[ end trace fad4a342d0414aa2 ]--- It turns out that what has happened is that the SLB entry for the vmmemap region hasn't been reloaded on exit from a guest, and it has the wrong page size. Then, when the host next accesses the vmemmap region, it gets a page fault. Commit a25bd72 ("powerpc/mm/radix: Workaround prefetch issue with KVM", 2017-07-24) modified the guest exit code so that it now only clears out the SLB for hash guest. The code tests the radix flag and puts the result in a non-volatile CR field, CR2, and later branches based on CR2. Unfortunately, the kvmppc_save_tm function, which gets called between those two points, modifies all the user-visible registers in the case where the guest was in transactional or suspended state, except for a few which it restores (namely r1, r2, r9 and r13). Thus the hash/radix indication in CR2 gets corrupted. This fixes the problem by re-doing the comparison just before the result is needed. For good measure, this also adds comments next to the call sites of kvmppc_save_tm and kvmppc_restore_tm pointing out that non-volatile register state will be lost. Fixes: a25bd72 ("powerpc/mm/radix: Workaround prefetch issue with KVM") Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 656d61c upstream. printk_ratelimit() invokes ___ratelimit() which may invoke a normal printk() (pr_warn() in this particular case) to warn about suppressed output. Given that printk_ratelimit() may be called from anywhere, that pr_warn() is dangerous - it may end up deadlocking the system. Fix ___ratelimit() by using deferred printk(). Sasha reported the following lockdep error: : Unregister pv shared memory for cpu 8 : select_fallback_rq: 3 callbacks suppressed : process 8583 (trinity-c78) no longer affine to cpu8 : : ====================================================== : WARNING: possible circular locking dependency detected : 4.14.0-rc2-next-20170927+ torvalds#252 Not tainted : ------------------------------------------------------ : migration/8/62 is trying to acquire lock: : (&port_lock_key){-.-.}, at: serial8250_console_write() : : but task is already holding lock: : (&rq->lock){-.-.}, at: sched_cpu_dying() : : which lock already depends on the new lock. : : : the existing dependency chain (in reverse order) is: : : -> #3 (&rq->lock){-.-.}: : __lock_acquire() : lock_acquire() : _raw_spin_lock() : task_fork_fair() : sched_fork() : copy_process.part.31() : _do_fork() : kernel_thread() : rest_init() : start_kernel() : x86_64_start_reservations() : x86_64_start_kernel() : verify_cpu() : : -> #2 (&p->pi_lock){-.-.}: : __lock_acquire() : lock_acquire() : _raw_spin_lock_irqsave() : try_to_wake_up() : default_wake_function() : woken_wake_function() : __wake_up_common() : __wake_up_common_lock() : __wake_up() : tty_wakeup() : tty_port_default_wakeup() : tty_port_tty_wakeup() : uart_write_wakeup() : serial8250_tx_chars() : serial8250_handle_irq.part.25() : serial8250_default_handle_irq() : serial8250_interrupt() : __handle_irq_event_percpu() : handle_irq_event_percpu() : handle_irq_event() : handle_level_irq() : handle_irq() : do_IRQ() : ret_from_intr() : native_safe_halt() : default_idle() : arch_cpu_idle() : default_idle_call() : do_idle() : cpu_startup_entry() : rest_init() : start_kernel() : x86_64_start_reservations() : x86_64_start_kernel() : verify_cpu() : : -> #1 (&tty->write_wait){-.-.}: : __lock_acquire() : lock_acquire() : _raw_spin_lock_irqsave() : __wake_up_common_lock() : __wake_up() : tty_wakeup() : tty_port_default_wakeup() : tty_port_tty_wakeup() : uart_write_wakeup() : serial8250_tx_chars() : serial8250_handle_irq.part.25() : serial8250_default_handle_irq() : serial8250_interrupt() : __handle_irq_event_percpu() : handle_irq_event_percpu() : handle_irq_event() : handle_level_irq() : handle_irq() : do_IRQ() : ret_from_intr() : native_safe_halt() : default_idle() : arch_cpu_idle() : default_idle_call() : do_idle() : cpu_startup_entry() : rest_init() : start_kernel() : x86_64_start_reservations() : x86_64_start_kernel() : verify_cpu() : : -> #0 (&port_lock_key){-.-.}: : check_prev_add() : __lock_acquire() : lock_acquire() : _raw_spin_lock_irqsave() : serial8250_console_write() : univ8250_console_write() : console_unlock() : vprintk_emit() : vprintk_default() : vprintk_func() : printk() : ___ratelimit() : __printk_ratelimit() : select_fallback_rq() : sched_cpu_dying() : cpuhp_invoke_callback() : take_cpu_down() : multi_cpu_stop() : cpu_stopper_thread() : smpboot_thread_fn() : kthread() : ret_from_fork() : : other info that might help us debug this: : : Chain exists of: : &port_lock_key --> &p->pi_lock --> &rq->lock : : Possible unsafe locking scenario: : : CPU0 CPU1 : ---- ---- : lock(&rq->lock); : lock(&p->pi_lock); : lock(&rq->lock); : lock(&port_lock_key); : : *** DEADLOCK *** : : 4 locks held by migration/8/62: : #0: (&p->pi_lock){-.-.}, at: sched_cpu_dying() : #1: (&rq->lock){-.-.}, at: sched_cpu_dying() : #2: (printk_ratelimit_state.lock){....}, at: ___ratelimit() : #3: (console_lock){+.+.}, at: vprintk_emit() : : stack backtrace: : CPU: 8 PID: 62 Comm: migration/8 Not tainted 4.14.0-rc2-next-20170927+ torvalds#252 : Call Trace: : dump_stack() : print_circular_bug() : check_prev_add() : ? add_lock_to_list.isra.26() : ? check_usage() : ? kvm_clock_read() : ? kvm_sched_clock_read() : ? sched_clock() : ? check_preemption_disabled() : __lock_acquire() : ? __lock_acquire() : ? add_lock_to_list.isra.26() : ? debug_check_no_locks_freed() : ? memcpy() : lock_acquire() : ? serial8250_console_write() : _raw_spin_lock_irqsave() : ? serial8250_console_write() : serial8250_console_write() : ? serial8250_start_tx() : ? lock_acquire() : ? memcpy() : univ8250_console_write() : console_unlock() : ? __down_trylock_console_sem() : vprintk_emit() : vprintk_default() : vprintk_func() : printk() : ? show_regs_print_info() : ? lock_acquire() : ___ratelimit() : __printk_ratelimit() : select_fallback_rq() : sched_cpu_dying() : ? sched_cpu_starting() : ? rcutree_dying_cpu() : ? sched_cpu_starting() : cpuhp_invoke_callback() : ? cpu_disable_common() : take_cpu_down() : ? trace_hardirqs_off_caller() : ? cpuhp_invoke_callback() : multi_cpu_stop() : ? __this_cpu_preempt_check() : ? cpu_stop_queue_work() : cpu_stopper_thread() : ? cpu_stop_create() : smpboot_thread_fn() : ? sort_range() : ? schedule() : ? __kthread_parkme() : kthread() : ? sort_range() : ? kthread_create_on_node() : ret_from_fork() : process 9121 (trinity-c78) no longer affine to cpu8 : smpboot: CPU 8 is now offline Link: http://lkml.kernel.org/r/20170928120405.18273-1-sergey.senozhatsky@gmail.com Fixes: 6b1d174 ("ratelimit: extend to print suppressed messages on release") Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reported-by: Sasha Levin <levinsasha928@gmail.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Borislav Petkov <bp@suse.de> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit ab31fd0 upstream. v4.10 commit 6f2ce1c ("scsi: zfcp: fix rport unblock race with LUN recovery") extended accessing parent pointer fields of struct zfcp_erp_action for tracing. If an erp_action has never been enqueued before, these parent pointer fields are uninitialized and NULL. Examples are zfcp objects freshly added to the parent object's children list, before enqueueing their first recovery subsequently. In zfcp_erp_try_rport_unblock(), we iterate such list. Accessing erp_action fields can cause a NULL pointer dereference. Since the kernel can read from lowcore on s390, it does not immediately cause a kernel page fault. Instead it can cause hangs on trying to acquire the wrong erp_action->adapter->dbf->rec_lock in zfcp_dbf_rec_action_lvl() ^bogus^ while holding already other locks with IRQs disabled. Real life example from attaching lots of LUNs in parallel on many CPUs: crash> bt 17723 PID: 17723 TASK: ... CPU: 25 COMMAND: "zfcperp0.0.1800" LOWCORE INFO: -psw : 0x0404300180000000 0x000000000038e424 -function : _raw_spin_lock_wait_flags at 38e424 ... #0 [fdde8fc90] zfcp_dbf_rec_action_lvl at 3e0004e9862 [zfcp] #1 [fdde8fce8] zfcp_erp_try_rport_unblock at 3e0004dfddc [zfcp] #2 [fdde8fd38] zfcp_erp_strategy at 3e0004e0234 [zfcp] #3 [fdde8fda8] zfcp_erp_thread at 3e0004e0a12 [zfcp] #4 [fdde8fe60] kthread at 173550 #5 [fdde8feb8] kernel_thread_starter at 10add2 zfcp_adapter zfcp_port zfcp_unit <address>, 0x404040d600000000 scsi_device NULL, returning early! zfcp_scsi_dev.status = 0x40000000 0x40000000 ZFCP_STATUS_COMMON_RUNNING crash> zfcp_unit <address> struct zfcp_unit { erp_action = { adapter = 0x0, port = 0x0, unit = 0x0, }, } zfcp_erp_action is always fully embedded into its container object. Such container object is never moved in its object tree (only add or delete). Hence, erp_action parent pointers can never change. To fix the issue, initialize the erp_action parent pointers before adding the erp_action container to any list and thus before it becomes accessible from outside of its initializing function. In order to also close the time window between zfcp_erp_setup_act() memsetting the entire erp_action to zero and setting the parent pointers again, drop the memset and instead explicitly initialize individually all erp_action fields except for parent pointers. To be extra careful not to introduce any other unintended side effect, even keep zeroing the erp_action fields for list and timer. Also double-check with WARN_ON_ONCE that erp_action parent pointers never change, so we get to know when we would deviate from previous behavior. Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com> Fixes: 6f2ce1c ("scsi: zfcp: fix rport unblock race with LUN recovery") Reviewed-by: Benjamin Block <bblock@linux.vnet.ibm.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

@s

commit 2eb9eab upstream. syzkaller with KASAN reported an out-of-bounds read in asn1_ber_decoder(). It can be reproduced by the following command, assuming CONFIG_X509_CERTIFICATE_PARSER=y and CONFIG_KASAN=y: keyctl add asymmetric desc $'\x30\x30' @s The bug is that the length of an ASN.1 data value isn't validated in the case where it is encoded using the short form, causing the decoder to read past the end of the input buffer. Fix it by validating the length. The bug report was: BUG: KASAN: slab-out-of-bounds in asn1_ber_decoder+0x10cb/0x1730 lib/asn1_decoder.c:233 Read of size 1 at addr ffff88003cccfa02 by task syz-executor0/6818 CPU: 1 PID: 6818 Comm: syz-executor0 Not tainted 4.14.0-rc7-00008-g5f479447d983 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0xb3/0x10b lib/dump_stack.c:52 print_address_description+0x79/0x2a0 mm/kasan/report.c:252 kasan_report_error mm/kasan/report.c:351 [inline] kasan_report+0x236/0x340 mm/kasan/report.c:409 __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:427 asn1_ber_decoder+0x10cb/0x1730 lib/asn1_decoder.c:233 x509_cert_parse+0x1db/0x650 crypto/asymmetric_keys/x509_cert_parser.c:89 x509_key_preparse+0x64/0x7a0 crypto/asymmetric_keys/x509_public_key.c:174 asymmetric_key_preparse+0xcb/0x1a0 crypto/asymmetric_keys/asymmetric_type.c:388 key_create_or_update+0x347/0xb20 security/keys/key.c:855 SYSC_add_key security/keys/keyctl.c:122 [inline] SyS_add_key+0x1cd/0x340 security/keys/keyctl.c:62 entry_SYSCALL_64_fastpath+0x1f/0xbe RIP: 0033:0x447c89 RSP: 002b:00007fca7a5d3bd8 EFLAGS: 00000246 ORIG_RAX: 00000000000000f8 RAX: ffffffffffffffda RBX: 00007fca7a5d46cc RCX: 0000000000447c89 RDX: 0000000020006f4a RSI: 0000000020006000 RDI: 0000000020001ff5 RBP: 0000000000000046 R08: fffffffffffffffd R09: 0000000000000000 R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 00007fca7a5d49c0 R15: 00007fca7a5d4700 Fixes: 42d5ec2 ("X.509: Add an ASN.1 decoder") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <james.l.morris@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

@s

commit 624f5ab upstream. syzkaller reported a NULL pointer dereference in asn1_ber_decoder(). It can be reproduced by the following command, assuming CONFIG_PKCS7_TEST_KEY=y: keyctl add pkcs7_test desc '' @s The bug is that if the data buffer is empty, an integer underflow occurs in the following check: if (unlikely(dp >= datalen - 1)) goto data_overrun_error; This results in the NULL data pointer being dereferenced. Fix it by checking for 'datalen - dp < 2' instead. Also fix the similar check for 'dp >= datalen - n' later in the same function. That one possibly could result in a buffer overread. The NULL pointer dereference was reproducible using the "pkcs7_test" key type but not the "asymmetric" key type because the "asymmetric" key type checks for a 0-length payload before calling into the ASN.1 decoder but the "pkcs7_test" key type does not. The bug report was: BUG: unable to handle kernel NULL pointer dereference at (null) IP: asn1_ber_decoder+0x17f/0xe60 lib/asn1_decoder.c:233 PGD 7b708067 P4D 7b708067 PUD 7b6ee067 PMD 0 Oops: 0000 [#1] SMP Modules linked in: CPU: 0 PID: 522 Comm: syz-executor1 Not tainted 4.14.0-rc8 #7 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.3-20171021_125229-anatol 04/01/2014 task: ffff9b6b3798c040 task.stack: ffff9b6b37970000 RIP: 0010:asn1_ber_decoder+0x17f/0xe60 lib/asn1_decoder.c:233 RSP: 0018:ffff9b6b37973c78 EFLAGS: 00010216 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000021c RDX: ffffffff814a04ed RSI: ffffb1524066e000 RDI: ffffffff910759e0 RBP: ffff9b6b37973d60 R08: 0000000000000001 R09: ffff9b6b3caa4180 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f10ed1f2700(0000) GS:ffff9b6b3ea00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000007b6f3000 CR4: 00000000000006f0 Call Trace: pkcs7_parse_message+0xee/0x240 crypto/asymmetric_keys/pkcs7_parser.c:139 verify_pkcs7_signature+0x33/0x180 certs/system_keyring.c:216 pkcs7_preparse+0x41/0x70 crypto/asymmetric_keys/pkcs7_key_type.c:63 key_create_or_update+0x180/0x530 security/keys/key.c:855 SYSC_add_key security/keys/keyctl.c:122 [inline] SyS_add_key+0xbf/0x250 security/keys/keyctl.c:62 entry_SYSCALL_64_fastpath+0x1f/0xbe RIP: 0033:0x4585c9 RSP: 002b:00007f10ed1f1bd8 EFLAGS: 00000216 ORIG_RAX: 00000000000000f8 RAX: ffffffffffffffda RBX: 00007f10ed1f2700 RCX: 00000000004585c9 RDX: 0000000020000000 RSI: 0000000020008ffb RDI: 0000000020008000 RBP: 0000000000000000 R08: ffffffffffffffff R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000216 R12: 00007fff1b2260ae R13: 00007fff1b2260af R14: 00007f10ed1f2700 R15: 0000000000000000 Code: dd ca ff 48 8b 45 88 48 83 e8 01 4c 39 f0 0f 86 a8 07 00 00 e8 53 dd ca ff 49 8d 46 01 48 89 85 58 ff ff ff 48 8b 85 60 ff ff ff <42> 0f b6 0c 30 89 c8 88 8d 75 ff ff ff 83 e0 1f 89 8d 28 ff ff RIP: asn1_ber_decoder+0x17f/0xe60 lib/asn1_decoder.c:233 RSP: ffff9b6b37973c78 CR2: 0000000000000000 Fixes: 42d5ec2 ("X.509: Add an ASN.1 decoder") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <james.l.morris@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

grahamwhaley · 2017-11-20T10:30:42Z

Closing this now we have the branch.

Each thread executing in an enclave is associated with a Thread Control Structure (TCS). The test enclave contains two hardcoded TCS. Each TCS contains meta-data used by the hardware to save and restore thread specific information when entering/exiting the enclave. The two TCS structures within the test enclave share their SSA (State Save Area) resulting in the threads clobbering each other's data. Fix this by providing each TCS their own SSA area. Additionally, there is an 8K stack space and its address is computed from the enclave entry point which is correctly done for TCS #1 that starts on the first address inside the enclave but results in out of bounds memory when entering as TCS #2. Split 8K stack space into two separate pages with offset symbol between to ensure the current enclave entry calculation can continue to be used for both threads. While using the enclave with multiple threads requires these fixes the impact is not apparent because every test up to this point enters the enclave from the first TCS. More detail about the stack fix: ------------------------------- Before this change the test enclave (test_encl) looks as follows: .tcs (2 pages): (page 1) TCS #1 (page 2) TCS #2 .text (1 page) One page of code .data (5 pages) (page 1) encl_buffer (page 2) encl_buffer (page 3) SSA (page 4 and 5) STACK encl_stack: As shown above there is a symbol, encl_stack, that points to the end of the .data segment (pointing to the end of page 5 in .data) which is also the end of the enclave. The enclave entry code computes the stack address by adding encl_stack to the pointer to the TCS that entered the enclave. When entering at TCS #1 the stack is computed correctly but when entering at TCS #2 the stack pointer would point to one page beyond the end of the enclave and a #PF would result when TCS #2 attempts to enter the enclave. The fix involves moving the encl_stack symbol between the two stack pages. Doing so enables the stack address computation in the entry code to compute the correct stack address for each TCS. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/a49dc0d85401db788a0a3f0d795e848abf3b1f44.1636997631.git.reinette.chatre@intel.com

During PCI bus rescan, adding new devices involve two notifiers. 1. dmar_pci_bus_notifier() 2. iommu_bus_notifier() The current code sets #1 as low priority (INT_MIN) which resulted in #2 being invoked first. The result is that struct device pointer cannot be found in DRHD search for the new device's DMAR/IOMMU. Subsequently, the device is put under the "catch-all" IOMMU instead of the correct one. This could cause system hang when device TLB invalidation is sent to the wrong IOMMU. Invalidation timeout error or hard lockup can be observed. This patch fixes the issue by setting a higher priority for dmar_pci_bus_notifier. DRHD search for a new device will find the correct IOMMU. Reported-by: Zhang, Bernice <bernice.zhang@intel.com> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>

So that virt_to_phys can work properly when converting the virtual address of report_data to physical address. This is aimed to fix an unaccepted memory issue when calling TDCALL to get tdreport data. The calltrace is like: [ 27.952508] Kernel panic - not syncing: #VE due to access to unaccepted memory. GPA: 0x349438580 [ 27.957943] CPU: 3 PID: 964 Comm: test_tdx_attest Tainted: G Y 5.15.0-01502-gc4b2b600cc06-dirty #2 [ 27.960756] Call Trace: [ 27.961440] dump_stack_lvl+0x33/0x42 [ 27.962452] panic+0xed/0x2bf [ 27.963287] tdx_handle_virtualization_exception.cold.23+0xc/0xc [ 27.964923] ? tdx_get_ve_info.part.18+0x19/0x60 [ 27.966202] exc_virtualization_exception+0x214/0x3f0 [ 27.967631] asm_exc_virtualization_exception+0x16/0x50 [ 27.969071] RIP: 0010:__tdx_module_call+0x10/0x40 [ 27.970363] Code: cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 b8 01 00 00 00 c3 cc cc cc cc cc 41 54 41 51 48 89 f8 4d 89 c1 49 89 c8 48 89 f1 <66> 0f 01 cc 41 5c 48 85 c0 75 22 4d 85 e4 74 1d 49 89 0c 24 49 89 [ 27.975423] RSP: 0018:ff6fb40f8075be38 EFLAGS: 00010246 [ 27.976857] RAX: 0000000000000004 RBX: 0000000000000000 RCX: 00000001050a1000 [ 27.978802] RDX: 0000000349438580 RSI: 00000001050a1000 RDI: 0000000000000004 [ 27.980784] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000 [ 27.982716] R10: ff6fb40f8075bd08 R11: ff6fb40f8075bd00 R12: 00000001050a1000 [ 27.984664] R13: 0000000349438580 R14: 0000000000000003 R15: 0000000000000000 [ 27.986595] __trace_tdx_module_call.constprop.20+0x3a/0x190 [ 27.988139] ? _printk+0x58/0x6f [ 27.989046] tdx_mcall_tdreport+0x32/0x90 [ 27.990151] tdx_attest_ioctl.cold.1+0x7f/0xe7 [intel_tdx_attest] [ 27.991839] __x64_sys_ioctl+0x81/0xc0 [ 27.992887] do_syscall_64+0x3f/0x90 [ 27.993876] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 27.995311] RIP: 0033:0x7f6993c5834b [ 27.996298] Code: 0f 1e fa 48 8b 05 55 4b 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 25 4b 0d 00 f7 d8 64 89 01 48 [ 28.001397] RSP: 002b:00007ffe33d3ace8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 28.003448] RAX: ffffffffffffffda RBX: 00007ffe33d3b1b0 RCX: 00007f6993c5834b [ 28.005383] RDX: 00007ffe33d3acf0 RSI: 00000000c0085401 RDI: 0000000000000003 [ 28.007578] RBP: 00007ffe33d3acf0 R08: 0000000000000003 R09: 0000000000000003 [ 28.009541] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe33d3b170 [ 28.011509] R13: 00007ffe33d3b1b0 R14: 0000000000000000 R15: 0000000000000000 [ 28.013762] Kernel Offset: 0x1b000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 28.020274] ---[ end Kernel panic - not syncing: #VE due to access to unaccepted memory. GPA: 0x349438580 ]--- Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>

A lock: param->lock B lock: fparam->lock C lock: pasid_mutex Thread #1: prq report, holds A lock, and tries to hold B lock Thread #2: page response, holds B lock, and tries to hold C lock Thread #3: unbind_gpasid (could be bind_gpasid or intel_svm_free_async_fn as well), holds C lock, and tries to hold A lock. Dead lock happens when #1 holds A lock, #2 holds B lock and #3 holds C lock. PRQ report: A lock => B lock unlock => unlock | | | | | +-------------+ | +-----------------------------------------------+ Page response: B lock => C lock unlock => unlock | | | | | +-------------+ | +-----------------------------------------------+ Unbind_gpasid: C lock => A lock unlock => unlock | | | | | +-------------+ | +-----------------------------------------------+ This fix moves the attempt of holding A lock in Thread #3 to be outside of C lock protection. To demonstrate well, also draw the bind_gpasid explicitly. After fixing, the locking sequence is as below: Bind_gpasid: {only for bind failure} A lock unlock => C lock unlock => A lock unlock | | | | | | +-----------+ +----------+ +------------+ PRQ report: A lock => B lock unlock => unlock | | | | | +-------------+ | +-------------------------------------------+ Page response: B lock => C lock unlock => unlock | | | | | +-------------+ | +-------------------------------------------+ Unbind_gpasid: C lock unlock => A lock unlock | | | | +------------+ +----------+ Signed-off-by: Yi Liu <yi.l.liu@intel.com>

Without this patch, lockdep complains as follows and disable it. ============================= [ BUG: Invalid wait context ] 5.18.0-rc1 torvalds#171 Not tainted ----------------------------- swapper/0/1 is trying to lock: ffffffff978f7978 (&port_lock_key){....}-{3:3}, at: serial8250_console_write+0x557/0x610 other info that might help us debug this: context-{5:5} 3 locks held by swapper/0/1: #0: ffffffff95a9ab48 (rcu_tasks.cbs_gbl_lock){....}-{2:2}, at: cblist_init_generic+0x29/0x310 #1: ffffffff95a86d60 (console_lock){+.+.}-{0:0}, at: vprintk_emit+0x153/0x200 #2: ffffffff959a66c0 (console_owner){....}-{0:0}, at: console_unlock+0x27a/0x620 stack backtrace: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc1-yamahata-kvm-upstream+ torvalds#171 eb656e2c6a2533f980e80fcf0fe9314dd3ea2a22 Hardware name: Intel Corporation EAGLESTREAM/EAGLESTREAM, BIOS EGSDCRB1.SYS.0066.D24.2110072326 10/07/2021 Call Trace: <TASK> show_stack+0x52/0x58 dump_stack_lvl+0x5b/0x82 dump_stack+0x10/0x12 check_wait_context.cold+0xc0/0xfe __lock_acquire+0x222/0x990 lock_acquire.part.0+0x126/0x2f0 ? serial8250_console_write+0x557/0x610 ? rcu_read_unlock+0x50/0x50 ? check_prev_add+0x1270/0x1270 ? __this_cpu_preempt_check+0x13/0x20 lock_acquire+0x8f/0x180 ? serial8250_console_write+0x557/0x610 _raw_spin_lock_irqsave+0x43/0x60 ? serial8250_console_write+0x557/0x610 serial8250_console_write+0x557/0x610 ? serial8250_config_port+0x210/0x210 ? rcu_read_unlock+0x50/0x50 ? __lock_release+0x14d/0x2d0 ? record_print_text+0x206/0x260 ? console_unlock+0x2a5/0x620 ? __this_cpu_preempt_check+0x13/0x20 univ8250_console_write+0x4b/0x60 call_console_drivers.constprop.0+0x180/0x250 console_unlock+0x2d0/0x620 ? console_unlock+0x27a/0x620 ? devkmsg_read+0x420/0x420 ? __down_trylock_console_sem+0x7c/0xc0 ? vprintk_emit+0x153/0x200 vprintk_emit+0x1a4/0x200 vprintk_default+0x1d/0x20 vprintk+0x4e/0x60 _printk+0xb2/0xe3 ? record_print_text.cold+0x11/0x11 ? rwlock_bug.part.0+0x60/0x60 ? lock_acquire+0x8f/0x180 ? cblist_init_generic+0x29/0x310 ? do_raw_spin_lock+0x11e/0x1b0 ? do_raw_spin_lock+0x109/0x1b0 cblist_init_generic.cold+0x44/0x52 rcu_init_tasks_generic+0x15/0x164 kernel_init_freeable+0x103/0x1ba ? rest_init+0x340/0x340 kernel_init+0x1f/0x170 ? rest_init+0x340/0x340 ret_from_fork+0x1f/0x30 </TASK> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>

UEFI Specification version 2.9 introduces the concept of memory acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, require memory to be accepted before it can be used by the guest. Accepting happens via a protocol specific to the Virtual Machine platform. There are several ways kernel can deal with unaccepted memory: 1. Accept all the memory during the boot. It is easy to implement and it doesn't have runtime cost once the system is booted. The downside is very long boot time. Accept can be parallelized to multiple CPUs to keep it manageable (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate memory bandwidth and does not scale beyond the point. 2. Accept a block of memory on the first use. It requires more infrastructure and changes in page allocator to make it work, but it provides good boot time. On-demand memory accept means latency spikes every time kernel steps onto a new memory block. The spikes will go away once workload data set size gets stabilized or all memory gets accepted. 3. Accept all memory in background. Introduce a thread (or multiple) that gets memory accepted proactively. It will minimize time the system experience latency spikes on memory allocation while keeping low boot time. This approach cannot function on its own. It is an extension of #2: background memory acceptance requires functional scheduler, but the page allocator may need to tap into unaccepted memory before that. The downside of the approach is that these threads also steal CPU cycles and memory bandwidth from the user's workload and may hurt user experience. Implement #2 for now. It is a reasonable default. Some workloads may want to use #1 or #3 and they can be implemented later based on user's demands. Support of unaccepted memory requires a few changes in core-mm code: - memblock has to accept memory on allocation; - page allocator has to accept memory on the first allocation of the page; Memblock change is trivial. The page allocator is modified to accept pages on the first allocation. The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is used to indicate that the page requires acceptance. Architecture has to provide two helpers if it wants to support unaccepted memory: - accept_memory() makes a range of physical addresses accepted. - range_contains_unaccepted_memory() checks anything within the range of physical addresses requires acceptance. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> # memblock Reviewed-by: David Hildenbrand <david@redhat.com>

Without this patch, lockdep complains as follows and disable it. ============================= [ BUG: Invalid wait context ] 5.18.0-rc1 torvalds#171 Not tainted ----------------------------- swapper/0/1 is trying to lock: ffffffff978f7978 (&port_lock_key){....}-{3:3}, at: serial8250_console_write+0x557/0x610 other info that might help us debug this: context-{5:5} 3 locks held by swapper/0/1: #0: ffffffff95a9ab48 (rcu_tasks.cbs_gbl_lock){....}-{2:2}, at: cblist_init_generic+0x29/0x310 #1: ffffffff95a86d60 (console_lock){+.+.}-{0:0}, at: vprintk_emit+0x153/0x200 #2: ffffffff959a66c0 (console_owner){....}-{0:0}, at: console_unlock+0x27a/0x620 stack backtrace: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc1-yamahata-kvm-upstream+ torvalds#171 eb656e2c6a2533f980e80fcf0fe9314dd3ea2a22 Hardware name: Intel Corporation EAGLESTREAM/EAGLESTREAM, BIOS EGSDCRB1.SYS.0066.D24.2110072326 10/07/2021 Call Trace: <TASK> show_stack+0x52/0x58 dump_stack_lvl+0x5b/0x82 dump_stack+0x10/0x12 check_wait_context.cold+0xc0/0xfe __lock_acquire+0x222/0x990 lock_acquire.part.0+0x126/0x2f0 ? serial8250_console_write+0x557/0x610 ? rcu_read_unlock+0x50/0x50 ? check_prev_add+0x1270/0x1270 ? __this_cpu_preempt_check+0x13/0x20 lock_acquire+0x8f/0x180 ? serial8250_console_write+0x557/0x610 _raw_spin_lock_irqsave+0x43/0x60 ? serial8250_console_write+0x557/0x610 serial8250_console_write+0x557/0x610 ? serial8250_config_port+0x210/0x210 ? rcu_read_unlock+0x50/0x50 ? __lock_release+0x14d/0x2d0 ? record_print_text+0x206/0x260 ? console_unlock+0x2a5/0x620 ? __this_cpu_preempt_check+0x13/0x20 univ8250_console_write+0x4b/0x60 call_console_drivers.constprop.0+0x180/0x250 console_unlock+0x2d0/0x620 ? console_unlock+0x27a/0x620 ? devkmsg_read+0x420/0x420 ? __down_trylock_console_sem+0x7c/0xc0 ? vprintk_emit+0x153/0x200 vprintk_emit+0x1a4/0x200 vprintk_default+0x1d/0x20 vprintk+0x4e/0x60 _printk+0xb2/0xe3 ? record_print_text.cold+0x11/0x11 ? rwlock_bug.part.0+0x60/0x60 ? lock_acquire+0x8f/0x180 ? cblist_init_generic+0x29/0x310 ? do_raw_spin_lock+0x11e/0x1b0 ? do_raw_spin_lock+0x109/0x1b0 cblist_init_generic.cold+0x44/0x52 rcu_init_tasks_generic+0x15/0x164 kernel_init_freeable+0x103/0x1ba ? rest_init+0x340/0x340 kernel_init+0x1f/0x170 ? rest_init+0x340/0x340 ret_from_fork+0x1f/0x30 </TASK> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>

UEFI Specification version 2.9 introduces the concept of memory acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, require memory to be accepted before it can be used by the guest. Accepting happens via a protocol specific to the Virtual Machine platform. There are several ways kernel can deal with unaccepted memory: 1. Accept all the memory during the boot. It is easy to implement and it doesn't have runtime cost once the system is booted. The downside is very long boot time. Accept can be parallelized to multiple CPUs to keep it manageable (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate memory bandwidth and does not scale beyond the point. 2. Accept a block of memory on the first use. It requires more infrastructure and changes in page allocator to make it work, but it provides good boot time. On-demand memory accept means latency spikes every time kernel steps onto a new memory block. The spikes will go away once workload data set size gets stabilized or all memory gets accepted. 3. Accept all memory in background. Introduce a thread (or multiple) that gets memory accepted proactively. It will minimize time the system experience latency spikes on memory allocation while keeping low boot time. This approach cannot function on its own. It is an extension of #2: background memory acceptance requires functional scheduler, but the page allocator may need to tap into unaccepted memory before that. The downside of the approach is that these threads also steal CPU cycles and memory bandwidth from the user's workload and may hurt user experience. Implement #2 for now. It is a reasonable default. Some workloads may want to use #1 or #3 and they can be implemented later based on user's demands. Support of unaccepted memory requires a few changes in core-mm code: - memblock has to accept memory on allocation; - page allocator has to accept memory on the first allocation of the page; Memblock change is trivial. The page allocator is modified to accept pages on the first allocation. The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is used to indicate that the page requires acceptance. Architecture has to provide two helpers if it wants to support unaccepted memory: - accept_memory() makes a range of physical addresses accepted. - range_contains_unaccepted_memory() checks anything within the range of physical addresses requires acceptance. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> # memblock Reviewed-by: David Hildenbrand <david@redhat.com>

grahamwhaley closed this as completed Nov 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make a branch for us to place our additional patches etc. onto #2

Make a branch for us to place our additional patches etc. onto #2

grahamwhaley commented Nov 16, 2017

devimc commented Nov 16, 2017

grahamwhaley commented Nov 20, 2017

Make a branch for us to place our additional patches etc. onto #2

Make a branch for us to place our additional patches etc. onto #2

Comments

grahamwhaley commented Nov 16, 2017

devimc commented Nov 16, 2017

grahamwhaley commented Nov 20, 2017