Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make a branch for us to place our additional patches etc. onto #2

Closed
grahamwhaley opened this issue Nov 16, 2017 · 2 comments
Closed

Comments

@grahamwhaley
Copy link

Our master branch is trackining the mainline kernel - we need a branch where we apply our config and small set of patches on top of.
We should probably define a naming scheme for the branches. Possibly just appending 'kata' to the end of the kernel version/tag will work - so we would be on branch 'v4.13.13-kata' when #1 lands for instance?

@devimc
Copy link

devimc commented Nov 16, 2017

sameo pushed a commit that referenced this issue Nov 17, 2017
commit 67f8a8c upstream.

Aneesh Kumar reported seeing host crashes when running recent kernels
on POWER8.  The symptom was an oops like this:

Unable to handle kernel paging request for data at address 0xf00000000786c620
Faulting instruction address: 0xc00000000030e1e4
Oops: Kernel access of bad area, sig: 11 [#1]
LE SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in: powernv_op_panel
CPU: 24 PID: 6663 Comm: qemu-system-ppc Tainted: G        W 4.13.0-rc7-43932-gfc36c59 #2
task: c000000fdeadfe80 task.stack: c000000fdeb68000
NIP:  c00000000030e1e4 LR: c00000000030de6c CTR: c000000000103620
REGS: c000000fdeb6b450 TRAP: 0300   Tainted: G        W        (4.13.0-rc7-43932-gfc36c59)
MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24044428  XER: 20000000
CFAR: c00000000030e134 DAR: f00000000786c620 DSISR: 40000000 SOFTE: 0
GPR00: 0000000000000000 c000000fdeb6b6d0 c0000000010bd000 000000000000e1b0
GPR04: c00000000115e168 c000001fffa6e4b0 c00000000115d000 c000001e1b180386
GPR08: f000000000000000 c000000f9a8913e0 f00000000786c600 00007fff587d0000
GPR12: c000000fdeb68000 c00000000fb0f000 0000000000000001 00007fff587cffff
GPR16: 0000000000000000 c000000000000000 00000000003fffff c000000fdebfe1f8
GPR20: 0000000000000004 c000000fdeb6b8a8 0000000000000001 0008000000000040
GPR24: 07000000000000c0 00007fff587cffff c000000fdec20bf8 00007fff587d0000
GPR28: c000000fdeca9ac0 00007fff587d0000 00007fff587c0000 00007fff587d0000
NIP [c00000000030e1e4] __get_user_pages_fast+0x434/0x1070
LR [c00000000030de6c] __get_user_pages_fast+0xbc/0x1070
Call Trace:
[c000000fdeb6b6d0] [c00000000139dab8] lock_classes+0x0/0x35fe50 (unreliable)
[c000000fdeb6b7e0] [c00000000030ef38] get_user_pages_fast+0xf8/0x120
[c000000fdeb6b830] [c000000000112318] kvmppc_book3s_hv_page_fault+0x308/0xf30
[c000000fdeb6b960] [c00000000010e10c] kvmppc_vcpu_run_hv+0xfdc/0x1f00
[c000000fdeb6bb20] [c0000000000e915c] kvmppc_vcpu_run+0x2c/0x40
[c000000fdeb6bb40] [c0000000000e5650] kvm_arch_vcpu_ioctl_run+0x110/0x300
[c000000fdeb6bbe0] [c0000000000d6468] kvm_vcpu_ioctl+0x528/0x900
[c000000fdeb6bd40] [c0000000003bc04c] do_vfs_ioctl+0xcc/0x950
[c000000fdeb6bde0] [c0000000003bc930] SyS_ioctl+0x60/0x100
[c000000fdeb6be30] [c00000000000b96c] system_call+0x58/0x6c
Instruction dump:
7ca81a14 2fa50000 41de0010 7cc8182a 68c60002 78c6ffe2 0b060000 3cc2000a
794a3664 390610d8 e9080000 7d485214 <e90a0020> 7d435378 790507e1 408202f0
---[ end trace fad4a342d0414aa2 ]---

It turns out that what has happened is that the SLB entry for the
vmmemap region hasn't been reloaded on exit from a guest, and it has
the wrong page size.  Then, when the host next accesses the vmemmap
region, it gets a page fault.

Commit a25bd72 ("powerpc/mm/radix: Workaround prefetch issue with
KVM", 2017-07-24) modified the guest exit code so that it now only clears
out the SLB for hash guest.  The code tests the radix flag and puts the
result in a non-volatile CR field, CR2, and later branches based on CR2.

Unfortunately, the kvmppc_save_tm function, which gets called between
those two points, modifies all the user-visible registers in the case
where the guest was in transactional or suspended state, except for a
few which it restores (namely r1, r2, r9 and r13).  Thus the hash/radix indication in CR2 gets corrupted.

This fixes the problem by re-doing the comparison just before the
result is needed.  For good measure, this also adds comments next to
the call sites of kvmppc_save_tm and kvmppc_restore_tm pointing out
that non-volatile register state will be lost.

Fixes: a25bd72 ("powerpc/mm/radix: Workaround prefetch issue with KVM")
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
sameo pushed a commit that referenced this issue Nov 17, 2017
commit 656d61c upstream.

printk_ratelimit() invokes ___ratelimit() which may invoke a normal
printk() (pr_warn() in this particular case) to warn about suppressed
output.  Given that printk_ratelimit() may be called from anywhere, that
pr_warn() is dangerous - it may end up deadlocking the system.  Fix
___ratelimit() by using deferred printk().

Sasha reported the following lockdep error:

 : Unregister pv shared memory for cpu 8
 : select_fallback_rq: 3 callbacks suppressed
 : process 8583 (trinity-c78) no longer affine to cpu8
 :
 : ======================================================
 : WARNING: possible circular locking dependency detected
 : 4.14.0-rc2-next-20170927+ torvalds#252 Not tainted
 : ------------------------------------------------------
 : migration/8/62 is trying to acquire lock:
 : (&port_lock_key){-.-.}, at: serial8250_console_write()
 :
 : but task is already holding lock:
 : (&rq->lock){-.-.}, at: sched_cpu_dying()
 :
 : which lock already depends on the new lock.
 :
 :
 : the existing dependency chain (in reverse order) is:
 :
 : -> #3 (&rq->lock){-.-.}:
 : __lock_acquire()
 : lock_acquire()
 : _raw_spin_lock()
 : task_fork_fair()
 : sched_fork()
 : copy_process.part.31()
 : _do_fork()
 : kernel_thread()
 : rest_init()
 : start_kernel()
 : x86_64_start_reservations()
 : x86_64_start_kernel()
 : verify_cpu()
 :
 : -> #2 (&p->pi_lock){-.-.}:
 : __lock_acquire()
 : lock_acquire()
 : _raw_spin_lock_irqsave()
 : try_to_wake_up()
 : default_wake_function()
 : woken_wake_function()
 : __wake_up_common()
 : __wake_up_common_lock()
 : __wake_up()
 : tty_wakeup()
 : tty_port_default_wakeup()
 : tty_port_tty_wakeup()
 : uart_write_wakeup()
 : serial8250_tx_chars()
 : serial8250_handle_irq.part.25()
 : serial8250_default_handle_irq()
 : serial8250_interrupt()
 : __handle_irq_event_percpu()
 : handle_irq_event_percpu()
 : handle_irq_event()
 : handle_level_irq()
 : handle_irq()
 : do_IRQ()
 : ret_from_intr()
 : native_safe_halt()
 : default_idle()
 : arch_cpu_idle()
 : default_idle_call()
 : do_idle()
 : cpu_startup_entry()
 : rest_init()
 : start_kernel()
 : x86_64_start_reservations()
 : x86_64_start_kernel()
 : verify_cpu()
 :
 : -> #1 (&tty->write_wait){-.-.}:
 : __lock_acquire()
 : lock_acquire()
 : _raw_spin_lock_irqsave()
 : __wake_up_common_lock()
 : __wake_up()
 : tty_wakeup()
 : tty_port_default_wakeup()
 : tty_port_tty_wakeup()
 : uart_write_wakeup()
 : serial8250_tx_chars()
 : serial8250_handle_irq.part.25()
 : serial8250_default_handle_irq()
 : serial8250_interrupt()
 : __handle_irq_event_percpu()
 : handle_irq_event_percpu()
 : handle_irq_event()
 : handle_level_irq()
 : handle_irq()
 : do_IRQ()
 : ret_from_intr()
 : native_safe_halt()
 : default_idle()
 : arch_cpu_idle()
 : default_idle_call()
 : do_idle()
 : cpu_startup_entry()
 : rest_init()
 : start_kernel()
 : x86_64_start_reservations()
 : x86_64_start_kernel()
 : verify_cpu()
 :
 : -> #0 (&port_lock_key){-.-.}:
 : check_prev_add()
 : __lock_acquire()
 : lock_acquire()
 : _raw_spin_lock_irqsave()
 : serial8250_console_write()
 : univ8250_console_write()
 : console_unlock()
 : vprintk_emit()
 : vprintk_default()
 : vprintk_func()
 : printk()
 : ___ratelimit()
 : __printk_ratelimit()
 : select_fallback_rq()
 : sched_cpu_dying()
 : cpuhp_invoke_callback()
 : take_cpu_down()
 : multi_cpu_stop()
 : cpu_stopper_thread()
 : smpboot_thread_fn()
 : kthread()
 : ret_from_fork()
 :
 : other info that might help us debug this:
 :
 : Chain exists of:
 :   &port_lock_key --> &p->pi_lock --> &rq->lock
 :
 :  Possible unsafe locking scenario:
 :
 :        CPU0                    CPU1
 :        ----                    ----
 :   lock(&rq->lock);
 :                                lock(&p->pi_lock);
 :                                lock(&rq->lock);
 :   lock(&port_lock_key);
 :
 :  *** DEADLOCK ***
 :
 : 4 locks held by migration/8/62:
 : #0: (&p->pi_lock){-.-.}, at: sched_cpu_dying()
 : #1: (&rq->lock){-.-.}, at: sched_cpu_dying()
 : #2: (printk_ratelimit_state.lock){....}, at: ___ratelimit()
 : #3: (console_lock){+.+.}, at: vprintk_emit()
 :
 : stack backtrace:
 : CPU: 8 PID: 62 Comm: migration/8 Not tainted 4.14.0-rc2-next-20170927+ torvalds#252
 : Call Trace:
 : dump_stack()
 : print_circular_bug()
 : check_prev_add()
 : ? add_lock_to_list.isra.26()
 : ? check_usage()
 : ? kvm_clock_read()
 : ? kvm_sched_clock_read()
 : ? sched_clock()
 : ? check_preemption_disabled()
 : __lock_acquire()
 : ? __lock_acquire()
 : ? add_lock_to_list.isra.26()
 : ? debug_check_no_locks_freed()
 : ? memcpy()
 : lock_acquire()
 : ? serial8250_console_write()
 : _raw_spin_lock_irqsave()
 : ? serial8250_console_write()
 : serial8250_console_write()
 : ? serial8250_start_tx()
 : ? lock_acquire()
 : ? memcpy()
 : univ8250_console_write()
 : console_unlock()
 : ? __down_trylock_console_sem()
 : vprintk_emit()
 : vprintk_default()
 : vprintk_func()
 : printk()
 : ? show_regs_print_info()
 : ? lock_acquire()
 : ___ratelimit()
 : __printk_ratelimit()
 : select_fallback_rq()
 : sched_cpu_dying()
 : ? sched_cpu_starting()
 : ? rcutree_dying_cpu()
 : ? sched_cpu_starting()
 : cpuhp_invoke_callback()
 : ? cpu_disable_common()
 : take_cpu_down()
 : ? trace_hardirqs_off_caller()
 : ? cpuhp_invoke_callback()
 : multi_cpu_stop()
 : ? __this_cpu_preempt_check()
 : ? cpu_stop_queue_work()
 : cpu_stopper_thread()
 : ? cpu_stop_create()
 : smpboot_thread_fn()
 : ? sort_range()
 : ? schedule()
 : ? __kthread_parkme()
 : kthread()
 : ? sort_range()
 : ? kthread_create_on_node()
 : ret_from_fork()
 : process 9121 (trinity-c78) no longer affine to cpu8
 : smpboot: CPU 8 is now offline

Link: http://lkml.kernel.org/r/20170928120405.18273-1-sergey.senozhatsky@gmail.com
Fixes: 6b1d174 ("ratelimit: extend to print suppressed messages on release")
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Borislav Petkov <bp@suse.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
sameo pushed a commit that referenced this issue Nov 17, 2017
commit ab31fd0 upstream.

v4.10 commit 6f2ce1c ("scsi: zfcp: fix rport unblock race with LUN
recovery") extended accessing parent pointer fields of struct
zfcp_erp_action for tracing.  If an erp_action has never been enqueued
before, these parent pointer fields are uninitialized and NULL. Examples
are zfcp objects freshly added to the parent object's children list,
before enqueueing their first recovery subsequently. In
zfcp_erp_try_rport_unblock(), we iterate such list. Accessing erp_action
fields can cause a NULL pointer dereference.  Since the kernel can read
from lowcore on s390, it does not immediately cause a kernel page
fault. Instead it can cause hangs on trying to acquire the wrong
erp_action->adapter->dbf->rec_lock in zfcp_dbf_rec_action_lvl()
                      ^bogus^
while holding already other locks with IRQs disabled.

Real life example from attaching lots of LUNs in parallel on many CPUs:

crash> bt 17723
PID: 17723  TASK: ...               CPU: 25  COMMAND: "zfcperp0.0.1800"
 LOWCORE INFO:
  -psw      : 0x0404300180000000 0x000000000038e424
  -function : _raw_spin_lock_wait_flags at 38e424
...
 #0 [fdde8fc90] zfcp_dbf_rec_action_lvl at 3e0004e9862 [zfcp]
 #1 [fdde8fce8] zfcp_erp_try_rport_unblock at 3e0004dfddc [zfcp]
 #2 [fdde8fd38] zfcp_erp_strategy at 3e0004e0234 [zfcp]
 #3 [fdde8fda8] zfcp_erp_thread at 3e0004e0a12 [zfcp]
 #4 [fdde8fe60] kthread at 173550
 #5 [fdde8feb8] kernel_thread_starter at 10add2

zfcp_adapter
 zfcp_port
  zfcp_unit <address>, 0x404040d600000000
  scsi_device NULL, returning early!
zfcp_scsi_dev.status = 0x40000000
0x40000000 ZFCP_STATUS_COMMON_RUNNING

crash> zfcp_unit <address>
struct zfcp_unit {
  erp_action = {
    adapter = 0x0,
    port = 0x0,
    unit = 0x0,
  },
}

zfcp_erp_action is always fully embedded into its container object. Such
container object is never moved in its object tree (only add or delete).
Hence, erp_action parent pointers can never change.

To fix the issue, initialize the erp_action parent pointers before
adding the erp_action container to any list and thus before it becomes
accessible from outside of its initializing function.

In order to also close the time window between zfcp_erp_setup_act()
memsetting the entire erp_action to zero and setting the parent pointers
again, drop the memset and instead explicitly initialize individually
all erp_action fields except for parent pointers. To be extra careful
not to introduce any other unintended side effect, even keep zeroing the
erp_action fields for list and timer. Also double-check with
WARN_ON_ONCE that erp_action parent pointers never change, so we get to
know when we would deviate from previous behavior.

Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com>
Fixes: 6f2ce1c ("scsi: zfcp: fix rport unblock race with LUN recovery")
Reviewed-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
sameo pushed a commit that referenced this issue Nov 17, 2017
commit 2eb9eab upstream.

syzkaller with KASAN reported an out-of-bounds read in
asn1_ber_decoder().  It can be reproduced by the following command,
assuming CONFIG_X509_CERTIFICATE_PARSER=y and CONFIG_KASAN=y:

    keyctl add asymmetric desc $'\x30\x30' @s

The bug is that the length of an ASN.1 data value isn't validated in the
case where it is encoded using the short form, causing the decoder to
read past the end of the input buffer.  Fix it by validating the length.

The bug report was:

    BUG: KASAN: slab-out-of-bounds in asn1_ber_decoder+0x10cb/0x1730 lib/asn1_decoder.c:233
    Read of size 1 at addr ffff88003cccfa02 by task syz-executor0/6818

    CPU: 1 PID: 6818 Comm: syz-executor0 Not tainted 4.14.0-rc7-00008-g5f479447d983 #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
     __dump_stack lib/dump_stack.c:16 [inline]
     dump_stack+0xb3/0x10b lib/dump_stack.c:52
     print_address_description+0x79/0x2a0 mm/kasan/report.c:252
     kasan_report_error mm/kasan/report.c:351 [inline]
     kasan_report+0x236/0x340 mm/kasan/report.c:409
     __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:427
     asn1_ber_decoder+0x10cb/0x1730 lib/asn1_decoder.c:233
     x509_cert_parse+0x1db/0x650 crypto/asymmetric_keys/x509_cert_parser.c:89
     x509_key_preparse+0x64/0x7a0 crypto/asymmetric_keys/x509_public_key.c:174
     asymmetric_key_preparse+0xcb/0x1a0 crypto/asymmetric_keys/asymmetric_type.c:388
     key_create_or_update+0x347/0xb20 security/keys/key.c:855
     SYSC_add_key security/keys/keyctl.c:122 [inline]
     SyS_add_key+0x1cd/0x340 security/keys/keyctl.c:62
     entry_SYSCALL_64_fastpath+0x1f/0xbe
    RIP: 0033:0x447c89
    RSP: 002b:00007fca7a5d3bd8 EFLAGS: 00000246 ORIG_RAX: 00000000000000f8
    RAX: ffffffffffffffda RBX: 00007fca7a5d46cc RCX: 0000000000447c89
    RDX: 0000000020006f4a RSI: 0000000020006000 RDI: 0000000020001ff5
    RBP: 0000000000000046 R08: fffffffffffffffd R09: 0000000000000000
    R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000000
    R13: 0000000000000000 R14: 00007fca7a5d49c0 R15: 00007fca7a5d4700

Fixes: 42d5ec2 ("X.509: Add an ASN.1 decoder")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
sameo pushed a commit that referenced this issue Nov 17, 2017
commit 624f5ab upstream.

syzkaller reported a NULL pointer dereference in asn1_ber_decoder().  It
can be reproduced by the following command, assuming
CONFIG_PKCS7_TEST_KEY=y:

        keyctl add pkcs7_test desc '' @s

The bug is that if the data buffer is empty, an integer underflow occurs
in the following check:

        if (unlikely(dp >= datalen - 1))
                goto data_overrun_error;

This results in the NULL data pointer being dereferenced.

Fix it by checking for 'datalen - dp < 2' instead.

Also fix the similar check for 'dp >= datalen - n' later in the same
function.  That one possibly could result in a buffer overread.

The NULL pointer dereference was reproducible using the "pkcs7_test" key
type but not the "asymmetric" key type because the "asymmetric" key type
checks for a 0-length payload before calling into the ASN.1 decoder but
the "pkcs7_test" key type does not.

The bug report was:

    BUG: unable to handle kernel NULL pointer dereference at           (null)
    IP: asn1_ber_decoder+0x17f/0xe60 lib/asn1_decoder.c:233
    PGD 7b708067 P4D 7b708067 PUD 7b6ee067 PMD 0
    Oops: 0000 [#1] SMP
    Modules linked in:
    CPU: 0 PID: 522 Comm: syz-executor1 Not tainted 4.14.0-rc8 #7
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.3-20171021_125229-anatol 04/01/2014
    task: ffff9b6b3798c040 task.stack: ffff9b6b37970000
    RIP: 0010:asn1_ber_decoder+0x17f/0xe60 lib/asn1_decoder.c:233
    RSP: 0018:ffff9b6b37973c78 EFLAGS: 00010216
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000021c
    RDX: ffffffff814a04ed RSI: ffffb1524066e000 RDI: ffffffff910759e0
    RBP: ffff9b6b37973d60 R08: 0000000000000001 R09: ffff9b6b3caa4180
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
    R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
    FS:  00007f10ed1f2700(0000) GS:ffff9b6b3ea00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 000000007b6f3000 CR4: 00000000000006f0
    Call Trace:
     pkcs7_parse_message+0xee/0x240 crypto/asymmetric_keys/pkcs7_parser.c:139
     verify_pkcs7_signature+0x33/0x180 certs/system_keyring.c:216
     pkcs7_preparse+0x41/0x70 crypto/asymmetric_keys/pkcs7_key_type.c:63
     key_create_or_update+0x180/0x530 security/keys/key.c:855
     SYSC_add_key security/keys/keyctl.c:122 [inline]
     SyS_add_key+0xbf/0x250 security/keys/keyctl.c:62
     entry_SYSCALL_64_fastpath+0x1f/0xbe
    RIP: 0033:0x4585c9
    RSP: 002b:00007f10ed1f1bd8 EFLAGS: 00000216 ORIG_RAX: 00000000000000f8
    RAX: ffffffffffffffda RBX: 00007f10ed1f2700 RCX: 00000000004585c9
    RDX: 0000000020000000 RSI: 0000000020008ffb RDI: 0000000020008000
    RBP: 0000000000000000 R08: ffffffffffffffff R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000216 R12: 00007fff1b2260ae
    R13: 00007fff1b2260af R14: 00007f10ed1f2700 R15: 0000000000000000
    Code: dd ca ff 48 8b 45 88 48 83 e8 01 4c 39 f0 0f 86 a8 07 00 00 e8 53 dd ca ff 49 8d 46 01 48 89 85 58 ff ff ff 48 8b 85 60 ff ff ff <42> 0f b6 0c 30 89 c8 88 8d 75 ff ff ff 83 e0 1f 89 8d 28 ff ff
    RIP: asn1_ber_decoder+0x17f/0xe60 lib/asn1_decoder.c:233 RSP: ffff9b6b37973c78
    CR2: 0000000000000000

Fixes: 42d5ec2 ("X.509: Add an ASN.1 decoder")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
@grahamwhaley
Copy link
Author

Closing this now we have the branch.

fidencio pushed a commit that referenced this issue Oct 12, 2022
Each thread executing in an enclave is associated with a Thread Control
Structure (TCS). The test enclave contains two hardcoded TCS. Each TCS
contains meta-data used by the hardware to save and restore thread specific
information when entering/exiting the enclave.

The two TCS structures within the test enclave share their SSA (State Save
Area) resulting in the threads clobbering each other's data. Fix this by
providing each TCS their own SSA area.

Additionally, there is an 8K stack space and its address is
computed from the enclave entry point which is correctly done for
TCS #1 that starts on the first address inside the enclave but
results in out of bounds memory when entering as TCS #2. Split 8K
stack space into two separate pages with offset symbol between to ensure
the current enclave entry calculation can continue to be used for both
threads.

While using the enclave with multiple threads requires these fixes the
impact is not apparent because every test up to this point enters the
enclave from the first TCS.

More detail about the stack fix:
-------------------------------
Before this change the test enclave (test_encl) looks as follows:

.tcs (2 pages):
(page 1) TCS #1
(page 2) TCS #2

.text (1 page)
One page of code

.data (5 pages)
(page 1) encl_buffer
(page 2) encl_buffer
(page 3) SSA
(page 4 and 5) STACK
encl_stack:

As shown above there is a symbol, encl_stack, that points to the end of the
.data segment (pointing to the end of page 5 in .data) which is also the
end of the enclave.

The enclave entry code computes the stack address by adding encl_stack to
the pointer to the TCS that entered the enclave. When entering at TCS #1
the stack is computed correctly but when entering at TCS #2 the stack
pointer would point to one page beyond the end of the enclave and a #PF
would result when TCS #2 attempts to enter the enclave.

The fix involves moving the encl_stack symbol between the two stack pages.
Doing so enables the stack address computation in the entry code to compute
the correct stack address for each TCS.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/a49dc0d85401db788a0a3f0d795e848abf3b1f44.1636997631.git.reinette.chatre@intel.com
fidencio pushed a commit that referenced this issue Oct 12, 2022
During PCI bus rescan, adding new devices involve two notifiers.
1. dmar_pci_bus_notifier()
2. iommu_bus_notifier()
The current code sets #1 as low priority (INT_MIN) which resulted in #2
being invoked first. The result is that struct device pointer cannot be
found in DRHD search for the new device's DMAR/IOMMU. Subsequently, the
device is put under the "catch-all" IOMMU instead of the correct one.

This could cause system hang when device TLB invalidation is sent to the
wrong IOMMU. Invalidation timeout error or hard lockup can be observed.

This patch fixes the issue by setting a higher priority for
dmar_pci_bus_notifier. DRHD search for a new device will find the
correct IOMMU.

Reported-by: Zhang, Bernice <bernice.zhang@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
fidencio pushed a commit that referenced this issue Oct 12, 2022
So that virt_to_phys can work properly when converting the virtual
address of report_data to physical address. This is aimed to fix an
unaccepted memory issue when calling TDCALL to get tdreport data.

The calltrace is like:

[   27.952508] Kernel panic - not syncing: #VE due to access to unaccepted memory. GPA: 0x349438580
[   27.957943] CPU: 3 PID: 964 Comm: test_tdx_attest Tainted: G                 Y 5.15.0-01502-gc4b2b600cc06-dirty #2
[   27.960756] Call Trace:
[   27.961440]  dump_stack_lvl+0x33/0x42
[   27.962452]  panic+0xed/0x2bf
[   27.963287]  tdx_handle_virtualization_exception.cold.23+0xc/0xc
[   27.964923]  ? tdx_get_ve_info.part.18+0x19/0x60
[   27.966202]  exc_virtualization_exception+0x214/0x3f0
[   27.967631]  asm_exc_virtualization_exception+0x16/0x50
[   27.969071] RIP: 0010:__tdx_module_call+0x10/0x40
[   27.970363] Code: cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 b8 01 00 00 00 c3 cc cc cc cc cc 41 54 41 51 48 89 f8 4d 89 c1 49 89 c8 48 89 f1 <66> 0f 01 cc 41 5c 48 85 c0 75 22 4d 85 e4 74 1d 49 89 0c 24 49 89
[   27.975423] RSP: 0018:ff6fb40f8075be38 EFLAGS: 00010246
[   27.976857] RAX: 0000000000000004 RBX: 0000000000000000 RCX: 00000001050a1000
[   27.978802] RDX: 0000000349438580 RSI: 00000001050a1000 RDI: 0000000000000004
[   27.980784] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
[   27.982716] R10: ff6fb40f8075bd08 R11: ff6fb40f8075bd00 R12: 00000001050a1000
[   27.984664] R13: 0000000349438580 R14: 0000000000000003 R15: 0000000000000000
[   27.986595]  __trace_tdx_module_call.constprop.20+0x3a/0x190
[   27.988139]  ? _printk+0x58/0x6f
[   27.989046]  tdx_mcall_tdreport+0x32/0x90
[   27.990151]  tdx_attest_ioctl.cold.1+0x7f/0xe7 [intel_tdx_attest]
[   27.991839]  __x64_sys_ioctl+0x81/0xc0
[   27.992887]  do_syscall_64+0x3f/0x90
[   27.993876]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   27.995311] RIP: 0033:0x7f6993c5834b
[   27.996298] Code: 0f 1e fa 48 8b 05 55 4b 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 25 4b 0d 00 f7 d8 64 89 01 48
[   28.001397] RSP: 002b:00007ffe33d3ace8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   28.003448] RAX: ffffffffffffffda RBX: 00007ffe33d3b1b0 RCX: 00007f6993c5834b
[   28.005383] RDX: 00007ffe33d3acf0 RSI: 00000000c0085401 RDI: 0000000000000003
[   28.007578] RBP: 00007ffe33d3acf0 R08: 0000000000000003 R09: 0000000000000003
[   28.009541] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe33d3b170
[   28.011509] R13: 00007ffe33d3b1b0 R14: 0000000000000000 R15: 0000000000000000
[   28.013762] Kernel Offset: 0x1b000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   28.020274] ---[ end Kernel panic - not syncing: #VE due to access to unaccepted memory. GPA: 0x349438580 ]---

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
fidencio pushed a commit that referenced this issue Oct 12, 2022
A lock: param->lock
B lock: fparam->lock
C lock: pasid_mutex

Thread #1: prq report, holds A lock, and tries to hold B lock
Thread #2: page response, holds B lock, and tries to hold C lock
Thread #3: unbind_gpasid (could be bind_gpasid or intel_svm_free_async_fn as well),
           holds C lock, and tries to hold A lock.

Dead lock happens when #1 holds A lock, #2 holds B lock and #3 holds C lock.

PRQ report:
     A lock    =>      B lock        unlock   =>     unlock
       |                 |             |               |
       |                 +-------------+               |
       +-----------------------------------------------+

Page response:
     B lock    =>      C lock        unlock   =>     unlock
       |                  |             |              |
       |                  +-------------+              |
       +-----------------------------------------------+

Unbind_gpasid:
     C lock   =>        A lock        unlock  =>     unlock
       |                  |             |              |
       |                  +-------------+              |
       +-----------------------------------------------+

This fix moves the attempt of holding A lock in Thread #3 to be
outside of C lock protection. To demonstrate well, also draw
the bind_gpasid explicitly. After fixing, the locking sequence is as
below:

Bind_gpasid:
                                                    {only for bind failure}
    A lock      unlock  =>   C lock     unlock  =>  A lock       unlock
      |           |           |          |            |            |
      +-----------+           +----------+            +------------+

PRQ report:
    A lock   =>    B lock        unlock   =>    unlock
      |              |             |              |
      |              +-------------+              |
      +-------------------------------------------+

Page response:
    B lock   =>    C lock        unlock   =>    unlock
      |              |             |              |
      |              +-------------+              |
      +-------------------------------------------+

Unbind_gpasid:
    C lock       unlock  =>    A lock     unlock
      |            |             |          |
      +------------+             +----------+

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
fidencio pushed a commit that referenced this issue Jan 23, 2023
Without this patch, lockdep complains as follows and disable it.
 =============================
 [ BUG: Invalid wait context ]
 5.18.0-rc1 torvalds#171 Not tainted
 -----------------------------
 swapper/0/1 is trying to lock:
 ffffffff978f7978 (&port_lock_key){....}-{3:3}, at: serial8250_console_write+0x557/0x610
 other info that might help us debug this:
 context-{5:5}
 3 locks held by swapper/0/1:
  #0: ffffffff95a9ab48 (rcu_tasks.cbs_gbl_lock){....}-{2:2}, at: cblist_init_generic+0x29/0x310
  #1: ffffffff95a86d60 (console_lock){+.+.}-{0:0}, at: vprintk_emit+0x153/0x200
  #2: ffffffff959a66c0 (console_owner){....}-{0:0}, at: console_unlock+0x27a/0x620
 stack backtrace:
 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc1-yamahata-kvm-upstream+ torvalds#171 eb656e2c6a2533f980e80fcf0fe9314dd3ea2a22
 Hardware name: Intel Corporation EAGLESTREAM/EAGLESTREAM, BIOS EGSDCRB1.SYS.0066.D24.2110072326 10/07/2021
 Call Trace:
  <TASK>
  show_stack+0x52/0x58
  dump_stack_lvl+0x5b/0x82
  dump_stack+0x10/0x12
  check_wait_context.cold+0xc0/0xfe
  __lock_acquire+0x222/0x990
  lock_acquire.part.0+0x126/0x2f0
  ? serial8250_console_write+0x557/0x610
  ? rcu_read_unlock+0x50/0x50
  ? check_prev_add+0x1270/0x1270
  ? __this_cpu_preempt_check+0x13/0x20
  lock_acquire+0x8f/0x180
  ? serial8250_console_write+0x557/0x610
  _raw_spin_lock_irqsave+0x43/0x60
  ? serial8250_console_write+0x557/0x610
  serial8250_console_write+0x557/0x610
  ? serial8250_config_port+0x210/0x210
  ? rcu_read_unlock+0x50/0x50
  ? __lock_release+0x14d/0x2d0
  ? record_print_text+0x206/0x260
  ? console_unlock+0x2a5/0x620
  ? __this_cpu_preempt_check+0x13/0x20
  univ8250_console_write+0x4b/0x60
  call_console_drivers.constprop.0+0x180/0x250
  console_unlock+0x2d0/0x620
  ? console_unlock+0x27a/0x620
  ? devkmsg_read+0x420/0x420
  ? __down_trylock_console_sem+0x7c/0xc0
  ? vprintk_emit+0x153/0x200
  vprintk_emit+0x1a4/0x200
  vprintk_default+0x1d/0x20
  vprintk+0x4e/0x60
  _printk+0xb2/0xe3
  ? record_print_text.cold+0x11/0x11
  ? rwlock_bug.part.0+0x60/0x60
  ? lock_acquire+0x8f/0x180
  ? cblist_init_generic+0x29/0x310
  ? do_raw_spin_lock+0x11e/0x1b0
  ? do_raw_spin_lock+0x109/0x1b0
  cblist_init_generic.cold+0x44/0x52
  rcu_init_tasks_generic+0x15/0x164
  kernel_init_freeable+0x103/0x1ba
  ? rest_init+0x340/0x340
  kernel_init+0x1f/0x170
  ? rest_init+0x340/0x340
  ret_from_fork+0x1f/0x30
  </TASK>

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
fidencio pushed a commit that referenced this issue Jan 23, 2023
UEFI Specification version 2.9 introduces the concept of memory
acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, require memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific to the Virtual Machine
platform.

There are several ways kernel can deal with unaccepted memory:

 1. Accept all the memory during the boot. It is easy to implement and
    it doesn't have runtime cost once the system is booted. The downside
    is very long boot time.

    Accept can be parallelized to multiple CPUs to keep it manageable
    (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
    memory bandwidth and does not scale beyond the point.

 2. Accept a block of memory on the first use. It requires more
    infrastructure and changes in page allocator to make it work, but
    it provides good boot time.

    On-demand memory accept means latency spikes every time kernel steps
    onto a new memory block. The spikes will go away once workload data
    set size gets stabilized or all memory gets accepted.

 3. Accept all memory in background. Introduce a thread (or multiple)
    that gets memory accepted proactively. It will minimize time the
    system experience latency spikes on memory allocation while keeping
    low boot time.

    This approach cannot function on its own. It is an extension of #2:
    background memory acceptance requires functional scheduler, but the
    page allocator may need to tap into unaccepted memory before that.

    The downside of the approach is that these threads also steal CPU
    cycles and memory bandwidth from the user's workload and may hurt
    user experience.

Implement #2 for now. It is a reasonable default. Some workloads may
want to use #1 or #3 and they can be implemented later based on user's
demands.

Support of unaccepted memory requires a few changes in core-mm code:

  - memblock has to accept memory on allocation;

  - page allocator has to accept memory on the first allocation of the
    page;

Memblock change is trivial.

The page allocator is modified to accept pages on the first allocation.
The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is
used to indicate that the page requires acceptance.

Architecture has to provide two helpers if it wants to support
unaccepted memory:

 - accept_memory() makes a range of physical addresses accepted.

 - range_contains_unaccepted_memory() checks anything within the range
   of physical addresses requires acceptance.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
Reviewed-by: David Hildenbrand <david@redhat.com>
fidencio pushed a commit that referenced this issue Feb 23, 2023
Without this patch, lockdep complains as follows and disable it.
 =============================
 [ BUG: Invalid wait context ]
 5.18.0-rc1 torvalds#171 Not tainted
 -----------------------------
 swapper/0/1 is trying to lock:
 ffffffff978f7978 (&port_lock_key){....}-{3:3}, at: serial8250_console_write+0x557/0x610
 other info that might help us debug this:
 context-{5:5}
 3 locks held by swapper/0/1:
  #0: ffffffff95a9ab48 (rcu_tasks.cbs_gbl_lock){....}-{2:2}, at: cblist_init_generic+0x29/0x310
  #1: ffffffff95a86d60 (console_lock){+.+.}-{0:0}, at: vprintk_emit+0x153/0x200
  #2: ffffffff959a66c0 (console_owner){....}-{0:0}, at: console_unlock+0x27a/0x620
 stack backtrace:
 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc1-yamahata-kvm-upstream+ torvalds#171 eb656e2c6a2533f980e80fcf0fe9314dd3ea2a22
 Hardware name: Intel Corporation EAGLESTREAM/EAGLESTREAM, BIOS EGSDCRB1.SYS.0066.D24.2110072326 10/07/2021
 Call Trace:
  <TASK>
  show_stack+0x52/0x58
  dump_stack_lvl+0x5b/0x82
  dump_stack+0x10/0x12
  check_wait_context.cold+0xc0/0xfe
  __lock_acquire+0x222/0x990
  lock_acquire.part.0+0x126/0x2f0
  ? serial8250_console_write+0x557/0x610
  ? rcu_read_unlock+0x50/0x50
  ? check_prev_add+0x1270/0x1270
  ? __this_cpu_preempt_check+0x13/0x20
  lock_acquire+0x8f/0x180
  ? serial8250_console_write+0x557/0x610
  _raw_spin_lock_irqsave+0x43/0x60
  ? serial8250_console_write+0x557/0x610
  serial8250_console_write+0x557/0x610
  ? serial8250_config_port+0x210/0x210
  ? rcu_read_unlock+0x50/0x50
  ? __lock_release+0x14d/0x2d0
  ? record_print_text+0x206/0x260
  ? console_unlock+0x2a5/0x620
  ? __this_cpu_preempt_check+0x13/0x20
  univ8250_console_write+0x4b/0x60
  call_console_drivers.constprop.0+0x180/0x250
  console_unlock+0x2d0/0x620
  ? console_unlock+0x27a/0x620
  ? devkmsg_read+0x420/0x420
  ? __down_trylock_console_sem+0x7c/0xc0
  ? vprintk_emit+0x153/0x200
  vprintk_emit+0x1a4/0x200
  vprintk_default+0x1d/0x20
  vprintk+0x4e/0x60
  _printk+0xb2/0xe3
  ? record_print_text.cold+0x11/0x11
  ? rwlock_bug.part.0+0x60/0x60
  ? lock_acquire+0x8f/0x180
  ? cblist_init_generic+0x29/0x310
  ? do_raw_spin_lock+0x11e/0x1b0
  ? do_raw_spin_lock+0x109/0x1b0
  cblist_init_generic.cold+0x44/0x52
  rcu_init_tasks_generic+0x15/0x164
  kernel_init_freeable+0x103/0x1ba
  ? rest_init+0x340/0x340
  kernel_init+0x1f/0x170
  ? rest_init+0x340/0x340
  ret_from_fork+0x1f/0x30
  </TASK>

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
fidencio pushed a commit that referenced this issue Feb 23, 2023
UEFI Specification version 2.9 introduces the concept of memory
acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, require memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific to the Virtual Machine
platform.

There are several ways kernel can deal with unaccepted memory:

 1. Accept all the memory during the boot. It is easy to implement and
    it doesn't have runtime cost once the system is booted. The downside
    is very long boot time.

    Accept can be parallelized to multiple CPUs to keep it manageable
    (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
    memory bandwidth and does not scale beyond the point.

 2. Accept a block of memory on the first use. It requires more
    infrastructure and changes in page allocator to make it work, but
    it provides good boot time.

    On-demand memory accept means latency spikes every time kernel steps
    onto a new memory block. The spikes will go away once workload data
    set size gets stabilized or all memory gets accepted.

 3. Accept all memory in background. Introduce a thread (or multiple)
    that gets memory accepted proactively. It will minimize time the
    system experience latency spikes on memory allocation while keeping
    low boot time.

    This approach cannot function on its own. It is an extension of #2:
    background memory acceptance requires functional scheduler, but the
    page allocator may need to tap into unaccepted memory before that.

    The downside of the approach is that these threads also steal CPU
    cycles and memory bandwidth from the user's workload and may hurt
    user experience.

Implement #2 for now. It is a reasonable default. Some workloads may
want to use #1 or #3 and they can be implemented later based on user's
demands.

Support of unaccepted memory requires a few changes in core-mm code:

  - memblock has to accept memory on allocation;

  - page allocator has to accept memory on the first allocation of the
    page;

Memblock change is trivial.

The page allocator is modified to accept pages on the first allocation.
The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is
used to indicate that the page requires acceptance.

Architecture has to provide two helpers if it wants to support
unaccepted memory:

 - accept_memory() makes a range of physical addresses accepted.

 - range_contains_unaccepted_memory() checks anything within the range
   of physical addresses requires acceptance.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
Reviewed-by: David Hildenbrand <david@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants