Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to create namespaces; pfn failed to enable + kernel oops #76

Closed
patr-geary-smci opened this issue Nov 20, 2018 · 23 comments
Closed

Comments

@patr-geary-smci
Copy link

This is all against Intel DCP DIMMs what were provisioned via ipmctl. I've tried mainline ipmctl builds to no avail; as well as attempting to use the latest 4.18.18-200.fc28.x86_64 fedora kernel.

[root@redacted ~]# ndctl create-namespace -r region0
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":"248.00 GiB (266.29 GB)",
"uuid":"f0ba1c10-0cf2-4572-bfec-7f8e5e4098f7",
"raw_uuid":"f9c4a0a8-df1c-4389-89d4-6d5c8bac80d7",
"sector_size":512,
"blockdev":"pmem0",
"numa_node":0
}

[root@redacted ~]# ndctl create-namespace -r region1
libndctl: ndctl_pfn_enable: pfn1.0: failed to enable
Error: namespace1.0: failed to enable

failed to create namespace: No such device or address

I'm seeing OOP's spit out by the kernel (This may not be exact; I just grabbed the first one out of messages with a matching pfn):

Nov 16 11:16:53 localhost kernel: nd_pmem pfn1.0: namespace1.0 alignment collision, truncate 67108864 bytes
Nov 16 11:17:12 localhost kernel: pmem1: detected capacity change from 0 to 266285875200
Nov 16 11:17:18 localhost kernel: WARNING: CPU: 55 PID: 2942 at arch/x86/mm/init_64.c:792 add_pages+0x5a/0x60
Nov 16 11:17:18 localhost kernel: Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6t$
$er acpi_pad xfs libcrc32c ast i2c_algo_bit drm_kms_helper mlx5_core ttm drm i40e crc32c_intel mlxfw devlink
Nov 16 11:17:18 localhost kernel: CPU: 55 PID: 2942 Comm: ndctl Not tainted 4.17.19-200.fc28.x86_64 #1
Nov 16 11:17:18 localhost kernel: Hardware name: Supermicro Super Server/X11DPU, BIOS 3.0 10/20/2018
Nov 16 11:17:18 localhost kernel: RIP: 0010:add_pages+0x5a/0x60
Nov 16 11:17:18 localhost kernel: RSP: 0018:ffffaeb44ed43c80 EFLAGS: 00010282
Nov 16 11:17:18 localhost kernel: RAX: 00000000fffffff4 RBX: 000000000ed78000 RCX: 0000000000000200
Nov 16 11:17:18 localhost kernel: RDX: 0000000000000200 RSI: 00000000000fc1fe RDI: 0000000000000000
Nov 16 11:17:18 localhost kernel: RBP: 0000000003f08000 R08: ffffa0c63c200000 R09: 00000000000001fe
Nov 16 11:17:18 localhost kernel: R10: ffff9ff071301700 R11: ffff9ff071301d60 R12: ffffa086b6e2e0b0
Nov 16 11:17:18 localhost kernel: R13: ffffa086b6e2e0c0 R14: 0000000003f00000 R15: 0000003f08000000
Nov 16 11:17:18 localhost kernel: FS:  00007fca295e3d00(0000) GS:ffff9ff0dbdc0000(0000) knlGS:0000000000000000
Nov 16 11:17:18 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 16 11:17:18 localhost kernel: CR2: 00007f3416c43f44 CR3: 00000017b08c8001 CR4: 00000000007606e0
Nov 16 11:17:18 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 16 11:17:18 localhost kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 16 11:17:18 localhost kernel: PKRU: 55555554
Nov 16 11:17:18 localhost kernel: Call Trace:
Nov 16 11:17:18 localhost kernel: devm_memremap_pages+0x2e4/0x440
Nov 16 11:17:18 localhost kernel: pmem_attach_disk+0x1c6/0x5e0 [nd_pmem]
Nov 16 11:17:18 localhost kernel: ? devm_nsio_enable+0xb8/0x100
Nov 16 11:17:18 localhost kernel: nvdimm_bus_probe+0x64/0x120
Nov 16 11:17:18 localhost kernel: driver_probe_device+0x2da/0x450
Nov 16 11:17:18 localhost kernel: bind_store+0xed/0x160
Nov 16 11:17:18 localhost kernel: kernfs_fop_write+0x116/0x190
Nov 16 11:17:18 localhost kernel: __vfs_write+0x36/0x170
Nov 16 11:17:18 localhost kernel: ? selinux_file_permission+0xf0/0x130
Nov 16 11:17:18 localhost kernel: vfs_write+0xa5/0x1a0
Nov 16 11:17:18 localhost kernel: ksys_write+0x4f/0xb0
Nov 16 11:17:18 localhost kernel: do_syscall_64+0x5b/0x160
Nov 16 11:17:18 localhost kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Nov 16 11:17:18 localhost kernel: RIP: 0033:0x7fca288c7ef4
Nov 16 11:17:18 localhost kernel: RSP: 002b:00007ffcaf9f2d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
Nov 16 11:17:18 localhost kernel: RAX: ffffffffffffffda RBX: 0000000001ad7870 RCX: 00007fca288c7ef4
Nov 16 11:17:18 localhost kernel: RDX: 0000000000000007 RSI: 0000000001ad7870 RDI: 0000000000000008
Nov 16 11:17:18 localhost kernel: RBP: 0000000000000007 R08: 0000000000000006 R09: 0000000000000005
Nov 16 11:17:18 localhost kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000008
Nov 16 11:17:18 localhost kernel: R13: 00007fca295e3c28 R14: 0000000000000001 R15: 0000000001ad7870
Nov 16 11:17:18 localhost kernel: Code: 3b 15 0b 5e 99 01 76 20 48 89 15 02 5e 99 01 48 89 15 0b 5e 99 01 48 c1 e2 0c 48 03 15 70 8b 11 01 48 89 15 61 5d 99 01 5b 5d c3 <0f> 0b eb bc 66 90 0f 1f 44 00 00 41 56 45 89 c6 41 55 49 $
Nov 16 11:17:18 localhost kernel: ---[ end trace b470acdc7eea493e ]---
Nov 16 11:17:18 localhost kernel: nd_pmem: probe of pfn3.0 failed with error -12
Nov 16 11:17:19 localhost kernel: nd_pmem pfn0.0: namespace0.0 alignment collision, truncate 67108864 bytes
Nov 16 11:17:19 localhost kernel: ------------[ cut here ]------------

Very similar to issue 39 ; I will attempt it with the alignment forced once I swap all the hardware back in. I'm posting this since now, even lacking data since the closing of 39 ended with "We have alignment fixes."

Interestingly it does not do this for all dimms; it seems totally based in how I lay out the physical dimms.
This is all Intel Optane DCP, The 2-2-2 configurations I have don't seem to have this issue; but 2-1-1 symmetric population does. I've seen it happen on either IMC; but it seems to be more prolific on IMC0. Asymmetric population does not appear to have the issue either. Additionally, I've only seen this when the dimm is in 0% Memory Mode (All storage). 50% ratios do not have this problem.

I'll bump the issue once I swap all the hardware back in and try with forced alignment.

@patr-geary-smci patr-geary-smci changed the title Failure to create namespaces; Failure to create namespaces; pfn failed to enable + kernel oops Nov 20, 2018
@djbw
Copy link
Member

djbw commented Nov 20, 2018

There is a known limitation with requiring namespaces to be 128MB aligned, but obviously the kernel is not figuring out the proper alignment. Can you send the output of "cat /proc/iomem" on this system and "ndctl list -R"

@patr-geary-smci
Copy link
Author

iomem:

# cat /proc/iomem 
00000000-00000fff : Reserved
00001000-0009a7ff : System RAM
0009a800-0009ffff : Reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000c7fff : Video ROM
  000c4000-000c7fff : PCI Bus 0000:00
000c8000-000c8dff : Adapter ROM
000e0000-000fffff : Reserved
  000f0000-000fffff : System ROM
00100000-69f0efff : System RAM
69f0f000-6c00efff : Reserved
  6afcf018-6afcf018 : APEI ERST
  6afcf01c-6afcf021 : APEI ERST
  6afcf028-6afcf039 : APEI ERST
  6afcf040-6afcf04c : APEI ERST
  6afcf050-6afd104f : APEI ERST
6c00f000-6c12bfff : System RAM
6c12c000-6ceb2fff : ACPI Non-volatile Storage
6ceb3000-6f2bbfff : Reserved
6f2bc000-6f7fffff : System RAM
6f800000-8fffffff : Reserved
  80000000-8fffffff : PCI MMCONFIG 0000 [bus 00-ff]
90000000-9d7fffff : PCI Bus 0000:00
  90000000-901fffff : PCI Bus 0000:01
  9c000000-9d0fffff : PCI Bus 0000:02
    9c000000-9d0fffff : PCI Bus 0000:03
      9c000000-9cffffff : 0000:03:00.0
      9d000000-9d01ffff : 0000:03:00.0
  9d100000-9d17ffff : 0000:00:17.0
    9d100000-9d17ffff : ahci
  9d180000-9d1fffff : 0000:00:11.5
    9d180000-9d1fffff : ahci
  9d200000-9d20ffff : 0000:00:14.0
    9d200000-9d20ffff : xhci-hcd
  9d210000-9d213fff : 0000:00:1f.2
  9d214000-9d215fff : 0000:00:17.0
    9d214000-9d215fff : ahci
  9d216000-9d217fff : 0000:00:11.5
    9d216000-9d217fff : ahci
  9d218000-9d2180ff : 0000:00:17.0
    9d218000-9d2180ff : ahci
  9d219000-9d2190ff : 0000:00:11.5
    9d219000-9d2190ff : ahci
  9d21a000-9d21afff : 0000:00:05.4
  9d7fc000-9d7fcfff : dmar7
9d800000-aaffffff : PCI Bus 0000:17
  9d800000-9d9fffff : PCI Bus 0000:19
  9da00000-9dbfffff : PCI Bus 0000:1a
  a8000000-aa0fffff : PCI Bus 0000:18
    a8000000-a8ffffff : 0000:18:00.1
      a8000000-a8ffffff : i40e
    a9000000-a9ffffff : 0000:18:00.0
      a9000000-a9ffffff : i40e
    aa000000-aa007fff : 0000:18:00.1
      aa000000-aa007fff : i40e
    aa008000-aa00ffff : 0000:18:00.0
      aa008000-aa00ffff : i40e
  aae00000-aaefffff : PCI Bus 0000:18
    aae00000-aae7ffff : 0000:18:00.1
    aae80000-aaefffff : 0000:18:00.0
  aaf00000-aaf00fff : 0000:17:05.4
  aaffc000-aaffcfff : dmar4
ab000000-b87fffff : PCI Bus 0000:3a
  ab000000-ab1fffff : PCI Bus 0000:3b
  ab200000-ab3fffff : PCI Bus 0000:3c
  ab400000-ab5fffff : PCI Bus 0000:3d
  ab600000-ab7fffff : PCI Bus 0000:3e
  b8700000-b8700fff : 0000:3a:05.4
  b87fc000-b87fcfff : dmar5
b8800000-c5ffffff : PCI Bus 0000:5d
  c5f00000-c5f00fff : 0000:5d:05.4
  c5ffc000-c5ffcfff : dmar6
c6000000-d37fffff : PCI Bus 0000:80
  d3700000-d3700fff : 0000:80:05.4
  d37fc000-d37fcfff : dmar0
d3800000-e0ffffff : PCI Bus 0000:85
  d3800000-d39fffff : PCI Bus 0000:86
  d3a00000-d3bfffff : PCI Bus 0000:87
  d3c00000-d3dfffff : PCI Bus 0000:88
  d3e00000-d3ffffff : PCI Bus 0000:89
  e0f00000-e0f00fff : 0000:85:05.4
  e0ffc000-e0ffcfff : dmar1
e1000000-ee7fffff : PCI Bus 0000:ae
  ee700000-ee700fff : 0000:ae:05.4
  ee7fc000-ee7fcfff : dmar2
ee800000-fbffffff : PCI Bus 0000:d7
  fbd00000-fbefffff : PCI Bus 0000:d8
    fbd00000-fbdfffff : 0000:d8:00.1
    fbe00000-fbefffff : 0000:d8:00.0
  fbf00000-fbf00fff : 0000:d7:05.4
  fbffc000-fbffcfff : dmar3
fd000000-fe7fffff : Reserved
  fd000000-fdabffff : pnp 00:05
  fdad0000-fdadffff : pnp 00:05
  fdb00000-fdffffff : pnp 00:05
    fdc6000c-fdc6000f : iTCO_wdt
  fe000000-fe00ffff : pnp 00:05
  fe010000-fe010fff : PCI Bus 0000:00
    fe010000-fe010fff : 0000:00:1f.5
  fe011000-fe01ffff : pnp 00:05
  fe036000-fe03bfff : pnp 00:05
  fe03d000-fe3fffff : pnp 00:05
  fe410000-fe7fffff : pnp 00:05
fec00000-fecfffff : PNP0003:00
  fec00000-fec003ff : IOAPIC 0
  fec01000-fec013ff : IOAPIC 1
  fec08000-fec083ff : IOAPIC 2
  fec10000-fec103ff : IOAPIC 3
  fec18000-fec183ff : IOAPIC 4
  fec20000-fec203ff : IOAPIC 5
  fec28000-fec283ff : IOAPIC 6
  fec30000-fec303ff : IOAPIC 7
  fec38000-fec383ff : IOAPIC 8
fed00000-fed003ff : HPET 0
  fed00000-fed003ff : PNP0103:00
fed12000-fed1200f : pnp 00:01
fed12010-fed1201f : pnp 00:01
fed1b000-fed1bfff : pnp 00:01
fed20000-fed44fff : Reserved
fed45000-fed8bfff : pnp 00:01
fee00000-feefffff : pnp 00:01
  fee00000-fee00fff : Local APIC
ff000000-ffffffff : Reserved
  ff000000-ffffffff : pnp 00:01
100000000-187bffffff : System RAM
187c000000-967bffffff : Persistent Memory
967c000000-ae7bffffff : System RAM
  a837000000-a837c031d0 : Kernel code
  a837c031d1-a83839617f : Kernel data
  a838960000-a838aadfff : Kernel bss
ae7c000000-12c7bffffff : Persistent Memory
12c7c000000-12c9bffffff : Reserved
  12c9bffc000-12c9bffcfff : ndbus0
  12c9bffd000-12c9bffdfff : ndbus0
  12c9bffe000-12c9bffefff : ndbus0
  12c9bfff000-12c9bffffff : ndbus0
380000000000-383fffffffff : PCI Bus 0000:00
  380000000000-3800001fffff : PCI Bus 0000:01
  380000200000-3800002000ff : 0000:00:1f.4
  383ffff00000-383ffff03fff : 0000:00:04.7
    383ffff00000-383ffff03fff : ioatdma
  383ffff04000-383ffff07fff : 0000:00:04.6
    383ffff04000-383ffff07fff : ioatdma
  383ffff08000-383ffff0bfff : 0000:00:04.5
    383ffff08000-383ffff0bfff : ioatdma
  383ffff0c000-383ffff0ffff : 0000:00:04.4
    383ffff0c000-383ffff0ffff : ioatdma
  383ffff10000-383ffff13fff : 0000:00:04.3
    383ffff10000-383ffff13fff : ioatdma
  383ffff14000-383ffff17fff : 0000:00:04.2
    383ffff14000-383ffff17fff : ioatdma
  383ffff18000-383ffff1bfff : 0000:00:04.1
    383ffff18000-383ffff1bfff : ioatdma
  383ffff1c000-383ffff1ffff : 0000:00:04.0
    383ffff1c000-383ffff1ffff : ioatdma
  383ffff20000-383ffff20fff : 0000:00:16.4
  383ffff21000-383ffff21fff : 0000:00:16.1
  383ffff22000-383ffff22fff : 0000:00:16.0
    383ffff22000-383ffff22fff : mei_me
  383ffff23000-383ffff23fff : 0000:00:14.2
384000000000-387fffffffff : PCI Bus 0000:17
  384000000000-3840001fffff : PCI Bus 0000:19
  384000200000-3840003fffff : PCI Bus 0000:1a
388000000000-38bfffffffff : PCI Bus 0000:3a
  388000000000-3880001fffff : PCI Bus 0000:3b
  388000200000-3880003fffff : PCI Bus 0000:3c
  388000400000-3880005fffff : PCI Bus 0000:3d
  388000600000-3880007fffff : PCI Bus 0000:3e
38c000000000-38ffffffffff : PCI Bus 0000:5d
390000000000-393fffffffff : PCI Bus 0000:80
  393ffff00000-393ffff03fff : 0000:80:04.7
    393ffff00000-393ffff03fff : ioatdma
  393ffff04000-393ffff07fff : 0000:80:04.6
    393ffff04000-393ffff07fff : ioatdma
  393ffff08000-393ffff0bfff : 0000:80:04.5
    393ffff08000-393ffff0bfff : ioatdma
  393ffff0c000-393ffff0ffff : 0000:80:04.4
    393ffff0c000-393ffff0ffff : ioatdma
  393ffff10000-393ffff13fff : 0000:80:04.3
    393ffff10000-393ffff13fff : ioatdma
  393ffff14000-393ffff17fff : 0000:80:04.2
    393ffff14000-393ffff17fff : ioatdma
  393ffff18000-393ffff1bfff : 0000:80:04.1
    393ffff18000-393ffff1bfff : ioatdma
  393ffff1c000-393ffff1ffff : 0000:80:04.0
    393ffff1c000-393ffff1ffff : ioatdma
394000000000-397fffffffff : PCI Bus 0000:85
  394000000000-3940001fffff : PCI Bus 0000:86
  394000200000-3940003fffff : PCI Bus 0000:87
  394000400000-3940005fffff : PCI Bus 0000:88
  394000600000-3940007fffff : PCI Bus 0000:89
398000000000-39bfffffffff : PCI Bus 0000:ae
39c000000000-39ffffffffff : PCI Bus 0000:d7
  39fffc000000-39ffffffffff : PCI Bus 0000:d8
    39fffc000000-39fffdffffff : 0000:d8:00.1
      39fffc000000-39fffdffffff : mlx5_core
    39fffe000000-39ffffffffff : 0000:d8:00.0
      39fffe000000-39ffffffffff : mlx5_core

Before configuring any namespaces:
ndctl list -R

# ndctl list -Ru
[
  {
    "dev":"region1",
    "size":"252.00 GiB (270.58 GB)",
    "available_size":"252.00 GiB (270.58 GB)",
    "type":"pmem",
    "iset_id":"0x7e76da9046418a22",
    "persistence_domain":"memory_controller"
  },
  {
    "dev":"region3",
    "size":"252.00 GiB (270.58 GB)",
    "available_size":"252.00 GiB (270.58 GB)",
    "type":"pmem",
    "iset_id":"0x549eda90f5458a22",
    "persistence_domain":"memory_controller"
  },
  {
    "dev":"region0",
    "size":"252.00 GiB (270.58 GB)",
    "available_size":"252.00 GiB (270.58 GB)",
    "type":"pmem",
    "iset_id":"0xb88ada907f438a22",
    "persistence_domain":"memory_controller"
  },
  {
    "dev":"region2",
    "size":"252.00 GiB (270.58 GB)",
    "available_size":"252.00 GiB (270.58 GB)",
    "type":"pmem",
    "iset_id":"0xaa76da9064418a22",
    "persistence_domain":"memory_controller"
  }
]


# ndctl create-namespace -r region0
{
  "dev":"namespace0.0",
  "mode":"fsdax",
  "map":"dev",
  "size":"248.00 GiB (266.29 GB)",
  "uuid":"1e5ef4f8-d534-48f6-882d-a05afc17d682",
  "raw_uuid":"fa550750-cfe2-4eb7-b759-74329d7561b8",
  "sector_size":512,
  "blockdev":"pmem0",
  "numa_node":0
}

--messages--
[  120.172835] nd_pmem pfn0.0: namespace0.0 alignment collision, truncate 67108864 bytes
[  163.232962] pmem0: detected capacity change from 0 to 266285875200

After namespace creation failure and oops:

# ndctl create-namespace -r region1
libndctl: ndctl_pfn_enable: pfn1.0: failed to enable
  Error: namespace1.0: failed to enable

failed to create namespace: No such device or address

--messages--
[  283.340139] nd_pmem pfn1.0: namespace1.0 alignment collision, truncate 67108864 bytes
[  283.340866] ------------[ cut here ]------------
[  283.340867] nd_pmem pfn1.0: Conflicting mapping in same section
[  283.340877] WARNING: CPU: 25 PID: 2966 at kernel/memremap.c:188 devm_memremap_pages+0xa4/0x440
[  283.340878] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transport_srp intel_rapl dax_pmem nd_pmem device_dax nd_btt skx_edac iTCO_wdt iTCO_vendor_support rpcrdma sunrpc rdma_ucm x86_pkg_temp_thermal intel_powerclamp ib_iser coretemp ib_umad rdma_cm ib_ipoib iw_cm kvm_intel kvm irqbypass libiscsi crct10dif_pclmul scsi_transport_iscsi crc32_pclmul ib_cm ghash_clmulni_intel intel_cstate
[  283.340916]  intel_uncore intel_rapl_perf joydev mlx5_ib ib_uverbs i40iw ib_core ipmi_ssif mei_me ioatdma mei i2c_i801 wmi dca lpc_ich ipmi_si ipmi_devintf ipmi_msghandler pcc_cpufreq nfit acpi_power_meter acpi_pad xfs libcrc32c ast i2c_algo_bit drm_kms_helper mlx5_core ttm drm i40e crc32c_intel mlxfw devlink
[  283.340936] CPU: 25 PID: 2966 Comm: ndctl Not tainted 4.18.18-200.fc28.x86_64 #1
[  283.340936] Hardware name: Supermicro Super Server/X11DPU, BIOS 3.0 10/20/2018
[  283.340938] RIP: 0010:devm_memremap_pages+0xa4/0x440
[  283.340938] Code: 89 c7 48 85 c0 74 6c 48 8b 5d 50 48 85 db 74 5d 48 89 ef e8 fe 85 3e 00 48 89 da 48 c7 c7 b8 71 0d bf 48 89 c6 e8 36 c0 eb ff <0f> 0b 49 8b bf 80 00 00 00 48 8b 47 08 a8 03 0f 85 89 00 00 00 65 
[  283.340957] RSP: 0018:ffffb4910fce7c98 EFLAGS: 00010282
[  283.340958] RAX: 0000000000000000 RBX: ffff89bbcf133f58 RCX: 0000000000000006
[  283.340959] RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8a51da856930
[  283.340960] RBP: ffff89bbbab17410 R08: 0000000000000044 R09: 0000000000000004
[  283.340960] R10: 0000000000000000 R11: 0000000000000001 R12: ffff89bb8ef1f6b0
[  283.340961] R13: ffff89bb8ef1f6c0 R14: 0000009677ffffff R15: ffff89bb90cca6b0
[  283.340963] FS:  00007f8420031d00(0000) GS:ffff8a51da840000(0000) knlGS:0000000000000000
[  283.340963] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  283.340964] CR2: 00007f841f3b3fe0 CR3: 000000ad5a8c6002 CR4: 00000000007606e0
[  283.340965] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  283.340966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  283.340966] PKRU: 55555554
[  283.340967] Call Trace:
[  283.340973]  ? devres_add+0x2f/0x40
[  283.340978]  pmem_attach_disk+0x1fd/0x660 [nd_pmem]
[  283.340980]  ? devm_nsio_enable+0xb8/0x100
[  283.340983]  nvdimm_bus_probe+0x69/0x170
[  283.340987]  driver_probe_device+0x2da/0x450
[  283.340989]  bind_store+0xf1/0x180
[  283.340994]  kernfs_fop_write+0x116/0x190
[  283.340998]  __vfs_write+0x36/0x190
[  283.341002]  ? selinux_file_permission+0xf0/0x130
[  283.341003]  vfs_write+0xa5/0x1a0
[  283.341006]  ksys_write+0x4f/0xb0
[  283.341011]  do_syscall_64+0x5b/0x160
[  283.341016]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  283.341018] RIP: 0033:0x7f841f315ef4
[  283.341018] Code: 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 8d 05 f1 3a 2d 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 49 89 d4 55 48 89 f5 53 
[  283.341037] RSP: 002b:00007fffe0f77778 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  283.341038] RAX: ffffffffffffffda RBX: 000000000071c870 RCX: 00007f841f315ef4
[  283.341039] RDX: 0000000000000007 RSI: 000000000071c870 RDI: 0000000000000008
[  283.341040] RBP: 0000000000000007 R08: 0000000000000006 R09: 0000000000000005
[  283.341041] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000008
[  283.341041] R13: 00007f8420031c28 R14: 0000000000000001 R15: 000000000071c870
[  283.341042] ---[ end trace 2870480c67ed832a ]---
[  283.341079] nd_pmem: probe of pfn1.0 failed with error -12

@patr-geary-smci
Copy link
Author

patr-geary-smci commented Nov 20, 2018

Tried manual alignment to no avail. 4k and 2M. All failed with the same message; all on the same region.

-Edit: Removed 128M; it wasn't supported when I tested.

@patr-geary-smci
Copy link
Author

I started going through the source. 1G Alignment worked.

@djbw
Copy link
Member

djbw commented Nov 21, 2018

As root can you send the output of:

for i in /sys/bus/nd/devices/region*/resource; do echo $i: $(cat $i); done

I fear this configuration may be placing two regions within 128MB of each other.

@patr-geary-smci
Copy link
Author

patr-geary-smci commented Nov 21, 2018

# for i in /sys/bus/nd/devices/region*/resource; do echo $i: $(cat $i); done
/sys/bus/nd/devices/region0/resource: 0x187c000000
/sys/bus/nd/devices/region1/resource: 0x577c000000
/sys/bus/nd/devices/region2/resource: 0xae7c000000
/sys/bus/nd/devices/region3/resource: 0xed7c000000

@djbw
Copy link
Member

djbw commented Nov 21, 2018

Ugh, yes, those are 64MB alignments. That is why 1GB aligned namespaces works and 2MB does not.

@patr-geary-smci
Copy link
Author

Is this something I should be pointing out to the ipmctl Intel folks, or is this going to be a kernel "issue"?

I actually haven't delved into the spec for any of this stuff, so I'm not sitting in a good place to draw a conclusion.

@djbw
Copy link
Member

djbw commented Nov 21, 2018

Linux needs 128MB alignment for each adjacent namespace. There isn't a fix because BIOS has no visibility or responsibility for Linux alignment constraints. Going forward Linux will eventually gain the capability to support fsdax mode with namespaces that collide within a section (128MB) until then the only workarounds are "raw" mode (not useful), or requiring fsdax namespaces to be created with "--align=1GB".

We faced something similar with section collisions with System RAM, but in that case we could interrogate the collision ahead of time. As it stands we don't find out about this collision until its too late. I'll try to think of something more clever, but the solution may devolve to just teaching the tooling to require large alignments.

@djbw
Copy link
Member

djbw commented Nov 21, 2018

Hold off on contacting ipmctl folks until I exhaust a possible kernel workaround for this issue...

@patr-geary-smci
Copy link
Author

If it's BIOS related that's something I can fire up a conversation with our BIOS team members responsible for integrating Intel's code.

Interestingly I did successfully configure goals and namespaces under the BIOS and I experienced no such issues, although the region size was.... different. I didn't prod at the region ranges though, I just saw size differences once allocated.

I didn't dig into it since the whole starting point for me is handling validation of these things under Linux.

@djbw
Copy link
Member

djbw commented Nov 21, 2018

Certainly the BIOS can fix it by making sure each NFIT SPA range is at least 128MB aligned.

@djbw
Copy link
Member

djbw commented Nov 21, 2018

...there's just no specification that requires that.

@patr-geary-smci
Copy link
Author

No spec = big problem

I'll talk to some folks, do some reading and see if Intel specified the alignment.

@djbw
Copy link
Member

djbw commented Nov 21, 2018

No Intel does not specify the alignment, this is an odd and specific quirk of the current Linux memory mapping implementation. This is why "raw" namespace do not have this problem.

@patr-geary-smci
Copy link
Author

Let me ask around. Might be some time before I get a good answer, and even then it might only apply to SMCI.

@djbw
Copy link
Member

djbw commented Nov 24, 2018

Can you give this kernel patch a try:

https://patchwork.kernel.org/patch/10696565/

Even if you get this changed in your particular BIOS I still think the kernel should carry this fix.

Also I'd like to give you credit for the report if you're comfortable with your email address in the public kernel history. Just respond with a tag like:

Reported-by: Your Name your.name@email.address

Thanks!

fengguang pushed a commit to 0day-ci/linux that referenced this issue Nov 24, 2018
Commit cfe30b8 "libnvdimm, pmem: adjust for section collisions with
'System RAM'" enabled Linux to workaround occasions where platform
firmware arranges for "System RAM" and "Persistent Memory" to collide
within a single section boundary. Unfortunately, as reported in this
issue [1], platform firmware can inflict the same collision between
persistent memory regions.

The approach of interrogating iomem_resource does not work in this
case because platform firmware may merge multiple regions into a single
iomem_resource range. Instead provide a method to interrogate regions
that share the same parent bus.

This is a stop-gap until the core-MM can grow support for hotplug on
sub-section boundaries.

[1]: pmem/ndctl#76

Fixes: cfe30b8 ("libnvdimm, pmem: adjust for section collisions with...")
Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
@patr-geary-smci
Copy link
Author

patr-geary-smci commented Nov 25, 2018 via email

@patr-geary-smci
Copy link
Author

Yeesh; look at that mess from my phone's mailer.

Anyways; tested. Works fine.

@djbw
Copy link
Member

djbw commented Nov 27, 2018

Thanks @patr-geary-smci much appreciated. Mind sharing your contact details so I can get the patch credit correct? All I can see on your github profile is "patr-geary-smci".

@patr-geary-smci
Copy link
Author

patrickg@supermicro.com

avagin pushed a commit to avagin/linux that referenced this issue Nov 30, 2018
Commit cfe30b8 "libnvdimm, pmem: adjust for section collisions with
'System RAM'" enabled the libnvdimm core to workaround occasions where
platform firmware arranges for "System RAM" and "Persistent Memory" to
collide within a single section boundary. Unfortunately, as reported in
this issue [1], platform firmware can inflict the same collision between
persistent memory regions.

The approach of interrogating iomem_resource does not work in this
case because platform firmware may merge multiple regions into a single
iomem_resource range. Instead provide a method to interrogate regions
that share the same parent bus.

This is a stop-gap until the core-MM can grow support for hotplug on
sub-section boundaries.

[1]: pmem/ndctl#76

Fixes: cfe30b8 ("libnvdimm, pmem: adjust for section collisions with...")
Cc: <stable@vger.kernel.org>
Reported-by: Patrick Geary <patrickg@supermicro.com>
Tested-by: Patrick Geary <patrickg@supermicro.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Whissi pushed a commit to Whissi/linux-stable that referenced this issue Dec 13, 2018
commit ae86cbf upstream.

Commit cfe30b8 "libnvdimm, pmem: adjust for section collisions with
'System RAM'" enabled Linux to workaround occasions where platform
firmware arranges for "System RAM" and "Persistent Memory" to collide
within a single section boundary. Unfortunately, as reported in this
issue [1], platform firmware can inflict the same collision between
persistent memory regions.

The approach of interrogating iomem_resource does not work in this
case because platform firmware may merge multiple regions into a single
iomem_resource range. Instead provide a method to interrogate regions
that share the same parent bus.

This is a stop-gap until the core-MM can grow support for hotplug on
sub-section boundaries.

[1]: pmem/ndctl#76

Fixes: cfe30b8 ("libnvdimm, pmem: adjust for section collisions with...")
Cc: <stable@vger.kernel.org>
Reported-by: Patrick Geary <patrickg@supermicro.com>
Tested-by: Patrick Geary <patrickg@supermicro.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
woodsts pushed a commit to woodsts/linux-stable that referenced this issue Dec 13, 2018
commit ae86cbf upstream.

Commit cfe30b8 "libnvdimm, pmem: adjust for section collisions with
'System RAM'" enabled Linux to workaround occasions where platform
firmware arranges for "System RAM" and "Persistent Memory" to collide
within a single section boundary. Unfortunately, as reported in this
issue [1], platform firmware can inflict the same collision between
persistent memory regions.

The approach of interrogating iomem_resource does not work in this
case because platform firmware may merge multiple regions into a single
iomem_resource range. Instead provide a method to interrogate regions
that share the same parent bus.

This is a stop-gap until the core-MM can grow support for hotplug on
sub-section boundaries.

[1]: pmem/ndctl#76

Fixes: cfe30b8 ("libnvdimm, pmem: adjust for section collisions with...")
Cc: <stable@vger.kernel.org>
Reported-by: Patrick Geary <patrickg@supermicro.com>
Tested-by: Patrick Geary <patrickg@supermicro.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
isjerryxiao pushed a commit to isjerryxiao/Amlogic_s905-kernel that referenced this issue Dec 15, 2018
commit ae86cbf upstream.

Commit cfe30b8 "libnvdimm, pmem: adjust for section collisions with
'System RAM'" enabled Linux to workaround occasions where platform
firmware arranges for "System RAM" and "Persistent Memory" to collide
within a single section boundary. Unfortunately, as reported in this
issue [1], platform firmware can inflict the same collision between
persistent memory regions.

The approach of interrogating iomem_resource does not work in this
case because platform firmware may merge multiple regions into a single
iomem_resource range. Instead provide a method to interrogate regions
that share the same parent bus.

This is a stop-gap until the core-MM can grow support for hotplug on
sub-section boundaries.

[1]: pmem/ndctl#76

Fixes: cfe30b8 ("libnvdimm, pmem: adjust for section collisions with...")
Cc: <stable@vger.kernel.org>
Reported-by: Patrick Geary <patrickg@supermicro.com>
Tested-by: Patrick Geary <patrickg@supermicro.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Emerald-Phoenix pushed a commit to Emerald-Phoenix/linux that referenced this issue Jan 1, 2019
commit ae86cbf upstream.

Commit cfe30b8 "libnvdimm, pmem: adjust for section collisions with
'System RAM'" enabled Linux to workaround occasions where platform
firmware arranges for "System RAM" and "Persistent Memory" to collide
within a single section boundary. Unfortunately, as reported in this
issue [1], platform firmware can inflict the same collision between
persistent memory regions.

The approach of interrogating iomem_resource does not work in this
case because platform firmware may merge multiple regions into a single
iomem_resource range. Instead provide a method to interrogate regions
that share the same parent bus.

This is a stop-gap until the core-MM can grow support for hotplug on
sub-section boundaries.

[1]: pmem/ndctl#76

Fixes: cfe30b8 ("libnvdimm, pmem: adjust for section collisions with...")
Cc: <stable@vger.kernel.org>
Reported-by: Patrick Geary <patrickg@supermicro.com>
Tested-by: Patrick Geary <patrickg@supermicro.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
gregmarsden pushed a commit to oracle/linux-uek that referenced this issue Jan 25, 2019
Commit cfe30b8 "libnvdimm, pmem: adjust for section collisions with
'System RAM'" enabled Linux to workaround occasions where platform
firmware arranges for "System RAM" and "Persistent Memory" to collide
within a single section boundary. Unfortunately, as reported in this
issue [1], platform firmware can inflict the same collision between
persistent memory regions.

The approach of interrogating iomem_resource does not work in this
case because platform firmware may merge multiple regions into a single
iomem_resource range. Instead provide a method to interrogate regions
that share the same parent bus.

This is a stop-gap until the core-MM can grow support for hotplug on
sub-section boundaries.

[1]: pmem/ndctl#76

Fixes: cfe30b8 ("libnvdimm, pmem: adjust for section collisions with...")
Cc: <stable@vger.kernel.org>
Reported-by: Patrick Geary <patrickg@supermicro.com>
Tested-by: Patrick Geary <patrickg@supermicro.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ae86cbf)
Orabug: 29168389
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Larry Bassel <larry.bassel@oracle.com>
jpuhlman pushed a commit to MontaVista-OpenSourceTechnology/linux-mvista-2.4 that referenced this issue Feb 4, 2019
Source: linux-mvista-2.4
MR: 96580, 00000
Type: Integration
Disposition: Merged from linux-mvista-2.4
ChangeID: 4794d94b44c4903514a878b9097e99417a5e6dbe
Description:

commit ae86cbfef3818300f1972e52f67a93211acb0e24 upstream.

Commit cfe30b8 "libnvdimm, pmem: adjust for section collisions with
'System RAM'" enabled Linux to workaround occasions where platform
firmware arranges for "System RAM" and "Persistent Memory" to collide
within a single section boundary. Unfortunately, as reported in this
issue [1], platform firmware can inflict the same collision between
persistent memory regions.

The approach of interrogating iomem_resource does not work in this
case because platform firmware may merge multiple regions into a single
iomem_resource range. Instead provide a method to interrogate regions
that share the same parent bus.

This is a stop-gap until the core-MM can grow support for hotplug on
sub-section boundaries.

[1]: pmem/ndctl#76

Fixes: cfe30b8 ("libnvdimm, pmem: adjust for section collisions with...")
Cc: <stable@vger.kernel.org>
Reported-by: Patrick Geary <patrickg@supermicro.com>
Tested-by: Patrick Geary <patrickg@supermicro.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Armin Kuster <akuster@mvista.com>
Signed-off-by: Jeremy Puhlman <jpuhlman@mvista.com>
jpuhlman pushed a commit to MontaVista-OpenSourceTechnology/linux-mvista-2.4 that referenced this issue Feb 7, 2019
Source: Kernel.org
MR: 96580
Type: Security Fix
Disposition: Backport from git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable linux-4.14.y
ChangeID: 4794d94b44c4903514a878b9097e99417a5e6dbe
Description:

commit ae86cbfef3818300f1972e52f67a93211acb0e24 upstream.

Commit cfe30b8 "libnvdimm, pmem: adjust for section collisions with
'System RAM'" enabled Linux to workaround occasions where platform
firmware arranges for "System RAM" and "Persistent Memory" to collide
within a single section boundary. Unfortunately, as reported in this
issue [1], platform firmware can inflict the same collision between
persistent memory regions.

The approach of interrogating iomem_resource does not work in this
case because platform firmware may merge multiple regions into a single
iomem_resource range. Instead provide a method to interrogate regions
that share the same parent bus.

This is a stop-gap until the core-MM can grow support for hotplug on
sub-section boundaries.

[1]: pmem/ndctl#76

Fixes: cfe30b8 ("libnvdimm, pmem: adjust for section collisions with...")
Cc: <stable@vger.kernel.org>
Reported-by: Patrick Geary <patrickg@supermicro.com>
Tested-by: Patrick Geary <patrickg@supermicro.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Armin Kuster <akuster@mvista.com>
jpuhlman pushed a commit to MontaVista-OpenSourceTechnology/linux-mvista-2.4 that referenced this issue Mar 5, 2019
Source: Kernel.org
MR: 96580
Type: Security Fix
Disposition: Backport from git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable linux-4.14.y
ChangeID: 4794d94b44c4903514a878b9097e99417a5e6dbe
Description:

commit ae86cbfef3818300f1972e52f67a93211acb0e24 upstream.

Commit cfe30b8 "libnvdimm, pmem: adjust for section collisions with
'System RAM'" enabled Linux to workaround occasions where platform
firmware arranges for "System RAM" and "Persistent Memory" to collide
within a single section boundary. Unfortunately, as reported in this
issue [1], platform firmware can inflict the same collision between
persistent memory regions.

The approach of interrogating iomem_resource does not work in this
case because platform firmware may merge multiple regions into a single
iomem_resource range. Instead provide a method to interrogate regions
that share the same parent bus.

This is a stop-gap until the core-MM can grow support for hotplug on
sub-section boundaries.

[1]: pmem/ndctl#76

Fixes: cfe30b8 ("libnvdimm, pmem: adjust for section collisions with...")
Cc: <stable@vger.kernel.org>
Reported-by: Patrick Geary <patrickg@supermicro.com>
Tested-by: Patrick Geary <patrickg@supermicro.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Armin Kuster <akuster@mvista.com>
randomblame pushed a commit to randomblame/android_kernel_xiaomi_sm8150 that referenced this issue Apr 13, 2019
commit ae86cbfef3818300f1972e52f67a93211acb0e24 upstream.

Commit cfe30b8 "libnvdimm, pmem: adjust for section collisions with
'System RAM'" enabled Linux to workaround occasions where platform
firmware arranges for "System RAM" and "Persistent Memory" to collide
within a single section boundary. Unfortunately, as reported in this
issue [1], platform firmware can inflict the same collision between
persistent memory regions.

The approach of interrogating iomem_resource does not work in this
case because platform firmware may merge multiple regions into a single
iomem_resource range. Instead provide a method to interrogate regions
that share the same parent bus.

This is a stop-gap until the core-MM can grow support for hotplug on
sub-section boundaries.

[1]: pmem/ndctl#76

Fixes: cfe30b8 ("libnvdimm, pmem: adjust for section collisions with...")
Cc: <stable@vger.kernel.org>
Reported-by: Patrick Geary <patrickg@supermicro.com>
Tested-by: Patrick Geary <patrickg@supermicro.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
randomblame pushed a commit to randomblame/android_kernel_xiaomi_sm8150 that referenced this issue Apr 27, 2019
commit ae86cbfef3818300f1972e52f67a93211acb0e24 upstream.

Commit cfe30b8 "libnvdimm, pmem: adjust for section collisions with
'System RAM'" enabled Linux to workaround occasions where platform
firmware arranges for "System RAM" and "Persistent Memory" to collide
within a single section boundary. Unfortunately, as reported in this
issue [1], platform firmware can inflict the same collision between
persistent memory regions.

The approach of interrogating iomem_resource does not work in this
case because platform firmware may merge multiple regions into a single
iomem_resource range. Instead provide a method to interrogate regions
that share the same parent bus.

This is a stop-gap until the core-MM can grow support for hotplug on
sub-section boundaries.

[1]: pmem/ndctl#76

Fixes: cfe30b8 ("libnvdimm, pmem: adjust for section collisions with...")
Cc: <stable@vger.kernel.org>
Reported-by: Patrick Geary <patrickg@supermicro.com>
Tested-by: Patrick Geary <patrickg@supermicro.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
rgushchin pushed a commit to rgushchin/linux that referenced this issue Jun 11, 2019
Patch series "mm: Sub-section memory hotplug support", v9.

The memory hotplug section is an arbitrary / convenient unit for memory
hotplug.  'Section-size' units have bled into the user interface
('memblock' sysfs) and can not be changed without breaking existing
userspace.  The section-size constraint, while mostly benign for typical
memory hotplug, has and continues to wreak havoc with 'device-memory' use
cases, persistent memory (pmem) in particular.  Recall that pmem uses
devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a
'struct page' memmap for pmem.  However, it does not use the 'bottom half'
of memory hotplug, i.e.  never marks pmem pages online and never exposes
the userspace memblock interface for pmem.  This leaves an opening to
redress the section-size constraint.

To date, the libnvdimm subsystem has attempted to inject padding to
satisfy the internal constraints of arch_add_memory().  Beyond
complicating the code, leading to bugs [2], wasting memory, and limiting
configuration flexibility, the padding hack is broken when the platform
changes this physical memory alignment of pmem from one boot to the next. 
Device failure (intermittent or permanent) and physical reconfiguration
are events that can cause the platform firmware to change the physical
placement of pmem on a subsequent boot, and device failure is an everyday
event in a data-center.

It turns out that sections are only a hard requirement of the user-facing
interface for memory hotplug and with a bit more infrastructure
sub-section arch_add_memory() support can be added for kernel internal
usages like devm_memremap_pages().  Here is an analysis of the current
design assumptions in the current code and how they are addressed in the
new implementation:

Current design assumptions:

- Sections that describe boot memory (early sections) are never
  unplugged / removed.

- pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
  valid_section() check

- __add_pages() and helper routines assume all operations occur in
  PAGES_PER_SECTION units.

- The memblock sysfs interface only comprehends full sections

New design assumptions:

- Sections are instrumented with a sub-section bitmask to track (on x86)
  individual 2MB sub-divisions of a 128MB section.

- Partially populated early sections can be extended with additional
  sub-sections, and those sub-sections can be removed with
  arch_remove_memory(). With this in place we no longer lose usable memory
  capacity to padding.

- pfn_valid() is updated to look deeper than valid_section() to also check the
  active-sub-section mask. This indication is in the same cacheline as
  the valid_section() so the performance impact is expected to be
  negligible. So far the lkp robot has not reported any regressions.

- Outside of the core vmemmap population routines which are replaced,
  other helper routines like shrink_{zone,pgdat}_span() are updated to
  handle the smaller granularity. Core memory hotplug routines that deal
  with online memory are not touched.

- The existing memblock sysfs user api guarantees / assumptions are
  not touched since this capability is limited to !online
  !memblock-sysfs-accessible sections.

Meanwhile the issue reports continue to roll in from users that do not
understand when and how the 128MB constraint will bite them.  The current
implementation relied on being able to support at least one misaligned
namespace, but that immediately falls over on any moderately complex
namespace creation attempt.  Beyond the initial problem of 'System RAM'
colliding with pmem, and the unsolvable problem of physical alignment
changes, Linux is now being exposed to platforms that collide pmem ranges
with other pmem ranges by default [3].  In short, devm_memremap_pages()
has pushed the venerable section-size constraint past the breaking point,
and the simplicity of section-aligned arch_add_memory() is no longer
tenable.

These patches are exposed to the kbuild robot on my libnvdimm-pending
branch [4], and a preview of the unit test for this functionality is
available on the 'subsection-pending' branch of ndctl [5].

[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
[3]: pmem/ndctl#76
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=libnvdimm-pending
[5]: pmem/ndctl@7c59b4867e1c


This patch (of 12):

Towards enabling memory hotplug to track partial population of a section,
introduce 'struct mem_section_usage'.

A pointer to a 'struct mem_section_usage' instance replaces the existing
pointer to a 'pageblock_flags' bitmap.  Effectively it adds one more
'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
a new 'subsection_map' bitmap.  The new bitmap enables the memory
hot{plug,remove} implementation to act on incremental sub-divisions of a
section.

The default SUBSECTION_SHIFT is chosen to keep the 'subsection_map' no
larger than a single 'unsigned long' on the major architectures. 
Alternatively an architecture can define ARCH_SUBSECTION_SHIFT to override
the default PMD_SHIFT.  Note that PowerPC needs to use
ARCH_SUBSECTION_SHIFT to workaround PMD_SHIFT being a non-constant
expression on PowerPC.

The primary motivation for this functionality is to support platforms that
mix "System RAM" and "Persistent Memory" within a single section, or
multiple PMEM ranges with different mapping lifetimes within a single
section.  The section restriction for hotplug has caused an ongoing saga
of hacks and bugs for devm_memremap_pages() users.

Beyond the fixups to teach existing paths how to retrieve the 'usemap'
from a section, and updates to usemap allocation path, there are no
expected behavior changes.

Link: http://lkml.kernel.org/r/155977187407.2443951.16503493275720588454.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
cdown pushed a commit to cdown/linux that referenced this issue Jul 4, 2019
Patch series "mm: Sub-section memory hotplug support", v10.

The memory hotplug section is an arbitrary / convenient unit for memory
hotplug.  'Section-size' units have bled into the user interface
('memblock' sysfs) and can not be changed without breaking existing
userspace.  The section-size constraint, while mostly benign for typical
memory hotplug, has and continues to wreak havoc with 'device-memory' use
cases, persistent memory (pmem) in particular.  Recall that pmem uses
devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a
'struct page' memmap for pmem.  However, it does not use the 'bottom half'
of memory hotplug, i.e.  never marks pmem pages online and never exposes
the userspace memblock interface for pmem.  This leaves an opening to
redress the section-size constraint.

To date, the libnvdimm subsystem has attempted to inject padding to
satisfy the internal constraints of arch_add_memory().  Beyond
complicating the code, leading to bugs [2], wasting memory, and limiting
configuration flexibility, the padding hack is broken when the platform
changes this physical memory alignment of pmem from one boot to the next. 
Device failure (intermittent or permanent) and physical reconfiguration
are events that can cause the platform firmware to change the physical
placement of pmem on a subsequent boot, and device failure is an everyday
event in a data-center.

It turns out that sections are only a hard requirement of the user-facing
interface for memory hotplug and with a bit more infrastructure
sub-section arch_add_memory() support can be added for kernel internal
usages like devm_memremap_pages().  Here is an analysis of the current
design assumptions in the current code and how they are addressed in the
new implementation:

Current design assumptions:

- Sections that describe boot memory (early sections) are never
  unplugged / removed.

- pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
  valid_section() check

- __add_pages() and helper routines assume all operations occur in
  PAGES_PER_SECTION units.

- The memblock sysfs interface only comprehends full sections

New design assumptions:

- Sections are instrumented with a sub-section bitmask to track (on x86)
  individual 2MB sub-divisions of a 128MB section.

- Partially populated early sections can be extended with additional
  sub-sections, and those sub-sections can be removed with
  arch_remove_memory(). With this in place we no longer lose usable memory
  capacity to padding.

- pfn_valid() is updated to look deeper than valid_section() to also check the
  active-sub-section mask. This indication is in the same cacheline as
  the valid_section() so the performance impact is expected to be
  negligible. So far the lkp robot has not reported any regressions.

- Outside of the core vmemmap population routines which are replaced,
  other helper routines like shrink_{zone,pgdat}_span() are updated to
  handle the smaller granularity. Core memory hotplug routines that deal
  with online memory are not touched.

- The existing memblock sysfs user api guarantees / assumptions are
  not touched since this capability is limited to !online
  !memblock-sysfs-accessible sections.

Meanwhile the issue reports continue to roll in from users that do not
understand when and how the 128MB constraint will bite them.  The current
implementation relied on being able to support at least one misaligned
namespace, but that immediately falls over on any moderately complex
namespace creation attempt.  Beyond the initial problem of 'System RAM'
colliding with pmem, and the unsolvable problem of physical alignment
changes, Linux is now being exposed to platforms that collide pmem ranges
with other pmem ranges by default [3].  In short, devm_memremap_pages()
has pushed the venerable section-size constraint past the breaking point,
and the simplicity of section-aligned arch_add_memory() is no longer
tenable.

These patches are exposed to the kbuild robot on a subsection-v10 branch
[4], and a preview of the unit test for this functionality is available
on the 'subsection-pending' branch of ndctl [5].

[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
[3]: pmem/ndctl#76
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
[5]: pmem/ndctl@7c59b4867e1c


This patch (of 13):

Towards enabling memory hotplug to track partial population of a section,
introduce 'struct mem_section_usage'.

A pointer to a 'struct mem_section_usage' instance replaces the existing
pointer to a 'pageblock_flags' bitmap.  Effectively it adds one more
'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
a new 'subsection_map' bitmap.  The new bitmap enables the memory
hot{plug,remove} implementation to act on incremental sub-divisions of a
section.

SUBSECTION_SHIFT is defined as global constant instead of per-architecture
value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
subsection users.  Specifically a common subsection size allows for the
possibility that persistent memory namespace configurations be made
compatible across architectures.

The primary motivation for this functionality is to support platforms that
mix "System RAM" and "Persistent Memory" within a single section, or
multiple PMEM ranges with different mapping lifetimes within a single
section.  The section restriction for hotplug has caused an ongoing saga
of hacks and bugs for devm_memremap_pages() users.

Beyond the fixups to teach existing paths how to retrieve the 'usemap'
from a section, and updates to usemap allocation path, there are no
expected behavior changes.

Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Qian Cai <cai@lca.pw>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
xdarklight pushed a commit to xdarklight/linux that referenced this issue Jul 9, 2019
Patch series "mm: Sub-section memory hotplug support", v10.

The memory hotplug section is an arbitrary / convenient unit for memory
hotplug.  'Section-size' units have bled into the user interface
('memblock' sysfs) and can not be changed without breaking existing
userspace.  The section-size constraint, while mostly benign for typical
memory hotplug, has and continues to wreak havoc with 'device-memory' use
cases, persistent memory (pmem) in particular.  Recall that pmem uses
devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a
'struct page' memmap for pmem.  However, it does not use the 'bottom half'
of memory hotplug, i.e.  never marks pmem pages online and never exposes
the userspace memblock interface for pmem.  This leaves an opening to
redress the section-size constraint.

To date, the libnvdimm subsystem has attempted to inject padding to
satisfy the internal constraints of arch_add_memory().  Beyond
complicating the code, leading to bugs [2], wasting memory, and limiting
configuration flexibility, the padding hack is broken when the platform
changes this physical memory alignment of pmem from one boot to the next.
Device failure (intermittent or permanent) and physical reconfiguration
are events that can cause the platform firmware to change the physical
placement of pmem on a subsequent boot, and device failure is an everyday
event in a data-center.

It turns out that sections are only a hard requirement of the user-facing
interface for memory hotplug and with a bit more infrastructure
sub-section arch_add_memory() support can be added for kernel internal
usages like devm_memremap_pages().  Here is an analysis of the current
design assumptions in the current code and how they are addressed in the
new implementation:

Current design assumptions:

- Sections that describe boot memory (early sections) are never
  unplugged / removed.

- pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
  valid_section() check

- __add_pages() and helper routines assume all operations occur in
  PAGES_PER_SECTION units.

- The memblock sysfs interface only comprehends full sections

New design assumptions:

- Sections are instrumented with a sub-section bitmask to track (on x86)
  individual 2MB sub-divisions of a 128MB section.

- Partially populated early sections can be extended with additional
  sub-sections, and those sub-sections can be removed with
  arch_remove_memory(). With this in place we no longer lose usable memory
  capacity to padding.

- pfn_valid() is updated to look deeper than valid_section() to also check the
  active-sub-section mask. This indication is in the same cacheline as
  the valid_section() so the performance impact is expected to be
  negligible. So far the lkp robot has not reported any regressions.

- Outside of the core vmemmap population routines which are replaced,
  other helper routines like shrink_{zone,pgdat}_span() are updated to
  handle the smaller granularity. Core memory hotplug routines that deal
  with online memory are not touched.

- The existing memblock sysfs user api guarantees / assumptions are
  not touched since this capability is limited to !online
  !memblock-sysfs-accessible sections.

Meanwhile the issue reports continue to roll in from users that do not
understand when and how the 128MB constraint will bite them.  The current
implementation relied on being able to support at least one misaligned
namespace, but that immediately falls over on any moderately complex
namespace creation attempt.  Beyond the initial problem of 'System RAM'
colliding with pmem, and the unsolvable problem of physical alignment
changes, Linux is now being exposed to platforms that collide pmem ranges
with other pmem ranges by default [3].  In short, devm_memremap_pages()
has pushed the venerable section-size constraint past the breaking point,
and the simplicity of section-aligned arch_add_memory() is no longer
tenable.

These patches are exposed to the kbuild robot on a subsection-v10 branch
[4], and a preview of the unit test for this functionality is available
on the 'subsection-pending' branch of ndctl [5].

[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
[3]: pmem/ndctl#76
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
[5]: pmem/ndctl@7c59b4867e1c

This patch (of 13):

Towards enabling memory hotplug to track partial population of a section,
introduce 'struct mem_section_usage'.

A pointer to a 'struct mem_section_usage' instance replaces the existing
pointer to a 'pageblock_flags' bitmap.  Effectively it adds one more
'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
a new 'subsection_map' bitmap.  The new bitmap enables the memory
hot{plug,remove} implementation to act on incremental sub-divisions of a
section.

SUBSECTION_SHIFT is defined as global constant instead of per-architecture
value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
subsection users.  Specifically a common subsection size allows for the
possibility that persistent memory namespace configurations be made
compatible across architectures.

The primary motivation for this functionality is to support platforms that
mix "System RAM" and "Persistent Memory" within a single section, or
multiple PMEM ranges with different mapping lifetimes within a single
section.  The section restriction for hotplug has caused an ongoing saga
of hacks and bugs for devm_memremap_pages() users.

Beyond the fixups to teach existing paths how to retrieve the 'usemap'
from a section, and updates to usemap allocation path, there are no
expected behavior changes.

Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Qian Cai <cai@lca.pw>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
avagin pushed a commit to avagin/linux that referenced this issue Jul 12, 2019
Patch series "mm: Sub-section memory hotplug support", v10.

The memory hotplug section is an arbitrary / convenient unit for memory
hotplug.  'Section-size' units have bled into the user interface
('memblock' sysfs) and can not be changed without breaking existing
userspace.  The section-size constraint, while mostly benign for typical
memory hotplug, has and continues to wreak havoc with 'device-memory' use
cases, persistent memory (pmem) in particular.  Recall that pmem uses
devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a
'struct page' memmap for pmem.  However, it does not use the 'bottom half'
of memory hotplug, i.e.  never marks pmem pages online and never exposes
the userspace memblock interface for pmem.  This leaves an opening to
redress the section-size constraint.

To date, the libnvdimm subsystem has attempted to inject padding to
satisfy the internal constraints of arch_add_memory().  Beyond
complicating the code, leading to bugs [2], wasting memory, and limiting
configuration flexibility, the padding hack is broken when the platform
changes this physical memory alignment of pmem from one boot to the next.
Device failure (intermittent or permanent) and physical reconfiguration
are events that can cause the platform firmware to change the physical
placement of pmem on a subsequent boot, and device failure is an everyday
event in a data-center.

It turns out that sections are only a hard requirement of the user-facing
interface for memory hotplug and with a bit more infrastructure
sub-section arch_add_memory() support can be added for kernel internal
usages like devm_memremap_pages().  Here is an analysis of the current
design assumptions in the current code and how they are addressed in the
new implementation:

Current design assumptions:

- Sections that describe boot memory (early sections) are never
  unplugged / removed.

- pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
  valid_section() check

- __add_pages() and helper routines assume all operations occur in
  PAGES_PER_SECTION units.

- The memblock sysfs interface only comprehends full sections

New design assumptions:

- Sections are instrumented with a sub-section bitmask to track (on x86)
  individual 2MB sub-divisions of a 128MB section.

- Partially populated early sections can be extended with additional
  sub-sections, and those sub-sections can be removed with
  arch_remove_memory(). With this in place we no longer lose usable memory
  capacity to padding.

- pfn_valid() is updated to look deeper than valid_section() to also check the
  active-sub-section mask. This indication is in the same cacheline as
  the valid_section() so the performance impact is expected to be
  negligible. So far the lkp robot has not reported any regressions.

- Outside of the core vmemmap population routines which are replaced,
  other helper routines like shrink_{zone,pgdat}_span() are updated to
  handle the smaller granularity. Core memory hotplug routines that deal
  with online memory are not touched.

- The existing memblock sysfs user api guarantees / assumptions are
  not touched since this capability is limited to !online
  !memblock-sysfs-accessible sections.

Meanwhile the issue reports continue to roll in from users that do not
understand when and how the 128MB constraint will bite them.  The current
implementation relied on being able to support at least one misaligned
namespace, but that immediately falls over on any moderately complex
namespace creation attempt.  Beyond the initial problem of 'System RAM'
colliding with pmem, and the unsolvable problem of physical alignment
changes, Linux is now being exposed to platforms that collide pmem ranges
with other pmem ranges by default [3].  In short, devm_memremap_pages()
has pushed the venerable section-size constraint past the breaking point,
and the simplicity of section-aligned arch_add_memory() is no longer
tenable.

These patches are exposed to the kbuild robot on a subsection-v10 branch
[4], and a preview of the unit test for this functionality is available
on the 'subsection-pending' branch of ndctl [5].

[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
[3]: pmem/ndctl#76
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
[5]: pmem/ndctl@7c59b4867e1c

This patch (of 13):

Towards enabling memory hotplug to track partial population of a section,
introduce 'struct mem_section_usage'.

A pointer to a 'struct mem_section_usage' instance replaces the existing
pointer to a 'pageblock_flags' bitmap.  Effectively it adds one more
'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
a new 'subsection_map' bitmap.  The new bitmap enables the memory
hot{plug,remove} implementation to act on incremental sub-divisions of a
section.

SUBSECTION_SHIFT is defined as global constant instead of per-architecture
value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
subsection users.  Specifically a common subsection size allows for the
possibility that persistent memory namespace configurations be made
compatible across architectures.

The primary motivation for this functionality is to support platforms that
mix "System RAM" and "Persistent Memory" within a single section, or
multiple PMEM ranges with different mapping lifetimes within a single
section.  The section restriction for hotplug has caused an ongoing saga
of hacks and bugs for devm_memremap_pages() users.

Beyond the fixups to teach existing paths how to retrieve the 'usemap'
from a section, and updates to usemap allocation path, there are no
expected behavior changes.

Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Qian Cai <cai@lca.pw>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
avagin pushed a commit to avagin/linux that referenced this issue Jul 15, 2019
Patch series "mm: Sub-section memory hotplug support", v10.

The memory hotplug section is an arbitrary / convenient unit for memory
hotplug.  'Section-size' units have bled into the user interface
('memblock' sysfs) and can not be changed without breaking existing
userspace.  The section-size constraint, while mostly benign for typical
memory hotplug, has and continues to wreak havoc with 'device-memory' use
cases, persistent memory (pmem) in particular.  Recall that pmem uses
devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a
'struct page' memmap for pmem.  However, it does not use the 'bottom half'
of memory hotplug, i.e.  never marks pmem pages online and never exposes
the userspace memblock interface for pmem.  This leaves an opening to
redress the section-size constraint.

To date, the libnvdimm subsystem has attempted to inject padding to
satisfy the internal constraints of arch_add_memory().  Beyond
complicating the code, leading to bugs [2], wasting memory, and limiting
configuration flexibility, the padding hack is broken when the platform
changes this physical memory alignment of pmem from one boot to the next.
Device failure (intermittent or permanent) and physical reconfiguration
are events that can cause the platform firmware to change the physical
placement of pmem on a subsequent boot, and device failure is an everyday
event in a data-center.

It turns out that sections are only a hard requirement of the user-facing
interface for memory hotplug and with a bit more infrastructure
sub-section arch_add_memory() support can be added for kernel internal
usages like devm_memremap_pages().  Here is an analysis of the current
design assumptions in the current code and how they are addressed in the
new implementation:

Current design assumptions:

- Sections that describe boot memory (early sections) are never
  unplugged / removed.

- pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
  valid_section() check

- __add_pages() and helper routines assume all operations occur in
  PAGES_PER_SECTION units.

- The memblock sysfs interface only comprehends full sections

New design assumptions:

- Sections are instrumented with a sub-section bitmask to track (on x86)
  individual 2MB sub-divisions of a 128MB section.

- Partially populated early sections can be extended with additional
  sub-sections, and those sub-sections can be removed with
  arch_remove_memory(). With this in place we no longer lose usable memory
  capacity to padding.

- pfn_valid() is updated to look deeper than valid_section() to also check the
  active-sub-section mask. This indication is in the same cacheline as
  the valid_section() so the performance impact is expected to be
  negligible. So far the lkp robot has not reported any regressions.

- Outside of the core vmemmap population routines which are replaced,
  other helper routines like shrink_{zone,pgdat}_span() are updated to
  handle the smaller granularity. Core memory hotplug routines that deal
  with online memory are not touched.

- The existing memblock sysfs user api guarantees / assumptions are
  not touched since this capability is limited to !online
  !memblock-sysfs-accessible sections.

Meanwhile the issue reports continue to roll in from users that do not
understand when and how the 128MB constraint will bite them.  The current
implementation relied on being able to support at least one misaligned
namespace, but that immediately falls over on any moderately complex
namespace creation attempt.  Beyond the initial problem of 'System RAM'
colliding with pmem, and the unsolvable problem of physical alignment
changes, Linux is now being exposed to platforms that collide pmem ranges
with other pmem ranges by default [3].  In short, devm_memremap_pages()
has pushed the venerable section-size constraint past the breaking point,
and the simplicity of section-aligned arch_add_memory() is no longer
tenable.

These patches are exposed to the kbuild robot on a subsection-v10 branch
[4], and a preview of the unit test for this functionality is available
on the 'subsection-pending' branch of ndctl [5].

[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
[3]: pmem/ndctl#76
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
[5]: pmem/ndctl@7c59b4867e1c

This patch (of 13):

Towards enabling memory hotplug to track partial population of a section,
introduce 'struct mem_section_usage'.

A pointer to a 'struct mem_section_usage' instance replaces the existing
pointer to a 'pageblock_flags' bitmap.  Effectively it adds one more
'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
a new 'subsection_map' bitmap.  The new bitmap enables the memory
hot{plug,remove} implementation to act on incremental sub-divisions of a
section.

SUBSECTION_SHIFT is defined as global constant instead of per-architecture
value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
subsection users.  Specifically a common subsection size allows for the
possibility that persistent memory namespace configurations be made
compatible across architectures.

The primary motivation for this functionality is to support platforms that
mix "System RAM" and "Persistent Memory" within a single section, or
multiple PMEM ranges with different mapping lifetimes within a single
section.  The section restriction for hotplug has caused an ongoing saga
of hacks and bugs for devm_memremap_pages() users.

Beyond the fixups to teach existing paths how to retrieve the 'usemap'
from a section, and updates to usemap allocation path, there are no
expected behavior changes.

Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Qian Cai <cai@lca.pw>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
avagin pushed a commit to avagin/linux that referenced this issue Jul 16, 2019
Patch series "mm: Sub-section memory hotplug support", v10.

The memory hotplug section is an arbitrary / convenient unit for memory
hotplug.  'Section-size' units have bled into the user interface
('memblock' sysfs) and can not be changed without breaking existing
userspace.  The section-size constraint, while mostly benign for typical
memory hotplug, has and continues to wreak havoc with 'device-memory' use
cases, persistent memory (pmem) in particular.  Recall that pmem uses
devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a
'struct page' memmap for pmem.  However, it does not use the 'bottom half'
of memory hotplug, i.e.  never marks pmem pages online and never exposes
the userspace memblock interface for pmem.  This leaves an opening to
redress the section-size constraint.

To date, the libnvdimm subsystem has attempted to inject padding to
satisfy the internal constraints of arch_add_memory().  Beyond
complicating the code, leading to bugs [2], wasting memory, and limiting
configuration flexibility, the padding hack is broken when the platform
changes this physical memory alignment of pmem from one boot to the next.
Device failure (intermittent or permanent) and physical reconfiguration
are events that can cause the platform firmware to change the physical
placement of pmem on a subsequent boot, and device failure is an everyday
event in a data-center.

It turns out that sections are only a hard requirement of the user-facing
interface for memory hotplug and with a bit more infrastructure
sub-section arch_add_memory() support can be added for kernel internal
usages like devm_memremap_pages().  Here is an analysis of the current
design assumptions in the current code and how they are addressed in the
new implementation:

Current design assumptions:

- Sections that describe boot memory (early sections) are never
  unplugged / removed.

- pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
  valid_section() check

- __add_pages() and helper routines assume all operations occur in
  PAGES_PER_SECTION units.

- The memblock sysfs interface only comprehends full sections

New design assumptions:

- Sections are instrumented with a sub-section bitmask to track (on x86)
  individual 2MB sub-divisions of a 128MB section.

- Partially populated early sections can be extended with additional
  sub-sections, and those sub-sections can be removed with
  arch_remove_memory(). With this in place we no longer lose usable memory
  capacity to padding.

- pfn_valid() is updated to look deeper than valid_section() to also check the
  active-sub-section mask. This indication is in the same cacheline as
  the valid_section() so the performance impact is expected to be
  negligible. So far the lkp robot has not reported any regressions.

- Outside of the core vmemmap population routines which are replaced,
  other helper routines like shrink_{zone,pgdat}_span() are updated to
  handle the smaller granularity. Core memory hotplug routines that deal
  with online memory are not touched.

- The existing memblock sysfs user api guarantees / assumptions are
  not touched since this capability is limited to !online
  !memblock-sysfs-accessible sections.

Meanwhile the issue reports continue to roll in from users that do not
understand when and how the 128MB constraint will bite them.  The current
implementation relied on being able to support at least one misaligned
namespace, but that immediately falls over on any moderately complex
namespace creation attempt.  Beyond the initial problem of 'System RAM'
colliding with pmem, and the unsolvable problem of physical alignment
changes, Linux is now being exposed to platforms that collide pmem ranges
with other pmem ranges by default [3].  In short, devm_memremap_pages()
has pushed the venerable section-size constraint past the breaking point,
and the simplicity of section-aligned arch_add_memory() is no longer
tenable.

These patches are exposed to the kbuild robot on a subsection-v10 branch
[4], and a preview of the unit test for this functionality is available
on the 'subsection-pending' branch of ndctl [5].

[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
[3]: pmem/ndctl#76
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
[5]: pmem/ndctl@7c59b4867e1c

This patch (of 13):

Towards enabling memory hotplug to track partial population of a section,
introduce 'struct mem_section_usage'.

A pointer to a 'struct mem_section_usage' instance replaces the existing
pointer to a 'pageblock_flags' bitmap.  Effectively it adds one more
'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
a new 'subsection_map' bitmap.  The new bitmap enables the memory
hot{plug,remove} implementation to act on incremental sub-divisions of a
section.

SUBSECTION_SHIFT is defined as global constant instead of per-architecture
value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
subsection users.  Specifically a common subsection size allows for the
possibility that persistent memory namespace configurations be made
compatible across architectures.

The primary motivation for this functionality is to support platforms that
mix "System RAM" and "Persistent Memory" within a single section, or
multiple PMEM ranges with different mapping lifetimes within a single
section.  The section restriction for hotplug has caused an ongoing saga
of hacks and bugs for devm_memremap_pages() users.

Beyond the fixups to teach existing paths how to retrieve the 'usemap'
from a section, and updates to usemap allocation path, there are no
expected behavior changes.

Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Qian Cai <cai@lca.pw>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
avagin pushed a commit to avagin/linux that referenced this issue Jul 17, 2019
Patch series "mm: Sub-section memory hotplug support", v10.

The memory hotplug section is an arbitrary / convenient unit for memory
hotplug.  'Section-size' units have bled into the user interface
('memblock' sysfs) and can not be changed without breaking existing
userspace.  The section-size constraint, while mostly benign for typical
memory hotplug, has and continues to wreak havoc with 'device-memory' use
cases, persistent memory (pmem) in particular.  Recall that pmem uses
devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a
'struct page' memmap for pmem.  However, it does not use the 'bottom half'
of memory hotplug, i.e.  never marks pmem pages online and never exposes
the userspace memblock interface for pmem.  This leaves an opening to
redress the section-size constraint.

To date, the libnvdimm subsystem has attempted to inject padding to
satisfy the internal constraints of arch_add_memory().  Beyond
complicating the code, leading to bugs [2], wasting memory, and limiting
configuration flexibility, the padding hack is broken when the platform
changes this physical memory alignment of pmem from one boot to the next.
Device failure (intermittent or permanent) and physical reconfiguration
are events that can cause the platform firmware to change the physical
placement of pmem on a subsequent boot, and device failure is an everyday
event in a data-center.

It turns out that sections are only a hard requirement of the user-facing
interface for memory hotplug and with a bit more infrastructure
sub-section arch_add_memory() support can be added for kernel internal
usages like devm_memremap_pages().  Here is an analysis of the current
design assumptions in the current code and how they are addressed in the
new implementation:

Current design assumptions:

- Sections that describe boot memory (early sections) are never
  unplugged / removed.

- pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
  valid_section() check

- __add_pages() and helper routines assume all operations occur in
  PAGES_PER_SECTION units.

- The memblock sysfs interface only comprehends full sections

New design assumptions:

- Sections are instrumented with a sub-section bitmask to track (on x86)
  individual 2MB sub-divisions of a 128MB section.

- Partially populated early sections can be extended with additional
  sub-sections, and those sub-sections can be removed with
  arch_remove_memory(). With this in place we no longer lose usable memory
  capacity to padding.

- pfn_valid() is updated to look deeper than valid_section() to also check the
  active-sub-section mask. This indication is in the same cacheline as
  the valid_section() so the performance impact is expected to be
  negligible. So far the lkp robot has not reported any regressions.

- Outside of the core vmemmap population routines which are replaced,
  other helper routines like shrink_{zone,pgdat}_span() are updated to
  handle the smaller granularity. Core memory hotplug routines that deal
  with online memory are not touched.

- The existing memblock sysfs user api guarantees / assumptions are
  not touched since this capability is limited to !online
  !memblock-sysfs-accessible sections.

Meanwhile the issue reports continue to roll in from users that do not
understand when and how the 128MB constraint will bite them.  The current
implementation relied on being able to support at least one misaligned
namespace, but that immediately falls over on any moderately complex
namespace creation attempt.  Beyond the initial problem of 'System RAM'
colliding with pmem, and the unsolvable problem of physical alignment
changes, Linux is now being exposed to platforms that collide pmem ranges
with other pmem ranges by default [3].  In short, devm_memremap_pages()
has pushed the venerable section-size constraint past the breaking point,
and the simplicity of section-aligned arch_add_memory() is no longer
tenable.

These patches are exposed to the kbuild robot on a subsection-v10 branch
[4], and a preview of the unit test for this functionality is available
on the 'subsection-pending' branch of ndctl [5].

[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
[3]: pmem/ndctl#76
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
[5]: pmem/ndctl@7c59b4867e1c

This patch (of 13):

Towards enabling memory hotplug to track partial population of a section,
introduce 'struct mem_section_usage'.

A pointer to a 'struct mem_section_usage' instance replaces the existing
pointer to a 'pageblock_flags' bitmap.  Effectively it adds one more
'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
a new 'subsection_map' bitmap.  The new bitmap enables the memory
hot{plug,remove} implementation to act on incremental sub-divisions of a
section.

SUBSECTION_SHIFT is defined as global constant instead of per-architecture
value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
subsection users.  Specifically a common subsection size allows for the
possibility that persistent memory namespace configurations be made
compatible across architectures.

The primary motivation for this functionality is to support platforms that
mix "System RAM" and "Persistent Memory" within a single section, or
multiple PMEM ranges with different mapping lifetimes within a single
section.  The section restriction for hotplug has caused an ongoing saga
of hacks and bugs for devm_memremap_pages() users.

Beyond the fixups to teach existing paths how to retrieve the 'usemap'
from a section, and updates to usemap allocation path, there are no
expected behavior changes.

Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Qian Cai <cai@lca.pw>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
mrchapp pushed a commit to mrchapp/linux that referenced this issue Jul 18, 2019
Patch series "mm: Sub-section memory hotplug support", v10.

The memory hotplug section is an arbitrary / convenient unit for memory
hotplug.  'Section-size' units have bled into the user interface
('memblock' sysfs) and can not be changed without breaking existing
userspace.  The section-size constraint, while mostly benign for typical
memory hotplug, has and continues to wreak havoc with 'device-memory' use
cases, persistent memory (pmem) in particular.  Recall that pmem uses
devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a
'struct page' memmap for pmem.  However, it does not use the 'bottom half'
of memory hotplug, i.e.  never marks pmem pages online and never exposes
the userspace memblock interface for pmem.  This leaves an opening to
redress the section-size constraint.

To date, the libnvdimm subsystem has attempted to inject padding to
satisfy the internal constraints of arch_add_memory().  Beyond
complicating the code, leading to bugs [2], wasting memory, and limiting
configuration flexibility, the padding hack is broken when the platform
changes this physical memory alignment of pmem from one boot to the next.
Device failure (intermittent or permanent) and physical reconfiguration
are events that can cause the platform firmware to change the physical
placement of pmem on a subsequent boot, and device failure is an everyday
event in a data-center.

It turns out that sections are only a hard requirement of the user-facing
interface for memory hotplug and with a bit more infrastructure
sub-section arch_add_memory() support can be added for kernel internal
usages like devm_memremap_pages().  Here is an analysis of the current
design assumptions in the current code and how they are addressed in the
new implementation:

Current design assumptions:

- Sections that describe boot memory (early sections) are never
  unplugged / removed.

- pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
  valid_section() check

- __add_pages() and helper routines assume all operations occur in
  PAGES_PER_SECTION units.

- The memblock sysfs interface only comprehends full sections

New design assumptions:

- Sections are instrumented with a sub-section bitmask to track (on x86)
  individual 2MB sub-divisions of a 128MB section.

- Partially populated early sections can be extended with additional
  sub-sections, and those sub-sections can be removed with
  arch_remove_memory(). With this in place we no longer lose usable memory
  capacity to padding.

- pfn_valid() is updated to look deeper than valid_section() to also check the
  active-sub-section mask. This indication is in the same cacheline as
  the valid_section() so the performance impact is expected to be
  negligible. So far the lkp robot has not reported any regressions.

- Outside of the core vmemmap population routines which are replaced,
  other helper routines like shrink_{zone,pgdat}_span() are updated to
  handle the smaller granularity. Core memory hotplug routines that deal
  with online memory are not touched.

- The existing memblock sysfs user api guarantees / assumptions are
  not touched since this capability is limited to !online
  !memblock-sysfs-accessible sections.

Meanwhile the issue reports continue to roll in from users that do not
understand when and how the 128MB constraint will bite them.  The current
implementation relied on being able to support at least one misaligned
namespace, but that immediately falls over on any moderately complex
namespace creation attempt.  Beyond the initial problem of 'System RAM'
colliding with pmem, and the unsolvable problem of physical alignment
changes, Linux is now being exposed to platforms that collide pmem ranges
with other pmem ranges by default [3].  In short, devm_memremap_pages()
has pushed the venerable section-size constraint past the breaking point,
and the simplicity of section-aligned arch_add_memory() is no longer
tenable.

These patches are exposed to the kbuild robot on a subsection-v10 branch
[4], and a preview of the unit test for this functionality is available
on the 'subsection-pending' branch of ndctl [5].

[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
[3]: pmem/ndctl#76
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
[5]: pmem/ndctl@7c59b4867e1c

This patch (of 13):

Towards enabling memory hotplug to track partial population of a section,
introduce 'struct mem_section_usage'.

A pointer to a 'struct mem_section_usage' instance replaces the existing
pointer to a 'pageblock_flags' bitmap.  Effectively it adds one more
'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
a new 'subsection_map' bitmap.  The new bitmap enables the memory
hot{plug,remove} implementation to act on incremental sub-divisions of a
section.

SUBSECTION_SHIFT is defined as global constant instead of per-architecture
value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
subsection users.  Specifically a common subsection size allows for the
possibility that persistent memory namespace configurations be made
compatible across architectures.

The primary motivation for this functionality is to support platforms that
mix "System RAM" and "Persistent Memory" within a single section, or
multiple PMEM ranges with different mapping lifetimes within a single
section.  The section restriction for hotplug has caused an ongoing saga
of hacks and bugs for devm_memremap_pages() users.

Beyond the fixups to teach existing paths how to retrieve the 'usemap'
from a section, and updates to usemap allocation path, there are no
expected behavior changes.

Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Qian Cai <cai@lca.pw>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
mrchapp pushed a commit to mrchapp/linux that referenced this issue Jul 19, 2019
Patch series "mm: Sub-section memory hotplug support", v10.

The memory hotplug section is an arbitrary / convenient unit for memory
hotplug.  'Section-size' units have bled into the user interface
('memblock' sysfs) and can not be changed without breaking existing
userspace.  The section-size constraint, while mostly benign for typical
memory hotplug, has and continues to wreak havoc with 'device-memory' use
cases, persistent memory (pmem) in particular.  Recall that pmem uses
devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a
'struct page' memmap for pmem.  However, it does not use the 'bottom half'
of memory hotplug, i.e.  never marks pmem pages online and never exposes
the userspace memblock interface for pmem.  This leaves an opening to
redress the section-size constraint.

To date, the libnvdimm subsystem has attempted to inject padding to
satisfy the internal constraints of arch_add_memory().  Beyond
complicating the code, leading to bugs [2], wasting memory, and limiting
configuration flexibility, the padding hack is broken when the platform
changes this physical memory alignment of pmem from one boot to the next.
Device failure (intermittent or permanent) and physical reconfiguration
are events that can cause the platform firmware to change the physical
placement of pmem on a subsequent boot, and device failure is an everyday
event in a data-center.

It turns out that sections are only a hard requirement of the user-facing
interface for memory hotplug and with a bit more infrastructure
sub-section arch_add_memory() support can be added for kernel internal
usages like devm_memremap_pages().  Here is an analysis of the current
design assumptions in the current code and how they are addressed in the
new implementation:

Current design assumptions:

- Sections that describe boot memory (early sections) are never
  unplugged / removed.

- pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
  valid_section() check

- __add_pages() and helper routines assume all operations occur in
  PAGES_PER_SECTION units.

- The memblock sysfs interface only comprehends full sections

New design assumptions:

- Sections are instrumented with a sub-section bitmask to track (on x86)
  individual 2MB sub-divisions of a 128MB section.

- Partially populated early sections can be extended with additional
  sub-sections, and those sub-sections can be removed with
  arch_remove_memory(). With this in place we no longer lose usable memory
  capacity to padding.

- pfn_valid() is updated to look deeper than valid_section() to also check the
  active-sub-section mask. This indication is in the same cacheline as
  the valid_section() so the performance impact is expected to be
  negligible. So far the lkp robot has not reported any regressions.

- Outside of the core vmemmap population routines which are replaced,
  other helper routines like shrink_{zone,pgdat}_span() are updated to
  handle the smaller granularity. Core memory hotplug routines that deal
  with online memory are not touched.

- The existing memblock sysfs user api guarantees / assumptions are
  not touched since this capability is limited to !online
  !memblock-sysfs-accessible sections.

Meanwhile the issue reports continue to roll in from users that do not
understand when and how the 128MB constraint will bite them.  The current
implementation relied on being able to support at least one misaligned
namespace, but that immediately falls over on any moderately complex
namespace creation attempt.  Beyond the initial problem of 'System RAM'
colliding with pmem, and the unsolvable problem of physical alignment
changes, Linux is now being exposed to platforms that collide pmem ranges
with other pmem ranges by default [3].  In short, devm_memremap_pages()
has pushed the venerable section-size constraint past the breaking point,
and the simplicity of section-aligned arch_add_memory() is no longer
tenable.

These patches are exposed to the kbuild robot on a subsection-v10 branch
[4], and a preview of the unit test for this functionality is available
on the 'subsection-pending' branch of ndctl [5].

[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
[3]: pmem/ndctl#76
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
[5]: pmem/ndctl@7c59b4867e1c

This patch (of 13):

Towards enabling memory hotplug to track partial population of a section,
introduce 'struct mem_section_usage'.

A pointer to a 'struct mem_section_usage' instance replaces the existing
pointer to a 'pageblock_flags' bitmap.  Effectively it adds one more
'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
a new 'subsection_map' bitmap.  The new bitmap enables the memory
hot{plug,remove} implementation to act on incremental sub-divisions of a
section.

SUBSECTION_SHIFT is defined as global constant instead of per-architecture
value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
subsection users.  Specifically a common subsection size allows for the
possibility that persistent memory namespace configurations be made
compatible across architectures.

The primary motivation for this functionality is to support platforms that
mix "System RAM" and "Persistent Memory" within a single section, or
multiple PMEM ranges with different mapping lifetimes within a single
section.  The section restriction for hotplug has caused an ongoing saga
of hacks and bugs for devm_memremap_pages() users.

Beyond the fixups to teach existing paths how to retrieve the 'usemap'
from a section, and updates to usemap allocation path, there are no
expected behavior changes.

Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Qian Cai <cai@lca.pw>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
nathanchance pushed a commit to ClangBuiltLinux/linux that referenced this issue Jul 19, 2019
Patch series "mm: Sub-section memory hotplug support", v10.

The memory hotplug section is an arbitrary / convenient unit for memory
hotplug.  'Section-size' units have bled into the user interface
('memblock' sysfs) and can not be changed without breaking existing
userspace.  The section-size constraint, while mostly benign for typical
memory hotplug, has and continues to wreak havoc with 'device-memory'
use cases, persistent memory (pmem) in particular.  Recall that pmem
uses devm_memremap_pages(), and subsequently arch_add_memory(), to
allocate a 'struct page' memmap for pmem.  However, it does not use the
'bottom half' of memory hotplug, i.e.  never marks pmem pages online and
never exposes the userspace memblock interface for pmem.  This leaves an
opening to redress the section-size constraint.

To date, the libnvdimm subsystem has attempted to inject padding to
satisfy the internal constraints of arch_add_memory().  Beyond
complicating the code, leading to bugs [2], wasting memory, and limiting
configuration flexibility, the padding hack is broken when the platform
changes this physical memory alignment of pmem from one boot to the
next.  Device failure (intermittent or permanent) and physical
reconfiguration are events that can cause the platform firmware to
change the physical placement of pmem on a subsequent boot, and device
failure is an everyday event in a data-center.

It turns out that sections are only a hard requirement of the
user-facing interface for memory hotplug and with a bit more
infrastructure sub-section arch_add_memory() support can be added for
kernel internal usages like devm_memremap_pages().  Here is an analysis
of the current design assumptions in the current code and how they are
addressed in the new implementation:

Current design assumptions:

 - Sections that describe boot memory (early sections) are never
   unplugged / removed.

 - pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
   valid_section() check

 - __add_pages() and helper routines assume all operations occur in
   PAGES_PER_SECTION units.

 - The memblock sysfs interface only comprehends full sections

New design assumptions:

 - Sections are instrumented with a sub-section bitmask to track (on
   x86) individual 2MB sub-divisions of a 128MB section.

 - Partially populated early sections can be extended with additional
   sub-sections, and those sub-sections can be removed with
   arch_remove_memory(). With this in place we no longer lose usable
   memory capacity to padding.

 - pfn_valid() is updated to look deeper than valid_section() to also
   check the active-sub-section mask. This indication is in the same
   cacheline as the valid_section() so the performance impact is
   expected to be negligible. So far the lkp robot has not reported any
   regressions.

 - Outside of the core vmemmap population routines which are replaced,
   other helper routines like shrink_{zone,pgdat}_span() are updated to
   handle the smaller granularity. Core memory hotplug routines that
   deal with online memory are not touched.

 - The existing memblock sysfs user api guarantees / assumptions are not
   touched since this capability is limited to !online
   !memblock-sysfs-accessible sections.

Meanwhile the issue reports continue to roll in from users that do not
understand when and how the 128MB constraint will bite them.  The current
implementation relied on being able to support at least one misaligned
namespace, but that immediately falls over on any moderately complex
namespace creation attempt.  Beyond the initial problem of 'System RAM'
colliding with pmem, and the unsolvable problem of physical alignment
changes, Linux is now being exposed to platforms that collide pmem ranges
with other pmem ranges by default [3].  In short, devm_memremap_pages()
has pushed the venerable section-size constraint past the breaking point,
and the simplicity of section-aligned arch_add_memory() is no longer
tenable.

These patches are exposed to the kbuild robot on a subsection-v10 branch
[4], and a preview of the unit test for this functionality is available
on the 'subsection-pending' branch of ndctl [5].

[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
[3]: pmem/ndctl#76
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
[5]: pmem/ndctl@7c59b4867e1c

This patch (of 13):

Towards enabling memory hotplug to track partial population of a section,
introduce 'struct mem_section_usage'.

A pointer to a 'struct mem_section_usage' instance replaces the existing
pointer to a 'pageblock_flags' bitmap.  Effectively it adds one more
'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
a new 'subsection_map' bitmap.  The new bitmap enables the memory
hot{plug,remove} implementation to act on incremental sub-divisions of a
section.

SUBSECTION_SHIFT is defined as global constant instead of per-architecture
value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
subsection users.  Specifically a common subsection size allows for the
possibility that persistent memory namespace configurations be made
compatible across architectures.

The primary motivation for this functionality is to support platforms that
mix "System RAM" and "Persistent Memory" within a single section, or
multiple PMEM ranges with different mapping lifetimes within a single
section.  The section restriction for hotplug has caused an ongoing saga
of hacks and bugs for devm_memremap_pages() users.

Beyond the fixups to teach existing paths how to retrieve the 'usemap'
from a section, and updates to usemap allocation path, there are no
expected behavior changes.

Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Qian Cai <cai@lca.pw>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mdeejay pushed a commit to BeastRoms-Devices/kernel_xiaomi_raphael that referenced this issue Aug 16, 2019
commit ae86cbfef3818300f1972e52f67a93211acb0e24 upstream.

Commit cfe30b8 "libnvdimm, pmem: adjust for section collisions with
'System RAM'" enabled Linux to workaround occasions where platform
firmware arranges for "System RAM" and "Persistent Memory" to collide
within a single section boundary. Unfortunately, as reported in this
issue [1], platform firmware can inflict the same collision between
persistent memory regions.

The approach of interrogating iomem_resource does not work in this
case because platform firmware may merge multiple regions into a single
iomem_resource range. Instead provide a method to interrogate regions
that share the same parent bus.

This is a stop-gap until the core-MM can grow support for hotplug on
sub-section boundaries.

[1]: pmem/ndctl#76

Fixes: cfe30b8 ("libnvdimm, pmem: adjust for section collisions with...")
Cc: <stable@vger.kernel.org>
Reported-by: Patrick Geary <patrickg@supermicro.com>
Tested-by: Patrick Geary <patrickg@supermicro.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mdeejay pushed a commit to BeastRoms-Devices/kernel_xiaomi_cepheus that referenced this issue Aug 22, 2019
commit ae86cbfef3818300f1972e52f67a93211acb0e24 upstream.

Commit cfe30b8 "libnvdimm, pmem: adjust for section collisions with
'System RAM'" enabled Linux to workaround occasions where platform
firmware arranges for "System RAM" and "Persistent Memory" to collide
within a single section boundary. Unfortunately, as reported in this
issue [1], platform firmware can inflict the same collision between
persistent memory regions.

The approach of interrogating iomem_resource does not work in this
case because platform firmware may merge multiple regions into a single
iomem_resource range. Instead provide a method to interrogate regions
that share the same parent bus.

This is a stop-gap until the core-MM can grow support for hotplug on
sub-section boundaries.

[1]: pmem/ndctl#76

Fixes: cfe30b8 ("libnvdimm, pmem: adjust for section collisions with...")
Cc: <stable@vger.kernel.org>
Reported-by: Patrick Geary <patrickg@supermicro.com>
Tested-by: Patrick Geary <patrickg@supermicro.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Flatty11 pushed a commit to Flatty11/SM-G9750 that referenced this issue Oct 10, 2019
commit ae86cbfef3818300f1972e52f67a93211acb0e24 upstream.

Commit cfe30b872058 "libnvdimm, pmem: adjust for section collisions with
'System RAM'" enabled Linux to workaround occasions where platform
firmware arranges for "System RAM" and "Persistent Memory" to collide
within a single section boundary. Unfortunately, as reported in this
issue [1], platform firmware can inflict the same collision between
persistent memory regions.

The approach of interrogating iomem_resource does not work in this
case because platform firmware may merge multiple regions into a single
iomem_resource range. Instead provide a method to interrogate regions
that share the same parent bus.

This is a stop-gap until the core-MM can grow support for hotplug on
sub-section boundaries.

[1]: pmem/ndctl#76

Fixes: cfe30b872058 ("libnvdimm, pmem: adjust for section collisions with...")
Cc: <stable@vger.kernel.org>
Reported-by: Patrick Geary <patrickg@supermicro.com>
Tested-by: Patrick Geary <patrickg@supermicro.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
freeza-inc pushed a commit to freeza-inc/bm-galaxy-note10-sd855-pie that referenced this issue Dec 24, 2019
commit ae86cbfef3818300f1972e52f67a93211acb0e24 upstream.

Commit cfe30b872058 "libnvdimm, pmem: adjust for section collisions with
'System RAM'" enabled Linux to workaround occasions where platform
firmware arranges for "System RAM" and "Persistent Memory" to collide
within a single section boundary. Unfortunately, as reported in this
issue [1], platform firmware can inflict the same collision between
persistent memory regions.

The approach of interrogating iomem_resource does not work in this
case because platform firmware may merge multiple regions into a single
iomem_resource range. Instead provide a method to interrogate regions
that share the same parent bus.

This is a stop-gap until the core-MM can grow support for hotplug on
sub-section boundaries.

[1]: pmem/ndctl#76

Fixes: cfe30b872058 ("libnvdimm, pmem: adjust for section collisions with...")
Cc: <stable@vger.kernel.org>
Reported-by: Patrick Geary <patrickg@supermicro.com>
Tested-by: Patrick Geary <patrickg@supermicro.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Inkypen79 pushed a commit to Inkypen79/kernel_xiaomi_andromeda_old that referenced this issue Jan 15, 2020
commit ae86cbf upstream.

Commit cfe30b8 "libnvdimm, pmem: adjust for section collisions with
'System RAM'" enabled Linux to workaround occasions where platform
firmware arranges for "System RAM" and "Persistent Memory" to collide
within a single section boundary. Unfortunately, as reported in this
issue [1], platform firmware can inflict the same collision between
persistent memory regions.

The approach of interrogating iomem_resource does not work in this
case because platform firmware may merge multiple regions into a single
iomem_resource range. Instead provide a method to interrogate regions
that share the same parent bus.

This is a stop-gap until the core-MM can grow support for hotplug on
sub-section boundaries.

[1]: pmem/ndctl#76

Fixes: cfe30b8 ("libnvdimm, pmem: adjust for section collisions with...")
Cc: <stable@vger.kernel.org>
Reported-by: Patrick Geary <patrickg@supermicro.com>
Tested-by: Patrick Geary <patrickg@supermicro.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
@yanglongkanglo
Copy link

Is this problem solved?
I've seen many PR linked to this issue, but it's not closed at the moment.

@patr-geary-smci
Copy link
Author

Yeah, the patch and upstream kernel stuff fixed it.

Didn't see any alerts until your ask Friday, got memory holed I'm sure when I got sidetracked before closing. I'm surprised no one came and smacked me up for leaving an issue open for so long.

krazey pushed a commit to krazey/android_kernel_motorola_exynos9610 that referenced this issue Apr 23, 2022
commit ae86cbfef3818300f1972e52f67a93211acb0e24 upstream.

Commit cfe30b8 "libnvdimm, pmem: adjust for section collisions with
'System RAM'" enabled Linux to workaround occasions where platform
firmware arranges for "System RAM" and "Persistent Memory" to collide
within a single section boundary. Unfortunately, as reported in this
issue [1], platform firmware can inflict the same collision between
persistent memory regions.

The approach of interrogating iomem_resource does not work in this
case because platform firmware may merge multiple regions into a single
iomem_resource range. Instead provide a method to interrogate regions
that share the same parent bus.

This is a stop-gap until the core-MM can grow support for hotplug on
sub-section boundaries.

[1]: pmem/ndctl#76

Fixes: cfe30b8 ("libnvdimm, pmem: adjust for section collisions with...")
Cc: <stable@vger.kernel.org>
Reported-by: Patrick Geary <patrickg@supermicro.com>
Tested-by: Patrick Geary <patrickg@supermicro.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants