Skip to content

Grain Size

Hüseyin Tuğrul BÜYÜKIŞIK edited this page Apr 9, 2017 · 10 revisions

This is a measure of smallest non-partitionable group size for calculations in Cekirdekler API and OpenCL. Each device must have a minimum of grain-size workitems. Total worksize must be exact multiple of grain-size. Load balancer trades grain(s) between devices at each compute method execution with same compute-id if devices have excessive grains or need more grains.

  • Single GPU, no pipelining: grain size = (local)workgroup size. User must choose local size exact divisor of global size.
  • Single GPU, N-blob pipelining: grain size= local size * N. Global size must be exact multiple of this value.
  • M-GPU, no pipelining: grain size = (local)workgroup size. To balance between devices, this size of workitems are moved. Global size must be exact multiple of this value. Minimum global size must also be local size * M
  • M-GPU, N-blobs for pipelining: Pipelining blobs are per device. Grain size = workgroup size * N to be swapped between devices for load balancing. Minimum global size must also be local size * N * M.

Global size equal to the minimum value can only equally distribute workitems among devices since any move means making one device completely useless.

Pipeline's number of blobs (N) has to be multiple of 4 (minimum 4), if pipelining is enabled.

Example:

  • 3 GPUs, 1024 workitems total, 256 local size:

    • load balancing: bad (one device must always be 512 workitems and others 256)
    • pipelining: 0
  • 3 GPUs, 1024 workitems total, 64 local size:

    • load balancing: good (a device can have any value between 64 and 886)
    • pipelining: can have 4 blobs per device with poor load balancing since grain size is now 256 and exact divisor of 1024
  • 3 GPUs, 16384 workitems total, 128 local size

    • load balancing: fine grained
    • pipelining: can have 8 blobs per device and not-bad load balancing or 32 blobs per device and bad load balancing.
  • 5 GPUs, 1M workitems total, 256 local size

    • load balancing: fine grained
    • pipelining: can have 64 blobs per device which means grain size is 16k which is 1.5 percent of total work. Fine grained load balancing. Each device has a minimum of 16k workitems and trades multiple of 16k workitems with other devices.