GitHub - alexei-matveev/dlb: Dynamic Load Balancing with MPI

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 417 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README		README
dlb.f90		dlb.f90
dlb.h		dlb.h
dlb_assert_failed.f90		dlb_assert_failed.f90
dlb_common.f90		dlb_common.f90
dlb_impl_rma.f90		dlb_impl_rma.f90
dlb_impl_static.f90		dlb_impl_static.f90
dlb_impl_thread_common.f90		dlb_impl_thread_common.f90
dlb_impl_thread_multiple.f90		dlb_impl_thread_multiple.f90
dlb_impl_thread_single.f90		dlb_impl_thread_single.f90
dlb_mpi.f90		dlb_mpi.f90
main.f90		main.f90
module.mk		module.mk
test.f90		test.f90
thread_wrapper.c		thread_wrapper.c
Repository files navigation

# DYNAMIC LOAD BALANCING #

The library created in this sub-folder (called libdlb.a) takes care of
dynamical load balancing. The  library works solely on task identifier
(task IDs). It is the  responsibility of the calling program to launch
the actual tasks belonging to these identifiers.

The library is written in  Fortran and is requiring a Fortran compiler
and  an MPI  2 library.  The MPI  2 communication  of this  library is
supposed  to work  independent  of  the remaining  program  and it  is
therefore   not   necessary   to   synchronize  it   with   the   main
program. Additional requirements are dependent on the variant which is
chosen.

There  are three  different cases  for the  distribution of  the tasks
possible,  additional there are  4 different  variants of  the library
possible. For the latter the  choice should be related to the hardware
and software on which the library should be used. They can be combined
with every task distribution.

# ACTING OF DLB #

DLB takes  care about the available  job numbers. On  request it gives
several of these to a processor to work on. It takes care that each of
them is given back exactly once.   If a processor runs out of jobs, he
can, for DLB_VARIANT  > 0, ask the other processors  to give them some
of theirs.

# CASES OF DISTRIBUTION #

There are three different cases for the task distribution possible. They
differ in the information on the tasks or in the starting distribution.

1. All jobs are equal in name, each processor starts with a
   consecutive number of tasks.

2. Each job gets a special color by initialization, it is expected for
   efficiency that jobs with the same color are having consecutive
   numbers.

3. The jobs are equal in name but the initial distribution of the
   tasks is such that for n processors one of them will get every n'th
   task.  This interface was designed for tasks, which are ordered
   after size or expected cost, starting with the larger ones.

There are 8 different functions altogether, where two are common ones,
while of the other each two belong to one of the cases.

# DIFFERENT VARIANTS #

Which of the different variants  of the DLB implementations is used is
selected by DLB_VARIANT.

DLB_VARIANT  = 0:  is a  static distribution,  where  no communication
    between the processors is used. Intended mainly for debugging.

DLB_VARIANT = 1:  is a remote memory access (RMA)  method. When RMA is
    really asynchronous, or  done by the hardware, this  method can be
    very effective.

DLB_VARIANT  = 2 and  DLB_VARIANT =  3: Both  methods use  threads for
    making the DLB algorithm work.  They need a higher level of thread
    safety  than  the normal  MPI_INIT().   They  should  be run  with
    MPI_INIT_THREAD,  where  the  variant   2  needs  at  least  level
    MPI_THREAD_SERIALIZED, while  variant 3 needs MPI_THREAD_MULTIPLE.
    It starts some threads to handle the MPI communication between the
    processors. Variant  3 needs a  good thread handling  for blocking
    MPI  calls. If  this  thread alternates  nicely  with the  working
    thread this  method should  be better than  variant 2,  where this
    alternation  is done  by hand.  But  MPI has  often an  aggressive
    polling by blocking calls, where  the variant 3 easily doubles the
    time requirement.  The variants need also POSIX Threads available.


# INTERFACE #

All functions  have to  be called on  all processors.  The  two common
functions are:  dlb_init() and dlb_finalize().  They  have to surround
the use of the other functions.  When calling dlb_init() MPI has to be
already  initialized,  also dlb_finalize()  has  to  be called  before
MPI_FINALIZE().   Other  than  the   MPI  counterparts  there  is  the
possibility to finalize and initialize  DLB in between some uses.  The
function  dlb_init()  has the  argument  world,  which  should be  the
communicator  of MPI, of  which DLB  makes a  private copy.   Thus the
calculations with DLB should be surrounded by:

    call dlb_init(world)
    ...
    call dlb_finalize()

The calculation  itself is done  by the two functions  dlb_setup() and
dlb_give_more()  or their  counterparts with  colors dlb_setup_color()
and  dlb_give_more_color()   or  an  additional   round  robin  method
dlb_setup_rr() and dlb_give_more_rr(). It  is possible to have several
of  such  calculations to  be  performed  in  between the  two  common
functions, one  might also choose  different interfaces for  them. The
only restriction  is that  the new calculation  should not  be started
inside  the previous  one  (there  must be  the  possibility that  one
calculation has finished before the  next starts) and one must not mix
interfaces during one calculation.

The general interface:

dlb_setup() will take the number  of available jobs as an argument. It
will already perform a static distribution.

dlb_give_more() has two arguments:  the first, MAXJOBS, stands for the
number of  job indices it wants  at once.  The second  argument is the
interval jobs, which  the processor should now work  on.  The work the
processor   should   then   do    is   between   jobs(1)   +   1   and
jobs(2). dlb_give_more() is a function, which will return true as long
as there a  jobs and false if  there are none. In the  last case there
will be also jobs(1) >= jobs(2). If it returns true there is always at
least one job.  But there may be also fewer, if the processor has only
few, this  has nothing to do  with the amount of  jobs, all processors
have.  If it returns false,  all processors have terminated (except in
case  DLB_VARIANT  = 0).   Otherwise  it  will  wait in  the  function
dlb_give_more() if the termination has not yet been confirmed.

Usage:

    call dlb_setup(NJOBS)

    do while (dlb_give_more (MAXJOBS, jobs))
      ...

Here NJOBS is  integer of idlb_kind, MAXJOBS is  integer of idlb_kind,
jobs(2) is  an integer  array of idlb_kind  DLB provides  the specific
integer kind idlb_kind,  which is currently an 8-byte  integer. But as
this is  able to change and in  order to have the  code always running
with the current setting make your integer of the idlb_kind.

The variant with colors:

The  difference of  the color  case, is  that here  the color  is also
considered.  In  dlb_setup_color() one gives a  distribution, with the
colors, thus  it gets an integer  array containing the  number of jobs
for  the corresponding  color (for  the i'th  element the  color  i is
used).  dlb_give_more_color() will give  in addition to the jobs their
color.  Furthermore  it ensures that all  jobs it gives  back have the
same color.

Usage:

    call dlb_setup_color(distr)

    do while (dlb_give_more_color (MAXJOBS, color, jobs))
      ...

Here distr is integer(idlb_kind)  array of size (many_colors), MAXJOBS
and color are  integers of the idlb_kind, jobs(2)  is integer array of
idlb_kind.

The variant with round robin start distribution:

The third  case is using a  round robin over the  regions.  Thus every
process  gets every n'th  task, where  n is  the number  of processes.
There  is   a  significant  difference   between  dlb_give_more()  and
dlb_give_more_rr():  the output  jobs  is of  different  size and  has
different meaning.  For the case  of dlb_give_more() the slice of task
IDs, included in it are the  tasks with jobs(1) + 1: jobs(2): jobs(3),
where jobs(3) is the stride for the succeeding job IDs.

Usage:

    call dlb_setup_rr(NJOBS)

    do while (dlb_give_more_rr (MAXJOBS, jobs))
      ...

Additional functions:

Additionally  there  is a  function  for  printing  statistics. It  is
possible to not print the output directly (see output_level below) but
also     to    give    them     summarized    with     the    function
dlb_print_statistics(). The output of  this function is independent on
the  output_level  of the  DLB  module but  the  amount  of output  is
specified by a  variable, given directly to it.   This functions need
some reuse functions,  thus it could cause some  overhead. In case the
function   gets  level   =   0   this  step   is   also  omitted   and
dlb_print_statistics()  will  return   immediately.   For  the  static
back-end  of DLB  only level  0 and  1 give  reasonable  results.  the
statistic function expects to get an integer with 4 bytes.

The output  specified with the  level is always including  all smaller
levels:

level  | new output
0      | None
1      | SUM(time spend in dlb_give_more())  [last time separated]
2      | About how long waited for new tasks
3      | Time of last two batches spend outside DLB
4      | Statistics about task length (time between dlb_give_more()
         calls) + complete time between dlb_setup() and the finish of
         DLB

Usage:

    call dlb_print_statistics(level)

Here level is an integer with 4 bytes.

# DLB AND THREAD SAFETY #

Different  implementations   have  different  requirements  concerning
thread  safety,  especially  as  some  of  the  implementations  build
explicitly  on them.  MPI  provides some  different MPI_THREAD_LEVELS,
which  show what could  be expected  in that  regard from  the current
implementation.  Each DLB variant  can show the required thread level,
as it  is stored  in the variable  DLB_THREAD_REQUIRED, which  is also
handed over to the general DLB  wrapper.  But be aware that the thread
level is only for the usage of DLB. It is assumed that DLB will be the
only one to do some message  passing in there. If the code, which uses
DLB  should have  also some  thread level  requirements one  of course
needs to use the higher one, which should cause no problems to DLB. If
the parts of the code using  DLB contain also some message passing one
even might  to raise the level  again. This of course  does not affect
MPI_THREAD_SINGLE  and might not  even affect  MPI_THREAD_FUNNELED but
MPI_THREAD_SERIALIZED should then be raised to MPI_THREAD_MULTIPLE.


# EXAMPLE #

    use mpi
    use dlb
    integer :: ierr, prov, i
    integer(idlb_kind) :: NJOBS, MAXJOBS, jobs(2)
    integer(idlb_kind) :: distr(4), color

    call MPI_INIT_THREADS(MPI_THREAD_MULTIPLE, prov, ierr)

    call dlb_init(MPI_COMM_WORLD)

    NJOBS = 20
    MAXJOBS = 2

    call dlb_setup(NJOBS)
    do while (dlb_give_more (MAXJOBS, jobs))
       do i = jobs(1) + 1, jobs(2)
          print *, "Doing job", i
       enddo
    enddo

    MAXJOBS = 3
    distr(1) = 10
    distr(2) = 13
    distr(3) = 1
    distr(4) = 9

    call dlb_setup_color(distr)
    do while (dlb_give_more_color (MAXJOBS, color, jobs))
       do i = jobs(1) + 1, jobs(2)
          print *, "Doing job", i, "of color", color
       enddo
    enddo

    call dlb_setup_rr(NJOBS)
    do while (dlb_give_more_rr (MAXJOBS, jobs))
       do i = jobs(1) + 1, jobs(2), jobs(3)
          print *, "Doing job", i
       enddo
    enddo

    call dlb_finalize()

    call MPI_FINALIZE(ierr)

# COMPILATION #

If DLB is used external from the ParaGauss repository set DLB_EXTERNAL
= 1 on top of the DLB Makefile.

## OUTPUT LEVEL ##

DLB is able  to run without producing output. But  it may also provide
some  informations and  statistics  up to  very detailed  informations
about its  behavior.  What it  will print of  them is selected  by the
output level.  The output  level is defined by parameter OUTPUT_BORDER
in  the Makefile.  This output  is independent  to the  output  of the
print_statistics function. Be aware that every process creates its own
output independent  of the  others (expect for  OUTPUT_BORDER<=1) thus
only use them if you know that the IO routines are thread save.

OUTPUT_BORDER = 0: no output at all
OUTPUT_BORDER = 1: at initialization it will be printed which DLB
	           variant is used
OUTPUT_BORDER = 2: the DLB variants will also provide some more
                   statistics with summarized informations about the
                   run of DLB
OUTPUT_BORDER > 2: additional output will appear with time stamps at
                   some selected places