A parallel programming system implements dynamic load balancing to
distribute processing workload to available processors in a parallel
computer. A preprocessor in the system converts a nested parallel program
into sequential code executable on processors of the parallel computer and
calls to a message passing interface for inter-processor communication
among the processors. When processing a nested parallel program, the
preprocessor inserts a test function to evaluate the computational cost of
a function call. At runtime, processors evaluate the test function to
determine whether to ship a function call to another processor. This
approach enables processors to offload function calls to other available
processors in cases where it is more efficient to incur the cost of
shipping the function call and receiving the results than it is to process
the function call on the original processor.