Clusters and distributed systems offer fault tolerance and high performance through load sharing. When all n computers are up and running, we would like the load to be evenly distributed among the computers. When one or more computers break down, the load on these computers must be redistributed to other computers in the system. The redistribution is determined by the recovery scheme. The recovery scheme is governed by a sequence of integers modulo n. Each sequence guarantees minimal load on the computer that has maximal load even when the most unfavorable combinations of computers go down. We calculate the best possible such recovery schemes for any number of crashed computers by an exhaustive search, where brute force testing is avoided by a mathematical reformulation of the problem and a branch-and-bound algorithm. The search nevertheless has a high complexity. Optimal sequences, and thus a corresponding optimal bound, are presented for a maximum of twenty one computers in the distributed system or cluster.
Consider a distributed system consisting of n computers connected by a number of identical broadcast channels. All computers may receive messages from all channels. We distinguish between two kinds of systems: systems in which the computers may send on any channel (dynamic allocation) and system where the send port of each computer is statically allocated to a particular channel. A distributed task (application) is executed on the distributed system. A task performs execution as well as communication between its subtasks. We compare the completion time of the communication for such a task using dynamic allocation and k(d) channels with the completion time using static allocation and k(s) channels. Some distributed tasks will benefit very much from allowing dynamic allocation, whereas others will work fine with static allocation. In this paper we define optimal upper and lower bounds on the gain (or loss) of using dynamic allocation and k(d) channels compared to static allocation and k(s) channels. Our results show that, for some tasks, the gain of permitting dynamic allocation is substantial, e.g. when k(s) = k(d) = 3, there are tasks which will complete 1.89 times faster using dynamic allocation compared to using the best possible static allocation, but there are no tasks with a higher such ratio.