| Authors | F. O. Sem-Jacobsen and T. Skeie | 
| Editors | P. S. Al. | 
| Title | Maintaining Quality of Service With Dynamic Fault Tolerance in Fat Trees | 
| Afilliation | , Communication Systems | 
| Status | Published | 
| Publication Type | Proceedings, refereed | 
| Year of Publication | 2008 | 
| Conference Name | International Conference on High Performance Computing (HiPC) | 
| Volume | 1 | 
| Pagination | 451-464 | 
| Date Published | december | 
| Publisher | Springer-Verlag | 
| Place Published | Berlin | 
| ISBN Number | 3-54089893-x | 
| Abstract | A very important ingredient in the computing landscape is Utility Computing Data Centres (UCDCs), large-scale computing system that offers computational services to concurrently running applications. In a UCDC, virtual servers containing a subset of the available resources are dynamically created to fulfil user demands. Typically, each virtual server will have its own service level agreement, which should to the largest extent be unaffected by the behaviour of the all other virtual servers in the system. As UCDC systems increase in size and the mean time between failure decreases, it is becoming an increasingly important challenge to expediently tolerate failures (dynamically), while distributing the effects of the failure amongst the virtual servers according to their service level agreements. In this paper we propose and evaluate a strategy for offering predictable service in fat trees experiencing faults, by reprioritising packets. The strategy is able to distribute the effect of network faults in order to satisfy a number of quality of service demands. These may include guaranteeing that high-priority packets not encountering the fault are unaffected by the fault event, guaranteeing high network throughput for all high-priority traffic, or ensuring that the negative effects of the fault are evenly and fairly spread throughout the network. We find that which demands to favour depends on the computer system and the characteristics of the applications it is running, and that in the presence of a moderate number of faults it is to some degree possible to meet the demands. | 
| Citation Key | Simula.ND.60 |