| Authors | R. Peñaranda, E. G. Gran, T. Skeie, M. E. Gómez and P. Lopez |
| Editors | P. J. Garcia and J. Escudero-Sahuquillo |
| Title | A New Fault-Tolerant Routing Methodology for KNS Topologies |
| Afilliation | Communication Systems |
| Status | Published |
| Publication Type | Proceedings, refereed |
| Year of Publication | 2016 |
| Conference Name | 2nd IEEE International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB) |
| Pagination | 1-8 |
| Date Published | 03/2016 |
| Publisher | IEEE |
| ISBN Number | 978-1-5090-2121-5 |
| Abstract | Exascale computing systems are being built with thousands of nodes. A key component of these systems is the interconnection network. The high number of components significantly increases the probability of failure. If failures occur in the interconnection network, they may isolate a large fraction of the machine. For this reason, an efficient fault-tolerant mechanism is needed to keep the system interconnected, even in the presence of faults. A topology recently proposed for these large systems is the hybrid KNS family that provides good performance and connectivity at a reduced hardware cost. This paper present a fault-tolerant routing methodology for the KNS topology that degrades performance gracefully in the presence of faults and tolerates a reasonably large number of faults without disabling any healthy node. In order to tolerate network failures, the methodology uses a simple mechanism: for some sourcedestination pairs, only if necessary, packets are forwarded to the destination node through a set of intermediate nodes (without being ejected from the network) which allow avoiding faults. The evaluation results shows that the methodology tolerates a large number of faults. Furthermore, the methodology offers a gracious performance degradation. For instance, performance degrades only 1% for a 2D-network with 1024 nodes and 1% faulty links. |
| DOI | 10.1109/HIPINEB.2016.9 |
| Citation Key | 23935 |

