CUED Publications database

SoftRM: Self-organized fault-tolerant resource management for failure detection and recovery in NoC based many-cores

Tsoutsouras, V and Masouros, D and Xydis, S and Soudris, D (2017) SoftRM: Self-organized fault-tolerant resource management for failure detection and recovery in NoC based many-cores. ACM Transactions on Embedded Computing Systems, 16. ISSN 1539-9087

Full text not available from this repository.

Abstract

Many-core systems are envisioned to leverage the ever-increasing demand for more powerful computing systems. To provide the necessary computing power, the number of Processing Elements integrated onchip increases and NoC based infrastructures are adopted to address the interconnection scalability. The advent of these new architectures surfaces the need for more sophisticated, distributed resource management paradigms, which in addition to the extreme integration scaling, make the new systems more prone to errors manifested both at hardware and software. In this work, we highlight the need for Run-Time Resource management to be enhanced with fault tolerance features and propose SoftRM, a resource management framework which can dynamically adapt to permanent failures in a self-organized, workload-aware manner. Self-organization allows the resource management agents to recover from a failure in a coordinated way by electing a new agent to replace the failed one, while workload awareness optimizes this choice according to the status of each core. We evaluate the proposed framework on Intel Single-chip Cloud Computer (SCC), a NoC based many-core system and customize it to achieve minimum interference on the resource allocation process. We showcase that its workload-aware features manage to utilize free resources in more that 90% of the conducted experiments. Comparison with relevant state-of-the-art fault tolerant frameworks shows decrease of up to 67% in the imposed overhead on application execution.

Item Type: Article
Subjects: UNSPECIFIED
Divisions: UNSPECIFIED
Depositing User: Cron Job
Date Deposited: 06 Mar 2019 20:05
Last Modified: 10 Apr 2021 01:09
DOI: 10.1145/3126562