Math and Computer Science Faculty Working Papers

Exploring Process Groups for Reliability, Availability and Serviceability of Terascale Computing Systems

Daniel I. Okunbor, Fayetteville State UniversityFollow
Christian Engelmann , Oak Ridge National Laboratory
Stephen L. Scott, Oak Ridge National Laboratory

Document Type

Article

Abstract

This paper presents various aspects of reliability, availability and serviceability (RAS) systems as they relate to group communication service, including reliable and total order multicast/broadcast, virtual synchrony, and failure detection. While the issue of availability, particularly high availability using replication-based architectures has recently received upsurge research interests, much still have to be done in understanding the basic underlying concepts for achieving RAS systems, especially in high-end and high performance computing (HPC) communities. Various attributes of group communication servic and the prototype of symmetric active replication following ideas utilized in the Newtop protocol will be discussed. We explore the application of group communication service for RAS HPC, laying the groundwork for its integrated model.

Recommended Citation

Okunbor, Daniel I.; Engelmann , Christian ; and Scott, Stephen L., "Exploring Process Groups for Reliability, Availability and Serviceability of Terascale Computing Systems" (2006). Math and Computer Science Faculty Working Papers. 1.
https://digitalcommons.uncfsu.edu/macsc_wp/1

Download

COinS

Math and Computer Science Faculty Working Papers

Exploring Process Groups for Reliability, Availability and Serviceability of Terascale Computing Systems

Document Type

Abstract

Recommended Citation

Search

Browse

Author Corner

Math and Computer Science Faculty Working Papers

Exploring Process Groups for Reliability, Availability and Serviceability of Terascale Computing Systems

Authors

Document Type

Abstract

Recommended Citation

Share

Search

Browse

Author Corner