ABSTRACT: Consistency of replicated copies is difficult to maintain and recover during multiple failures of sites and network communication in a distributed database system. Transaction processing must continue as long as a single copy is available. But in a multiple failure environment, each operational site must make correct decisions about which copy to update and which one will be updated by the recovery system. This requires refreshing the copies on failed sites that missed the updates and doing this correctly while other transactions are updating and some more sites are either failing or recovering. This problem has been classified as the "replicated copy control problem." In this paper, we present several ideas that are necessary to attack and manage this problem. We introduce the ideas of session numbers, nominal session vectors, fail locks, and view serializability and discuss their role in transaction processing on operational, recovering, and partitioned sites. We have experimented with many of these ideas in a prototype system called RAID and we present the implementation issues. There is little overhead associated with our approach if no failures occur.
Key words and phrases: replicated databases, multiple site failures, consistency control of replicated copies, distributed databases