Data Replication 101

Basic Database Replication Strategies

There are two basic replication strategies, master-slave and master-master.

Master-Slave Replication

Master-slave is the easiest to design and implement because one database does all of the work and simply reports results to the slave database on some designated timing interval. In the event that the master dies, the slave takes over. The problem with the master-slave model is that the slave is idle until it is promoted to become the master. This model leaves half of the computer resources unused at all times.

Master-Master Replication

The master-master model is designed to address the unused resource problem. In the master-master model, both computer systems and both databases are working for users at the same time. In the event that a disaster were to occur, the remaining system would take over the entire load of the combined systems. The system would run at a slower speed until more resources could be added to the remaining system or until the failed system had recovered.

Master-Master Challenges

The problem with the master-master model is complexity. The master-master replication strategy is an order of magnitude more difficult to design and implement. Architects must create their own conflict resolution mechanisms as well as their own work delegation techniques. Because the data exists in two different computer systems at the same time and users can access each of the two systems at the same time, there is nothing stopping two different users from changing the same piece of data or worse yet some related, nested data. If the changes cause a business logic violation, that either system would have stopped if both actions where attempted in one system, both systems must reconcile the data and inform the affected users. This event would not even be detected until the systems attempted to push their changes to the other system, in which case the replication would have failed because the changes were considered a violation of business logic. The failure would leave both systems out of sync and could cause data instabilities in other parts of the system until the problems are found and corrected. If the synchronization timing were set for minutes, as opposed to seconds, the affected users may not even be on the system when the failure occurs.