What is CAP theorem
The CAP theorem is a well-known principle in the distributed system, where C, A, and P stand for Consistency, Availability, and Partition Tolerance respectively. The theorem says that in any data architecture system, the system can support two characteristics out of three in a given time which is either “Consistency and Partition Tolerance (CP)”, “Availability and Partition Tolerance (AP)”, or “Consistency and Availability (CA)”. In the case of the distribution system, the network can fail at any time and we don’t have any control over that. That’s why partition tolerance is one of the mandatory properties for distributed systems. And mostly, we will have only two options to choose AP or CP in the distributed system.
There are few Databases that support consistency rather than availability like MongoDB and like Cassandra, which support high Availability. Also, there are few databases available that can support consistency and availability but not at the same given time like Aerospike. An understanding of the CAP theorem is important to decide the best database suited for the given use-case or business requirements.
Distributed System
Now, what is a distributed system? In a distributed system, multiple computers or computing devices within a large network (LAN or WAN) are being used to execute specific components of the software in a distributed mode. Then, each distributed components share the information and coordinated with each other to produce the final output result efficiently. This distributed system has evolved because there is a limitation on a single machine. We can’t increase the load on a single machine after a certain limit, but, the load can be distributed to multiple systems, the single point of failure is also can be avoided in the distributed system, and we can run multiple processes parallelly. There is an option to increase the capacity of distributed clusters horizontally i.e. by adding new nodes. In a distributed database system, we partition the data into multiple chunks and write data to different computers, that is to speed up the data reading and writing process. Distributed systems write the same data (data replication) in multiple computer systems to overcome a single point of failure. At the same time, there are some risks in the distributed system for example suppose the user is trying to access data from the server where replicated data do not exist because of network issues or time delay or the user is trying to access data from the server which is detached from the network. That’s why it is important to understand the business requirement to determine the necessary characteristics in the distributed system – Consistency or Availability . In a distributed database system, the system must support partitioning tolerance which means the user should be able to access the data if there is a network failure between the nodes.
Consistency in CAP theorem
In a distributed system consistency means, all the nodes in the cluster should return the same value to any client at the given time. Any node of the cluster should return the last value written to any node of the cluster. The data written to any node should be immediately replicated to all other nodes in the distributed cluster. Dirty reads can be avoided in a distributed cluster that supports Consistency characteristics.
Example of consistency:
In a distributed cluster, there are two nodes Node-X and Node-Y are connected to each other.
At time t1, we write the number 1 at Node-X which is replicated to Node-y.
At time t2, there is a network error and Node-X is unable to connect to Node-Y. At the same time, we write the number 2 at Node-X which will not be propagated to Node-Y because of a network error. In this scenario, the distributed system which supports consistency will not allow any read from the Node-Y.
Availability in CAP theorem
Availability means, all nodes within the cluster will always return the value to the client but there is no guarantee that the latest value written to the cluster will be returned. This is possible if the latest data is not being distributed to all nodes of the cluster because of network issues or time delays.
Example of availability
In a distributed cluster, there are two nodes Node-X and Node-Y are connected to each other.
At time t1, we write the number 1 at Node-X which is replicated to Node-y.
At time t2, there is a network error and Node-X is unable to connect to Node-Y. At the same time, we write the number 2 at Node-X which will not be propagated to Node-Y because of a network error. In this scenario, the distributed system which supports availability will allow reading the value from Node-X as well as Node-Y. But, the retrieved value will be different, 1 in the case of Node-Y and 2 in the case of Node-X. Thus, it will support the availability but no guarantee of consistency in response data.
Partition Tolerance in CAP theorem
CAP theorem databases examples
RDBMS
Relational Databases like MySql, Oracle, etc mostly run on the high capacity single machine which can only be upgraded vertically. So, these database system supports CA characteristics.
CAP Theorem MongoDb
Consistency – MongoDB is by default a high-consistency DB. There will be weak consistency in case of any primary node failure.
https://www.mongodb.com/docs/manual/core/read-isolation-consistency-recency/#std-label-causal-consistency https://dzone.com/articles/mongodb-consistency-levels-cappaclec-theorem
Availability – Mongo DB can be configured to high availability using “Replica Set”.
https://www.mongodb.com/docs/manual/replication/#replication-in-mongodb https://www.mongodb.com/docs/v6.0/core/replica-set-high-availability/
CAP Theorem HBase
HBase is a NoSQL columnar database developed to run on Hadoop Ecosystem. It is used to hold a large volume of data that can be accessed randomly. HBase internally uses Hadoop FileSystem (HDFS) to store the data.
CAP Theorem Kudu
Kudu is a NoSQL columnar database built to run on Hadoop Ecosystem. It was developed by Cloudera and now the free version is under an Apache license (https://kudu.apache.org/).
Consistency – Kudu supports CP characteristics.
https://kudu.apache.org/faq.html#consistency-and-cap-theorem
Manage inventory efficiently using AlnicoSoft.