JIYIK CN >

Current Location:Home > Learning > ALGORITHM >

Introduction to Consistent Hashing Algorithm and PHP Code Implementation

Author:JIYIK Last Updated:2025/03/19 Views:

When designing a distributed system architecture, in order to improve the system's load capacity, different data needs to be distributed to different service nodes. Therefore, a distribution mechanism is needed, which is actually an algorithm, to achieve this function. Here we use the Consistent Hashing algorithm.

Before formally introducing the Consistent Hashing algorithm, let's first look at a simple hash algorithm, which is to select nodes by taking the remainder. The specific steps are as follows:

1. Create a hash table based on the number of nodes in the cluster service
. 2. Then calculate the integer hash value of the key name based on the key name, and use the hash value to take the remainder of the number of nodes.
3. Finally, take out the node from the hash table based on the remainder.

Suppose there are n server nodes in a cluster, and these nodes are numbered 0, 1, 2, ..., n-1. Then, a piece of data (key, value) is stored in the server. How do we choose the server node at this time? According to the above steps, we need to calculate the hash value of the key, and then take the remainder of n (the number of nodes). The final value is the node we want. Expressed as a formula: num = hash(key) % n. hash() is a function that calculates the hash value. There are certain requirements for the hash() function here. If the hash() function we use is very optimized, the calculated num is evenly distributed between 0, 1, 2, ..., n-1, so that as many server nodes as possible can be used. Instead of all data being concentrated on one or several server nodes. The specific hash() implementation is not the focus of this chapter.

Although this method of simply taking the remainder is simple, there will be big problems if it is applied to the actual production system. Suppose we have 23 service nodes. Then according to the above method, the probability of a key being mapped to each node is 1/23. Assuming that a service node is added, the previous hash(key) % n will become hash(key) % (n+1). That is to say, there is a 23/24 probability that the key will be reallocated to the new node. On the contrary, there is only a 1/24 probability that it will be allocated to the original node. Similarly, when you reduce a node, there is a 22/23 probability that it will be reallocated to the new node.

In view of this situation, there needs to be a way to avoid or reduce the occurrence of a decrease in hit rate when scaling horizontally. This method is the Consistent Hashing algorithm we are going to introduce, which we call the consistent hashing algorithm.
To understand how the Consistent Hashing algorithm works, we assume that the unit interval [0, 1) is evenly distributed on the circle in a clockwise direction.

Suppose there are n service nodes, and each service node is numbered 0, 1, 2, …, n-1. Then we need a hash() function to calculate the hash value of the service node. If the selected hash() function returns a value in the range [0, R), then use the formula v = hash(n) / R. The v obtained in this way will be distributed in the unit interval [0, 1). Therefore, in this way, our service nodes can be distributed on the circle.

Of course, drawing a circle with the unit interval [0, 1) is just one way. There are many other ways to draw a circle, for example: using the interval [0, 2^32-1) as a circle, and then using the hash() function to calculate the hash() value of the service node. The value generated by the selected hash() function must of course be within the range of 0 – (2^32-1).

Here we still take [0, 1) as an example.

Let's take 3 service nodes as an example to illustrate

These three nodes are randomly distributed on the circle. Now suppose we have a piece of data (key, value) that needs to be stored. The next thing to do is to map this data to the circle using the same method.

Then, starting from the position where the key is located on the circle, we search for the service node in a clockwise direction. The first service node we find is the node to be stored. So this data will be stored on service node 1.

Similarly, when other (key, value) pairs need to be stored, the service node is selected in the same way as above.

Now let's see whether this method can solve the horizontal expansion problem we mentioned at the beginning?

Suppose we need to add a service node 3

From the above figure, we can see that only key1 will change its storage service node. For most data, the original node will still be found. Therefore, for a cluster of n service nodes, when a service node is added, there is only a 1/(n+1) probability that a piece of data will change its storage service node. This probability is much smaller than the probability obtained by the remainder method. Similarly, the principle of reducing a service node and adding a service node is the same, and the probability of reselecting a service node for each piece of data is 1/(n-1). This probability is also very small.

Here is a simple PHP code to implement this process.

$nodes = array('192.168.5.201','192.168.5.102','192.168.5.111');
$keys = array('onmpw', 'jiyi', 'onmpw_key', 'jiyi_key', 'www','www_key','key1');
$buckets = array(); //节点的hash字典
$maps = array(); //存储key和节点之间的映射关系
/**
 * 生成节点字典 —— 使节点分布在单位区间[0,1)的圆上
 */
foreach( $nodes as $key) {
	$crc = crc32($key)/pow(2,32);            // CRC値
	$buckets[] = array('index'=>$crc,'node'=>$key);
}

/*
 * 根据索引进行排序
 */
sort($buckets);
/*
 * 对每个key进行hash计算,找到其在圆上的位置
 * 然后在该位置开始依顺时针方向找到第一个服务节点
 */
foreach($keys as $key){
	$flag = false; //表示是否有找到服务节点
	$crc = crc32($key)/pow(2,32);//计算key的hash值
	for($i = 0; $i < count($buckets); $i++){

	    if($buckets[$i]['index'] > $crc){
	    /*
	     * 因为已经对buckets进行了排序
	     * 所以第一个index大于key的hash值的节点即是要找的节点
	     */
	     $maps[$key] = $buckets[$i]['node'];
	          $flag = true;
	          break;
	     }
	}
	if(!$flag){
	     //没有找到,则使用buckets中的第一个服务节点
	     $maps[$key] = $buckets[0]['node'];
	}
}
foreach($maps as $key=>$val){
	echo $key.'=>'.$val,"<br />";
}

The result of running this code is as follows

onmpw=>192.168.5.102
jiyi=>192.168.5.201
onmpw_key=>192.168.5.201
jiyi_key=>192.168.5.102
www=>192.168.5.201
www_key=>192.168.5.201
key1=>192.168.5.111

Then we add a service node and modify the code as follows

$nodes = array('192.168.5.201','192.168.5.102','192.168.5.111','192.168.5.11');

The other codes remain unchanged, and the results of continued operation are as follows

onmpw=>192.168.5.102
jiyi=>192.168.5.201
onmpw_key=>192.168.5.11
jiyi_key=>192.168.5.102
www=>192.168.5.201
www_key=>192.168.5.201
key1=>192.168.5.111

We can see that only onmpw_key has reselected the service node. The others are the original nodes.

We can see here that the probability of hitting the target has been greatly improved compared to the remainder method. Does this solve the problem we encountered earlier?

Actually, not yet. Because the distribution of these values ​​is not so uniform after all. In the system, it is possible that these service nodes are very concentrated, which may lead to a situation where all keys are mapped to one or several nodes, and the remaining service nodes are not used. Although this is not a serious problem, why should we waste even just one server?

We can see that this situation causes the data to be concentrated on one service node, resulting in waste of other service nodes. So how to solve this problem? People have come up with a new way: to create a virtual node for each node. What does it mean? That is to say, for node j, create m replicas for it. These m replicated nodes all get different hash values ​​through the hash() function, but the node information saved by each virtual node is that of node j. Then these virtual nodes are randomly distributed on the circle. For example, we have two service nodes. And three virtual nodes are copied for each node. These nodes (including virtual nodes are randomly distributed on the circle)

It seems that the service nodes are distributed evenly on the circle. In fact, to sum up, it is just a slight improvement on the above method - copying some virtual nodes to each node.

Therefore, our code does not need to be modified too much. In order to read the code more intuitively, I will list the entire code here.

$nodes = array('192.168.5.201','192.168.5.102','192.168.5.111');
$keys = array('onmpw', 'jiyi', 'onmpw_key', 'jiyi_key', 'www','www_key','key1');
//添加的变量  修改的地方
$replicas = 160//每个节点的复制的个数
$buckets = array(); //节点的hash字典
$maps = array(); //存储key和节点之间的映射关系
/**
 * 生成节点字典 —— 使节点分布在单位区间[0,1)的圆上
 */
foreach( $nodes as $key) {
        //修改的地方
        for($i=1;$i<=$replicas;$i++){
	    $crc = crc32($key.'.'.$i)/pow(2,32);            // CRC値
	    $buckets[] = array('index'=>$crc,'node'=>$key);
        }
}
/*
 * 根据索引进行排序
 */
sort($buckets);
/*
 * 对每个key进行hash计算,找到其在圆上的位置
 * 然后在该位置开始依顺时针方向找到第一个服务节点
 */
foreach($keys as $key){
	$flag = false; //表示是否有找到服务节点
	$crc = crc32($key)/pow(2,32);//计算key的hash值
	for($i = 0; $i < count($buckets); $i++){
	    if($buckets[$i]['index'] > $crc){
	         /*
	          * 因为已经对buckets进行了排序
	          * 所以第一个index大于key的hash值的节点即是要找的节点
	          */
	          $maps[$key] = $buckets[$i]['node'];
	          $flag = true;
	          break;

	    }

	 }
	 if(!$flag){
	    //没有找到,则使用buckets中的第一个服务节点
	    $maps[$key] = $buckets[0]['node'];
	 }
}
foreach($maps as $key=>$val){
	echo $key.'=>'.$val,"<br />";
}

The changes are marked in the code. As you can see, the changes are relatively minor.

At this point, I believe everyone should have a clearer understanding of Consistent Hashing. Hash algorithms are widely used, such as in memcache clusters, nginx load, etc. Therefore, understanding hash algorithms is of great help to us.

If there are any unclear or inappropriate parts in the description of the above algorithm process, you are welcome to give me your advice.

For reprinting, please send an email to 1244347461@qq.com for approval. After obtaining the author's consent, kindly include the source as a link.

Article URL:

Related Articles

The road to learning sorting algorithms - heap sort

Publish Date:2025/03/19 Views:145 Category:ALGORITHM

Like other sorting algorithms, let's first look at the definition of heap sort. Heapsort : refers to a sorting algorithm designed using the heap data structure. A heap is a structure that approximates a complete binary tree and satisfies th

Distributed thinking in Memcached

Publish Date:2025/03/19 Views:146 Category:ALGORITHM

Memcached is known as a high-performance distributed cache system. Speaking of distribution, Memcached is worth analyzing. Its distribution mechanism is different from that of general distributed service systems. In distributed service syst

My understanding of the Paxos algorithm

Publish Date:2025/03/19 Views:147 Category:ALGORITHM

In a distributed system, a core issue is data consistency. Therefore, the consistency algorithm is of utmost importance in a distributed system. The Paxos algorithm is designed to solve the consistency problem, but it has always been consid

Linked List Merge Sort

Publish Date:2025/03/18 Views:138 Category:ALGORITHM

In this article, we will learn how to sort a linked list using merge sort algorithm. It is one of the most preferred algorithms to sort a linked list because slow pointer random access makes the performance of other algorithms worse (for ex

Nginx load balancing settings

Publish Date:2025/03/18 Views:198 Category:NETWORK

At this stage, load balancing is a widely used technology. Nginx, as a load balancing server for http, is being used more and more widely. There are three ways to set up Nginx load balancing: Round-robin - This method distributes access req

Deep understanding of Nginx's server block selection algorithm

Publish Date:2025/03/17 Views:95 Category:NETWORK

Nginx is one of the most popular web servers in the world. It can successfully handle high loads with many concurrent client connections and can be used as a web server, mail server, or reverse proxy server. In this article, we will discuss

In-depth understanding of Nginx Location block matching algorithm

Publish Date:2025/03/17 Views:62 Category:NETWORK

Similar to the process that Nginx uses to select the Server block that will handle a request  , Nginx also has an established algorithm to decide which Location block within a Server block to use to handle a request. location block syntax

C# 中的字典与哈希表

Publish Date:2024/01/19 Views:176 Category:编程语言

本指南将讨论 C# 中 Dictionary 和 Hashtable 之间的区别。你应该更喜欢哪一个?本指南将讨论 C# 中 Dictionary 和 Hashtable 之间的区别。

C# 中的 HashSet 与列表

Publish Date:2024/01/19 Views:193 Category:编程语言

在 C# 中,哈希集是唯一对象的无序集合,而列表是可由索引访问的对象的集合。集合的数学定义是不同对象的无序集合。C# 中的 HashSet 数据结构也遵循相同的原则。

Scan to Read All Tech Tutorials

Social Media
  • https://www.github.com/onmpw
  • qq:1244347461

Recommended

Tags

Scan the Code
Easier Access Tutorial