Storage 负载均衡¶
用户可以使用BALANCE
语句平衡分片和 Raft leader 的分布,或者清空某些 Storage 服务器方便进行维护。详情请参见 BALANCE。
Danger
BALANCE
命令通过创建和执行一组子任务来迁移数据和均衡分片分布,禁止停止集群中的任何机器或改变机器的 IP 地址,直到所有子任务完成,否则后续子任务会失败。
均衡分片分布¶
Enterpriseonly
仅企业版支持均衡分片分布。
Note
如果当前图空间已经有失败的均衡分片分布作业,无法开始新的均衡分片分布作业,只能恢复之前失败的作业。如果作业一直执行失败,可以先停止作业,再开始新的均衡分片分布作业。
BALANCE DATA
语句会开始一个作业,将当前图空间的分片平均分配到所有 Storage 服务器。通过创建和执行一组子任务来迁移数据和均衡分片分布。
示例¶
以横向扩容 Nebula Graph 为例,集群中增加新的 Storage 主机后,新主机上没有分片。
-
执行命令
SHOW HOSTS
检查分片的分布。nebual> SHOW HOSTS; +-----------------+------+-----------+----------+--------------+-----------------------+------------------------+-------------+ | Host | Port | HTTP port | Status | Leader count | Leader distribution | Partition distribution | Version | +-----------------+------+-----------+----------+--------------+-----------------------+------------------------+-------------+ | "192.168.8.101" | 9779 | 19669 | "ONLINE" | 0 | "No valid partition" | "No valid partition" | "3.1.0-ent" | | "192.168.8.100" | 9779 | 19669 | "ONLINE" | 15 | "basketballplayer:15" | "basketballplayer:15" | "3.1.0-ent" | +-----------------+------+-----------+----------+--------------+-----------------------+------------------------+-------------+
-
进入图空间
basketballplayer
,然后执行命令BALANCE DATA
将所有分片均衡分布。nebula> USE basketballplayer; nebula> BALANCE DATA; +------------+ | New Job Id | +------------+ | 2 | +------------+
-
根据返回的任务ID,执行命令
SHOW JOB <job_id>
检查任务状态。nebula> SHOW JOB 2; +------------------------+------------------------------------------+-------------+---------------------------------+---------------------------------+-------------+ | Job Id(spaceId:partId) | Command(src->dst) | Status | Start Time | Stop Time | Error Code | +------------------------+------------------------------------------+-------------+---------------------------------+---------------------------------+-------------+ | 2 | "DATA_BALANCE" | "FINISHED" | "2022-04-12T03:41:43.000000000" | "2022-04-12T03:41:53.000000000" | "SUCCEEDED" | | "2, 1:1" | "192.168.8.100:9779->192.168.8.101:9779" | "SUCCEEDED" | 2022-04-12T03:41:43.000000 | 2022-04-12T03:41:53.000000 | "SUCCEEDED" | | "2, 1:2" | "192.168.8.100:9779->192.168.8.101:9779" | "SUCCEEDED" | 2022-04-12T03:41:43.000000 | 2022-04-12T03:41:53.000000 | "SUCCEEDED" | | "2, 1:3" | "192.168.8.100:9779->192.168.8.101:9779" | "SUCCEEDED" | 2022-04-12T03:41:43.000000 | 2022-04-12T03:41:53.000000 | "SUCCEEDED" | | "2, 1:4" | "192.168.8.100:9779->192.168.8.101:9779" | "SUCCEEDED" | 2022-04-12T03:41:43.000000 | 2022-04-12T03:41:53.000000 | "SUCCEEDED" | | "2, 1:5" | "192.168.8.100:9779->192.168.8.101:9779" | "SUCCEEDED" | 2022-04-12T03:41:43.000000 | 2022-04-12T03:41:53.000000 | "SUCCEEDED" | | "2, 1:6" | "192.168.8.100:9779->192.168.8.101:9779" | "SUCCEEDED" | 2022-04-12T03:41:43.000000 | 2022-04-12T03:41:43.000000 | "SUCCEEDED" | | "2, 1:7" | "192.168.8.100:9779->192.168.8.101:9779" | "SUCCEEDED" | 2022-04-12T03:41:43.000000 | 2022-04-12T03:41:53.000000 | "SUCCEEDED" | | "Total:7" | "Succeeded:7" | "Failed:0" | "In Progress:0" | "Invalid:0" | "" | +------------------------+------------------------------------------+-------------+---------------------------------+---------------------------------+-------------+
-
等待所有子任务完成,负载均衡进程结束,执行命令
SHOW HOSTS
确认分片已经均衡分布。Note
BALANCE DATA
不会均衡 leader 的分布。均衡 leader 请参见均衡 leader 分布。nebula> SHOW HOSTS; +-----------------+------+-----------+----------+--------------+----------------------+------------------------+-------------+ | Host | Port | HTTP port | Status | Leader count | Leader distribution | Partition distribution | Version | +-----------------+------+-----------+----------+--------------+----------------------+------------------------+-------------+ | "192.168.8.101" | 9779 | 19669 | "ONLINE" | 7 | "basketballplayer:7" | "basketballplayer:7" | "3.1.0-ent" | | "192.168.8.100" | 9779 | 19669 | "ONLINE" | 8 | "basketballplayer:8" | "basketballplayer:8" | "3.1.0-ent" | +-----------------+------+-----------+----------+--------------+----------------------+------------------------+-------------+
如果有子任务失败,请执行RECOVER JOB <job_id>
。如果重做负载均衡仍然不能解决问题,请到Nebula Graph社区寻求帮助。
停止负载均衡作业¶
停止负载均衡作业,请执行命令STOP JOB <job_id>
。
- 如果没有正在执行的负载均衡作业,会返回错误。
- 如果有正在执行的负载均衡作业,会返回
Job stopped
。
Note
STOP JOB <job_id>
不会停止正在执行的子任务,而是取消所有后续子任务,状态会置为INVALID
,然后等待正在执行的子任执行完毕根据结果置为SUCCEEDED
或FAILED
。用户可以执行命令SHOW JOB <job_id>
检查停止的作业状态。
恢复负载均衡作业¶
恢复负载均衡作业,请执行命令RECOVER JOB <job_id>
。
Note
- 可以恢复执行失败的作业。
- 对于停止的作业,Nebula Graph 会判断该作业的开始时间(start time)之后是否有相同类型的失败作业(failed job)或完成作业(finished job),如果有的话,无法恢复停止的作业。例如当有
stopped job1 -> finished job2 -> stopped job3
时,只能恢复 job3,无法恢复 job1。
迁移分片¶
迁移指定的 Storage 主机中的分片来缩小集群规模,可以使用命令BALANCE DATA REMOVE <ip:port> [,<ip>:<port> ...]
。
例如需要迁移192.168.8.100:9779
中的分片,请执行如下命令:
nebula> BALANCE DATA REMOVE 192.168.8.100:9779;
nebula> SHOW HOSTS;
+-----------------+------+-----------+----------+--------------+-----------------------+------------------------+-------------+
| Host | Port | HTTP port | Status | Leader count | Leader distribution | Partition distribution | Version |
+-----------------+------+-----------+----------+--------------+-----------------------+------------------------+-------------+
| "192.168.8.101" | 9779 | 19669 | "ONLINE" | 15 | "basketballplayer:15" | "basketballplayer:15" | "3.1.0-ent" |
| "192.168.8.100" | 9779 | 19669 | "ONLINE" | 0 | "No valid partition" | "No valid partition" | "3.1.0-ent" |
+-----------------+------+-----------+----------+--------------+-----------------------+------------------------+-------------+
Note
该命令仅迁移分片,不会将 Storage 主机从集群中删除。删除 Storage 主机请参见管理 Storage 主机。
均衡 leader 分布¶
用户可以使用命令BALANCE LEADER
均衡 leader 分布。
示例¶
nebula> BALANCE LEADER;
用户可以执行SHOW HOSTS
检查结果。
nebula> SHOW HOSTS;
+------------------+------+-----------+----------+--------------+-----------------------------------+------------------------+---------+
| Host | Port | HTTP port | Status | Leader count | Leader distribution | Partition distribution | Version |
+------------------+------+-----------+----------+--------------+-----------------------------------+------------------------+---------+
| "192.168.10.100" | 9779 | 19669 | "ONLINE" | 4 | "basketballplayer:3" | "basketballplayer:8" | "3.1.0" |
| "192.168.10.101" | 9779 | 19669 | "ONLINE" | 8 | "basketballplayer:3" | "basketballplayer:8" | "3.1.0" |
| "192.168.10.102" | 9779 | 19669 | "ONLINE" | 3 | "basketballplayer:3" | "basketballplayer:8" | "3.1.0" |
| "192.168.10.103" | 9779 | 19669 | "ONLINE" | 0 | "basketballplayer:2" | "basketballplayer:7" | "3.1.0" |
| "192.168.10.104" | 9779 | 19669 | "ONLINE" | 0 | "basketballplayer:2" | "basketballplayer:7" | "3.1.0" |
| "192.168.10.105" | 9779 | 19669 | "ONLINE" | 0 | "basketballplayer:2" | "basketballplayer:7" | "3.1.0" |
+------------------+------+-----------+----------+--------------+-----------------------------------+------------------------+---------+
Caution
在 Nebula Graph 3.1.0 中,Leader 切换会导致短时的大量请求错误(Storage Error E_RPC_FAILURE
),处理方法见 FAQ。