跳转至

Storage 负载均衡

用户可以使用BALANCE语句平衡分片和 Raft leader 的分布,或者清空某些 Storage 服务器方便进行维护。详情请参见 BALANCE

Danger

BALANCE命令通过创建和执行一组子任务来迁移数据和均衡分片分布,禁止停止集群中的任何机器或改变机器的 IP 地址,直到所有子任务完成,否则后续子任务会失败。

均衡分片分布

Enterpriseonly

仅企业版支持均衡分片分布。

Note

如果当前图空间已经有失败的均衡分片分布作业,无法开始新的均衡分片分布作业,只能恢复之前失败的作业。如果作业一直执行失败,可以先停止作业,再开始新的均衡分片分布作业。

BALANCE DATA语句会开始一个作业,将当前图空间的分片平均分配到所有 Storage 服务器。通过创建和执行一组子任务来迁移数据和均衡分片分布。

示例

以横向扩容 NebulaGraph 为例,集群中增加新的 Storage 主机后,新主机上没有分片。

  1. 执行命令SHOW HOSTS检查分片的分布。

    nebual> SHOW HOSTS;
    +-----------------+------+----------+--------------+-----------------------+------------------------+-------------+
    | Host            | Port | Status   | Leader count | Leader distribution   | Partition distribution | Version     |
    +-----------------+------+----------+--------------+-----------------------+------------------------+-------------+
    | "192.168.8.101" | 9779 | "ONLINE" | 0            | "No valid partition"  | "No valid partition"   | "3.4.3" |
    | "192.168.8.100" | 9779 | "ONLINE" | 15           | "basketballplayer:15" | "basketballplayer:15"  | "3.4.3" |
    +-----------------+------+----------+--------------+-----------------------+------------------------+-------------+
    
  2. 进入图空间basketballplayer,然后执行命令BALANCE DATA将所有分片均衡分布。

    nebula> USE basketballplayer;
    nebula> BALANCE DATA;
    +------------+
    | New Job Id |
    +------------+
    | 25         |
    +------------+
    
  3. 根据返回的任务ID,执行命令SHOW JOB <job_id>检查任务状态。

    nebula> SHOW JOB 25;
    +------------------------+-------------------+------------+----------------------------+----------------------------+-------------+
    | Job Id(spaceId:partId) | Command(src->dst) | Status     | Start Time                 | Stop Time                  | State       |
    +------------------------+-------------------+------------+----------------------------+----------------------------+-------------+
    | 25                     | "DATA_BALANCE"    | "FINISHED" | 2023-01-17T06:24:35.000000 | 2023-01-17T06:24:35.000000 | "SUCCEEDED" |
    | "Total:0"              | "Succeeded:0"     | "Failed:0" | "In Progress:0"            | "Invalid:0"                | ""          |
    +------------------------+-------------------+------------+----------------------------+----------------------------+-------------+
    
  4. 等待所有子任务完成,负载均衡进程结束,执行命令SHOW HOSTS确认分片已经均衡分布。

    Note

    BALANCE DATA不会均衡 leader 的分布。均衡 leader 请参见均衡 leader 分布

    nebula> SHOW HOSTS;
    +-----------------+------+----------+--------------+----------------------+------------------------+-------------+
    | Host            | Port | Status   | Leader count | Leader distribution  | Partition distribution | Version     |
    +-----------------+------+----------+--------------+----------------------+------------------------+-------------+
    | "192.168.8.101" | 9779 | "ONLINE" | 7            | "basketballplayer:7" | "basketballplayer:7"   | "3.4.3" |
    | "192.168.8.100" | 9779 | "ONLINE" | 8            | "basketballplayer:8" | "basketballplayer:8"   | "3.4.3" |
    +-----------------+------+----------+--------------+----------------------+------------------------+-------------+
    

如果有子任务失败,请执行RECOVER JOB <job_id>。如果重做负载均衡仍然不能解决问题,请到NebulaGraph社区寻求帮助。

停止负载均衡作业

停止负载均衡作业,请执行命令STOP JOB <job_id>

  • 如果没有正在执行的负载均衡作业,会返回错误。
  • 如果有正在执行的负载均衡作业,会返回Job stopped

Note

STOP JOB <job_id>不会停止正在执行的子任务,而是取消所有后续子任务,状态会置为INVALID,然后等待正在执行的子任执行完毕根据结果置为SUCCEEDEDFAILED。用户可以执行命令SHOW JOB <job_id>检查停止的作业状态。

恢复负载均衡作业

恢复负载均衡作业,请执行命令RECOVER JOB <job_id>

Note

  • 可以恢复执行失败的作业。
  • 对于停止的作业,NebulaGraph 会判断该作业的开始时间(start time)之后是否有相同类型的失败作业(failed job)或完成作业(finished job),如果有的话,无法恢复停止的作业。例如当有stopped job1 -> finished job2 -> stopped job3时,只能恢复 job3,无法恢复 job1。

迁移分片

迁移指定的 Storage 主机中的分片来缩小集群规模,可以使用命令BALANCE DATA REMOVE <ip:port> [,<ip>:<port> ...]

例如需要迁移192.168.8.100:9779中的分片,请执行如下命令:

nebula> BALANCE DATA REMOVE 192.168.8.100:9779;
nebula> SHOW HOSTS;
+-----------------+------+----------+--------------+-----------------------+------------------------+-------------+
| Host            | Port | Status   | Leader count | Leader distribution   | Partition distribution | Version     |
+-----------------+------+----------+--------------+-----------------------+------------------------+-------------+
| "192.168.8.101" | 9779 | "ONLINE" | 15           | "basketballplayer:15" | "basketballplayer:15"  | "3.4.3" |
| "192.168.8.100" | 9779 | "ONLINE" | 0            | "No valid partition"  | "No valid partition"   | "3.4.3" |
+-----------------+------+----------+--------------+-----------------------+------------------------+-------------+

Note

该命令仅迁移分片,不会将 Storage 主机从集群中删除。删除 Storage 主机请参见管理 Storage 主机

均衡 leader 分布

用户可以使用命令BALANCE LEADER均衡 leader 分布。

示例

nebula> BALANCE LEADER;

用户可以执行SHOW HOSTS检查结果。

nebula> SHOW HOSTS;
+------------------+------+----------+--------------+-----------------------------------+------------------------+---------+
| Host             | Port | Status   | Leader count | Leader distribution               | Partition distribution | Version |
+------------------+------+----------+--------------+-----------------------------------+------------------------+---------+
| "192.168.10.100" | 9779 | "ONLINE" | 4            | "basketballplayer:3"              | "basketballplayer:8"   | "3.4.3" |
| "192.168.10.101" | 9779 | "ONLINE" | 8            | "basketballplayer:3"              | "basketballplayer:8"   | "3.4.3" |
| "192.168.10.102" | 9779 | "ONLINE" | 3            | "basketballplayer:3"              | "basketballplayer:8"   | "3.4.3" |
| "192.168.10.103" | 9779 | "ONLINE" | 0            | "basketballplayer:2"              | "basketballplayer:7"   | "3.4.3" |
| "192.168.10.104" | 9779 | "ONLINE" | 0            | "basketballplayer:2"              | "basketballplayer:7"   | "3.4.3" |
| "192.168.10.105" | 9779 | "ONLINE" | 0            | "basketballplayer:2"              | "basketballplayer:7"   | "3.4.3" |
+------------------+------+-----------+----------+--------------+-----------------------------------+------------------------+---------+

Caution

在 NebulaGraph 3.4.3 中,Leader 切换会导致短时的大量请求错误(Storage Error E_RPC_FAILURE),处理方法见 FAQ


最后更新: September 4, 2023