查询 NebulaGraph 监控指标¶

NebulaGraph 支持多种方式查询服务的监控指标，本文将介绍最基础的方式，即通过 HTTP 端口查询。

监控指标结构说明¶

NebulaGraph 的每个监控指标都由三个部分组成，中间用英文句号（.）隔开，例如num_queries.sum.600。不同的 NebulaGraph 服务支持查询的监控指标也不同。指标结构的说明如下。

类别	示例	说明
指标名称	`num_queries`	简单描述指标的含义。
统计类型	`sum`	指标统计的方法。当前支持 SUM、AVG、RATE 和 P 分位数（P75、P95、P99、P999）。
统计时间	`600`	指标统计的时间范围，当前支持 5 秒、60 秒、600 秒和 3600 秒，分别表示最近 5 秒、最近 1 分钟、最近 10 分钟和最近 1 小时。

图空间监控指标¶

Graph 服务支持一系列基于图空间的监控指标（Space Level Metrics），对不同图空间的数据分别记录。

如需开启图空间监控指标，先在 Graph 服务的配置文件中将enable_space_level_metrics参数的值修改为true，再启动 NebulaGraph。修改配置的详细方式参见配置管理。

Note

图空间指标只能通过查询所有监控指标的形式查询到，例如curl -G "http://192.168.8.40:19559/stats"，返回结果中以{space=space_name}的形式包含图空间名称，例如num_active_queries{space=basketballplayer}.sum.5=0。

通过 HTTP 端口查询监控指标¶

语法¶

curl -G "http://<ip>:<port>/stats?stats=<metric_name_list> [&format=json]"

选项	说明
`ip`	服务器的 IP 地址，可以在安装目录内查看配置文件获取。
`port`	服务器的 HTTP 端口，可以在安装目录内查看配置文件获取。默认情况下，Meta 服务端口为 19559，Graph 服务端口为 19669，Storage 服务端口为 19779。
`metric_name_list`	监控指标名称，多个监控指标用英文逗号（,）隔开。
`&format=json`	将结果以 JSON 格式返回。

Note

如果 NebulaGraph 服务部署在容器中，需要执行docker-compose ps命令查看映射到容器外部的端口，然后通过该端口查询。

示例¶

查询单个监控指标

查询 Graph 服务中，最近 10 分钟的请求总数。

$ curl -G "http://192.168.8.40:19669/stats?stats=num_queries.sum.600"
num_queries.sum.600=400

查询多个监控指标

查询 Meta 服务中，最近 1 分钟的心跳平均延迟和最近 10 分钟 P99 心跳（1%最慢的心跳）的平均延迟。

$ curl -G "http://192.168.8.40:19559/stats?stats=heartbeat_latency_us.avg.60,heartbeat_latency_us.p99.600"
heartbeat_latency_us.avg.60=281
heartbeat_latency_us.p99.600=985

查询监控指标并以 JSON 格式返回

查询 Storage 服务中，最近 10 分钟新增的点数量，并以 JSON 格式返回结果。

$ curl -G "http://192.168.8.40:19779/stats?stats=num_add_vertices.sum.600&format=json"
[{"value":1,"name":"num_add_vertices.sum.600"}]

查询服务器的所有监控指标

不指定查询某个监控指标时，会返回该服务器上所有的监控指标。

$ curl -G "http://192.168.8.40:19559/stats"
heartbeat_latency_us.avg.5=304
heartbeat_latency_us.avg.60=308
heartbeat_latency_us.avg.600=299
heartbeat_latency_us.avg.3600=285
heartbeat_latency_us.p75.5=652
heartbeat_latency_us.p75.60=669
heartbeat_latency_us.p75.600=651
heartbeat_latency_us.p75.3600=642
heartbeat_latency_us.p95.5=930
heartbeat_latency_us.p95.60=963
heartbeat_latency_us.p95.600=933
heartbeat_latency_us.p95.3600=929
heartbeat_latency_us.p99.5=986
heartbeat_latency_us.p99.60=1409
heartbeat_latency_us.p99.600=989
heartbeat_latency_us.p99.3600=986
num_heartbeats.rate.5=0
num_heartbeats.rate.60=0
num_heartbeats.rate.600=0
num_heartbeats.rate.3600=0
num_heartbeats.sum.5=2
num_heartbeats.sum.60=40
num_heartbeats.sum.600=394
num_heartbeats.sum.3600=2364
...

监控指标说明¶

Graph¶

参数	说明
`num_active_queries`	活跃的查询语句数的变化数。计算公式：时间范围内开始执行的语句数减去执行完毕的语句数。
`num_active_sessions`	活跃的会话数的变化数。计算公式：时间范围内登录的会话数减去登出的会话数。例如查询num_active_sessions.sum.5，过去 5 秒中登录了 10 个会话数，登出了 30 个会话数，那么该指标值就是`-20`（10-30）。
`num_aggregate_executors`	聚合（Aggregate）算子执行时间。
`num_auth_failed_sessions_bad_username_password`	因用户名密码错误导验证失败的会话数量。
`num_auth_failed_sessions_out_of_max_allowed`	因为超过`FLAG_OUT_OF_MAX_ALLOWED_CONNECTIONS`参数导致的验证登录的失败的会话数量。
`num_auth_failed_sessions`	登录验证失败的会话数量。
`num_indexscan_executors`	索引扫描（IndexScan）算子执行时间。
`num_killed_queries`	被终止的查询数量。
`num_opened_sessions`	服务端建立过的会话数量。
`num_queries`	查询次数。
`num_query_errors_leader_changes`	因查询错误而导致的 Leader 变更的次数。
`num_query_errors`	查询错误次数。
`num_reclaimed_expired_sessions`	服务端主动回收的过期的会话数量。
`num_rpc_sent_to_metad_failed`	Graphd 服务发给 Metad 的 RPC 请求失败的数量。
`num_rpc_sent_to_metad`	Graphd 服务发给 Metad 服务的 RPC 请求数量。
`num_rpc_sent_to_storaged_failed`	Graphd 服务发给 Storaged 服务的 RPC 请求失败的数量。
`num_rpc_sent_to_storaged`	Graphd 服务发给 Storaged 服务的 RPC 请求数量。
`num_sentences`	Graphd 服务接收的语句数。
`num_slow_queries`	慢查询次数。
`num_sort_executors`	排序（Sort）算子执行时间。
`optimizer_latency_us`	优化器阶段延迟时间。
`query_latency_us`	查询延迟时间。
`slow_query_latency_us`	慢查询延迟时间。
`num_queries_hit_memory_watermark`	达到内存水位线的语句的数量。

Meta¶

参数	说明
`commit_log_latency_us`	Raft 协议中 Commit 日志的延迟时间。
`commit_snapshot_latency_us`	Raft 协议中 Commit 快照的延迟时间。
`heartbeat_latency_us`	心跳延迟时间。
`num_heartbeats`	心跳次数。
`num_raft_votes`	Raft 协议中投票的次数。
`transfer_leader_latency_us`	Raft 协议中转移 Leader 的延迟时间。
`num_agent_heartbeats`	AgentHBProcessor 心跳次数。
`agent_heartbeat_latency_us`	AgentHBProcessor 延迟时间。
`replicate_log_latency_us`	Raft 复制日志至大多数节点的延迟。
`num_send_snapshot`	Raft 发送快照至其他节点的次数。
`append_log_latency_us`	Raft 复制日志到单个节点的延迟。
`append_wal_latency_us`	Raft 写入单条 WAL 的延迟。
`num_grant_votes`	Raft 投票给其他节点的次数。
`num_start_elect`	Raft 发起投票的次数。

Storage¶

参数	说明
`add_edges_latency_us`	添加边的延迟时间。
`add_vertices_latency_us`	添加点的延迟时间。
`commit_log_latency_us`	Raft 协议中 Commit 日志的延迟时间。
`commit_snapshot_latency_us`	Raft 协议中 Commit 快照的延迟时间。
`delete_edges_latency_us`	删除边的延迟时间。
`delete_vertices_latency_us`	删除点的延迟时间。
`get_neighbors_latency_us`	查询邻居延迟时间。
`get_dst_by_src_latency_us`	通过起始点获取终点的延迟时间。
`num_get_prop`	GetPropProcessor 执行的次数。
`num_get_neighbors_errors`	GetNeighborsProcessor 执行出错的次数。
`num_get_dst_by_src_errors`	GetDstBySrcProcessor 执行出错的次数。
`get_prop_latency_us`	GetPropProcessor 执行的延迟时间。
`num_edges_deleted`	删除的边数量。
`num_edges_inserted`	插入的边数量。
`num_raft_votes`	Raft 协议中投票的次数。
`num_rpc_sent_to_metad_failed`	Storaged 服务发给 Metad 服务的 RPC 请求失败的数量。
`num_rpc_sent_to_metad`	Storaged 服务发给 Metad 服务的 RPC 请求数量。
`num_tags_deleted`	删除的 Tag 数量。
`num_vertices_deleted`	删除的点数量。
`num_vertices_inserted`	插入的点数量。
`transfer_leader_latency_us`	Raft 协议中转移 Leader 的延迟时间。
`lookup_latency_us`	LookupProcessor 执行的延迟时间。
`num_lookup_errors`	LookupProcessor 执行时出错的次数。
`num_scan_vertex`	ScanVertexProcessor 执行的次数。
`num_scan_vertex_errors`	ScanVertexProcessor 执行时出错的次数。
`update_edge_latency_us`	UpdateEdgeProcessor 执行的延迟时间。
`num_update_vertex`	UpdateVertexProcessor 执行的次数。
`num_update_vertex_errors`	UpdateVertexProcessor 执行时出错的次数。
`kv_get_latency_us`	Getprocessor 的延迟时间。
`kv_put_latency_us`	PutProcessor 的延迟时间。
`kv_remove_latency_us`	RemoveProcessor 的延迟时间。
`num_kv_get_errors`	GetProcessor 执行出错次数。
`num_kv_get`	GetProcessor 执行次数。
`num_kv_put_errors`	PutProcessor 执行出错次数。
`num_kv_put`	PutProcessor 执行次数。
`num_kv_remove_errors`	RemoveProcessor 执行出错次数。
`num_kv_remove`	RemoveProcessor 执行次数。
`forward_tranx_latency_us`	传输延迟时间。
`scan_edge_latency_us`	ScanEdgeProcessor 执行的延迟时间。
`num_scan_edge_errors`	ScanEdgeProcessor 执行时出错的次数。
`num_scan_edge`	ScanEdgeProcessor 执行的次数。
`scan_vertex_latency_us`	ScanVertexProcessor 执行的延迟时间。
`num_add_edges`	添加边的次数。
`num_add_edges_errors`	添加边时出错的次数。
`num_add_vertices`	添加点的次数。
`num_start_elect`	Raft 发起投票的次数
`num_add_vertices_errors`	添加点时出错的次数。
`num_delete_vertices_errors`	删除点时出错的次数。
`append_log_latency_us`	Raft 复制日志到单个节点的延迟。
`num_grant_votes`	Raft 投票给其他节点的次数。
`replicate_log_latency_us`	Raft 复制日志到大多数节点的延迟。
`num_delete_tags`	删除 Tag 的次数。
`num_delete_tags_errors`	删除 Tag 时出错的次数。
`num_delete_edges`	删除边的次数。
`num_delete_edges_errors`	删除边时出错的次数。
`num_send_snapshot`	发送快照的次数。
`update_vertex_latency_us`	UpdateVertexProcessor 执行的延迟时间。
`append_wal_latency_us`	Raft 写入单条 WAL 的延迟。
`num_update_edge`	UpdateEdgeProcessor 执行的次数。
`delete_tags_latency_us`	删除 Tag 的延迟时间。
`num_update_edge_errors`	UpdateEdgeProcessor 执行时出错的次数。
`num_get_neighbors`	GetNeighborsProcessor 执行的次数。
`num_get_dst_by_src`	GetDstBySrcProcessor 执行的次数。
`num_get_prop_errors`	GetPropProcessor 执行时出错的次数。
`num_delete_vertices`	删除点的次数。
`num_lookup`	LookupProcessor 执行的次数。
`num_sync_data`	Storage 同步 Drainer 数据的次数。
`num_sync_data_errors`	Storage 同步 Drainer 数据出错的次数。
`sync_data_latency_us`	Storage 同步 Drainer 数据的延迟时间。

图空间级别监控指标¶

Note

图空间级别监控指标是动态创建的, 只有当图空间内触发该行为时，对应的指标才会创建，用户才能查询到。

参数	说明
`num_active_queries`	当前正在执行的查询数。
`num_queries`	查询次数。
`num_sentences`	Graphd 服务接收的语句数。
`optimizer_latency_us`	优化器阶段延迟时间。
`query_latency_us`	查询延迟时间。
`num_slow_queries`	慢查询次数。
`num_query_errors`	查询报错语句数量。
`num_query_errors_leader_changes`	因查询错误而导致的 Leader 变更的次数。
`num_killed_queries`	被终止的查询数量。
`num_aggregate_executors`	聚合（Aggregate）算子执行时间。
`num_sort_executors`	排序（Sort）算子执行时间。
`num_indexscan_executors`	索引扫描（IndexScan）算子执行时间。
`num_auth_failed_sessions_bad_username_password`	因用户名密码错误导验证失败的会话数量。
`num_auth_failed_sessions`	登录验证失败的会话数量。
`num_opened_sessions`	服务端建立过的会话数量。
`num_queries_hit_memory_watermark`	达到内存水位线的语句的数量。
`num_reclaimed_expired_sessions`	服务端主动回收的过期的会话数量。
`num_rpc_sent_to_metad_failed`	Graphd 服务发给 Metad 的 RPC 请求失败的数量。
`num_rpc_sent_to_metad`	Graphd 服务发给 Metad 服务的 RPC 请求数量。
`num_rpc_sent_to_storaged_failed`	Graphd 服务发给 Storaged 服务的 RPC 请求失败的数量。
`num_rpc_sent_to_storaged`	Graphd 服务发给 Storaged 服务的 RPC 请求数量。
`slow_query_latency_us`	慢查询延迟时间。

单进程指标¶

Graph、Meta 和 Storage 服务都有自身的单进程指标。

参数	说明
`context_switches_total`	上下文切换数量。
`cpu_seconds_total`	基于用户时间和系统时间的 CPU 使用量。
`memory_bytes_gauge`	使用的内存字节数。
`open_filedesc_gauge`	文件描述符的数量。
`read_bytes_total`	读取的字节数量。
`write_bytes_total`	写入的字节数量。

最后更新: 2025年4月21日